使用 Python 解析 HTML 并提取特定区域内容

DDD

发布时间：2025-08-06 18:32:19

453人浏览过

来源于php中文网

原创

使用 python 解析 html 并提取特定区域内容

本文将介绍如何使用 Python 和 BeautifulSoup 库从 HTML 文档中提取特定区域的内容。正如前文摘要所述，我们将通过定义起始和结束标签的特征，遍历 HTML 文档，并捕获位于这些标签之间的所有标签。

使用 BeautifulSoup 解析 HTML

BeautifulSoup 是一个 Python 库，用于从 HTML 和 XML 文件中提取数据。它提供了一种简单而 Pythonic 的方式来导航、搜索和修改解析树。

首先，我们需要安装 BeautifulSoup：

立即学习“Python免费学习笔记（深入）”；

pip install beautifulsoup4

然后，导入 BeautifulSoup 库：

from bs4 import BeautifulSoup

加载 HTML 内容

假设我们有以下 HTML 内容：


    Something other ...


    Notes to Unaudited Condensed Consolidated Financial Statements

I want this...
I want this too...

    Item 2.

I DON'T want this...

我们可以将其加载到 BeautifulSoup 对象中：

html_text = """

    Something other ...


    Notes to Unaudited Condensed Consolidated Financial Statements

I want this...
I want this too...

    Item 2.

I DON'T want this...
"""

soup = BeautifulSoup(html_text, "html.parser")

html.parser 是 BeautifulSoup 使用的解析器。 Python 还支持其他解析器，例如 lxml，通常速度更快，但需要单独安装。

定位起始和结束标签

考拉新媒体导航

考拉新媒体导航——新媒体人的专属门户网站

下载

我们需要找到起始标签（包含 "Notes to Unaudited Condensed Consolidated Financial Statements"）和结束标签（包含 "Item 2."）。我们可以使用 find() 方法和 lambda 函数来定位这些标签：

tag_start = soup.find(
    lambda tag: "Notes to Unaudited Condensed Consolidated Financial Statements"
    in tag.text,
    recursive=False,
)

tag_end = soup.find(
    lambda tag: "Item 2." in tag.text,
    recursive=False,
)

recursive=False 确保我们只在直接子节点中搜索，而不是递归地搜索整个文档树。这在处理大型文档时可以提高效率。

提取标签之间的内容

现在，我们可以遍历所有标签，并提取起始标签和结束标签之间的标签：

tags_in_between, state = [], False
for tag in soup.find_all(recursive=False):
    if tag is tag_start:
        state = True
    elif tag is tag_end:
        state = False
    elif state:
        tags_in_between.append(tag)

print(tags_in_between)

这段代码的工作原理如下：

tags_in_between 列表用于存储提取的标签。
state 变量是一个布尔值，用于跟踪我们是否位于起始标签和结束标签之间。
我们使用 find_all(recursive=False) 遍历所有直接子节点。
如果当前标签是起始标签，我们将 state 设置为 True。
如果当前标签是结束标签，我们将 state 设置为 False。
如果 state 为 True，则表示我们位于起始标签和结束标签之间，我们将当前标签添加到 tags_in_between 列表中。

完整代码示例

from bs4 import BeautifulSoup

html_text = """

    Something other ...


    Notes to Unaudited Condensed Consolidated Financial Statements

I want this...
I want this too...

    Item 2.

I DON'T want this...
"""

soup = BeautifulSoup(html_text, "html.parser")

tag_start = soup.find(
    lambda tag: "Notes to Unaudited Condensed Consolidated Financial Statements"
    in tag.text,
    recursive=False,
)

tag_end = soup.find(
    lambda tag: "Item 2." in tag.text,
    recursive=False,
)

tags_in_between, state = [], False
for tag in soup.find_all(recursive=False):
    if tag is tag_start:
        state = True
    elif tag is tag_end:
        state = False
    elif state:
        tags_in_between.append(tag)

print(tags_in_between)

输出结果：