当 div 不存在时使用 Beautiful soup 提取数据
P粉818306280
P粉818306280 2024-02-26 16:22:47
[HTML讨论组]

我正在尝试从几千个 html 文件或站点数据中提取表数据,但是这些表没有 div 来使这变得简单,而且我对 beautiful soup 还很陌生。现在,我正在手动编辑所有转换后的 html 到 csv 并将它们放入我的数据库中以创建表格,但我宁愿只抓取我已经拥有的内容。

<
    
 
Center Banner
 

5k Run

Overall Finish List

September 24, 2022



1st Alarm 5k

Place Name City Bib No Age Gender Age Group Total Time Pace
1 Runner 1 ANYTOWN PA 390 52 M 1:Overall 18:43.93 6:03/M
2 Runner 2 ANYTOWN PA 380 33 M 1:19-39 19:31.27 6:18/M
3 Runner 3 ANYTOWN PA 389 65 F 1:Overall 45:45.20 14:46/M
4 Runner 4 ANYTOWN PA 381 18 F 1: 1-18 53:28.84 17:15/M
5 Runner 5 ANYTOWN PA 382 41 F 1:40-59 53:30.48 17:16/M
6 Runner 6 ANYTOWN PA 384 14 M 1: 1-18 57:38.66 18:36/M
7 Runner 7 ANYTOWN PA 385 72 F 1:60-99 57:40.11 18:36/M

>

我尝试过添加 div,但没有取得太大成功。

P粉818306280
P粉818306280

全部回复(1)
P粉463291248

BeautifulSoup 允许您搜索 div 以外的内容。

假设您显示的 html 想要检索看起来像跑步者的内容,您可以执行类似的操作。

from bs4 import BeautifulSoup

file_path = 'scrap.html'

with open(file_path, 'r',
          encoding='utf-8') as file:  # We simulate a return from an html request by just opening an .html file
    html_content = file.read()

soup = BeautifulSoup(html_content, 'html.parser')
table = soup.find('table', {"class": "racetable"})  # We are looking for the table with the 'racetable' class
rows_table = table.find_all('tr')[1:]  # All lines in the table without the first one

columns_name = [
    row.get_text() for row in rows_table[0].find_all('td')
]  # We get the name of each column in a list

runners = []
for row in rows_table[1:]:  # We repeat on all the lines except the first one which is the one with the name of the columns
    data = [
        elem.get_text().strip() for elem in row.find_all('td')
    ]
    runner = {
        "place": data[columns_name.index("Place")],
        "name": data[columns_name.index("Name")],
        "city": data[columns_name.index("City")],
        "bib_no": data[columns_name.index("Bib No")],
        "age": data[columns_name.index("Age")],
        "gender": data[columns_name.index("Gender")],
        "age_group": data[columns_name.index("Age Group")],
        "total_time": data[columns_name.index("Total Time")],
        "pace": data[columns_name.index("Pace")]
    }
    print(runner)
    runners.append(runner)

打印的结果看起来像这样

{'place': '1', 'name': 'Runner 1', 'city': 'ANYTOWN  PA', 'bib_no': '390', 'age': '52', 'gender': 'M', 'age_group': '1:Overall', 'total_time': '18:43.93', 'pace': '6:03/M'}
{'place': '2', 'name': 'Runner 2', 'city': 'ANYTOWN  PA', 'bib_no': '380', 'age': '33', 'gender': 'M', 'age_group': '1:19-39', 'total_time': '19:31.27', 'pace': '6:18/M'}
{'place': '3', 'name': 'Runner 3', 'city': 'ANYTOWN  PA', 'bib_no': '389', 'age': '65', 'gender': 'F', 'age_group': '1:Overall', 'total_time': '45:45.20', 'pace': '14:46/M'}
{'place': '4', 'name': 'Runner 4', 'city': 'ANYTOWN  PA', 'bib_no': '381', 'age': '18', 'gender': 'F', 'age_group': '1: 1-18', 'total_time': '53:28.84', 'pace': '17:15/M'}
{'place': '5', 'name': 'Runner 5', 'city': 'ANYTOWN  PA', 'bib_no': '382', 'age': '41', 'gender': 'F', 'age_group': '1:40-59', 'total_time': '53:30.48', 'pace': '17:16/M'}
{'place': '6', 'name': 'Runner 6', 'city': 'ANYTOWN  PA', 'bib_no': '384', 'age': '14', 'gender': 'M', 'age_group': '1: 1-18', 'total_time': '57:38.66', 'pace': '18:36/M'}
{'place': '7', 'name': 'Runner 7', 'city': 'ANYTOWN  PA', 'bib_no': '385', 'age': '72', 'gender': 'F', 'age_group': '1:60-99', 'total_time': '57:40.11', 'pace': '18:36/M'}
热门教程
更多>
最新下载
更多>
网站特效
网站源码
网站素材
前端模板
关于我们 免责申明 举报中心 意见反馈 讲师合作 广告合作 最新更新 English
php中文网:公益在线php培训,帮助PHP学习者快速成长!
关注服务号 技术交流群
PHP中文网订阅号
每天精选资源文章推送
PHP中文网APP
随时随地碎片化学习

Copyright 2014-2025 https://www.php.cn/ All Rights Reserved | php.cn | 湘ICP备2023035733号