
本教程详细介绍了如何使用Python的BeautifulSoup库从网页中准确提取文章内容。文章通过一个实际案例,揭示了在选择HTML元素时因CSS类名不匹配导致的常见问题,并提供了正确的解决方案。通过学习本教程,读者将掌握如何通过检查网页源代码来识别正确的选择器,从而有效避免数据抓取失败,提升爬虫的健壮性。
BeautifulSoup是一个功能强大的Python库,用于从HTML或XML文件中提取数据。它能够解析文档,并提供简单、Pythonic的方式来搜索、导航和修改解析树。在进行网页数据抓取(Web Scraping)时,BeautifulSoup是不可或缺的工具之一,尤其适用于处理静态HTML内容。
然而,在实际操作中,开发者常会遇到因选择器不准确而导致数据提取失败的问题。本文将通过一个具体的案例,深入探讨这一常见问题及其解决方案,帮助读者提升使用BeautifulSoup的技能。
在尝试从特定网页(例如 https://economictimes.indiatimes.com/industry/cons-products/food/heinz-braces-up-for-aggressive-marketing/articleshow/5417995.cms)提取文章内容时,我们可能会编写如下Python代码:
from bs4 import BeautifulSoup
import requests
url = 'https://economictimes.indiatimes.com/industry/cons-products/food/heinz-braces-up-for-aggressive-marketing/articleshow/5417995.cms'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# 尝试定位文章主体
article = soup.find('article', class_='artData clr paywall')
if article:
# 尝试定位文章内容文本,使用了'artText medium'作为类名
content = article.find('div', class_='artText medium')
text_contents = content.text.strip() if content else "No data"
else:
text_contents = "No data"
print(text_contents)然而,运行上述代码后,输出结果却是:
No data
这表明程序未能成功找到目标内容。尽管我们已经定位到了文章的父级元素,但在进一步细化选择时出现了问题。
BeautifulSoup的find()方法在通过class_参数匹配元素时,要求提供的是HTML元素class属性的完整且精确的字符串值。这意味着,如果一个HTML元素的class属性是class="artText",而我们尝试使用class_='artText medium'去匹配,那么find()方法将无法找到该元素,因为它期待一个完全匹配的字符串。
针对上述案例,失败的原因在于:通过检查目标网页的HTML结构,我们可以发现包含文章内容的div元素的class属性实际上是class="artText",而不是class="artText medium"。原始代码中多余的medium导致了匹配失败。
要解决这个问题,关键在于准确识别目标元素的CSS类名。这通常需要通过浏览器开发者工具(如Chrome的F12)来检查网页的HTML源代码。
步骤:
通过检查,我们发现目标div元素的class属性确实是artText。因此,正确的选择器应该是class_='artText'。
修正后的代码如下:
from bs4 import BeautifulSoup
import requests
url = 'https://economictimes.indiatimes.com/industry/cons-products/food/heinz-braces-up-for-aggressive-marketing/articleshow/5417995.cms'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# 定位文章主体(此部分在原代码中是正确的)
article = soup.find('article', class_='artData clr paywall')
if article:
# 修正:使用正确的类名'artText'
content = article.find('div', class_='artText')
text_contents = content.text.strip() if content else "No data"
else:
text_contents = "No data"
print(text_contents)运行修正后的代码,将得到预期的文章内容输出:
'MUMBAI: US foods major Heinz, which owns brands such as Glucon D and Complan in India, has asked the Indian subsidiary to gun for more growth and scout for local acquisitions. It is ramping up investments in R&D and marketing. The aggression is in wake of the double digit growth rates recorded by markets such as India and China which has propelled Heinz’s global sales, said Chris Warmoth, executive vice-president, Asia-Pac, in an ET exclusive. The Rs-900 crore plus Heinz India competes with HUL, Nestle and Glaxo Smithkline. Consumer has intensified the localisation and regionalisation of its brands to cater to specific consumer needs and tastes."We have been dramatically increasing our investment in terms of marketing, building new factory, information systems in India. It is hard not to be extremely upbeat on India. I think we have a very strong organisation and we feel we really know India very well. We have two excellent brands in Complan and Glucon D and we got a lot of proven successes and a great new product pipeline,” said Warmoth. Heinz’s Asia-Pacific division also includes Japan and high-growth emerging markets such as China, India and Indonesia. During fiscal 2009, sales in emerging markets grew by 15.7% propelled by double-digit organic sales growth in these regions. The focus is on leveraging its first-mover advantage and go-to-market capabilities to drive accelerated growth, Warmoth said. After a couple of mistakes such as launching global food brands in a diverse consumer market, Heinz also known for its Heinz ketchup got its act together and focused on a more localised strategy of focusing on specific consumer needs and tastes across Indian markets, strengthened relationships with the customer and trade.Heinz India’s brands like Complan has a market share of 15.7% in the milk drinks segment while Glucon-D has a 62% in the glucose drinks segment with Nycil prickly heat powder at 36.8% and Heinz Ketchup at 2.2 %. Heinz has invested over Rs 300 crore in India since 2007 and is looking at another Rs 100 crore plus investment this year company officials said. Heinz relaunched Complan, launched Complan Nutri Bowl Muesli in TN, Complan Memory and Complan Milk Biscuits in AP with local flavours such as Strawberry and Kesar Badam. The company launched a top-down squeeze pack of Heinz Tomato Ketchup and recently introduced Heinz condiments portfolio with the launch of Heinz Kitchen Klassics, Ready To Eat range which is currently being test marketed in Mumbai. Another key brand from the Heinz portfolio – Glucon-D is available in three flavours – Natural, Orange and more localised Nimbu Paani across the country.“What we found out over the last 6-7 years we have been here is the country being what it is. The food challenges in India are very unique. So every 100 kilometers you drive in this country, the taste preferences change. So, we have learned our lessons and we also know that ketchup is just an entry point. We are looking at other Indian interpretations of ketchups, we are looking at other packaged food, we are looking at other sauces,” said N Thiruambalam, managing director of Heinz India. In 2009, Heinz sales in emerging markets grew 8.8% propelled by sales in India, Indonesia, Latin America and Poland. Emerging markets contribute now 14% of Heinz’s total sales. Heinz is now focusing on building strong operations in fast growing merging markets and stepping up investments in R&D and marketing to drive growth. Emerging markets are expected to contribute about a third of the company’s total global sales growth over the next two years.“We don’t start off necessarily with global brands because I think in food it is much harder to be global than in shampoos or washing detergents or feminine protection or whatever. If you look at lot of the brands we compete within, Glucose category is very Indian and even the flavoured milk segment is very Indian,” said Warmoth.So we start off with more local brands. But in terms of leveraging global scale we are very active. So we have something called the Heinz Marketing Academy, we have something called the Heinz Purchasing Academy, we have something called the Heinz Sales Academy, we have a manufacturing system called the Heinz Global Performance System, which is a standardized set up measures on running factories,"said Warmoth.Heinz is in the middle of a multi year process to roll out a global common information that allows it to start leveraging global scale and have a better view on the commodities when purchasing them.H. J. Heinz Company is a global marketers and producer of healthy, convenient and affordable foods specializing in ketchup, sauces, meals, soups, snacks and infant nutrition. Its leading branded products, including Heinz Ketchup, sauces, soups, beans, pasta and infant foods (representing over one third of Heinz’s total sales), Ore-Ida potato products, Weight Watchers Smart Onesentrees, Boston Marketmeals, T.G.I. Friday’s snacks, and Plasmon infant nutrition.'
在进行网页数据抓取时,除了精确选择器外,还需要注意以下几点:
通过本教程,我们深入探讨了使用BeautifulSoup进行网页数据提取时,因CSS类名选择不精确而导致数据抓取失败的常见问题。核心解决方案在于精确地识别目标元素的完整class属性值。掌握这一技巧,并结合开发者工具进行HTML检查,将大大提高您使用BeautifulSoup进行网页抓取的效率和成功率。同时,遵循最佳实践,可以构建更加健壮和负责任的爬虫程序。
以上就是使用BeautifulSoup精准提取网页内容:常见陷阱与解决方案的详细内容,更多请关注php中文网其它相关文章!
每个人都需要一台速度更快、更稳定的 PC。随着时间的推移,垃圾文件、旧注册表数据和不必要的后台进程会占用资源并降低性能。幸运的是,许多工具可以让 Windows 保持平稳运行。
Copyright 2014-2025 https://www.php.cn/ All Rights Reserved | php.cn | 湘ICP备2023035733号