之前爬取过几个静态的网站数据,都还比较顺利,这次遇到ajax,看了几个文档,感觉不是很难,就直接上手了,但还是卡住了。。。
目标:
爬取大街网的职位信息。
过程:
1,使用浏览器审查元素功能查看数据动态加载的地址信息。
2,根据显示的信息配置requests的请求参数。
data = {
'keyword': 'python',
'order': '0',
'city': '',
'recruitType': '',
'salary': '',
'experience': '',
'page': '5',
'positionFunction': '',
'_CSRFToken': '',
'ajax': '1'
}
headers = {
'accept': 'application/json, text/javascript, */*; q=0.01',
'accept-language': 'zh-CN,zh;q=0.8',
'accept-encoding': 'gzip, deflate, sdch',
'cookie': 'DJ_UVID=MTQ5MDMyMTExNTAzODM2MTc5; DJ_RF=empty; DJ_EU=http%3A%2F%2Fjob.dajie.com%2F; __login_tips=1; dj_cap=9c8c95bdef72e84a9bd7493a5ab91694; USER_ACTION="request^A-^A-^Ajobdetail:^A-"; SO_COOKIE_V2=0c7cGprjIH0q9RHc53CWLLXf151DQ5QvUP5ccPQj4g0B/izuXHm8sp41lJjJJh3nmjAkroj8JczFN/SCLPAUzbOHW7wYWmQ6Zu7s',
'referer': 'https://so.dajie.com/job/search?keyword=%E9%A3%9E%E5%88%A9%E6%B5%A6&from=job&clicktype=blank',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
'method':'get'
}
3,将请求头信息添加到requests.get()中。
response = requests.get('https://so.dajie.com/job/ajax/search/filter', params=data, headers=headers)
4,查看返回的页面信息。
print response.url
print ''
print response.request.headers
print ''
print response.headers
print ''
print response.content[-1000:]
print ''
print response
5,返回的结果怎么不是期望的json数据呢。。。
response.url:
https://so.dajie.com/job/ajax/search/filter?salary=&city=&ajax=1&positionFunction=&_CSRFToken=&keyword=python&recruitType=&order=0&experience=&page=5
response.request.headers:
{'accept-language': 'zh-CN,zh;q=0.8', 'accept-encoding': 'gzip, deflate, sdch', 'X-Requested-With': 'XMLHttpRequest', 'accept': 'application/json, text/javascript, */*; q=0.01', 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36', 'Connection': 'keep-alive', 'referer': 'https://so.dajie.com/job/search?keyword=%E9%A3%9E%E5%88%A9%E6%B5%A6&from=job&clicktype=blank', 'cookie': 'DJ_UVID=MTQ5MDMyMTExNTAzODM2MTc5; DJ_RF=empty; DJ_EU=http%3A%2F%2Fjob.dajie.com%2F; __login_tips=1; dj_cap=9c8c95bdef72e84a9bd7493a5ab91694; USER_ACTION="request^A-^A-^Ajobdetail:^A-"; SO_COOKIE_V2=0c7cGprjIH0q9RHc53CWLLXf151DQ5QvUP5ccPQj4g0B/izuXHm8sp41lJjJJh3nmjAkroj8JczFN/SCLPAUzbOHW7wYWmQ6Zu7s', 'method': 'get'}
response.headers:
{'Date': 'Wed, 19 Apr 2017 02:00:47 GMT', 'Content-Length': '5944', 'ETag': '"552f21de-1738"', 'Content-Type': 'text/html; charset=UTF-8', 'Connection': 'keep-alive'}
response.content[-1000:]:
,这个页面去火星了,试试搜索一下吧: