咱们以爬虫为关键词,进行百度指数的分析
然后F12开发者模式,然后刷新,依次点击Network
-> XHR
-> index?area=0&word=...
-> Preview
,然后你就会看到
这些都是个啥啊,显然data里面是加密了的,头秃。
先按下不表,接着看后面。
https://index.baidu.com/api/SearchApi/index?area=0&word=[[%7B%22name%22:爬虫%22,%22wordType%22:1%7D]]&&startDate=2011-01-02&endDate=2022-01-02
很明显,他有三个参数:
如果你能掌管好这三个参数,那数据不就是手到擒来嘛!
是get请求,返回数据格式是json
格式,编码为utf-8
知道了url规则,以及返回数据的格式,那现在咱们的任务就是构造url然后请求数据
url = "https://index.baidu.com/api/SearchApi/index?area=0&word=[[%7B%22name%22:%22{}%22,%22wordType%22:1%7D]]&startDate=2011-01-02&endDate=2022-01-02".format(keyword)
那就直接上呗,直接请求他
所以我们为了方便就把请求网页的代码写成了函数get_html(url),传入的参数是url返回的是请求到的内容。
def get_html(url): headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36", "Host": "index.baidu.com", "Referer": "http://index.baidu.com/v2/main/index.html", } cookies = { "Cookie": 你的cookie } response = requests.get(url, headers=headers, cookies=cookies) return response.text
注意这里一定要把你的cookie替换掉,不然请求不到内容。
cookies获取方式
将获得的数据格式化为json
格式的数据。
def get_html(url): headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36", "Host": "index.baidu.com", "Referer": "http://index.baidu.com/v2/main/index.html", } cookies = { "Cookie": 你的cookie } response = requests.get(url, headers=headers, cookies=cookies) return response.text def get_data(keyword): url = "https://index.baidu.com/api/SearchApi/index?area=0&word=[[%7B%22name%22:%22{}%22,%22wordType%22:1%7D]]&startDate=2011-01-02&endDate=2022-01-02".format(keyword) data = get_html(url) data = json.lodas(data) data = data['data']['userIndexes'][0]['all']['data']
ok,到此数据就获取完了,我们下期见,拜拜~
好了,这个加密的data怎么处理呢???
其中data可以看到应该是加密了的,all是表示全部数据,pc是指pc端,wise是移动端,这些可以在js文件里找到;首先先搞清楚这个像加密了的data是怎么解密的;我们现在知道这个数据是json格式,那么它处理肯定要从中取出这些data,所以,重新刷新一下网页,目的是为了让所有js都能加载出来,然后利用搜索功能从中找。搜索过程就不上图了,我是搜索 decrypt找到的;首先,我用decrypt找到了一个js文件,其中有一个名为decrypt的方法
def decrypt(t,e): n = list(t) i = list(e) a = {} result = [] ln = int(len(n)/2) start = n[ln:] end = n[:ln] for j,k in zip(start, end): a.update({k: j}) for j in e: result.append(a.get(j)) return ''.join(result)
到这可能都觉得已经解决了,可你不知道t这个参数是什么,怎么来的,这里我就不带各位分析了,你么可以自己尝试分析分析,我直接说结果,是利用uniqid请求另一个接口得到的。
这个t其实是叫ptbk,获取这个ptbk的url:http://index.baidu.com/Interface/ptbk?uniqid=
有一个参数uniqid,GET请求,返回json内容
这个data
就是t
,上一步all
里面的data
就是e
def get_ptbk(uniqid): url = 'http://index.baidu.com/Interface/ptbk?uniqid={}' resp = get_html(url.format(uniqid)) return json.loads(resp)['data']
# -*- coding:utf-8 -*- # @time: 2022/1/4 8:35 # @Author: 韩国麦当劳 # @Environment: Python 3.7 import datetime import requests import sys import time import json word_url = 'http://index.baidu.com/api/SearchApi/thumbnail?area=0&word={}' def get_html(url): headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36", "Host": "index.baidu.com", "Referer": "http://index.baidu.com/v2/main/index.html", } cookies = { 'Cookie': 你的Cookie } response = requests.get(url, headers=headers, cookies=cookies) return response.text def decrypt(t, e): n = list(t) i = list(e) a = {} result = [] ln = int(len(n) / 2) start = n[ln:] end = n[:ln] for j, k in zip(start, end): a.update({k: j}) for j in e: result.append(a.get(j)) return ''.join(result) def get_ptbk(uniqid): url = 'http://index.baidu.com/Interface/ptbk?uniqid={}' resp = get_html(url.format(uniqid)) return json.loads(resp)['data'] def get_data(keyword, start='2011-01-02', end='2022-01-02'): url = "https://index.baidu.com/api/SearchApi/index?area=0&word=[[%7B%22name%22:%22{}%22,%22wordType%22:1%7D]]&startDate={}&endDate={}".format(keyword, start, end) data = get_html(url) data = json.loads(data) uniqid = data['data']['uniqid'] data = data['data']['userIndexes'][0]['all']['data'] ptbk = get_ptbk(uniqid) result = decrypt(ptbk, data) result = result.split(',') start = start_date.split("-") end = end_date.split("-") a = datetime.date(int(start[0]), int(start[1]), int(start[2])) b = datetime.date(int(end[0]), int(end[1]), int(end[2])) node = 0 for i in range(a.toordinal(), b.toordinal()): date = datetime.date.fromordinal(i) print(date, result[node]) node += 1 if __name__ == '__main__': keyword = "爬虫" start_date = "2011-01-02" end_date = "2022-01-02" get_data(keyword, start_date, end_date)
欢迎一键三连哦!
还想看哪个网站的爬虫?欢迎留言,说不定下次要分析的就是你想要看的!