什么是爬虫:通过编写程序,模拟浏览器上网,然后让其去互联网上抓取数据的过程
爬虫究竟是合法还是违法的?
爬虫带来的风险可以体现在如下两个方面:
如何在编写使用的过程中避免进入局子的厄运?
爬虫在使用场景中的分类:
爬虫的矛与盾:
robots.txt协议:君子协议。规定了网站中那些数据可以被爬虫爬取,那些数据不允许被爬取。
概念:就是服务器和客户端进行数据交互的一种形式。
常用请求头信息:
常用响应头信息:
概念:安全的超文本传输协议
非对称秘钥加密
存在缺点:第一个是如何保证接收端向发送端发出公开秘钥的时候,发送端确保收到的是预先要发送的,而不会被挟持,只要是发送秘钥,就有可能有被挟持的风险;第二个是非对称秘钥加密方式效率比较低,处理起来更为复杂,通信过程中使用就有一定的效率问题而影响通信速度。
证书秘钥加密:
requests模块:Python中原生的一款基于网络请求的模块,功能非常强大,简单便捷,效率极高。
作用:模拟浏览器发请求。
如何使用:(requests模块的编码流程)
环境的安装:pip install requests
实战编码:
import requests if __name__ == __main__: #step1 指定url url = https://www.sogou.com/ #step2 发起请求 #get方法会返回一个响应对象 response = requests.get(url = url) #step3 获取响应数据,text返回的是字符串形式的响应数据 page_text = response.text print(page_text) #step4 持久化存储 with open(./sogou.html,w,encoding = utf-8) as fp: fp.write(page_text) print(爬取数据结束!)
#UA:User-Agent请求载体的身份标识 UA检测:门户网站的服务器会监测对应请求的载体身份标识, 如果检测到请求载体身份标识是某一款浏览器,说明该请求时一个正常的请求; 但是,如果检测到请求的载体身份不是基于某一款浏览器的,则表示该请求为不正常请求(爬虫), 则服务器很有可能拒绝该次请求 #UA伪装:让爬虫对应的请求载体身份标识伪装成某一款浏览器,躲过UA检测 import requests if __name__ == __main__: #UA伪装:将对应的User-Agent封装到一个字典中 headers = { User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36 } #step1 指定url query url = https://www.sogou.com/web #处理url携带的参数 封装到字典中 kw = input(Enter a word:) param ={ query:kw } #step2 对指定的url发起请求,对应的url是携带参数的,并且处理过程中处理了参数 response = requests.get(url = url,params = param,headers = headers) #step3 page_text = response.text #step4 fileName = kw + .html with open(fileName,w,encoding =utf-8) as fp: fp.write(page_text) print(fileName,保存成功!!)
import requests import json if __name__ == __main__: #step1 指定URL post_url = https://fanyi.baidu.com/sug #step2 进行UA伪装 headers = { User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36 } #step3 post请求参数处理(同get请求类似) word = input(Enter a word: ) data = { kw:word } #step4 请求发送 response = requests.post(url = post_url,data = data,headers = headers) #step5 获取响应数据:json()方法返回的是obj (如果确认响应数据是json类型-->通过Content-Type分辨,才可以直接用json方法) dict_obj = response.json() print(dict_obj) #step6 持久化存储 fileName = word + .json fp = open(fileName,w,encoding=utf-8) json.dump(dict_obj,fp = fp,ensure_ascii = False) print(Over!)
import requests import json if __name__ == __main__: url = https://movie.douban.com/j/chart/top_list param = { type:24, interval_id:100:90, action:, start:0, #从库中的第几部电影去取 limit:20 #一次取出的个数 } headers = { User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36 } response = requests.get(url = url,params = param,headers = headers) list_data = response.json() fp = open(./douban.json,w,encoding = utf-8) json.dump(list_data,fp = fp,ensure_ascii = False) print(Over!)
import requests import json if __name__ == __main__: post_url = https://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword keyword = input(请输入要查询的城市:) data ={ cname: , pid: , keyword: keyword, pageindex: 1, pageSize: 10 } headers = { User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36 } response = requests.post(url = post_url, data = data, headers = headers) # 持久化存储 # page_text = response.text # fileName = keyword + .html # with open(fileName, w, encoding= utf-8) as fp: # fp.write(page_text) # print(fileName, Over!) # 直接打印出来 page = response.json() for dict in page[Table1]: StoreName = dict[storeName] address = dict[addressDetail] print(StoreName: + StoreName, address: + address + )
(http://scxk.nmpa.gov.cn:81/xk/)
import requests import json if __name__ == __main__: headers = { User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36 } id_list = [] # 存储企业的id all_data_list = [] # 存储企业所有的详情数据 # 批量获取不同企业的id值 url = http://scxk.nmpa.gov.cn:81/xk/itownet/portalAction.do?method=getXkzsList # 参数的封装 for page in range(1, 11): page = str(page) data = { on: true, page: page, pageSize: 15, productName: , conditionType: 1, applyname: , applysn: , } json_ids = requests.post(url=url, headers=headers, data=data).json() # 从 json_ids 字典中拿到 list 对应的 value 值,对 value 值列表进行遍历 for dic in json_ids[list]: id_list.append(dic[ID]) # print(id_list, ) # 获取企业详情数据,也是动态加载出来的,携带一个参数 id,其值可以通过前一步生成的 id列表提取 post_url = http://scxk.nmpa.gov.cn:81/xk/itownet/portalAction.do?method=getXkzsById for id in id_list: data = { id: id } json_detail = requests.post(url=post_url, data=data, headers=headers).json() #print(json_detail, -------------END----------) all_data_list.append(json_detail ) all_data_list.append(---------------------------------------------------------) # 持久化存储all_data_list fp = open(./allData.json, w, encoding=utf-8) json.dump(all_data_list, fp=fp, ensure_ascii=False, indent= True) # indent 自动排版 print(Over!)
常用的正则表达式 单字符: . : 除换行以外所有字符 [ ] : [aoe] [a-w] 匹配集合中任意一个字符 d : 数字 [0-9] D : 非数字 w : 数字、字母、下划线、中文 W : 非w s : 所有的空白字符包,包括空格、制表符、换页符等等,等价于[ f v ] S : 非空白 数量修饰: * : 任意多次 >=0 + : 至少一次 >=1 ? : 可有可无 0次或者1次 {m} : 固定m次 hello{3,} {m,} : 至少m次 {m,n} : m-n次 边界: $ : 以某某结尾 ^ : 以某某开头 分组: (ab) 贪婪模式: .* 非贪婪(惰性)模式: .*? re.I : 忽略大小写 re.M : 多行匹配 re.S : 单行匹配 re.sub : 正则表达式,替换内容,字符串
正则练习 import re #提取出python key = "javapythonc++php" re.findall(python, key)[0] #提取出hello world key = "<html><h1><hello world><h1></html>" re.findall(<h1>(.*)<h1>, key)[0] #提取170 string = 我喜欢身高为170的女孩’ re.findall(d+, string) #提取出http://和https:// key = http://www.baidu.com and https://boob.com re.findall(https?://, key) #提取出hello key = lalala<hTml><hello></HtMl>hahah #输出<hTml><hello></HtMl> re.findall(<[Hh][Tt][mM][lL]>(.*)</[Hh][Tt][mM][lL]>, key) #提取出hit. key = bobo@hit.edu.com #想要匹配到hit re.findall(h.*?., key) #匹配sas和saas key = sasa and sas and saaas re.findall(sa{1,2}s, key)
import requests if __name__ == __main__: #如何爬取图片 url = https://pic.qiushibaike.com/system/pictures/12409/124098453/medium/YNPHJQC101MS31E1.jpg #content返回的是二进制形式的图片数据 #text(字符串) content(二进制) json(队形) img_data = requests.get(url = url).content with open(./qiutu.jpg, wb) as fp: fp.write(img_data)
# 需求:爬取糗事百科中糗图板块下所有的糗图图片 <div class="thumb"> <a href="/article/124098472" target="_blank"> <img src="//pic.qiushibaike.com/system/pictures/12409/124098472/medium/HSN2WWN0TP1VUPNG.jpg" alt="糗事#124098472" class="illustration" width="100%" height="auto"> </a> </div> import re import os import requests if __name__ == __main__: # 创建一个文件夹,保存所有的图片 if not os.path.exists(./qiutuLibs): os.mkdir(./qiutuLibs) url = https://www.qiushibaike.com/imgrank/ headers = { User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36 } # 使用通用爬虫对url对应的一整张页面进行爬取 page_text = requests.get(url=url, headers=headers).text #print(page_text) #使用聚焦爬虫将页面中所有的糗图进行解析提取 ex = <div class="thumb">.*?<img src="(.*?)" alt=.*?</div> img_src_list = re.findall(ex, page_text, re.S) print(img_src_list) for src in img_src_list: #拼接出完整的图片url src = https: + src img_data = requests.get(url = src, headers = headers).content #生成图片名称 img_name = src.split(/)[-1] imgPath = ./qiutuLibs/ + img_name with open(imgPath, wb) as fp: fp.write(img_data) print(img_name, 下载成功!)
# 对上述代码进行进一步处理,使得能够分页爬取图片 import re import os import requests if __name__ == __main__: # 创建一个文件夹,保存所有的图片 if not os.path.exists(./qiutuLibs): os.mkdir(./qiutuLibs) # 设置一个通用的url模板 url = https://www.qiushibaike.com/imgrank/page/%d/ for pageNum in range(1, 11): # 对应页码的 url new_url = format(url % pageNum) headers = { User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36 } # 使用通用爬虫对url对应的一整张页面进行爬取 page_text = requests.get(url=new_url, headers=headers).text #print(page_text) #使用聚焦爬虫将页面中所有的糗图进行解析提取 ex = <div class="thumb">.*?<img src="(.*?)" alt=.*?</div> img_src_list = re.findall(ex, page_text, re.S) print(img_src_list) for src in img_src_list: #拼接出完整的图片url src = https: + src img_data = requests.get(url = src, headers = headers).content #生成图片名称 img_name = src.split(/)[-1] imgPath = ./qiutuLibs/ + img_name with open(imgPath, wb) as fp: fp.write(img_data) print(img_name, 下载成功!)
<html lang="en"> <head> <meta charset="UTF-8" /> <title>测试bs4</title> </head> <body> <div> <p>百里守约</p> </div> <div class="song"> <p>李清照</p> <p>王安石</p> <p>苏轼</p> <p>柳宗元</p> <a title="赵匡胤" target="_self"> <span>this is span</span> 宋朝是最强大的王朝,不是军队的强大,而是经济很强大,国民都很有钱</a> <a href="" class="du">总为浮云能蔽日,长安不见使人愁</a> <img src="http://www.baidu.com/meinv.jpg" alt="" /> </div> <div class="tang"> <ul> <li><a title="qing">清明时节雨纷纷,路上行人欲断魂,借问酒家何处有,牧童遥指杏花村</a></li> <li><a title="qin">秦时明月汉时关,万里长征人未还,但使龙城飞将在,不教胡马度阴山</a></li> <li><a alt="qi">岐王宅里寻常见,崔九堂前几度闻,正是江南好风景,落花时节又逢君</a></li> <li><a class="du">杜甫</a></li> <li><a class="du">杜牧</a></li> <li><b>杜小月</b></li> <li><i>度蜜月</i></li> <li><a id="feng">凤凰台上凤凰游,凤去台空江自流,吴宫花草埋幽径,晋代衣冠成古丘</a></li> </ul> </div> </body> </html>
from bs4 import BeautifulSoup if __name__ == __main__: # 将本地的html文档中的数据加载到该对象中 fp = open(./test.html, r, encoding=utf-8) soup = BeautifulSoup(fp, lxml) # print(soup) # page_text = response.text # soup = BeautifulSoup(page_text,lxml) print(soup.a) # soup.tagName 返回的是html中第一次出现的tagName标签 print(soup.div) print(soup.find(div)) # find(tagName) 等同于 soup.div print(soup.find(div, class_=song)) # 属性定位 print(soup.find_all(a)) # 返回符合要求的所有标签(列表) print(soup.select(.tang)) # 返回的是一个列表 print(soup.select(.tang > ul > li > a)[0]) # 层级选择器 > 表示一个层级 print(soup.select(.tang > ul a)[0]) # 空格表示多个层级 print(soup.select(.tang > ul a)[0].text) print(soup.select(.tang > ul a)[0].get_text()) print(soup.select(.tang > ul a)[0].string) print(soup.select(.tang > ul a)[0][href])
# 需求:爬取三国演义小说所有的章节标题和章节内容 # https://www.shicimingju.com/book/sanguoyanyi.html import requests from bs4 import BeautifulSoup if __name__ == __main__: #对首页的页面数据进行爬取 headers = { User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36 } url = https://www.shicimingju.com/book/sanguoyanyi.html response = requests.get(url = url, headers = headers) response.encoding = utf-8 page_text = response.text #在首页中解析出章节的标题和详情页的url #实例化BeautifulSoup对象,需要将页面源码数据加载到该对象中 soup = BeautifulSoup(page_text, lxml) # 解析章节标题和详情页的url li_list = soup.select(.book-mulu > ul > li) fp = open(./sanguo.txt, w, encoding = utf-8) for li in li_list: title = li.a.string detail_url =http://www.shicimingju.com + li.a[href] #对详情页发起请求,解析出章节内容 detail_response = requests.get(url = detail_url, headers = headers) detail_response.encoding = utf-8 detail_page_text = detail_response.text #解析出详情页中相关的章节内容 detail_soup = BeautifulSoup(detail_page_text, lxml) div_tag = detail_soup.find(div, class_ = chapter_content) #解析到了章节的内容 content = div_tag.text fp.write(title + : + content + ) print(title, 爬取成功!)
from lxml import etree if __name__ == "__main__": #实例化好了一个etree对象,且将被解析的源码加载到了该对象中 tree = etree.parse(test.html) # r = tree.xpath(/html/body/div) # r = tree.xpath(/html//div) # r = tree.xpath(//div) # r = tree.xpath(//div[@class="song"]) # r = tree.xpath(//div[@class="tang"]//li[5]/a/text())[0] # r = tree.xpath(//li[7]//text()) # r = tree.xpath(//div[@class="tang"]//text()) r = tree.xpath(//div[@class="song"]/img/@src) print(r)
#需求:解析下载图片数据 http://pic.netbian.com/4kmeinv/ import requests from lxml import etree import os if __name__ == "__main__": url = http://pic.netbian.com/4kmeinv/ headers = { User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36 } response = requests.get(url=url, headers=headers) # 手动设定响应数据的编码格式 # response.encoding = utf-8 page_text = response.text #数据解析:src的属性值 alt属性 tree = etree.HTML(page_text) li_list = tree.xpath(//div[@class="slist"]/ul/li) #创建一个文件夹 if not os.path.exists(./picLibs): os.mkdir(./picLibs) for li in li_list: img_src = http://pic.netbian.com+li.xpath(./a/img/@src)[0] img_name = li.xpath(./a/img/@alt)[0]+.jpg #通用处理中文乱码的解决方案 img_name = img_name.encode(iso-8859-1).decode(gbk) # print(img_name,img_src) # 请求图片进行持久化存储 img_data = requests.get(url=img_src, headers=headers).content img_path = picLibs/+img_name with open(img_path, wb) as fp: fp.write(img_data) print(img_name, 下载成功!!!) print(------------------------OVER!---------------------------------)
# 需求:解析出所有城市名称 https://www.aqistudy.cn/historydata/ import requests from lxml import etree if __name__ == __main__: headers = { User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36 } url = https://www.aqistudy.cn/historydata/ page_text = requests.get(url=url,headers=headers).text tree = etree.HTML(page_text) #数据解析 hot_li_list = tree.xpath(//div[@class="bottom"]/ul/li) all_city_names = [] #解析热门城市名字 for li in hot_li_list: hot_city_names = li.xpath(./a/text())[0] all_city_names.append(hot_city_names) #解析全部城市名字: city_names_list = tree.xpath(.//div[@class="bottom"]/ul/div[2]/li) for li in city_names_list: city_name = li.xpath(./a/text())[0] all_city_names.append(city_name) print(all_city_names,len(all_city_names)) # 第二种方法,一起解析 headers = { User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36 } url = https://www.aqistudy.cn/historydata/ page_text = requests.get(url=url, headers=headers).text tree = etree.HTML(page_text) # 数据解析 解析到热门城市和全部城市对应的a标签 # 热门城市标签层级div/ul/li/a # 全部城市标签层级div/ul/div[2]/li/a a_list = tree.xpath(//div[@class="bottom"]/ul/li/a | //div[@class="bottom"]/ul/div[2]/li/a ) all_city_names = [] for a in a_list: a_name = a.xpath(./text())[0] all_city_names.append(a_name) print(all_city_names, len(all_city_names))
验证码和爬虫之间的爱恨情仇:
识别验证码的操作:
#!/usr/bin/env python # coding:utf-8 from lxml import etree import requests from hashlib import md5 class Chaojiying_Client(object): def __init__(self, username, password, soft_id): self.username = username password = password.encode(utf8) self.password = md5(password).hexdigest() self.soft_id = soft_id self.base_params = { user: self.username, pass2: self.password, softid: self.soft_id, } self.headers = { Connection: Keep-Alive, User-Agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0), } def PostPic(self, im, codetype): """ im: 图片字节 codetype: 题目类型 参考 http://www.chaojiying.com/price.html """ params = { codetype: codetype, } params.update(self.base_params) files = {userfile: (ccc.jpg, im)} r = requests.post(http://upload.chaojiying.net/Upload/Processing.php, data=params, files=files, headers=self.headers) return r.json() def ReportError(self, im_id): """ im_id:报错题目的图片ID """ params = { id: im_id, } params.update(self.base_params) r = requests.post(http://upload.chaojiying.net/Upload/ReportError.php, data=params, headers=self.headers) return r.json() def tranformImgCode(imgPath,imgType): chaojiying = Chaojiying_Client(此处是账户, 此处是密码, 此处是软件ID) #用户中心>>软件ID 生成一个替换 im = open(imgPath, rb).read() return chaojiying.PostPic(im,imgType)[pic_str] #1902 验证码类型 官方网站>>价格体系 3.4+版 print(tranformImgCode(./a.jpg,1902))