使用场景:如果爬取解析的数据不在同一张页面中(需要深度爬取)
在爬虫文件中,需要回调的函数那里添加一个meta
字典对象
使用方法:手动发送请求时,需要传入一个新定义回调函数,同时这个函数需要参数,那么就需要meta
来发送,在调用时用meta={}
字典形式发送,接收就用response.meta['xxx']
获取其中的字典
class BossSpider(scrapy.Spider): name = 'boss' start_urls = ['https://www.test.com/chaxun/'] def parse_detail(self,response): #获取传送过来的meta对象 item=response.meta['item'] job_desc=response.xpath('//div[2]/div/div[1]/ul/li[1]/a') job_desc=''.join(job_desc) item['job_desc']=job_desc def parse(self, response): div_list = response.xpath('//div[@class="shici_list_main"]') for div in div_list: item=BossproItem() job_name=div.xpath('.//div/h3/a/text()').extract() item['job_name']=job_name detail_url='https://www.test.com'+div.xpath('.//div/h3/h4/text()').extract_first() #手动请求发送 #请求传参,根据meta={},可以将meta字典传送给请求对应的回调函数 yield scrapy.Request(detail_url,callback=self.parse_detail,meta={'item':item})
图片数据爬取之ImagesPipeline
基于scrapy
爬取字符串类型的数据和爬取图片类型的数据区别:
xpath
进行解析且提交管道进行持久化存储xpath
解析出图片src
属性值,单独的对图片地址发起请求获取图片二进制类型的数据ImagesPipeline
只需要将img
的src
属性值进行解析,提交到管道,管道就会对图片的src
进行发送获取图片的二进制类型数据,且还会帮我们进行持久化存储注意
:有些网站会对图片的src
标签修改为src2
,只有当图片标签定位到当前窗口才会变为正常标签属性,这样可以避免加载过多图片,减轻服务器压力
使用步骤:
item
提交到判定的管道类imagesPipeLine
的一个管道类,并重写三个方法:get_media_requests
,file_path
,item_completed
settings.py
指定刚刚自定义的定制的管道类,指定图片存储目录:IMAGES_STORE='./imgs'
图片爬虫文件,里面包含了主要的解析类文件信息
爬虫文件
import scrapy from imgsPro.items import ImgsproItem class ImgSpider(scrapy.Spider): name = 'img' # allowed_domains = ['www.xxx.com'] # 模拟站长素材地址 start_urls = ['http://sc.test.com/tupian/'] # 数据解析 def parse(self, response): div_list = response.xpath('//div[@id="container"]/div') for div in div_list: src='https:'+div.xpath('./div/a/img/@src2').extract_first() print(src) item = ImgsproItem() item['src']=src return item
item
文件
import scrapy class ImgsproItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() src=scrapy.Field()
写一个继承于ImagesPipeLine
的管道类,并重写三个方法:get_media_requests
,file_path
,item_completed
from scrapy.pipelines.images import ImagesPipeline import scrapy class ImgsPipeline(ImagesPipeline): 可以根据图片的地址进行图片数据的请求 def get_media_requests(self, item, info): yield scrapy.Request(item['src']) 指定图片存储名字 具体路径在配置文件中指定的 def file_path(self,request,response=None,info=None): imgName=request.url.split('/')[-1] return imgName 返回下一个即将被执行的管道类 def item_completed(self, results, item, info): return item
指定刚新建的管道类,以及图片储存地址
指定管道类 ITEM_PIPELINES = { 'imgsPro.pipelines.ImgsPipeline': 300, } ROBOTSTXT_OBEY = False 日志级别 LOG_LEVEL='ERROR' 图片目录 IMAGES_STORE='./imgs_sucai/'
注意
:可能会有图片下不下来问题,比如,代码没有报错,但是图片就是保存不下来,可以试着安下pillow
这个包:pip install pillow
中间件在scrapy
工程里的middlewares.py
中
爬虫中间件(MiddleproSpiderMiddleware
):在引擎和爬虫文件之间的中间件
下载中间件(MiddleproDownloaderMiddleware
):
UA
伪装;代理IP
selenium
抓取响应信息并返回)如图所示:
请求时的UA
伪装
下载中间件文件:
middlewares.py文件
class MiddleproDownloaderMiddleware: # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the downloader middleware does not modify the # passed objects. agents_list = [ "Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; Nexus S Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Avant Browser/1.2.789rel1 (http://www.avantbrowser.com)", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5", "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7", "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7", "Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10", "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)", "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 GTB5", "Mozilla/5.0 (Windows; U; Windows NT 5.1; tr; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729; .NET4.0E)", "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1", "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1", "Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110622 Firefox/6.0a2", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b4pre) Gecko/20100815 Minefield/4.0b4pre", "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0 )", "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)", "Mozilla/5.0 (Windows; U; Windows XP) Gecko MultiZilla/1.6.1.0a", "Mozilla/2.02E (Win95; U)", "Mozilla/3.01Gold (Win95; I)", "Mozilla/4.8 [en] (Windows NT 5.1; U)", "Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4) Gecko Netscape/7.1 (ax)", "HTC_Dream Mozilla/5.0 (Linux; U; Android 1.5; en-ca; Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", "Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.2; U; de-DE) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/234.40.1 Safari/534.6 TouchPad/1.0", "Mozilla/5.0 (Linux; U; Android 1.5; en-us; sdk Build/CUPCAKE) AppleWebkit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Mozilla/5.0 (Linux; U; Android 1.5; en-us; htc_bahamas Build/CRB17) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", "Mozilla/5.0 (Linux; U; Android 2.1-update1; de-de; HTC Desire 1.19.161.5 Build/ERE27) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Mozilla/5.0 (Linux; U; Android 1.5; de-ch; HTC Hero Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Mozilla/5.0 (Linux; U; Android 2.1; en-us; HTC Legend Build/cupcake) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", "Mozilla/5.0 (Linux; U; Android 1.5; de-de; HTC Magic Build/PLAT-RC33) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1 FirePHP/0.3", "Mozilla/5.0 (Linux; U; Android 1.6; en-us; HTC_TATTOO_A3288 Build/DRC79) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", "Mozilla/5.0 (Linux; U; Android 1.0; en-us; dream) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2", "Mozilla/5.0 (Linux; U; Android 1.5; en-us; T-Mobile G1 Build/CRB43) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari 525.20.1", "Mozilla/5.0 (Linux; U; Android 1.5; en-gb; T-Mobile_G2_Touch Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Droid Build/FRG22D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Milestone Build/ SHOLS_U2_01.03.1) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", "Mozilla/5.0 (Linux; U; Android 2.0.1; de-de; Milestone Build/SHOLS_U2_01.14.0) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2", "Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522 (KHTML, like Gecko) Safari/419.3", "Mozilla/5.0 (Linux; U; Android 1.1; en-gb; dream) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2", "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Mozilla/5.0 (Linux; U; Android 2.2; en-ca; GT-P1000M Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Mozilla/5.0 (Linux; U; Android 3.0.1; fr-fr; A500 Build/HRI66) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13", "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2", "Mozilla/5.0 (Linux; U; Android 1.6; es-es; SonyEricssonX10i Build/R1FA016) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", "Mozilla/5.0 (Linux; U; Android 1.6; en-us; SonyEricssonX10i Build/R1AA056) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", ] def process_request(self, request, spider): request.headers['User-Agent']=random.choice(self.agents_list) return None def process_response(self, request, response, spider): # Called with the response returned from the downloader. # Must either; # - return a Response object # - return a Request object # - or raise IgnoreRequest return response def process_exception(self, request, exception, spider): pass def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name)
在settings.py文件中开启对下载中间件的支持:
DOWNLOADER_MIDDLEWARES = { 'middlePro.middlewares.MiddleproDownloaderMiddleware': 543, }
模拟爬取网易新闻信息
由于动态响应数据需要使用selenium
来获取,因此需要在spider文件中添加浏览器驱动对象selenium用法详解
middleSpider.py
import scrapy from selenium.webdriver import webdriver class MiddlespiderSpider(scrapy.Spider): 实例化一个浏览器对象 def __init__(self): self.bro=webdriver.Chrome(executable_paht='浏览器驱动地址') name = 'MiddleSpider' # allowed_domains = ['www.xxx.com'] start_urls = ['https://news.163.com/'] models_urls=[]#存储的板块内的详情页 def parse(self, response): pass 爬虫结束后关闭浏览器对象 def closed(self,spider): self.bro.quit()
MiddleproDownloaderMiddleware
是下载中间件文件
篡改响应主要说的是下载中间件中的process_response
方法的修改,其中使用到了selenium
语法selenium用法详解
import time from scrapy.http import HtmlResponse class MiddleproDownloaderMiddleware: def process_request(self, request, spider): return None 参数spider表示爬虫对象,即MiddleSpider def process_response(self, request, response, spider): #挑选出指定响应对象进行篡改 #通过url指定request #通过request指定response 此处的参数spider表示爬虫对象,即MiddleSpider bro=spider.bro if request.url in spider.models_urls: bro.get(request.url) time.sleep(2) page_text=bro.page_source #五大板块对应的对象 需要替换response new_response=HtmlResponse(url=request.url,body=page_text,encode='utf-8',request=request) return new_response else: return response def process_exception(self, request, exception, spider): pass
下载中间件修改后需要在settings.py
文件中开启
DOWNLOADER_MIDDLEWARES = { 'middlePro.middlewares.MiddleproDownloaderMiddleware': 543, }