本次爬取的网站https://image.so.com/打开此页面切换到美女的页面,打开浏览器的开发者工具,切换到XHR选项,然后往下拉页面,我么会看到出现许多的ajax请求,如图:
对上面的许多请求进行分析会发现我们要爬取图片的数据就在很多类似这样的 zjl?ch=beauty&sn=30 sn=0时代表0-30张图片,sn=30代表31-60张图片依次排列 点进去,如图
切换到Headers 找到我们要请求的url (Request URL) 经过分析我们要请求的url很有规律经过简单的拼接一下就可以得到
实现代码
Spiders.py代码
import scrapy from Pro360.items import Pro360Item import json class ImaSpider(scrapy.Spider): name = 'Ima' # allowed_domains = ['www.xxx.com'] start_urls = ['https://image.so.com/zjl?ch=beauty&sn=0'] MAx_page = 50 # 爬取的页数 for i in range(1,MAx_page+1): url = 'https://image.so.com/zjl?ch=beauty&sn={}'.format(i*30)# 拼接url start_urls.append(url)# 加入到start_urls中 # print(start_urls) def parse(self, response): # pass result = json.loads(response.text) for image in result['list']: #获取图片的各项信息并提交到管道 item = Pro360Item() item['id'] = image.get('id') item['url'] = image.get('qhimg_url') item['title'] = image.get('title') item['thumb'] = image.get('qhimg_thumb') yield item # print(item)
items.py中代码实现
import scrapy class Pro360Item(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() # pass id = scrapy.Field() url = scrapy.Field() title = scrapy.Field() thumb = scrapy.Field()
pipelines.py中的代码实现
import pymongo import pymysql from scrapy.pipelines.images import ImagesPipeline from scrapy.exceptions import DropItem from scrapy import Request # 存储mongodb数据库的代码实现 class MongoPipeline: def open_spider(self, spider): print('开始爬虫momgod') self.client = pymongo.MongoClient(host='127.0.0.1', port=27017) self.db = self.client['Image360'] self.collection = self.db['images'] def process_item(self, item, spider): self.collection.insert(dict(item)) return item def close_spider(self,spider): print('结束爬虫momgod') self.client.close() # 存储mysql数据库的代码实现 class MysqlPipeline: def open_spider(self, spider): print('开始爬虫mysql') self.db = pymysql.connect(host='127.0.0.1', user='root', password='123456', database='image360', charset='utf8', port=3306) self.cursor = self.db.cursor() def process_item(self, item, spider): data = dict(item) keys = ','.join(data.keys()) values = ','.join(['%s'] * len(data)) sql = 'insert into images(%s) values(%s)' % (keys, values) try: self.cursor.execute(sql, tuple(data.values())) self.db.commit() except: self.db.rollback() return item def close_spider(self, spider): print('结束爬虫mysql') self.db.close() # 把文件存储到本地 class ImagePipeline(ImagesPipeline): def file_path(self, request, response=None, info=None, *, item=None): url = request.url file_name = url.split('/')[-1] return file_name def item_completed(self, results, item, info): image_path = [x['path'] for ok,x in results if ok] if not image_path: raise DropItem('Image download Failed') return item def get_media_requests(self, item, info): yield Request(item['url'])
settings.py中文件的配置信息
1,把机器人协议改成False 并添加日志的等级为ERROR 添加User-Agent
2、把pipelines.py文件中重写的三个类在setting.py中进行配置,并指定优先级
3、指定本地存储照片的路径
结果:
mongodb数据存储的数据
mysql数据库的数据
存储到本地的照片