http://www.doupo321.com/doupocangqiong/`
网页很简单,也不用过多分析,内容都在网页源代码中,就是一个多级链接爬虫,步骤就是先爬取到网页下的子链接,然后通过子链接爬取到每章小说内容。
因为这个网页的源代码都很规整,所有我们用xpath来匹配,当然你熟悉正则或者bs4也可以用bs4来匹配。然后我们就开始写代码吧。
# @Time:2022/1/1312:04 # @Author:中意灬 # @File:斗破2.py # @ps:tutu qqnum:2117472285 import time import requests from lxml import etree def download(url,title):#下载内容 resp=requests.get(url) resp.encoding='utf-8' html=resp.text tree=etree.HTML(html) body = tree.xpath("/html/body/div/div/div[4]/p/text()") body = '\n'.join(body) with open(f'斗破2/{title}.txt',mode='w',encoding='utf-8')as f: f.write(body) def geturl(url):#获取子链接 resp=requests.get(url) resp.encoding='utf-8' html=resp.text tree=etree.HTML(html) lis=tree.xpath("/html/body/div[1]/div[2]/div[1]/div[3]/div[2]/ul/li") for li in lis: href=li.xpath("./a/@href")[0].strip('//') href="http://"+href title=li.xpath("./a/text()")[0] download(href,title) if __name__ == '__main__': url="http://www.doupo321.com/doupocangqiong/" t1=time.time() geturl(url) t2=time.time() print("耗时:",t2-t1)
运行结果:
# @Time:2022/1/1311:42 # @Author:中意灬 # @File:斗破1.py # @ps:tutu qqnum:2117472285 import time import requests from lxml import etree from concurrent.futures import ThreadPoolExecutor def download(url,title): resp=requests.get(url) resp.encoding='utf-8' html=resp.text tree=etree.HTML(html) body = tree.xpath("/html/body/div/div/div[4]/p/text()") body = '\n'.join(body) with open(f'斗破1/{title}.txt',mode='w',encoding='utf-8')as f: f.write(body) def geturl(url): resp = requests.get(url) resp.encoding = 'utf-8' html = resp.text tree = etree.HTML(html) lis = tree.xpath("/html/body/div[1]/div[2]/div[1]/div[3]/div[2]/ul/li") return lis if __name__ == '__main__': url="http://www.doupo321.com/doupocangqiong/" t1=time.time() lis=geturl(url) with ThreadPoolExecutor(1000)as t:#创建线程池,有1000个线程 for li in lis: href = li.xpath("./a/@href")[0].strip('//') href = "http://" + href title = li.xpath("./a/text()")[0] t.submit(download,url=href,title=title) t2=time.time() print("耗时:",t2-t1)
运行结果:
# @Time:2022/1/1310:30 # @Author:中意灬 # @File:斗破.py # @ps:tutu qqnum:2117472285 import requests import aiohttp import asyncio import aiofiles from lxml import etree import time async def download(url,title,session): async with session.get(url) as resp:#resp=requst.get() html= await resp.text() tree=etree.HTML(html) body=tree.xpath("/html/body/div/div/div[4]/p/text()") body='\n'.join(body) async with aiofiles.open(f'斗破/{title}.txt',mode='w',encoding='utf-8')as f:#保存下载内容 await f.write(body) async def geturl(url): resp=requests.get(url) resp.encoding='utf-8' html=resp.text tree=etree.HTML(html) lis=tree.xpath("/html/body/div[1]/div[2]/div[1]/div[3]/div[2]/ul/li") tasks=[] async with aiohttp.ClientSession() as session:#request for li in lis: href=li.xpath("./a/@href")[0].strip('//') href="http://"+href title=li.xpath("./a/text()")[0] # 插入异步操作 tasks.append(asyncio.create_task(download(href,title,session))) await asyncio.wait(tasks) if __name__ == '__main__': url="http://www.doupo321.com/doupocangqiong/" t1=time.time() loop = asyncio.get_event_loop() loop.run_until_complete(geturl(url)) t2=time.time() print("耗时:",t2-t1)
运行结果:
因为没有进行排序,所以爬取出来的章节都是乱序的,大家可以写爬虫的时候里面自己设置一下标题,这样爬取出来的顺序就可能是有序的了。
我们可以看出,用多线程,仅仅5秒就扒完了一部1600多章的小说,但是多线程会对系统的开销较大;如果用异步协程,爬取速度会稍微慢些,需要大概20多秒,但是对系统开销较小,建议大家采用异步协程的方式,但是用单线程去爬取会慢很多,扒完一部小说耗时需要9分多钟,不是很推荐。