Python教程

【python】实验2项目1:使用多协程和队列,爬取时光网电视剧TOP100的数据

本文主要是介绍【python】实验2项目1:使用多协程和队列,爬取时光网电视剧TOP100的数据,对大家解决编程问题具有一定的参考价值,需要的程序猿们随着小编来一起学习吧!

请使用多协程和队列,爬取时光网电视剧TOP100的数据(剧名、导演、主演和简介),并用CSV模块将数据存储下来(文件名:time100.csv)。
时光网电视剧排行榜链接:http://list.mtime.com/listIndex
在这里插入图片描述

知识点:
该站点启用了cookies反爬技术,因此,需要准确复制你的headers:例如:
a=’’‘Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8,application/signed-exchange;v=b3;q=0.9
Accept-Encoding: gzip, deflate
Accept-Language: zh-CN,zh;q=0.9
Connection: keep-alive
Cookie: userId=0; defaultCity=%25E5%258C%2597%25E4%25BA%25AC%257C290; waf_cookie=59ca4180-5a16-459e122021f2731eb3889667e33bee3b5cd0; _ydclearance=dca49a10afc623028d11eefe-48d8-4053-bde0-dea67b20ab57-1586501304; userCode=20204101248277038; userIdentity=2020410124827743; tt=731C76D4E29CB5ED5BD5F19F3774A2AC; Hm_lvt_6dd1e3b818c756974fb222f0eae5512e=1586494108; __utma=196937584.377597232.1586494108.1586494108.1586494108.1; __utmc=196937584; __utmz=196937584.1586494108.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __utmt=1; _utmt~1=1; __utmb=196937584.18.10.1586494108; Hm_lpvt_6dd1e3b818c756974fb222f0eae5512e=1586495472
Host: www.mtime.com
Referer: http://www.mtime.com/top/tv/top100/index-2.html
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36’’’
注意headers需要换行且更改类型为字典,需要用到字符串操作:
字典的Key:line.split(": “,1)
字典的Value:for line in a.split(”\n")
形成键值对后使用dict(XXXX)进行类型转换
需要导入的包:
from gevent import monkey
monkey.patch_all()
import gevent,requests,bs4,csv
from gevent.queue import Queue

使用gevent实现多协程爬虫的重点:
定义爬取函数
用gevent.spawn()创建任务
用gevent.joinall()执行任务

使用queue模块的重点:
用Queue()创建队列
用put_nowait()储存数据
用get_nowait()提取数据
queue对象的其他方法:empty()判断队列是否为空,full()判断队列是否为满,qsize()判断队列还剩多少

csv写入的步骤:
创建文件调用open()函数
创建对象借助writer()函数
写入内容调用writer对象的writerow()方法
关闭文件close()
解题思路
该站点启用了cookies反爬技术,因此,需要准确复制你的headers,首先找到headers
在这里插入图片描述

找到我们需要的剧名,导演,主演,简介,遍历前Top100条,当出现空的时候即director == ‘’ or isinstance(director, str) == 0时我们表明未知防止报错。
使用gevent,用gevent.spawn()创建任务,用gevent.joinall()执行任务,最后存储文件CSV格式即可。

from json.decoder import JSONDecodeError
import  gevent
import requests
import csv
from gevent import monkey

monkey.patch_all()

id = []
director=[]
actor=[]
name=[]
movie=[]
story=[]
task=[]
Actors=[]
#cookie
headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36 Edg/89.0.774.63',
        'Content-Type': 'application/json'
    }
    #获取json
url = 'http://front-gateway.mtime.com/library/index/app/topList.api?tt=1616811596867&'
request = requests.get(url, headers=headers)
#取元素
def catch(x,y):
    request = requests.get(url, headers=headers)
    html = request.json()
    items = ((((html['data'])['tvTopList'])['topListInfos'])[0])['items']
    for i in range(100):#前100条电视剧
        tvid = ((items[i])['movieInfo'])['movieId']
        id.append(tvid)
    for i in range(x,y):
        try:
            url1 = 'http://front-gateway.mtime.com/library/movie/detail.api?tt=1617412224076&movieId='+str(id[i])+'&locationId=290'
            request = requests.get(url=url1, headers=headers)
            tvhtml = request.json()
        except JSONDecodeError:
            url1 = 'http://front-gateway.mtime.com/library/movie/detail.api?tt=1617412224076&movieId=' + str(id[i]) + '&locationId=290'
            request = requests.get(url=url1, headers=headers)
            tvhtml = request.json()
        a = []
        Movie =(((tvhtml['data'])['basic'])['director'])
        #如果信息为空,则填写未知
        if Movie==None:
            director.append('未知')
        elif Movie['name']=='':
            director.append(Movie['nameEn'])
        else:
            director.append(Movie['name'])

        Actors=(((tvhtml['data'])['basic'])['actors'])
        for j in (((tvhtml['data'])['basic'])['actors']):
            if j['name'] == '':
                a.append(j['nameEn'])
            else:
                a.append(j['name'])
        actor.append(a)

        demo=(((tvhtml['data'])['basic'])['name'])
        movie.append(demo)
        simple=(((tvhtml['data'])['basic'])['story'])
        # 如果信息为空,则填写未知
        if simple==None:
            story.append('未知')
        else:
            story.append(simple)

if __name__ == '__main__':
    x = 0
    for i in range(10):
        task1 = gevent.spawn(catch(x, x + 10))
        task.append(task1)
        x = x + 10
    gevent.joinall(task)
    f=open('Timetop100.csv','w',newline='',encoding='gb18030')
    csv_write = csv.writer(f)
    for i in range(100):
        csv_write.writerow(['电视剧',movie[i]])
        csv_write.writerow(['导演',director[i]])
        csv_write.writerow(['演员'])
        for x in actor[i]:
            csv_write.writerow([x])
        csv_write.writerow(['剧情',story[i]])
    print("完成")
    f.close()
print('Tans.plt')
这篇关于【python】实验2项目1:使用多协程和队列,爬取时光网电视剧TOP100的数据的文章就介绍到这儿,希望我们推荐的文章对大家有所帮助,也希望大家多多支持为之网!