一、选题的背景
为什么要选择此选题?要达到的数据分析的预期目标是什么?(10 分)从社会、经济、技术、数据来源等方面进行描述(200 字以内)
选题原因:爬虫是指一段自动抓取互联网信息的程序,从互联网上抓取对于我们有价值的信息。选择此题正是因为随着信息化的发展,大数据时代对信息的采需求和集量越来越大,相应的处理量也越来越大,正是因为如此,爬虫相应的岗位也开始增多,因此,学好这门课也是为将来就业打下扎实的基础。B站在当今众多视频网站中,相对于年轻化较有爬取价值,可以进一步了解现阶段年轻人的观看喜好。
预期目标:熟悉地掌握爬取网页信息,将储存地信息进行清洗、查重并处理,并对其进行持久性可更新性地储存,然后对数据进行简单的可视化处理,最后再假设根据客户需求,提供快捷方便的相应的数据。
二、主题式网络爬虫设计方案(10 分)
1.主题式网络爬虫名称
爬取B站原创视频以及其动漫视频相关信息并反馈处理的程序
2.主题式网络爬虫爬取的内容与数据特征分析
内容:B站热门原创视频排名(视频标题、排名、播放量、弹幕量、作者网络名称、视频播放地址、作者空间地址);B站热门动漫的排名(排名,动漫标题,播放量,弹幕量,更新至集数,动漫播放地址)
数据特征分析:对前十名的视频进行制作柱状图(视频标题与播放量,视频排名与弹幕量,动漫标题与播放量,动漫排名与弹幕量)
3.主题式网络爬虫设计方案概述(包括实现思路与技术难点)
实现思路:
1.网络爬虫爬取B站的内容与数据进行分析
2.数据清洗和统计
3.mysql数据库的数据储存
技术难点:网页各信息上的标签属性查找,def自定义函数的建立,对存储至csv文件的数据进行清理查重,并且对其特点数据进行数据整数化(如:排名,播放量,弹幕量),对网址进行添加删除(如:添加“https://”,删除多余的“//”),机器学习sklearn库的学习与调用,selenium库的学习与调用。
三、主题页面的结构特征分析(10 分)
本次爬取两个同网址不同排行榜的主题页面(B站的原创视频排行榜、B站的动漫排行榜)的URL,分别为:“https://www.bilibili.com/v/popular/rank/all”与“https://www.bilibili.com/v/popular/rank/bangumi”。
Schema : https
Host : www.bilibili.com
Path : /v/popular/rank/all
/v/popular/rank/bangumi
主题页面组成为:<html>
<head>...</head>
<body class="header-v2">...<body>
<html>
B站的原创视频排行榜和B站的动漫排行榜的<head>标签中包含了<mate><title><script><link><style>五种标签,这些标签定义文档的头部,它是所有头部元素的容器。(附图)
本次课程设计主要是对<body>部分进行解析,<body>中存在<svg><div><script><style>四种标签,经过定位,确定要爬取的数据位于<div id=”app”>的<li ...class=”rank-item”>标签中。
一下为爬取<li ...class=”rank-item”>标签中所有信息的代码:
import requests
from bs4 import BeautifulSoup
url = 'https://www.bilibili.com/v/popular/rank/all'
bdata = requests.get(url).text
soup = BeautifulSoup(bdata,'html.parser')
items = soup.findAll('li',{'class':'rank-item'})#提取列表
print(items)
1.节点(标签)的查找方法与遍历方法(必要时画出节点树结构)
import requests
from bs4 import BeautifulSoup
r=requests.get('https://www.bilibili.com/v/popular/rank/all')
demo=r.text
soup=BeautifulSoup(demo,'html.parser')
#遍历方法:
print(soup.contents)# 获取整个标签树的儿子节点
print(soup.body.content)#返回标签树的body标签下的节点
print(soup.head)#返回head标签
#查找方法:
print(soup.title)#查找标签,这里查找了title标签
print(soup.li['class'])#根据标签名查找某属性,这里查找了li标签下的class
print(soup.find_all('li'))#根据标签名查找元素,这里查找了li标签下的所有代码
节点树结构图:
二、 网络爬虫程序设计(60分)
爬虫程序主体要包括以下各部分,要附源代码及较详细注释,并在每部分程序后面提供输出结果的截图。
①bvid网址获取
②aid的获取
③爬取界面
#导入数据库
import requests
from bs4 import BeautifulSoup
import csv
import datetime
import pandas as pd
import numpy as np
from matplotlib import rcParams
import matplotlib.pyplot as plt
import matplotlib.font_manager as font_manager
from selenium import webdriver
from time import sleep
import matplotlib
url = 'https://www.bilibili.com/v/popular/rank/all'
#发起网络请求
response = requests.get(url)
html_text = response.text
soup = BeautifulSoup(html_text,'html.parser')
#创建Video对象
class Video:
def __init__(self,rank,title,visit,barrage,up_id,url,space):
self.rank = rank
self.title = title
self.visit = visit
self.barrage = barrage
self.up_id = up_id
self.url = url
self.space = space
def to_csv(self):
return[self.rank,self.title,self.visit,self.barrage,self.up_id,self.url,self.space]
@staticmethod
def csv_title():
return ['排名','标题','播放量','弹幕量','Up_ID','URL','作者空间']
#提取列表
items = soup.findAll('li',{'class':'rank-item'})
#保存提取出来的Video列表
videos = []
for itm in items:
title = itm.find('a',{'class':'title'}).text #视频标题
rank = itm.find('i',{'class':'num'}).text #排名
visit = itm.find_all('span')[3].text #播放量
barrage = itm.find_all('span')[4].text #弹幕量
up_id = itm.find('span',{'class':'data-box up-name'}).text #作者id
url = itm.find_all('a')[1].get('href')#获取视频网址
space = itm.find_all('a')[2].get('href')#获取作者空间网址
v = Video(rank,title,visit,barrage,up_id,url,space)
videos.append(v)
#建立时间后缀
now_str = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')
#建立文件名称以及属性
file_name1 = f'哔哩哔哩视频top100_{now_str}.csv'
#写入数据到文件中,并存储
with open(file_name1,'w',newline='',encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow(Video.csv_title())
for v in videos:
writer.writerow(v.to_csv())
url = 'https://www.bilibili.com/v/popular/rank/bangumi'
#发起网络请求
response = requests.get(url)
html_text = response.text
soup = BeautifulSoup(html_text,'html.parser')
#创建Video对象
class Video:
def __init__(self,rank,title,visit,barrage,new_word,url):
self.rank = rank
self.title = title
self.visit = visit
self.barrage = barrage
self.new_word = new_word
self.url = url
def to_csv(self):
return[self.rank,self.title,self.visit,self.barrage,self.new_word,self.url]
@staticmethod
def csv_title():
return ['排名','标题','播放量','弹幕量','更新话数至','URL']
#提取列表
items = soup.findAll('li',{'class':'rank-item'})
#保存提取出来的Video列表
videos = []
for itm in items:
rank = itm.find('i',{'class':'num'}).text #排名
title = itm.find('a',{'class':'title'}).text #视频标题
url = itm.find_all('a')[0].get('href')#获取视频网址
visit = itm.find_all('span')[2].text #播放量
barrage = itm.find_all('span')[3].text #弹幕量
new_word = itm.find('span',{'class':'data-box'}).text#更新话数
v = Video(rank,title,visit,barrage,new_word,url)
videos.append(v)
#建立时间后缀
now_str = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')
#建立文件名称以及属性
file_name2 = f'哔哩哔哩番剧top50_{now_str}.csv'
#写入数据到文件中,并存储
with open(file_name2,'w',newline='',encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow(Video.csv_title())
for v in videos:
writer.writerow(v.to_csv())
④清洗数据
#导入数据库
import pandas as pd
file_name1 = f'哔哩哔哩视频top100_20211215_154744.csv'
file_name2 = f'哔哩哔哩番剧top50_20211215_154745.csv'
paiming1 = pd.DataFrame(pd.read_csv(file_name1,encoding="utf_8_sig"))#对数据进行清洗和处理
paiming2 = pd.DataFrame(pd.read_csv(file_name2,encoding="utf_8_sig"))
print(paiming1.head())
print(paiming2.head())
#查找重复值
print(paiming1.duplicated())
print(paiming2.duplicated())
#查找空值与缺失值
print(paiming1['标题'].isnull().value_counts())
print(paiming2['标题'].isnull().value_counts())
print(paiming1['URL'].isnull().value_counts())
print(paiming2['URL'].isnull().value_counts())
print(paiming1['播放量'].isnull().value_counts())
print(paiming2['播放量'].isnull().value_counts())
print(paiming1['弹幕量'].isnull().value_counts())
print(paiming2['弹幕量'].isnull().value_counts())
3,储存至mysql数据库当中
①爬取网站
# 爬取B站日榜新闻
def BilibiliNews():
newsList=[]
# 伪装标头
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36'}
res=requests.get('https://www.bilibili.com/ranking/all/0/0/3',headers=headers) # 请求网页
soup = BeautifulSoup(res.text,'html.parser') # 解析网页
result=soup.find_all(class_='rank-item') # 找到榜单所在标签
num=0
startTime=time.strftime("%Y-%m-%d", time.localtime()) # 记录爬取的事件
for i in result:
try:
num=int(i.find(class_='num').text) # 当前排名
con=i.find(class_='content')
title=con.find(class_='title').text # 标题
detail=con.find(class_='detail').find_all(class_='data-box')
play=detail[0].text # 播放量
view=detail[1].text # 弹幕量
# 由于这两者存在类似15.5万的数据情况,所以为了保存方便将他们同义转换为整型
if(play[-1]=='万'):
play=int(float(play[:-1])*10000)
if(view[-1]=='万'):
view=int(float(view[:-1])*10000)
# 以下为预防部分数据不显示的情况
if(view=='--'):
view=0
if(play=='--'):
play=0
author=detail[2].text # UP主
url=con.find(class_='title')['href'] # 获取视频链接
BV=re.findall(r'https://www.bilibili.com/video/(.*)', url)[0] # 通过正则表达式解析得到视频的BV号
pts=int(con.find(class_='pts').find('div').text) # 视频综合得分
newsList.append([num,title,author,play,view,BV,pts,startTime]) # 将数据插入列表中
except:
continue
return newsList # 返回数据信息列表
②数据库的创建
mysql> create table BILIBILI(
-> NUM INT,
-> TITLE CHAR(80),
-> UP CHAR(20),
-> VIEW INT,
-> COMMENT INT,
-> BV_NUMBER INT,
-> SCORE INT,
-> EXECUTION_TIME DATETIME);
③将数据插入MySQL中
def GetMessageInMySQL():
# 连接数据库
db = pymysql.connect(host="cdb-cdjhisi3hih.cd.tencentcdb.com",port=10056,user="root",password="xxxxxx",database="weixinNews",charset='utf8')
cursor = db.cursor() # 创建游标
news=getHotNews() # 调用getHotNews()方法获取热搜榜数据内容
sql = "INSERT INTO WEIBO(NUMBER_SERIAL,TITLE, ATTENTION,EXECUTION_TIME) VALUES (%s,%s,%s,%s)" # 插入语句
timebegin=time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()) # 记录开始事件,便于查找错误发生情况
try:
# 执行sql语句,executemany用于批量插入数据
cursor.executemany(sql, news)
# 提交到数据库执行
db.commit()
print(timebegin+"成功!")
except :
# 如果发生错误则回滚
db.rollback()
print(timebegin+"失败!")
# 关闭游标
cursor.close()
# 关闭数据库连接
db.close()
# 记录程序运行事件
time1=time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())
print("开始爬取信息,程序正常执行:"+time1)
# 每20分钟执行一次程序
schedule.every(20).minutes.do(startFunction)
# 检查部署的情况,如果任务准备就绪,就开始执行任务
while True:
schedule.run_pending()
time.sleep(1)
4,flask开发服务器端
from collections import Counter
from pyecharts import WordCloud
import jieba.analyse
# 将counter拆分成两个list
def counter2list(counter):
keyList,valueList = [],[]
for c in counter:
keyList.append(c[0])
valueList.append(c[1])
return keyList,valueList
# 使用jieba提取关键词并计算权重
def extractTag(content,tagsList):
keyList,valueList = [],[]
if content:
tags = jieba.analyse.extract_tags(content, topK=100, withWeight=True)
for tex, widget in tags:
tagsList[tex] += int(widget*10000)
def drawWorldCloud(content,count):
outputFile = './测试词云.html'
cloud = WordCloud('词云图', width=1000, height=600, title_pos='center')
cloud.add(
' ',content,count,
shape='circle',
background_color='white',
max_words=200
)
cloud.render(outputFile)
if __name__ == '__main__':
c = Counter() #建一个容器
filePath = './新建文本文档.txt' #分析的文档路径
with open(filePath) as file_object:
contents = file_object.read()
extractTag(contents, c)
contentList,countList = counter2list(c.most_common(200))
drawWorldCloud(contentList, countList)
username = request.form.get("username")
password = request.form.get("password", type=str, default=None)
cpuCount = request.form.get("cpuCount", type=int, default=None)
memorySize = request.form.get("memorySize", type=int, default=None)
③BV爬取
# _*_ coding: utf-8 _*_
from urllib.request import urlopen, Request
from http.client import HTTPResponse
from bs4 import BeautifulSoup
import gzip
import json
def get_all_comments_by_bv(bv: str, time_order=False) -> tuple:
"""
根据哔哩哔哩的BV号,返回对应视频的评论列表(包括评论下面的回复)
:param bv: 视频的BV号
:param time_order: 是否需要以时间顺序返回评论,默认按照热度返回
:return: 包含三个成员的元组,第一个是所有评论的列表(评论的评论按原始的方式组合其中,字典类型)
第二个是视频的AV号(字符串类型),第三个是统计到的实际评论数(包括评论的评论)
"""
video_url = 'https://www.bilibili.com/video/' + bv
headers = {
'Host': 'www.bilibili.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Cookie': '',
'Upgrade-Insecure-Requests': '1',
'Cache-Control': 'max-age=0',
'TE': 'Trailers',
}
rep = Request(url=video_url, headers=headers) # 获取页面
html_response = urlopen(rep) # type: HTTPResponse
html_content = gzip.decompress(html_response.read()).decode(encoding='utf-8')
bs = BeautifulSoup(markup=html_content, features='html.parser')
comment_meta = bs.find(name='meta', attrs={'itemprop': 'commentCount'})
av_meta = bs.find(name='meta', attrs={'property': 'og:url'})
comment_count = int(comment_meta.attrs['content']) # 评论总数
av_number = av_meta.attrs['content'].split('av')[-1][:-1] # AV号
print(f'视频 {bv} 的AV号是 {av_number} ,元数据中显示本视频共有 {comment_count} 条评论(包括评论的评论)。')
page_num = 1
replies_count = 0
res = []
while True:
# 按时间排序:type=1&sort=0
# 按热度排序:type=1&sort=2
comment_url = f'https://api.bilibili.com/x/v2/reply?pn={page_num}&type=1&oid={av_number}' + \
f'&sort={0 if time_order else 2}'
comment_response = urlopen(comment_url) # type: HTTPResponse
comments = json.loads(comment_response.read().decode('utf-8')) # type: dict
comments = comments.get('data').get('replies') # type: list
if comments is None:
break
replies_count += len(comments)
for c in comments: # type: dict
if c.get('replies'):
rp_id = c.get('rpid')
rp_num = 10
rp_page = 1
while True: # 获取评论下的回复
reply_url = f'https://api.bilibili.com/x/v2/reply/reply?' +
f'type=1&pn={rp_page}&oid={av_number}&ps={rp_num}&root={rp_id}'
reply_response = urlopen(reply_url) # type: HTTPResponse
reply_reply = json.loads(reply_response.read().decode('utf-8')) # type: dict
reply_reply = reply_reply.get('data').get('replies') # type: dict
if reply_reply is None:
break
replies_count += len(reply_reply)
for r in reply_reply: # type: dict
res.append(r)
if len(reply_reply) < rp_num:
break
rp_page += 1
c.pop('replies')
res.append(c)
if replies_count >= comment_count:
break
page_num += 1
print(f'实际获取视频 {bv} 的评论总共 {replies_count} 条。')
return res, av_number, replies_count
if __name__ == '__main__':
cts, av, cnt = get_all_comments_by_bv('BV1op4y1X7N2')
for i in cts:
print(i.get('content').get('message'))
2.数据分析可视化(例如:数据柱形图、直方图、散点图、盒图、分布图)
#数据分析以及可视化
filename1 = file_name1
filename2 = file_name2
with open(filename1,encoding="utf_8_sig") as f1:
#创建阅读器(调用csv.reader()将前面存储的文件对象最为实参传给它)
reader1 = csv.reader(f1)
#调用了next()一次,所以这边只调用了文件的第一行,并将头文件存储在header_row中
header_row1 = next(reader1)
print(header_row1)
#指出每个头文件的索引
for index,column_header in enumerate(header_row1):
print(index,column_header)
#建立空列表
title1 = []
rank1 = []
highs1=[]
url1 = []
visit1 = []
space1 = []
up_id1 = []
for row in reader1:
rank1.append(row[0])
title1.append(row[1])
visit1.append(row[2].strip('\n').strip(' ').strip('\n'))
highs1.append(row[3].strip('\n').strip(' ').strip('\n'))
up_id1.append(row[4].strip('\n').strip(' ').strip('\n'))
url1.append(row[5].strip('\n').strip(' ').strip('\n').strip('//'))
space1.append(row[6].strip('\n').strip(' ').strip('\n').strip('//'))
visit1 = str(visit1)
visit1 = visit1.replace('万', '000')
visit1 = visit1.replace('.', '')
visit1 = eval(visit1)
visit_list_new1 = list(map(int, visit1))
highs1 = str(highs1)
highs1 = highs1.replace('万', '000')
highs1 = highs1.replace('.', '')
highs1 = eval(highs1)
highs_list_new1 = list(map(int, highs1))
print(highs_list_new1)
#设置x轴数据
x=np.array(rank1[0:10])
#设置y轴数据
y=np.array(highs_list_new1[0:10])
# 绘制柱状图,并把每根柱子的颜色设置自己的喜欢的颜色,顺便设置每根柱子的宽度
plt.bar(x,y,color = ["red","yellow","green","blue","black","gold","pink","purple","violet","Chocolate"],width = 0.5)
plt.show()
#设置x轴数据
x=np.array(title1[0:10])
#设置y轴数据
y=np.array(visit_list_new1[0:10])
# 绘制柱状图,并把每根柱子的颜色设置自己的喜欢的颜色,顺便设置每根柱子的宽度
plt.bar(x,y,color = ["red","yellow","green","blue","black","gold","pink","purple","violet","Chocolate"],width = 0.5)
matplotlib.rcParams['font.sans-serif'] = ['KaiTi']
plt.show()
#定义画布的大小
fig = plt.figure(figsize = (15,8))
#添加主标题
plt.title("各视频播放量")
#设置X周与Y周的标题
plt.xlabel("视频名称")
plt.ylabel("播放量")
# 显示网格线
plt.grid(True)
#设置x轴数据
x=np.array(title1[0:10])
#设置y轴数据
y=np.array(visit_list_new1[0:10])
#绘制柱状图,并把每根柱子的颜色设置自己的喜欢的颜色,顺便设置每根柱子的宽度
plt.bar(x,y,color = ["red","yellow","green","blue","black","gold","pink","purple","violet","Chocolate"],width = 0.6)
#图片保存
plt.savefig(r"C:\Users\24390\Desktop\bilibili-up-v.png")
with open(filename2,encoding="utf_8_sig") as f2:
reader2 = csv.reader(f2)
header_row2 = next(reader2)
print(header_row2)
for index,column_header in enumerate(header_row2):
print(index,column_header)
rank2 = []
title2 = []
highs2 = []
url2 = []
visit2 = []
new_word2 = []
for row in reader2:
rank2.append(row[0])
title2.append(row[1])
visit2.append(row[2].strip('\n').strip(' ').strip('\n'))
highs2.append(row[3].strip('\n').strip(' ').strip('\n'))
new_word2.append(row[4])
url2.append(row[5].strip('\n').strip(' ').strip('\n').strip('//'))
print(highs2)
title2 = str(title2)
title2 = eval(title2)
visit2 = str(visit2)
visit2 = visit2.replace('万', '000')
visit2 = visit2.replace('亿', '0000000')
visit2 = visit2.replace('.', '')
visit2 = eval(visit2)
visit2 = list(map(int, visit2))
visit_list_new2 = list(map(int, visit2))
highs2 = str(highs2)
highs2 = highs2.replace('万', '000')
highs2 = highs2.replace('.', '')
highs2 = eval(highs2)
highs_list_new2 = list(map(int, highs2))
print(highs_list_new2)
#设置x轴数据
x=np.array(rank2[0:10])
#设置y轴数据
y=np.array(highs_list_new2[0:10])
# 绘制柱状图,并把每根柱子的颜色设置自己的喜欢的颜色,顺便设置每根柱子的宽度
plt.bar(x,y,color = ["red","yellow","green","blue","black","gold","pink","purple","violet","Chocolate"],width = 0.5)
plt.show()
#设置x轴数据
x=np.array(title2[0:10])
#设置y轴数据
y=np.array(visit_list_new2[0:10])
# 绘制柱状图,并把每根柱子的颜色设置自己的喜欢的颜色,顺便设置每根柱子的宽度
plt.bar(x,y,color = ["red","yellow","green","blue","black","gold","pink","purple","violet","Chocolate"],width = 0.5)
matplotlib.rcParams['font.sans-serif'] = ['KaiTi']
plt.show()
# 定义画布的大小
fig = plt.figure(figsize = (15,8))
#添加主标题
plt.title("番剧播放量")
#设置X周与Y周的标题
plt.xlabel("番剧名称")
plt.ylabel("播放量")
# 显示网格线
plt.grid(True)
#设置x轴数据
x=np.array(title2[0:10])
#设置y轴数据
y=np.array(visit_list_new2[0:10])
# 绘制柱状图,并把每根柱子的颜色设置自己的喜欢的颜色,顺便设置每根柱子的宽度
plt.bar(x,y,color = ["red","yellow","green","blue","black","gold","pink","purple","violet","Chocolate"],width = 0.6)
#图片保存
plt.savefig(r"C:\Users\24390\Desktop\bilibili-draw-v.png")
3.根据数据之间的关系,分析两个变量之间的相关系数,画出散点图,并建立变量之间的回归方程(一元或多元)。
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pandas import DataFrame,Series
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import csv
file_name2 = f'哔哩哔哩番剧top50_20211215_154745.csv'
filename2 = file_name2
with open(filename2,encoding="utf_8_sig") as f2:
reader2 = csv.reader(f2)
header_row2 = next(reader2)
print(header_row2)
for index,column_header in enumerate(header_row2):
print(index,column_header)
rank2 = []
title2 = []
highs2 = []
url2 = []
visit2 = []
new_word2 = []
for row in reader2:
rank2.append(row[0])
title2.append(row[1])
visit2.append(row[2].strip('\n').strip(' ').strip('\n'))
highs2.append(row[3].strip('\n').strip(' ').strip('\n'))
new_word2.append(row[4])
url2.append(row[5].strip('\n').strip(' ').strip('\n').strip('//'))
print(highs2)
title2 = str(title2)
title2 = eval(title2)
visit2 = str(visit2)
visit2 = visit2.replace('万', '000')
visit2 = visit2.replace('亿', '0000000')
visit2 = visit2.replace('.', '')
visit2 = eval(visit2)
visit2 = list(map(int, visit2))
visit_list_new2 = list(map(int, visit2))
highs2 = str(highs2)
highs2 = highs2.replace('万', '000')
highs2 = highs2.replace('.', '')
highs2 = eval(highs2)
highs_list_new2 = list(map(int, highs2))
with open('output.csv','w') as f:
writer = csv.writer(f)
writer.writerows(zip(highs_list_new2,visit_list_new2))
#创建数据集
examDict = {'弹幕量':highs_list_new2[0:10],
'播放量':visit_list_new2[0:10]}
#转换为DataFrame的数据格式
examDf = DataFrame(examDict)
#绘制散点图
plt.scatter(examDf.播放量,examDf.弹幕量,color = 'b',label = "Exam Data")
#添加图的标签(x轴,y轴)
plt.xlabel("Hours")
plt.ylabel("Score")
#显示图像
plt.show()
rDf = examDf.corr()
print(rDf)
exam_X=examDf.弹幕量
exam_Y=examDf.播放量
#将原数据集拆分训练集和测试集
X_train,X_test,Y_train,Y_test = train_test_split(exam_X,exam_Y,train_size=.8)
#X_train为训练数据标签,X_test为测试数据标签,exam_X为样本特征,exam_y为样本标签,train_size 训练数据占比
print("原始数据特征:",exam_X.shape,
",训练数据特征:",X_train.shape,
",测试数据特征:",X_test.shape)
print("原始数据标签:",exam_Y.shape,
",训练数据标签:",Y_train.shape,
",测试数据标签:",Y_test.shape)
#散点图
plt.scatter(X_train, Y_train, color="blue", label="train data")
plt.scatter(X_test, Y_test, color="red", label="test data")
#添加图标标签
plt.legend(loc=2)
plt.xlabel("Hours")
plt.ylabel("Pass")
#显示图像
plt.savefig("tests.jpg")
plt.show()
model = LinearRegression()
#对于模型错误我们需要把我们的训练集进行reshape操作来达到函数所需要的要求
# model.fit(X_train,Y_train)
#reshape如果行数=-1的话可以使我们的数组所改的列数自动按照数组的大小形成新的数组
#因为model需要二维的数组来进行拟合但是这里只有一个特征所以需要reshape来转换为二维数组
X_train = X_train.values.reshape(-1,1)
X_test = X_test.values.reshape(-1,1)
model.fit(X_train,Y_train)
a = model.intercept_#截距
b = model.coef_#回归系数
print("最佳拟合线:截距",a,",回归系数:",b)
#训练数据的预测值
y_train_pred = model.predict(X_train)
#绘制最佳拟合线:标签用的是训练数据的预测值y_train_pred
plt.plot(X_train, y_train_pred, color='black', linewidth=3, label="best line")
#测试数据散点图
plt.scatter(X_test, Y_test, color='red', label="test data")
#添加图标标签
plt.legend(loc=2)
plt.xlabel("Number1")
plt.ylabel("Number2")
#显示图像
plt.savefig("lines.jpg")
plt.show()
score = model.score(X_test,Y_test)
print(score)
4.数据持久化
file_name1 = f'哔哩哔哩视频top100_{now_str}.csv'
with open(file_name1,'w',newline='',encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow(Video.csv_title())
for v in videos:
writer.writerow(v.to_csv())
file_name2 = f'哔哩哔哩番剧top50_{now_str}.csv'
with open(file_name2,'w',newline='',encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow(Video.csv_title())
for v in videos:
writer.writerow(v.to_csv())
plt.savefig(r"C:\Users\24390\Desktop\bilibili-up-v.png")#图片保存
plt.savefig(r"C:\Users\24390\Desktop\bilibili-draw-v.png")#图片保存
5.将以上各部分的代码汇总,附上完整程序代码
import requests
from bs4 import BeautifulSoup
import csv
import datetime
import pandas as pd
import numpy as np
from matplotlib import rcParams
import matplotlib.pyplot as plt
import matplotlib.font_manager as font_manager
from selenium import webdriver
from time import sleep
import matplotlib
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from pandas import DataFrame,Series
url = 'https://www.bilibili.com/v/popular/rank/all'
response = requests.get(url)#发起网络请求
html_text = response.text
soup = BeautifulSoup(html_text,'html.parser')
class Video:#创建Video对象
def __init__(self,rank,title,visit,barrage,up_id,url,space):
self.rank = rank
self.title = title
self.visit = visit
self.barrage = barrage
self.up_id = up_id
self.url = url
self.space = space
def to_csv(self):
return[self.rank,self.title,self.visit,self.barrage,self.up_id,self.url,self.space]
@staticmethod
def csv_title():
return ['排名','标题','播放量','弹幕量','Up_ID','URL','作者空间']
items = soup.findAll('li',{'class':'rank-item'})#提取列表
videos = []#保存提取出来的Video列表
for itm in items:
title = itm.find('a',{'class':'title'}).text #视频标题
rank = itm.find('i',{'class':'num'}).text #排名
visit = itm.find_all('span')[3].text #播放量
barrage = itm.find_all('span')[4].text #弹幕量
up_id = itm.find('span',{'class':'data-box up-name'}).text #作者id
url = itm.find_all('a')[1].get('href')#获取视频网址
space = itm.find_all('a')[2].get('href')#获取作者空间网址
v = Video(rank,title,visit,barrage,up_id,url,space)
videos.append(v)
now_str = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')
file_name1 = f'哔哩哔哩视频top100_{now_str}.csv'
with open(file_name1,'w',newline='',encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow(Video.csv_title())
for v in videos:
writer.writerow(v.to_csv())
url = 'https://www.bilibili.com/v/popular/rank/bangumi'
response = requests.get(url)#发起网络请求
html_text = response.text
soup = BeautifulSoup(html_text,'html.parser')
class Video:#创建Video对象
def __init__(self,rank,title,visit,barrage,new_word,url):
self.rank = rank
self.title = title
self.visit = visit
self.barrage = barrage
self.new_word = new_word
self.url = url
def to_csv(self):
return[self.rank,self.title,self.visit,self.barrage,self.new_word,self.url]
@staticmethod
def csv_title():
return ['排名','标题','播放量','弹幕量','更新话数至','URL']
items = soup.findAll('li',{'class':'rank-item'})#提取列表
videos = []#保存提取出来的Video列表
for itm in items:
rank = itm.find('i',{'class':'num'}).text #排名
title = itm.find('a',{'class':'title'}).text #视频标题
url = itm.find_all('a')[0].get('href')#获取视频网址
visit = itm.find_all('span')[2].text #播放量
barrage = itm.find_all('span')[3].text #弹幕量
new_word = itm.find('span',{'class':'data-box'}).text#更新话数
v = Video(rank,title,visit,barrage,new_word,url)
videos.append(v)
now_str = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')
file_name2 = f'哔哩哔哩番剧top50_{now_str}.csv'
with open(file_name2,'w',newline='',encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow(Video.csv_title())
for v in videos:
writer.writerow(v.to_csv())
paiming1 = pd.DataFrame(pd.read_csv(file_name1,encoding="utf_8_sig"))#对数据进行清洗和处理
paiming2 = pd.DataFrame(pd.read_csv(file_name2,encoding="utf_8_sig"))
print(paiming1.head())
print(paiming2.head())
print(paiming1.duplicated())#查找重复值
print(paiming2.duplicated())
print(paiming1['标题'].isnull().value_counts())#查找空值与缺失值
print(paiming2['标题'].isnull().value_counts())
print(paiming1['URL'].isnull().value_counts())
print(paiming2['URL'].isnull().value_counts())
print(paiming1['播放量'].isnull().value_counts())
print(paiming2['播放量'].isnull().value_counts())
print(paiming1['弹幕量'].isnull().value_counts())
print(paiming2['弹幕量'].isnull().value_counts())
#数据分析以及可视化
filename1 = file_name1
filename2 = file_name2
with open(filename1,encoding="utf_8_sig") as f1:
reader1 = csv.reader(f1)#创建阅读器(调用csv.reader()将前面存储的文件对象最为实参传给它)
header_row1 = next(reader1)#调用了next()一次,所以这边只调用了文件的第一行,并将头文件存储在header_row中
print(header_row1)
for index,column_header in enumerate(header_row1):#指出每个头文件的索引
print(index,column_header)
title1 = []
rank1 = []
highs1=[]
url1 = []
visit1 = []
space1 = []
up_id1 = []
for row in reader1:
rank1.append(row[0])
title1.append(row[1])
visit1.append(row[2].strip('\n').strip(' ').strip('\n'))
highs1.append(row[3].strip('\n').strip(' ').strip('\n'))
up_id1.append(row[4].strip('\n').strip(' ').strip('\n'))
url1.append(row[5].strip('\n').strip(' ').strip('\n').strip('//'))
space1.append(row[6].strip('\n').strip(' ').strip('\n').strip('//'))
visit1 = str(visit1)
visit1 = visit1.replace('万', '000')
visit1 = visit1.replace('.', '')
visit1 = eval(visit1)
visit_list_new1 = list(map(int, visit1))
highs1 = str(highs1)
highs1 = highs1.replace('万', '000')
highs1 = highs1.replace('.', '')
highs1 = eval(highs1)
highs_list_new1 = list(map(int, highs1))
print(highs_list_new1)
x=np.array(rank1[0:10])#设置x轴数据
y=np.array(highs_list_new1[0:10])#设置y轴数据
plt.bar(x,y,color = ["red","yellow","green","blue","black","gold","pink","purple","violet","Chocolate"],width = 0.5)# 绘制柱状图,并把每根柱子的颜色设置自己的喜欢的颜色,顺便设置每根柱子的宽度
plt.show()
x=np.array(title1[0:10])#设置x轴数据
y=np.array(visit_list_new1[0:10])#设置y轴数据
plt.bar(x,y,color = ["red","yellow","green","blue","black","gold","pink","purple","violet","Chocolate"],width = 0.5)# 绘制柱状图,并把每根柱子的颜色设置自己的喜欢的颜色,顺便设置每根柱子的宽度
matplotlib.rcParams['font.sans-serif'] = ['KaiTi']
plt.show()
fig = plt.figure(figsize = (15,8))#定义画布的大小
plt.title("各视频播放量")#添加主标题
plt.xlabel("视频名称")#设置X周与Y周的标题
plt.ylabel("播放量")
plt.grid(True)# 显示网格线
x=np.array(title1[0:10])#设置x轴数据
y=np.array(visit_list_new1[0:10])#设置y轴数据
plt.bar(x,y,color = ["red","yellow","green","blue","black","gold","pink","purple","violet","Chocolate"],width = 0.6)#绘制柱状图,并把每根柱子的颜色设置自己的喜欢的颜色,顺便设置每根柱子的宽度
plt.savefig(r"C:\Users\24390\Desktop\bilibili-up-v.png")#图片保存
with open(filename2,encoding="utf_8_sig") as f2:
reader2 = csv.reader(f2)
header_row2 = next(reader2)
print(header_row2)
for index,column_header in enumerate(header_row2):
print(index,column_header)
rank2 = []
title2 = []
highs2 = []
url2 = []
visit2 = []
new_word2 = []
for row in reader2:
rank2.append(row[0])
title2.append(row[1])
visit2.append(row[2].strip('\n').strip(' ').strip('\n'))
highs2.append(row[3].strip('\n').strip(' ').strip('\n'))
new_word2.append(row[4])
url2.append(row[5].strip('\n').strip(' ').strip('\n').strip('//'))
print(highs2)
title2 = str(title2)
title2 = eval(title2)
visit2 = str(visit2)
visit2 = visit2.replace('万', '000')
visit2 = visit2.replace('亿', '0000000')
visit2 = visit2.replace('.', '')
visit2 = eval(visit2)
visit2 = list(map(int, visit2))
visit_list_new2 = list(map(int, visit2))
highs2 = str(highs2)
highs2 = highs2.replace('万', '000')
highs2 = highs2.replace('.', '')
highs2 = eval(highs2)
highs_list_new2 = list(map(int, highs2))
print(highs_list_new2)
x=np.array(rank2[0:10])#设置x轴数据
y=np.array(highs_list_new2[0:10])#设置y轴数据
plt.bar(x,y,color = ["red","yellow","green","blue","black","gold","pink","purple","violet","Chocolate"],width = 0.5)# 绘制柱状图,并把每根柱子的颜色设置自己的喜欢的颜色,顺便设置每根柱子的宽度
plt.show()
x=np.array(title2[0:10])#设置x轴数据
y=np.array(visit_list_new2[0:10])#设置y轴数据
plt.bar(x,y,color = ["red","yellow","green","blue","black","gold","pink","purple","violet","Chocolate"],width = 0.5)# 绘制柱状图,并把每根柱子的颜色设置自己的喜欢的颜色,顺便设置每根柱子的宽度
matplotlib.rcParams['font.sans-serif'] = ['KaiTi']
plt.show()
fig = plt.figure(figsize = (15,8))# 定义画布的大小
plt.title("番剧播放量")#添加主标题
plt.xlabel("番剧名称")#设置X周与Y周的标题
plt.ylabel("播放量")
plt.grid(True)# 显示网格线
x=np.array(title2[0:10])#设置x轴数据
y=np.array(visit_list_new2[0:10])#设置y轴数据
plt.bar(x,y,color = ["red","yellow","green","blue","black","gold","pink","purple","violet","Chocolate"],width = 0.6)# 绘制柱状图,并把每根柱子的颜色设置自己的喜欢的颜色,顺便设置每根柱子的宽度
plt.savefig(r"C:\Users\24390\Desktop\bilibili-draw-v.png")#图片保存
with open('output.csv','w') as f:
writer = csv.writer(f)
writer.writerows(zip(highs_list_new2,visit_list_new2))
#创建数据集
examDict = {'弹幕量':highs_list_new2[0:10],
'播放量':visit_list_new2[0:10]}
#转换为DataFrame的数据格式
examDf = DataFrame(examDict)
#绘制散点图
plt.scatter(examDf.播放量,examDf.弹幕量,color = 'b',label = "Exam Data")
#添加图的标签(x轴,y轴)
plt.xlabel("Hours")
plt.ylabel("Score")
#显示图像
plt.show()
rDf = examDf.corr()
print(rDf)
exam_X=examDf.弹幕量
exam_Y=examDf.播放量
#将原数据集拆分训练集和测试集
X_train,X_test,Y_train,Y_test = train_test_split(exam_X,exam_Y,train_size=.8)
#X_train为训练数据标签,X_test为测试数据标签,exam_X为样本特征,exam_y为样本标签,train_size 训练数据占比
print("原始数据特征:",exam_X.shape,
",训练数据特征:",X_train.shape,
",测试数据特征:",X_test.shape)
print("原始数据标签:",exam_Y.shape,
",训练数据标签:",Y_train.shape,
",测试数据标签:",Y_test.shape)
#散点图
plt.scatter(X_train, Y_train, color="blue", label="train data")
plt.scatter(X_test, Y_test, color="red", label="test data")
#添加图标标签
plt.legend(loc=2)
plt.xlabel("Hours")
plt.ylabel("Pass")
#显示图像
plt.savefig("tests.jpg")
plt.show()
model = LinearRegression()
#对于模型错误我们需要把我们的训练集进行reshape操作来达到函数所需要的要求
# model.fit(X_train,Y_train)
#reshape如果行数=-1的话可以使我们的数组所改的列数自动按照数组的大小形成新的数组
#因为model需要二维的数组来进行拟合但是这里只有一个特征所以需要reshape来转换为二维数组
X_train = X_train.values.reshape(-1,1)
X_test = X_test.values.reshape(-1,1)
model.fit(X_train,Y_train)
a = model.intercept_#截距
b = model.coef_#回归系数
print("最佳拟合线:截距",a,",回归系数:",b)
#训练数据的预测值
y_train_pred = model.predict(X_train)
#绘制最佳拟合线:标签用的是训练数据的预测值y_train_pred
plt.plot(X_train, y_train_pred, color='black', linewidth=3, label="best line")
#测试数据散点图
plt.scatter(X_test, Y_test, color='red', label="test data")
#添加图标标签
plt.legend(loc=2)
plt.xlabel("Number1")
plt.ylabel("Number2")
#显示图像
plt.savefig("lines.jpg")
plt.show()
score = model.score(X_test,Y_test)
print(score)
print(title1[1],title2[1])
print('请问您想观看UP主视频还是番剧亦或者是查询UP主的空间页面?\n观看UP主视频请扣1,观看番剧请扣2,查询UP主空间页面请扣3。')
z = int(input())
if z == int(2):
print(title2)
print('请输入您想观看的番剧:')
name = input()
i=0
for i in range(0,50,1):
if title2[i]==name:
print(i)
break
print(url2[i])
to_url2=url2[i]
d = webdriver.Chrome()#打开谷歌浏览器,并且赋值给变量d
d.get('https://'+to_url2)#通过get()方法,在当前窗口打开网页
sleep(2)
elif z == int(1):
print(title1)
print('请输入您想观看的UP主视频:')
name = input()
i=0
for i in range(0,100,1):
if title1[i]==name:
print(i)
break
print(url1[i])
to_url1=url1[i]
d = webdriver.Chrome()#打开谷歌浏览器,并且赋值给变量d
d.get('https://'+to_url1)#通过get()方法,在当前窗口打开网页
sleep(2)
elif z == int(3):
print(up_id1)
print('请输入您想查询的UP主空间:')
name = input()
i=0
for i in range(0,100,1):
if up_id1[i]==name:
print(i)
break
print(space1[i])
to_space11=space1[i]
d = webdriver.Chrome()#打开谷歌浏览器,并且赋值给变量d
d.get('https://'+to_space11)#通过get()方法,在当前窗口打开网页
sleep(2)
else:
print('输入不符合要求')
三、 总结
1.经过对主题数据的分析与可视化,可以得到哪些结论?是否达到预期的目标?
结论:本次课程设计,影响最深的就是在遇到问题时候,可以通过网上了解BUG问题的原因并很好地解决,在设计课程时候,可以考虑与机器学习以及其他方面进行结合本次课程所绘制的散点图与直方图等不只局限于课程爬虫设计这一主题,其中还涉及到对机器主题的应用,让我明白了设计课题主题的知识广泛与应用。
目标:首先需要学好网络爬虫基本的步骤request请求与存储。采集信息并提取出来进行可视化绘制也是我下次要学习的重点。实行数据的持久化可以减少对所获取的数据的清洗与处理次数。这次的课程设计使我明白了要加强对python的了解与理解,才能迅速的找到自己不足的地方并且专攻下来,争取推动自己对python的进程。