本文主要是介绍爬虫习题攻破——求一个网页中所有数字之和!(题网:http://www.glidedsky.com/),对大家解决编程问题具有一定的参考价值,需要的程序猿们随着小编来一起学习吧!
第一道:题目如下!
1.目标URL:http://www.glidedsky.com/,此网站中的第一题。
2.python中requests+etree+xpath实现:(cookie_str数据大家使用自己登陆后的cookies数据即可,安全起见,以下cookie_str为错误的!)
import requests
from lxml import etree
#已知网页中的cookies是字符串格式,但request要求传的是字典
cookie_str="__gads=ID=22f6a2858602bd803-2259684f8fc40016:T=1604466297:RT=1604466297:S=ALNI_MYdpnvlmVSlQmK8R_5QLUQZrnNr6A; _ga=GA1.2.361744110.1604466299; footprints=eyJpdiI6IldHVmx6d1wvU2NYR0dJaFZOOHZQRW1RPT0iLCJ2YWx1ZSI6IkdrSmNXV2VlT2RERmZcL0FORXdVQlljc29taUxVMklvaGZ3cCtDM09QV1VHOEhjakRCNmhlNTg2ZWJBVk9pSjdOIiwibWFjIjoiZDhiYTgzMGZkODg5YjRkMmY4MzQ1MjZmMTUxOTU2YTY3YzAxYzk1OWY4NDkwNGExMjI2NDQ0YzY1NDkzMTg2ZCJ9; Hm_lvt_020fbaad6104bcddd1db12d6b78812f6=1604466299,1604466352,1604591546; _gid=GA1.2.414782633.1604591546; remember_web_59ba36addc2b2f9401580f014c7f58ea4e30989d=eyJpdiI6IllaSUdOTmxxWEJRSmJRXC9EbkhwKzRRPT0iLCJ2YWx1ZSI6IkRON1RlVTNJd3M5RTRRTVwvTTh3b0JBSW44RVVVeTlwaE5oQzR1Q1NjQlhseEVMbHRBa2owMlFTUzhyeXlFa1JGTHdKU1wvclVjS1Y3Slo5blptUW02ekhxSGxGTjhGK05hSzJPTjRKb0NROG56NkY5SUswOWFYSjhubklUemtaNmlqanp2bXAxRCt1K0o0ZGlaS0htYWlzbllsR1wvbGIrRURSeDhJV2QxNktYTT0iLCJtYWMiOiIyZDNjY2ZlY2MzM2YwZjc4MzVkNmQyMzQ4M2QwMDgxODkzNTE3YjFmZWFhMTk3MDkxNGJkNTI5Nzg3Njc2Mjc5In0%3D; _gat_gtag_UA_75859356_3=1; XSRF-TOKEN=eyJpdiI6InEyNkhoa1B1WHIrVkwyZzdrSTdlUXc9PSIsInZhbHVlIjoiUTM0VEpXc1IrMnlsWm9WOW9CQmpuVjNwUmhDY2JIZWE1WmZTWDNHXC9ucndDbUFsemNiU0ZiQ21qaWRGb2FaS00iLCJtYWMiOiIzZGQzMjEzNGM4ZDQyNjJlYjNkY2IxNWFmZGFiMTM1ZjdiZmQ0MmIxMDMyMDUwNWYzMWNmYWEzNTM2ZGY2ZWMwIn0%3D; glidedsky_session=eyJpdiI6IlJkQ0pvbTNFTHJNUFJvZkFZOWgrM0E9PSIsInZhbHVlIjoic25uKzJYV1hxcCtqcEViVjVRcnZzU045SVN5ek45MExlRm55YWxaT3M5aUZKaVBEV290M1F4VmFIVmM1UHcycyIsIm1hYyI6ImNlNmJjMWY2OTg3OWY1MTBjOTg0ZTRhZmEzZWMxOWVmODMxODk0ZTY0N2IwOTI2YjNiNmZjYmY3MmViZWUwMjEifQ%3D%3D; Hm_lpvt_020fbaad6104bcddd1db12d6b78812f6=1604591625"
# 列表遍历,通过字典生成式将cookie由字符串格式转为字典格式
cookie_dict = {i.split("=")[0]: i.split("=")[1] for i in cookie_str.split(";")}
# print(cookie_dict)
url = "http://www.glidedsky.com/level/web/crawler-basic-1"
# 本网页唯一反爬措施:需要登录才能获取数据,所以通过传参cookies破解!
res = requests.get(url,cookies=cookie_dict)
# 数据解析:通过xpath匹配目标数据
data = etree.HTML(res.text)
data_last = data.xpath("//div[@class='row']/div/text()")
# data_last输出类似: ['\n 339\n ', '\n 80\n] ',]
# 数据处理:因为匹配数据如上不能进行直接求和,所以进行处理。以下举出两种简单处理方法:
# 第一种:通过内置高级函数map()!
data_new = sum(map(int,data_last))
print("合为:",data_new)
# 第二种: 先将数据转成:['339', '80',],再通过循环进行求和处理!
num = "".join(data_last).split()
a = 0
for i in num:
a+=int(i)
print("合为:",a)
这篇关于爬虫习题攻破——求一个网页中所有数字之和!(题网:http://www.glidedsky.com/)的文章就介绍到这儿,希望我们推荐的文章对大家有所帮助,也希望大家多多支持为之网!