使用工具:Jupyter Notebook
示例网页:网易新闻https://3g.163.com/touch/news?referFrom=
导入requests库(注意不是request):
import requests
从网页获取源码:
r = requests.get("https://3g.163.com/touch/news")
r.encoding = "utf-8"
r.text
(输出太长就不贴了)
导入lxml库:
from lxml import html
解析树:
tree = html.fromstring(r.text)
tree
<Element html at 0x1d70dffe130>
爬取标题信息,根据网页开发页面(f12)中的元素信息用XPath写路径。上面是返回文本,下面是返回element:
tree.xpath("//div[contains(@class, 'tab-content')]//*[contains(@class, 'title')]/text()")
tree.xpath("//div[contains(@class, 'tab-content')]//*[contains(@class, 'title')]/")
爬取链接,返回成element
t = tree.xpath("//div[contains(@class, 'tab-content')]//article/a/@href")
['//3g.163.com/news/article/GEP9DPO5000189FH.html?clickfrom=channel2018_news_newsList#offset=0', '//3g.163.com/news/article/GEP9GL4K000189FH.html?clickfrom=channel2018_news_newsList#offset=1', '//3g.163.com/news/article/GENUISCU000189FH.html?clickfrom=channel2018_news_newsList#offset=2', '//3g.163.com/news/article/GEKDOC04000189FH.html?clickfrom=channel2018_news_newsList#offset=3', '//3g.163.com/news/article/GENIFQ29053469LG.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14890', '//3g.163.com/news/article/GEOFCFP5053469LG.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14891', '//3g.163.com/news/article/GEP44IPC05503FCU.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14892', '//3g.163.com/news/article/GENR4E7O0515CCSC.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14893', '//3g.163.com/news/article/GENSJU9B0001899O.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14894', '//3g.163.com/news/article/GENC1JKF051795VD.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14895', '//3g.163.com/news/article/GEKMO4J80512B07B.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14896', '//3g.163.com/news/article/GEPCQS8000258152.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14897', '//3g.163.com/news/article/GENG81J405528G7P.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14898', '//3g.163.com/news/article/GENTU50O05390TQD.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14899', '//3g.163.com/news/article/GEKDCO2T05238V2G.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14900', '//3g.163.com/news/article/GENEQK0K05527WCX.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14901', '//3g.163.com/news/article/GEO9ATNH05527EP3.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14902', '//3g.163.com/news/article/GENGMEH505128ELF.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14903', '//3g.163.com/news/article/GENKOU0S0552C180.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14904', '//3g.163.com/news/article/GEO70EKO051796Q9.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14905', '//3g.163.com/news/article/GEOC9O4D051484S5.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14906', '//3g.163.com/news/article/GEMVDSV90534M1TZ.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14907', '//3g.163.com/news/article/GENU1BE505148JTU.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14908', '//3g.163.com/news/article/GEMV89510534MH06.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14909', '//3g.163.com/news/article/GENOLQAF0537A693.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14910', '//3g.163.com/news/article/GENBORD10537N9PG.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14911', '//3g.163.com/news/article/GENNO2UA05521A2M.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14912', '//3g.163.com/news/article/GEKS7KAG0552CPF4.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14913', '//3g.163.com/news/article/GEN0IBGO0512B07B.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14914', '//3g.163.com/news/article/GEN9DQOA0534MH06.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14915', '//3g.163.com/news/article/GENASTAJ051100DH.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14916', '//3g.163.com/news/article/GEN5LQCK0517KC40.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14917', '//3g.163.com/news/article/GEN19FNP00058781.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14918', '//3g.163.com/news/article/GEN0DVG70514R9P4.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14919', 'https://3g.163.com/news/article/EUM2KO9N000189FH.html?#offset=0', 'https://3g.163.com/news/article/EUM24BRS000189FH.html?#offset=1', 'https://3g.163.com/news/article/EUJ790P7000189FH.html?#offset=2', 'https://3g.163.com/news/article/EUJ76SJ2000189FH.html?#offset=3', 'https://3g.163.com/news/article/EUJ75M7V000189FH.html?#offset=4', 'https://3g.163.com/news/article/EUJ72ESB000189FH.html?#offset=5', 'https://3g.163.com/news/article/EUJ70D5R000189FH.html?#offset=6', 'https://3g.163.com/news/article/E6KATJC70514HDK6.html?#offset=7', 'https://3g.163.com/news/article/E6KA44UI0514HDK6.html?#offset=8', 'https://3g.163.com/news/article/E6K77H210514HDK6.html?#offset=9']
注意到有些链接开头没有https: ,观察原网站得知部分链接为原网页直接跳转,故考虑给它们加上抬头。导入urljoin库:
from urllib.parse import urljoin
用urljoin将爬取到的链接拼接:
for i in t:
x = urljoin("https://3g.163.com/touch/news?referFrom=", i)
print(x)
https://3g.163.com/news/article/GEP9DPO5000189FH.html?clickfrom=channel2018_news_newsList#offset=0 https://3g.163.com/news/article/GEP9GL4K000189FH.html?clickfrom=channel2018_news_newsList#offset=1 https://3g.163.com/news/article/GENUISCU000189FH.html?clickfrom=channel2018_news_newsList#offset=2 https://3g.163.com/news/article/GEKDOC04000189FH.html?clickfrom=channel2018_news_newsList#offset=3 https://3g.163.com/news/article/GENIFQ29053469LG.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14890 https://3g.163.com/news/article/GEOFCFP5053469LG.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14891 https://3g.163.com/news/article/GEP44IPC05503FCU.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14892 https://3g.163.com/news/article/GENR4E7O0515CCSC.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14893 https://3g.163.com/news/article/GENSJU9B0001899O.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14894 https://3g.163.com/news/article/GENC1JKF051795VD.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14895 https://3g.163.com/news/article/GEKMO4J80512B07B.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14896 https://3g.163.com/news/article/GEPCQS8000258152.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14897 https://3g.163.com/news/article/GENG81J405528G7P.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14898 https://3g.163.com/news/article/GENTU50O05390TQD.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14899 https://3g.163.com/news/article/GEKDCO2T05238V2G.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14900 https://3g.163.com/news/article/GENEQK0K05527WCX.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14901 https://3g.163.com/news/article/GEO9ATNH05527EP3.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14902 https://3g.163.com/news/article/GENGMEH505128ELF.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14903 https://3g.163.com/news/article/GENKOU0S0552C180.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14904 https://3g.163.com/news/article/GEO70EKO051796Q9.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14905 https://3g.163.com/news/article/GEOC9O4D051484S5.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14906 https://3g.163.com/news/article/GEMVDSV90534M1TZ.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14907 https://3g.163.com/news/article/GENU1BE505148JTU.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14908 https://3g.163.com/news/article/GEMV89510534MH06.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14909 https://3g.163.com/news/article/GENOLQAF0537A693.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14910 https://3g.163.com/news/article/GENBORD10537N9PG.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14911 https://3g.163.com/news/article/GENNO2UA05521A2M.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14912 https://3g.163.com/news/article/GEKS7KAG0552CPF4.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14913 https://3g.163.com/news/article/GEN0IBGO0512B07B.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14914 https://3g.163.com/news/article/GEN9DQOA0534MH06.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14915 https://3g.163.com/news/article/GENASTAJ051100DH.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14916 https://3g.163.com/news/article/GEN5LQCK0517KC40.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14917 https://3g.163.com/news/article/GEN19FNP00058781.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14918 https://3g.163.com/news/article/GEN0DVG70514R9P4.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14919 https://3g.163.com/news/article/EUM2KO9N000189FH.html#offset=0 https://3g.163.com/news/article/EUM24BRS000189FH.html#offset=1 https://3g.163.com/news/article/EUJ790P7000189FH.html#offset=2 https://3g.163.com/news/article/EUJ76SJ2000189FH.html#offset=3 https://3g.163.com/news/article/EUJ75M7V000189FH.html#offset=4 https://3g.163.com/news/article/EUJ72ESB000189FH.html#offset=5 https://3g.163.com/news/article/EUJ70D5R000189FH.html#offset=6 https://3g.163.com/news/article/E6KATJC70514HDK6.html#offset=7 https://3g.163.com/news/article/E6KA44UI0514HDK6.html#offset=8 https://3g.163.com/news/article/E6K77H210514HDK6.html#offset=9
参考资料:Requests: 让 HTTP 服务人类 — Requests 2.18.1 文档 (python-requests.org)
XPath 语法 | 菜鸟教程 (runoob.com)