目录
1 应用正则表达式解析HTML
2 应用 BeautifulSoup解析HTML
Web scraping is when we write a program that pretends to be a web browser and retrieves pages, then examines the data in those pages looking for patterns.
Web抓取是指我们编写一个程序,假装是一个Web浏览器,然后检索页面,然后检查这些页面中的数据寻找模式。
▲示例:从网页内容中提取HTML:
网页内容:
<h1>The First Page</h1>
<p>
If you like, you can switch to the
<a href="http://www.dr-chuck.com/page2.htm">
Second Page</a>.
</p>
提取"http://www.dr-chuck.com/page2.htm"的正则表达式:
href="http[s]?://.+?"
其中,[s]?表示有0个或1个s,即http://或https://,后面接一个或多个任意字符,+?为non-greedy模式 ,到“截止
▲应用urllib加正则表达式的代码
import urllib.request, urllib.parse, urllib.error import re import ssl # Ignore SSL certificate errors ssl库允许该程序访问严格执行HTTPS的网站 ctx = ssl.create_default_context() ctx.check_hostname = False ctx.verify_mode = ssl.CERT_NONE url = input('Enter - ')###用户输入需要抓取的网页 html = urllib.request.urlopen(url, context=ctx).read()###一次性读入全部网页内容,为bytes对象 links = re.findall(b'href="(http[s]?://.*?)"', html)###查找并提取HTML,返回列表,b''表示bytes对象 for link in links: print(link.decode())###逐个输出,decode解析
输出内容:
Enter - https://docs.python.org https://docs.python.org/3/index.html https://www.python.org/ https://docs.python.org/3.8/ https://docs.python.org/3.7/ https://docs.python.org/3.5/ https://docs.python.org/2.7/ https://www.python.org/doc/versions/ https://www.python.org/dev/peps/ https://wiki.python.org/moin/BeginnersGuide https://wiki.python.org/moin/PythonBooks https://www.python.org/doc/av/ https://www.python.org/ https://www.python.org/psf/donations/ http://sphinx.pocoo.org/
▲当HTML格式良好且可预测时,正则表达式工作得非常好。但是,由于有很多损坏的HTML页面,只使用正则表达式的解决方案可能会错过一些有效链接,或者最终得到坏数据。
▲安装BeautifulSoup
beautifulsoup4 · PyPI
下载后命令行运行:
###安装或更新pip py -m pip install --upgrade pip setuptools wheel pip install desktop/beautifulsoup4-4.10.0-py3-none-any.whl ####我下载在桌面上,所以是desktop/,根据自己下载的路径更改
▲解析HTML
import urllib.request, urllib.parse, urllib.error from bs4 import BeautifulSoup import ssl # Ignore SSL certificate errors ctx = ssl.create_default_context() ctx.check_hostname = False ctx.verify_mode = ssl.CERT_NONE url = input('Enter - ')###输入网页 html = urllib.request.urlopen(url, context=ctx).read()###urllib打开并读取内容 soup = BeautifulSoup(html, 'html.parser')###传输到BeautifulSoup解析 # Retrieve all of the anchor tags抓取anchor元素,具体见https://www.w3school.com.cn/tags/index.asp tags = soup('a') for tag in tags: print(tag.get('href', None))
▲ 还能抓取其他内容
from urllib.request import urlopen from bs4 import BeautifulSoup import ssl # Ignore SSL certificate errors ctx = ssl.create_default_context() ctx.check_hostname = False ctx.verify_mode = ssl.CERT_NONE url = input('Enter - ') html = urlopen(url, context=ctx).read() print(html.decode()) soup = BeautifulSoup(html, "html.parser") # Retrieve all of the anchor tags tags = soup('a') for tag in tags: # Look at the parts of a tag print('TAG:', tag) print('URL:', tag.get('href', None)) print('Contents:', tag.contents[0]) print('Attrs:', tag.attrs)
运行结果:
Enter - http://www.dr-chuck.com/page1.htm TAG: <a href="http://www.dr-chuck.com/page2.htm"> Second Page</a> URL: http://www.dr-chuck.com/page2.htm Content: ['\nSecond Page'] Attrs: [('href', 'http://www.dr-chuck.com/page2.htm')]