#python学习笔记（十六）#解析HTML，BeautifulSoup

本文主要是介绍#python学习笔记（十六）#解析HTML，BeautifulSoup，对大家解决编程问题具有一定的参考价值，需要的程序猿们随着小编来一起学习吧！

1 应用正则表达式解析HTML

2 应用 BeautifulSoup解析HTML

Web scraping is when we write a program that pretends to be a web browser and retrieves pages, then examines the data in those pages looking for patterns.

Web抓取是指我们编写一个程序，假装是一个Web浏览器，然后检索页面，然后检查这些页面中的数据寻找模式。

1 应用正则表达式解析HTML

▲示例：从网页内容中提取HTML：

网页内容：

<h1>The First Page</h1>
<p>
If you like, you can switch to the
<a href="http://www.dr-chuck.com/page2.htm">
Second Page</a>.
</p>

提取"http://www.dr-chuck.com/page2.htm"的正则表达式：

href="http[s]?://.+?"

其中，[s]?表示有0个或1个s，即http://或https://，后面接一个或多个任意字符，+？为non-greedy模式，到“截止

▲应用urllib加正则表达式的代码

import urllib.request, urllib.parse, urllib.error
import re
import ssl
# Ignore SSL certificate errors ssl库允许该程序访问严格执行HTTPS的网站
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')###用户输入需要抓取的网页
html = urllib.request.urlopen(url, context=ctx).read()###一次性读入全部网页内容，为bytes对象
links = re.findall(b'href="(http[s]?://.*?)"', html)###查找并提取HTML，返回列表，b''表示bytes对象
for link in links:
    print(link.decode())###逐个输出，decode解析

输出内容：

Enter - https://docs.python.org
https://docs.python.org/3/index.html
https://www.python.org/
https://docs.python.org/3.8/
https://docs.python.org/3.7/
https://docs.python.org/3.5/
https://docs.python.org/2.7/
https://www.python.org/doc/versions/
https://www.python.org/dev/peps/
https://wiki.python.org/moin/BeginnersGuide
https://wiki.python.org/moin/PythonBooks
https://www.python.org/doc/av/
https://www.python.org/
https://www.python.org/psf/donations/
http://sphinx.pocoo.org/

▲当HTML格式良好且可预测时，正则表达式工作得非常好。但是，由于有很多损坏的HTML页面，只使用正则表达式的解决方案可能会错过一些有效链接，或者最终得到坏数据。

2 应用 BeautifulSoup解析HTML

▲安装BeautifulSoup

beautifulsoup4 · PyPI

下载后命令行运行：

###安装或更新pip
py -m pip install --upgrade pip setuptools wheel
pip install desktop/beautifulsoup4-4.10.0-py3-none-any.whl
####我下载在桌面上，所以是desktop/,根据自己下载的路径更改

▲解析HTML

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')###输入网页
html = urllib.request.urlopen(url, context=ctx).read()###urllib打开并读取内容
soup = BeautifulSoup(html, 'html.parser')###传输到BeautifulSoup解析

# Retrieve all of the anchor tags抓取anchor元素，具体见https://www.w3school.com.cn/tags/index.asp
tags = soup('a')
for tag in tags:
    print(tag.get('href', None))

▲ 还能抓取其他内容

from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')
html = urlopen(url, context=ctx).read()
print(html.decode())
soup = BeautifulSoup(html, "html.parser")
# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
# Look at the parts of a tag
    print('TAG:', tag)
    print('URL:', tag.get('href', None))
    print('Contents:', tag.contents[0])
    print('Attrs:', tag.attrs)

运行结果：

Enter - http://www.dr-chuck.com/page1.htm
TAG: <a href="http://www.dr-chuck.com/page2.htm">
Second Page</a>
URL: http://www.dr-chuck.com/page2.htm
Content: ['\nSecond Page']
Attrs: [('href', 'http://www.dr-chuck.com/page2.htm')]

这篇关于#python学习笔记（十六）#解析HTML，BeautifulSoup的文章就介绍到这儿，希望我们推荐的文章对大家有所帮助，也希望大家多多支持为之网！

Python教程

#python学习笔记（十六）#解析HTML，BeautifulSoup

1 应用正则表达式解析HTML

2 应用 BeautifulSoup解析HTML

前端开发

后端开发

移动端开发

数据库

服务器运维

人工智能

区块链

游戏开发

网站运营

大数据/云计算

软件工程

软件/开发工具使用

资讯