导包并以这段HTML源码为例,创建一个bs对象。
from bs4 import BeautifulSoup html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> """ soup = BeautifulSoup(html_doc, 'lxml')
print(soup.select('a'))
输出结果为一个列表
筛选出class='sister’的标签:
在指定class属性值前加点符号表示class:
print(soup.select('.sister'))
筛选出id值为link1的标签:
在指定的id属性值前加井号表示id:
print(soup.select('#link1'))
# 获取title标签当中的文本 print(soup.select('title')) print('_'*100) print(soup.select('title')[0].string) print('_'*100) print(soup.select('title')[0].get_text())
输出结果:
获取所有<a>标签的href属性
a_tag = soup.select('a') for i in a_tag: print(i['href'])
(i是标签对象,i.href不管用。)