新闻网站一般都会有文章标题列表,也就是概览,另外顶部和左右两边可能还会有其它链接,比如广告、导航、推荐新闻列表、热门新闻列表等,如果通过爬虫获取主要的文章列表,会有一定的困扰。对于介绍获取网站文章正文的方法比较多,常见的算法是对HTML的DOM树进行文本标签打分,结合文本密度分析,提取文章正文。百度出来的实现工具主要有readability、newspaper、html2article等。然而,对于文章列表的获取,本人还没搜到合适的工具,因此本文主要介绍了通过爬虫获取网页文章列表的方法。
通过爬虫获取网页上的文章列表,比如网址https://news.163.com/domestic/,如下图所示:
获取到文章标题和文章链接后就可以进一步爬取文章正文了。
最简单的实现方式就是爬虫获取网页的HTML后,通过xpath获取所有a标签的文本和href属性值,即可得到文章标题和链接。通常对于只有文章列表的、干净的网页是没有问题的,比如政府部门网站:
图3所示的网页比较干净,中间一大块都是文章列表。但是即便如此,页面顶部还存在栏目导航和其它系统入口链接,底部还有页面跳转和其他相关站点的导航链接。因此,几乎不存在100%纯粹的文章列表的页面,由此获取的文章列表也会存在噪点。
一般网页设计都会分为多个区域,推荐列表、热门列表、导航、广告和文章列表等,每个板块单独一个区域。如果是新闻网页,文章列表是主要的区域。另外,文章列表作为主要的区域,从视觉上看也是面积最大和最密集的区域。
据此,可通过聚类算法和文本密度分析,提取文章列表。
本方案利用基于密度的OPTICS(通过点排序识别聚类结构)聚类算法,识别网页的超链接簇即区域,从而为提取文章列表做准备。
大多数的聚类算法对输入参数ε(邻域的最大半径)和MinPts(核心对象的邻域中要求的最少点数)都比较敏感。输入参数的细微变化,可能得出差别很大的簇。OPTICS算法弱化输入参数的敏感性,本身并不直接输出数据集的聚类,而是输出簇排序,该排序是线性表,并且代表了数据的基于密度的聚类结构,据此结构可以导出具体的数据簇。关于具体算法,数据挖掘教科书和网上都有详述,本文不再赘述。建议读者先从其他资源学习OPTICS算法,然后再阅读本文。
算法的关键部分是数据建模,构造数据的相似度模型。基于密度的聚类算法需要刻画数据点之间的距离,距离越近相似度越高,距离越远相似度越低。对于文章列表的分析,是对HTML DOM结构中a标签的聚类分析。
如何刻画HTML DOM结构中两两标签的距离?DOM是个树状结构,如下图所示:
disti,j = (Pmax - M) / Pmax
其中, Pmax = max(Pi, Pj)。
M值取决于两个标签具有相同祖先的个数或者祖先是同类的个数,算法如下:
这样就把任意两个标签的距离规范化到 [0,1] 区间。将DOM树中所有的a标签放入一个线性列表,计算任意两个标签之间的距离,由此构造 N × N的距离矩阵,主对角线为全0(同一个标签的距离是0)。
然后通过OPTICS算法输出a标签的簇排序,再按如下算法生成a标签聚类结果:
(注:建议读者先从其他资源学习OPTICS算法,然后再阅读本文。)
获得a标签聚类后,还不能断定哪一个簇是文章列表。通过对大量新闻网站的查验,文章列表一般是文本密度最大的簇。簇的文本密度 T 定义如下:
T = ΣNiWi / N
Wi : 簇内第i个标签的文本字数
N: 簇内a标签个数
其实也就是簇内标签文本平均字数。计算每个聚类簇的文本密度,将簇按文本密度排序,密度最大的簇即为文章列表。
代码使用python + selenium + chrome + chromedriver实现,还需要安装numpy和selenium库。
python -m pip install numpy
python -m pip install selenium
Linux上安装Chrome:
yum install https://dl.google.com/linux/direct/google-chrome-stable_current_x86_64.rpm
下载chromedriver: https://npm.taobao.org/mirrors/chromedriver/2.40/chromedriver_linux64.zip
解压后将chromedriver文件放在代码所在的目录下。
class Article(object): def __init__(self): self.title = '' self.text = '' self.url = '' self.published_date = ''
from Article import Article from selenium import webdriver from selenium.webdriver.chrome.options import Options from selenium.webdriver.chrome.service import Service from lxml import etree import numpy as np from functools import cmp_to_key from scrapy.selector import Selector import time import re import traceback class ArticleListSpider: _undefined_dist = 2 _epsilon = 0.135 _min_pts = 3 _min_title_len = 5 def __init__(self, url): self.start_url = url self.domain = ArticleListSpider.get_domain(url) if self.start_url.startswith('http://'): self.protocol = 'http://' if self.start_url.startswith('https://'): self.protocol = 'https://' def download_articles(self): url = self.start_url options = Options() options.add_argument('--headless') options.add_argument('--disable-gpu') service = Service('./chromedriver') driver = webdriver.Chrome(service=service, options=options) driver.get(url) # driver.implicitly_wait(5) html = driver.page_source driver.close() # print(html) articles = self.extract_article_list(html) return articles def make_article_url(self, node): start_url = self.start_url href = node.get("href") if not href or href.startswith("javascript:") or href.strip() == '#': href = None onclick = node.get("onclick") if onclick: try: m = re.search(r'\((.+)\)', onclick) if m: href = m.group(1).strip() href = re.sub(r'[\'"]', '', href) except: traceback.print_exc() if not href: raise AttributeError("Unable to make article URL from node: %s, from start URL: %s" % (etree.tostring(node, encoding='utf-8').decode('utf-8'), start_url)) article_url = href article_url = re.sub('^./', '', article_url) m = re.search("^https?://", article_url) if not m: if article_url.startswith('/'): article_url = self.protocol + self.domain + article_url else: if start_url.endswith('html'): start_url = re.sub(r'/[a-z_]+\.[s]?html', '', start_url) start_url = start_url + '/' article_url = start_url + article_url return article_url def extract_article_list(self, html): doc = etree.HTML(html) link_nodes = doc.xpath('//a') # Filter empty href and short title tmp_nodes = [] for node in link_nodes: title = ArticleListSpider.get_title(node) if title and len(title) >= ArticleListSpider._min_title_len and node.get("href"): tmp_nodes.append(node) link_nodes = tmp_nodes link_nodes_num = len(link_nodes) ancestors = [None for _ in range(link_nodes_num)] # Construct distance matrix dist_matrix = np.zeros((link_nodes_num, link_nodes_num)) for i in range(link_nodes_num): node_i = link_nodes[i] ancestors_i = ancestors[i] if not ancestors_i: ancestors_i = ArticleListSpider.get_ancestors(node_i) ancestors[i] = ancestors_i for j in range(i+1, link_nodes_num): node_j = link_nodes[j] ancestors_j = ancestors[j] if not ancestors_j: ancestors_j = ArticleListSpider.get_ancestors(node_j) ancestors[j] = ancestors_j dist_matrix[i][j] = ArticleListSpider.dist(node_i, ancestors_i, node_j, ancestors_j) dist_matrix[j][i] = dist_matrix[i][j] core_dists = [ArticleListSpider._undefined_dist for _ in range(link_nodes_num)] reach_dists = [ArticleListSpider._undefined_dist for _ in range(link_nodes_num)] ordered_indices = ArticleListSpider.optics_ordering(link_nodes, core_dists, reach_dists, dist_matrix) node_clusters = ArticleListSpider.clustering(link_nodes, core_dists, reach_dists, ordered_indices) articles = [] if len(node_clusters) > 0: article_list_nodes = ArticleListSpider.get_article_list_cluster(node_clusters) for node in article_list_nodes: article = Article() article.title = ArticleListSpider.get_title(node) article.url = self.make_article_url(node) articles.append(article) return articles @staticmethod def get_title(node): # node.text is not available if there is sub element in <a> tag, # but xpath can retrieve the text. For example, <a href="./index.html"><i>Some</i>thing</a>, # the "thing" inside <a> tag cannot be retrieved by node.text title = Selector(text=etree.tostring(node, encoding='utf-8').decode('utf-8')).xpath('//a//text()').extract() title = [t.strip() for t in title] return ''.join(title) @staticmethod def get_domain(start_url): url = start_url.replace("http://", '') url = url.replace("https://", '') components = url.split('/') domain = components[0] return domain @staticmethod def get_article_list_cluster(node_clusters): # Article links usually have the highest words density words_density = [0.0 for _ in range(len(node_clusters))] for i in range(len(node_clusters)): nodes = node_clusters[i] words = 0 for node in nodes: title = ArticleListSpider.get_title(node) words = words + len(title) words_density[i] = words / len(nodes) i = np.argsort(words_density)[len(words_density) - 1] return node_clusters[i] @staticmethod def clustering(link_nodes, core_dists, reach_dists, ordered_indices): clusters = [] cluster = [] for i in range(len(ordered_indices)): p = ordered_indices[i] node = link_nodes[p] # (ArticleListSpider.get_title(node) + "\t" + str(reach_dists[p])) if reach_dists[p] <= ArticleListSpider._epsilon: cluster.append(node) elif core_dists[p] <= ArticleListSpider._epsilon: if len(cluster) > 0: clusters.append(cluster) # new cluster cluster = [node] else: # p is noise pass # Append the last cluster if len(cluster) > 0: clusters.append(cluster) return clusters @staticmethod def optics_ordering(link_nodes, core_dists, reach_dists, dist_matrix): link_nodes_num = len(link_nodes) neighbors = [[] for _ in range(link_nodes_num)] core_indices = [False for _ in range(link_nodes_num)] for i in range(link_nodes_num): dist = dist_matrix[i] # Gathering all distances from current node to others, # those of which are less than or equal to epsilon core_dist = dist[dist <= ArticleListSpider._epsilon] # Gathering indices of all nodes whose distance from current node is less than or equal to epsilon. # They are neighbors of current node neighbors_i = np.argwhere(dist <= ArticleListSpider._epsilon).flatten() # Remove index of current node neighbors_i = neighbors_i[neighbors_i != i] if len(core_dist) > ArticleListSpider._min_pts: core_indices[i] = True # Sorting core_dist, the min_pts'th position is exactly the core dist of current node core_dists[i] = np.sort(core_dist)[ArticleListSpider._min_pts] neighbors[i] = neighbors_i # Updating reachable distance and sorting all nodes by reachable distance in term of OPTICS algorithm visited = [False for _ in range(link_nodes_num)] order_seeds = [] p_queue = [i for i in range(link_nodes_num)] np.random.shuffle(p_queue) p = ArticleListSpider.next(p_queue, order_seeds, visited) ordered_indices = [] while p > -1: ordered_indices.append(p) visited[p] = True if core_indices[p]: neighbors_p = neighbors[p] for q in neighbors_p: if visited[q]: continue else: ArticleListSpider.update(p, q, order_seeds, reach_dists, core_dists, dist_matrix) p = ArticleListSpider.next(p_queue, order_seeds, visited) return ordered_indices @staticmethod def update(p, q, order_seeds, reach_dists, core_dists, dist_matrix): # p is core object if q not in order_seeds: order_seeds.append(q) q_reach_dist = reach_dists[q] new_reach_dist = max(core_dists[p], dist_matrix[p][q]) if q_reach_dist == ArticleListSpider._undefined_dist: # If reachable distance from p to q was not set yet, then set it. reach_dists[q] = new_reach_dist else: # Otherwise, update q's reachable distance if the new one is less than the old one reach_dists[q] = min(q_reach_dist, new_reach_dist) order_seeds.sort(key=cmp_to_key(lambda x, y: reach_dists[x] - reach_dists[y])) @staticmethod def next(p_queue, order_seeds, visited): while True: if len(order_seeds) > 0: p = order_seeds[0] del order_seeds[0] elif len(p_queue) > 0: p = p_queue.pop() else: p = -1 if p == -1 or not visited[p]: break return p @staticmethod def get_ancestors(node): ancestors = [] node = node.getparent() while node is not None: ancestors.append(node) node = node.getparent() return ancestors @staticmethod def dist(node1, ancestors1, node2, ancestors2): if node1 is node2: return 0 p = max(len(ancestors1), len(ancestors2)) m = 0 for i in range(p): if i < len(ancestors1) and i < len(ancestors2): p1 = ancestors1[i] p2 = ancestors2[i] if p1 is p2: # If their ancestor nodes are the same, plus one m += 1 elif p1.tag == p2.tag and p1.get("class") == p2.get("class"): # Otherwise if the tag names are equal and have save css classes, plus 0.9 m += 0.9 return (p - m) / p
article_list_url = "https://news.163.com/domestic/" spider = ArticleListSpider(article_list_url) article_list = spider.download_articles() for a in article_list: print(a.title + "\t" + a.url)
1. 基于密度的聚类算法,聚类结果通常对ε的取值比较敏感。尽管OPTICS算法降低了对此参数的敏感度,但网页设计千姿百态,层次深度不一。层次越深,同类元素间的距离越小,反之越大。所以ε的选取,并非所有网站皆适用。本方案ε取值0.13,是爬取上百个网站的经验值。
2. 文本密度的计算方法有些简单粗暴,等同于簇内的链接文本的平均长度。对一些边上有热门列表或者推荐列表的网页,提取出来的可能是旁边的列表。如图5所示,算法有时提取到了今日推荐,而非主要的新闻列表。
http://47.119.182.42/news/index.html
通过基于密度的OPTICS聚类算法以及文本密度分析,可获取新闻网站的新闻列表,由此可以进一步爬取新闻正文,为文本挖掘、信息检索、机器学习提供数据支撑。