[自动化] PyChromeDevTools源码分析

本文主要是介绍[自动化] PyChromeDevTools源码分析，对大家解决编程问题具有一定的参考价值，需要的程序猿们随着小编来一起学习吧！

文章目录

- 导读
- - 开发环境
- 基础知识
- - Chrome Devtool Protocol
  - 无头浏览器
  - 测试环境搭建
- 源码分析
- - 源码获取
  - 测试代码
  - 源码分析 - 初始化cdp接口对象
  - 源码分析 - chrome.Page.getFrameTree()
- 总结
- 参考资料

导读

作为主流浏览器Chrome，其自动化一直备受关注，各种解决方案层出不穷。selenium、Puppeteer等自动化工具自然不必说，各种开发插件也是漫天飞，今天的主角PyChromeDevTools是针对Chrome Devtool Protocol的一款Python库分析，其实类似的开源库很多，github上有人总结了一些，可以参考一下：https://github.com/ChromeDevTools/awesome-chrome-devtools。其中Python相关的库有下面几个（居然没有本文的主角，-_-||）：
在这里插入图片描述
为啥分析PyChromeDevTools呢？

代码简洁：总共代码155行，去掉空行和注释，有效代码只有130行左右
功能强大，没错，就是强大，这百行代码写了个框架，需要用什么只需要自己拼就好了
代码优雅，适合学习。
- Chrome Devtool Protocol
- python语法：__getattr__、__setattr__
- websocket使用
github的star数量高达200+，比上面推荐的那些库还高很多。

开发环境

软件名	版本号	描述
操作系统	Win10-1607
Python(venv)	Python3.8.6(virtualenv)
Google Chrome	96.0.4664.110 (正式版本) （64 位） (cohort: 97_Win_99)

基础知识

Chrome Devtool Protocol

Chrome Devtool Protocol（下面简称 CDP）是一个非常强大工具，简单来说，它可以揭开束缚 Chrome 的各种封印，从浏览器角度深入页面（及其它领域，包括 worker），完成一些平日里难以完成的操作。

Chrome 提供了 websocket 调试接口用于对当前 Tab内页面的 DOM、网络、性能、存储等等进行调试，我们常用的开发者工具就是基于此接口。

常用命令：

http://127.0.0.1:9222/json ：查看已经打开的Tab列表
http://127.0.0.1:9222/json/version : 查看浏览器版本信息
http://127.0.0.1:9222/json/new?http://www.baidu.com : 新开Tab打开指定地址
http://127.0.0.1:9222/json/close/ac5a6adb-bb53-44f1-a9e6-2354bd724924 : 关闭指定Tab
http://127.0.0.1:9222/json/activate/69301801-d503-42a3-9335-3e448a780857 : 切换到目标Tab

无头浏览器

无头模式，就是无界面模式运行，启动chrome时候添加参数--headless就可以了。

在该模式下，系统缺少了显示设备、键盘或鼠标。

一般来说，服务器（如提供Web服务的主机）往往可能缺少前述设备，但又需要使用他们提供的功能，生成相应的数据，以提供给应用程序，无头模式就有了用武之地。

无头模式只是不显示出来，图片、视频等资源都会正常下载！！！

测试环境搭建

在Chrome96.0.4664.110上，添加--remote-debugging-port=9991并不生效，需要增加--headless才可以正常运行，具体命令如下：

"C:\Program Files\Google\Chrome\Application\chrome.exe" "https://www.baidu.com" --remote-debugging-port=9991 --headless

通过浏览器打开网页http://localhost:9991/，就可以访问无头浏览器了。
在这里插入图片描述

此时，任意页面访问http://127.0.0.1:9991/json/new?http://www.csdn.com，再次刷新http://localhost:9991/页面，可以看到已经创建了新的Tab页面。

源码分析

先上一张源码结构图，就两个类ChromeInterface和GenericElement。
在这里插入图片描述

源码获取

您可以安装PyChromeDevTools发出 git 命令：

git clone https://github.com/marty90/PyChromeDevTools

或者，更好的是，您可以使用以下命令安装它及其依赖项pip：

pip install PyChromeDevTools

测试代码

def test_PyChromeDevTools():
    import PyChromeDevTools
    # 初始化cdp（默认连接第一个tab页面）
    chrome = PyChromeDevTools.ChromeInterface(port=9991)

    # 打印所有tab页面的url和标题
    for tab in chrome.tabs:
        print(tab['url'], tab['title'])

    # 这里以CSDN主页为例，分析Page中的Frame
    # result是cdp协议解析返回的json字符串解析后的dict对象
    result, messages = chrome.Page.getFrameTree()
    print(f'指令ID: {result.get("id")}')

    frameTree = result.get('result').get('frameTree')
    # 顶部frame
    frame_top = frameTree.get('frame')
    print('id: {}, parentId: {}, url: {}'.format(
        frame_top.get('id', ''), frame_top.get('parentId', ''), frame_top.get('url', '')
    ))
    
    # 子frame列表信息
    frame_children = frameTree.get('childFrames')
    for child in frame_children:
        child = child.get('frame')
        print('\tid: {}, parentId: {}, url: {}'.format(
            child.get('id', ''), child.get('parentId', ''), child.get('url', '')
        ))

运行结果：
在这里插入图片描述

源码分析 - 初始化cdp接口对象

PyChromeDevTools通过构造函数ChromeInterface创建对象即可完成接口对象初始化。

chrome = PyChromeDevTools.ChromeInterface(port=9991)

其调用堆栈过程如下：
在这里插入图片描述

构造函数

所有参数都有默认值，默认情况会连接第0个tab页面（第0个对应的是最后创建的Tab页面）。
在这里插入图片描述
构造函数中只是对各个参数赋值给对象自身，然后调用connect方法：self.connect(tab=tab)。

connect方法分析

    def connect(self, tab=0, update_tabs=True):
        # 调用get_tabs，初始化self.tabs
        if update_tabs or self.tabs is None:
            self.get_tabs()

		# 获取第tab个标签页面的webSocket调试URL
		#	/devtools/inspector.html?ws=localhost:9991/devtools/page/4873CA17C4E484B54405D71AAE7BDC84
        wsurl = self.tabs[tab]['webSocketDebuggerUrl']
        # 关闭之前的连接
        self.close()
        # 连接新的websocket
        self.ws = websocket.create_connection(wsurl)
        self.ws.settimeout(self.timeout)

get_tabs方法分析

    def get_tabs(self):
        # 其实就是通过requests库，请求了接口`http://localhost:9991/json`
        response = requests.get(f'http://{self.host}:{self.port}/json')
        self.tabs = json.loads(response.text)

源码分析 - chrome.Page.getFrameTree()

chrome.Page方法分析

class ChromeInterface(object):
    # 当访问object不存在的属性时会调用该方法
    def __getattr__(self, attr):
        genericelement = GenericElement(attr, self)
        # 将genericelement设置为对象属性
        self.__setattr__(attr, genericelement)
        return genericelement

chrome.Page.getFrameTree方法分析

class GenericElement(object):
    def __init__(self, name, parent):
        self.name = name
        self.parent = parent

    def __getattr__(self, attr):
        func_name = '{}.{}'.format(self.name, attr)

        def generic_function(**args):
        	# 清除所有message
            self.parent.pop_messages()
            
        	# 消息ID递增
            self.parent.message_counter += 1
            message_id = self.parent.message_counter
            
        	# 消息ID递增
            call_obj = {'id': message_id, 'method': func_name, 'params': args}
            self.parent.ws.send(json.dumps(call_obj))
            
        	# 解析cdp结果并返回数据
            result, messages = self.parent.wait_result(message_id)
            return result, messages
        
        # 返回一个函数
        return generic_function

关于元素是否有某成员的思考

一个类定义了__getattr__后就能通过.进行对象的访问了，本项目中，在__getattr__方法中执行了__setattr__方法，使得获取元素过程中就将元素设置到了对象上面。

其实本项目每次都会执行大量的对象赋值操作，完全没必要。那么怎么解决呢？

通过python函数getattr获取对象属性的时候，每次都返回True。已经无法满足我们的需求了
dir和in检测是否有元素，对于本项目，就是执行'Page' in dir(chrome)语句判断是否存在元素Page。

ps: 当访问object不存在的属性时会调用__getattr__方法，也就是说，调用执行chrome.Page50次，也只调用一次__getattr__方法

总结

通过http://127.0.0.1:9991/json获取所有页面及websocket的URL
通过websocket进行cdp通信
cdp协议只有两层method：
- chrome.Page 对于GenericElement这个类
- chrome.Page.getFrameTree 对于GenericElement这个类的属性__getattr__

参考资料

https://github.com/marty90/PyChromeDevTools
CDP协议文档 https://chromedevtools.github.io/devtools-protocol
qq群：夜猫逐梦技术交流裙/953949723

**ps：**文章中内容仅用于技术交流，请勿用于违规违法行为。

这篇关于[自动化] PyChromeDevTools源码分析的文章就介绍到这儿，希望我们推荐的文章对大家有所帮助，也希望大家多多支持为之网！

C/C++教程