python爬虫,基于selenium+chrome driver使用代理IP进行数据采集,如何关闭验证弹窗。运行环境如下:
1、程序语言python3
2、确保selenium安装完成
3、chrome浏览器不要求必须更新到最新版本,只要保证本地chrome浏览器版本和将要下载的驱动文件版本适配即可,注意chrome版本号需要关注前三段,例如:100.0.4896
4、爬虫代理或代理服务器地址以上环境准备好之后,程序如下
from selenium import webdriver username = 'username' password = 'password' url = 'http://whatismyipaddress.com' PROXY = "www.16yun.cn:8000" # IP:PORT or HOST:PORT chrome_options = webdriver.ChromeOptions() chrome_options.add_argument('--proxy-server=%s' % PROXY) chrome = webdriver.Chrome(options=chrome_options) chrome.get(url)
运行以上程序,每次都会出现弹窗,要求输入用户名和密码的情况,这也是selenium框架下使用代理IP经常出现的问题,解决方法如下:
from selenium import webdriver import os import zipfile url = 'https://whatismyipaddress.com/'# 目标网站 PROXY = 'www.16yun.cn' # 代理服务器地址 port = '31111' # 代理服务器端口 user = 'username' # 代理服务器用户名 passw = 'password' # 代理服务器密码 manifest_json = """ { "version": "1.0.0", "manifest_version": 2, "name": "Chrome Proxy", "permissions": [ "proxy", "tabs", "unlimitedStorage", "storage", "<all_urls>", "webRequest", "webRequestBlocking" ], "background": { "scripts": ["background.js"] }, "minimum_chrome_version":"22.0.0" } """ background_js = """ var config = { mode: "fixed_servers", rules: { singleProxy: { scheme: "http", host: "%s", port: parseInt(%s) }, bypassList: ["localhost"] } }; chrome.proxy.settings.set({value: config, scope: "regular"}, function() {}); function callbackFn(details) { return { authCredentials: { username: "%s", password: "%s" } }; } chrome.webRequest.onAuthRequired.addListener( callbackFn, {urls: ["<all_urls>"]}, ['blocking'] ); """ % (PROXY, port, user, passw) def get_chromedriver(use_proxy=False, user_agent=None): path = os.path.dirname(os.path.abspath(__file__)) chrome_options = webdriver.ChromeOptions() if use_proxy: pluginfile = 'proxy_auth_plugin.zip' with zipfile.ZipFile(pluginfile, 'w') as zp: zp.writestr("manifest.json", manifest_json) zp.writestr("background.js", background_js) chrome_options.add_extension(pluginfile) if user_agent: chrome_options.add_argument('--user-agent=%s' % user_agent) driver = webdriver.Chrome( os.path.join(path, 'chromedriver'), chrome_options=chrome_options) return driver driver = get_chromedriver(use_proxy=True) driver.get(url)
以上的程序需要与chromedriver.exe 在相同目录中(否则会没有临时文件写入、读取权限),如果复制该代码使用,请注意代码格式和代理认证信息正确。