最近运行一个Github项目,里面用到了Huggingface的Datasets库,这个库在会主动去网络上下载原始数据集文件,但其下载源都是原始数据集的链接。比如Spider数据集,其下载来源为原作者发布的Google Drive链接上。然而,学校里的服务器并不支持访问外网。故需要使用代理来协助程序访问Google Drive。
下面以一个简单的代码和报错为例,介绍这个问题。
from datasets import load_dataset dataset = load_dataset('spider')
直接运行上述代码,程序会自动去Google drive上尝试下载Spider数据集,但是由于网络访问限制,将会如下报错。
(slurm) jxqi@main-2:~/Text-to-SQL/tmp$ python test_google.py Using the latest cached version of the module from /home/jxqi/.cache/huggingface/modules/datasets_modules/datasets/spider/edbe505fd96c6218feb563fa547869bbc170052a1484d675f9d96d090a9473cf (last modified on Wed Oct 20 15:33:00 2021) since it couldn't be found locally at spider/spider.py or remotely (ConnectionError). Downloading and preparing dataset spider/spider (download: 95.12 MiB, generated: 5.17 MiB, post-processed: Unknown size, total: 100.29 MiB) to /home/jxqi/.cache/huggingface/datasets/spider/spider/1.0.0/edbe505fd96c6218feb563fa547869bbc170052a1484d675f9d96d090a9473cf... Traceback (most recent call last): File "test_google.py", line 3, in <module> dataset = load_dataset('spider') File "/home/jxqi/anaconda3/envs/slurm/lib/python3.8/site-packages/datasets/load.py", line 742, in load_dataset builder_instance.download_and_prepare( File "/home/jxqi/anaconda3/envs/slurm/lib/python3.8/site-packages/datasets/builder.py", line 574, in download_and_prepare self._download_and_prepare( File "/home/jxqi/anaconda3/envs/slurm/lib/python3.8/site-packages/datasets/builder.py", line 630, in _download_and_prepare split_generators = self._split_generators(dl_manager, **split_generators_kwargs) File "/home/jxqi/.cache/huggingface/modules/datasets_modules/datasets/spider/edbe505fd96c6218feb563fa547869bbc170052a1484d675f9d96d090a9473cf/spider.py", line 78, in _split_generators downloaded_filepath = dl_manager.download_and_extract(_URL) File "/home/jxqi/anaconda3/envs/slurm/lib/python3.8/site-packages/datasets/utils/download_manager.py", line 287, in download_and_extract return self.extract(self.download(url_or_urls)) File "/home/jxqi/anaconda3/envs/slurm/lib/python3.8/site-packages/datasets/utils/download_manager.py", line 195, in download downloaded_path_or_paths = map_nested( File "/home/jxqi/anaconda3/envs/slurm/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 195, in map_nested return function(data_struct) File "/home/jxqi/anaconda3/envs/slurm/lib/python3.8/site-packages/datasets/utils/download_manager.py", line 218, in _download return cached_path(url_or_filename, download_config=download_config) File "/home/jxqi/anaconda3/envs/slurm/lib/python3.8/site-packages/datasets/utils/file_utils.py", line 281, in cached_path output_path = get_from_cache( File "/home/jxqi/anaconda3/envs/slurm/lib/python3.8/site-packages/datasets/utils/file_utils.py", line 623, in get_from_cache raise ConnectionError("Couldn't reach {}".format(url)) ConnectionError: Couldn't reach https://drive.google.com/uc?export=download&id=1_AckYkinAnhqmRQtGsQgUKAnTHxxX5J0
可以看到,由于服务器无法访问Google drive链接导致报错。
查找资料,发现类似的问题,参考Linux 让终端走代理的几种方法,可以通过修改shell配置文件.bashrc实现本用户的程序直接走代理的方法。
其具体步骤为首先打开.bashrc文件,然后再文件尾部追加以下两行内容:
export http_proxy="http://proxy_host:port" export https_proxy="http://proxy_host:port"
其中将proxy_host修改为你的代理服务器名称、port修改为代理端口。然后可能还需要添加用户名和密码,即:
export http_proxy="http://username:password@proxy_host:port" export https_proxy="http://username:passwordproxy_host:port"
之后,需要对shell进行重启。使用以下命令:
source ~/.bashrc
重启之后程序就可以使用代理访问外网了。
[1] Linux 让终端走代理的几种方法, https://zhuanlan.zhihu.com/p/46973701