問(wèn)題描述
問(wèn)題:檢查超過(guò) 1000 個(gè) url 的列表并獲取 url 返回碼(status_code).
我的腳本可以運(yùn)行,但速度很慢.
我認(rèn)為必須有一種更好的、pythonic(更漂亮)的方式來(lái)執(zhí)行此操作,我可以在其中生成 10 或 20 個(gè)線(xiàn)程來(lái)檢查 url 并收集 resonses.(即:
200 ->www.yahoo.com404->www.badurl.com...
輸入文件:Url10.txt
www.example.comwww.yahoo.comwww.testsite.com
....
導(dǎo)入請(qǐng)求使用 open("url10.txt") 作為 f:urls = f.read().splitlines()打印(網(wǎng)址)對(duì)于網(wǎng)址中的網(wǎng)址:url = 'http://'+url #將http://添加到每個(gè)url(必須有更好的方法來(lái)做到這一點(diǎn))嘗試:resp = requests.get(url, timeout=1)print(len(resp.content), '->', resp.status_code, '->', resp.url)例外為 e:打印(錯(cuò)誤",網(wǎng)址)
挑戰(zhàn):通過(guò)多處理提高速度.
多處理
但它不工作.我收到以下錯(cuò)誤:(注意:我不確定我是否正確地實(shí)現(xiàn)了這個(gè))
AttributeError: Can't get attribute 'checkurl' on <module '__main__' (built-in)>
--
導(dǎo)入請(qǐng)求從多處理導(dǎo)入池使用 open("url10.txt") 作為 f:urls = f.read().splitlines()def checkurlconnection(url):對(duì)于網(wǎng)址中的網(wǎng)址:url = 'http://'+url嘗試:resp = requests.get(url, timeout=1)print(len(resp.content), '->', resp.status_code, '->', resp.url)例外為 e:打印(錯(cuò)誤",網(wǎng)址)如果 __name__ == __main__":p = 池(進(jìn)程=4)結(jié)果 = p.map(checkurlconnection, urls)
在這種情況下,您的任務(wù)受 I/O 限制而非處理器限制 - 網(wǎng)站回復(fù)所需的時(shí)間比 CPU 循環(huán)一次所需的時(shí)間長(zhǎng)您的腳本(不包括 TCP 請(qǐng)求).這意味著您不會(huì)從并行執(zhí)行此任務(wù)中獲得任何加速(這就是 multiprocessing
所做的).你想要的是多線(xiàn)程.實(shí)現(xiàn)這一點(diǎn)的方法是使用文檔很少,可能名稱(chēng)不佳的 multiprocessing.dummy
:
導(dǎo)入請(qǐng)求from multiprocessing.dummy import Pool as ThreadPoolurls = ['https://www.python.org','https://www.python.org/about/']def get_status(url):r = requests.get(url)返回 r.status_code如果 __name__ == "__main__":pool = ThreadPool(4) # 建立工人池results = pool.map(get_status, urls) #在自己的線(xiàn)程中打開(kāi)urlpool.close() #關(guān)閉池并等待工作完成pool.join()
參見(jiàn)此處,了解 Python 中多處理與多線(xiàn)程的示例.p>
Problem: Check a listing of over 1000 urls and get the url return code (status_code).
The script I have works but very slow.
I am thinking there has to be a better, pythonic (more beutifull) way of doing this, where I can spawn 10 or 20 threads to check the urls and collect resonses. (i.e:
200 -> www.yahoo.com
404 -> www.badurl.com
...
Input file:Url10.txt
www.example.com
www.yahoo.com
www.testsite.com
....
import requests
with open("url10.txt") as f:
urls = f.read().splitlines()
print(urls)
for url in urls:
url = 'http://'+url #Add http:// to each url (there has to be a better way to do this)
try:
resp = requests.get(url, timeout=1)
print(len(resp.content), '->', resp.status_code, '->', resp.url)
except Exception as e:
print("Error", url)
Challenges: Improve speed with multiprocessing.
With multiprocessing
But is it not working. I get the following error: (note: I am not sure if I have even implemented this correctly)
AttributeError: Can't get attribute 'checkurl' on <module '__main__' (built-in)>
--
import requests
from multiprocessing import Pool
with open("url10.txt") as f:
urls = f.read().splitlines()
def checkurlconnection(url):
for url in urls:
url = 'http://'+url
try:
resp = requests.get(url, timeout=1)
print(len(resp.content), '->', resp.status_code, '->', resp.url)
except Exception as e:
print("Error", url)
if __name__ == "__main__":
p = Pool(processes=4)
result = p.map(checkurlconnection, urls)
In this case your task is I/O bound and not processor bound - it takes longer for a website to reply than it does for your CPU to loop once through your script (not including the TCP request). What this means is that you wont get any speedup from doing this task in parallel (which is what multiprocessing
does). What you want is multi-threading. The way this is achieved is by using the little documented, perhaps poorly named, multiprocessing.dummy
:
import requests
from multiprocessing.dummy import Pool as ThreadPool
urls = ['https://www.python.org',
'https://www.python.org/about/']
def get_status(url):
r = requests.get(url)
return r.status_code
if __name__ == "__main__":
pool = ThreadPool(4) # Make the Pool of workers
results = pool.map(get_status, urls) #Open the urls in their own threads
pool.close() #close the pool and wait for the work to finish
pool.join()
See here for examples of multiprocessing vs multithreading in Python.
這篇關(guān)于如何使用多處理循環(huán)遍歷一大串 URL?的文章就介紹到這了,希望我們推薦的答案對(duì)大家有所幫助,也希望大家多多支持html5模板網(wǎng)!