問題描述
我正在使用 selenium
和 chrome-driver
從某些頁面 scrape 數據,然后使用該信息運行一些額外的任務(例如,在某些頁面上輸入一些評論)
I am using selenium
and chrome-driver
to scrape data from some pages and then run some additional tasks with that information (for example, type some comments on some pages)
我的程序有一個按鈕.每次按下它都會調用 thread_(self)
(如下),開始一個新線程.目標函數 self.main
具有在 chrome-driver
上運行所有 selenium 工作的代碼.
My program has a button. Every time it's pressed it calls the thread_(self)
(bellow), starting a new thread. The target function self.main
has the code to run all the selenium work on a chrome-driver
.
def thread_(self):
th = threading.Thread(target=self.main)
th.start()
我的問題是用戶第一次按下后.這個 th
線程將打開瀏覽器 A 并做一些事情.當瀏覽器 A 正在做一些事情時,用戶將再次按下按鈕并打開運行相同 self.main
的瀏覽器 B.我希望每個打開的瀏覽器同時運行.我遇到的問題是,當我運行那個線程函數時,第一個瀏覽器停止并且第二個瀏覽器打開.
My problem is that after the user press the first time. This th
thread will open browser A and do some stuff. While browser A is doing some stuff, the user will press the button again and open browser B that runs the same self.main
. I want each browser opened to run simultaneously. The problem I faced is that when I run that thread function, the first browser stops and the second browser is opened.
我知道我的代碼可以無限創建線程.我知道這會影響電腦性能,但我可以接受.我想加快 self.main
完成的工作!
I know my code can create threads infinitely. And I know that this will affect the pc performance but I am ok with that. I want to speed up the work done by self.main
!
推薦答案
Threading
for selenium
加速
考慮以下函數來舉例說明與單一驅動程序方法相比,使用 selenium 的線程如何提供一些加速.下面的代碼 scraps 來自 selenium 使用 BeautifulSoup
打開的頁面的 html 標題.頁面列表是links
.
Threading
for selenium
speed up
Consider the following functions to exemplify how threads with selenium give some speed-up compared to a single driver approach. The code bellow scraps the html title from a page opened by selenium using BeautifulSoup
. The list of pages is links
.
import time
from bs4 import BeautifulSoup
from selenium import webdriver
import threading
def create_driver():
"""returns a new chrome webdriver"""
chromeOptions = webdriver.ChromeOptions()
chromeOptions.add_argument("--headless") # make it not visible, just comment if you like seeing opened browsers
return webdriver.Chrome(options=chromeOptions)
def get_title(url, webdriver=None):
"""get the url html title using BeautifulSoup
if driver is None uses a new chrome-driver and quit() after
otherwise uses the driver provided and don't quit() after"""
def print_title(driver):
driver.get(url)
soup = BeautifulSoup(driver.page_source,"lxml")
item = soup.find('title')
print(item.string.strip())
if webdriver:
print_title(webdriver)
else:
webdriver = create_driver()
print_title(webdriver)
webdriver.quit()
links = ["https://www.amazon.com", "https://www.google.com", "https://www.youtube.com/", "https://www.facebook.com/", "https://www.wikipedia.org/",
"https://us.yahoo.com/?p=us", "https://www.instagram.com/", "https://www.globo.com/", "https://outlook.live.com/owa/"]
現在在上面的 links
上調用 get_tile
.
Calling now get_tile
on the links
above.
順序方法
單個 chrome 驅動程序并按順序傳遞所有鏈接.我的機器需要 22.3 秒(注意:windows).
A single chrome driver and passing all links sequentially. Takes 22.3 s my machine (note:windows).
start_time = time.time()
driver = create_driver()
for link in links: # could be 'like' clicks
get_title(link, driver)
driver.quit()
print("sequential took ", (time.time() - start_time), " seconds")
多線程方法
為每個鏈接使用一個線程.結果在 10.5 秒內 >快 2 倍.
Using a thread for each link. Results in 10.5 s > 2x faster.
start_time = time.time()
threads = []
for link in links: # each thread could be like a new 'click'
th = threading.Thread(target=get_title, args=(link,))
th.start() # could `time.sleep` between 'clicks' to see whats'up without headless option
threads.append(th)
for th in threads:
th.join() # Main thread wait for threads finish
print("multiple threads took ", (time.time() - start_time), " seconds")
這里和這個更好是其他一些工作示例.第二個在 ThreadPool
上使用固定數量的線程.并建議存儲在每個線程上初始化的 chrome-driver
實例比每次都創建-啟動它更快.
This here and this better are some other working examples. The second uses a fixed number of threads on a ThreadPool
. And suggests that storing the chrome-driver
instance initialized on each thread is faster than creating-starting it every time.
我仍然不確定這是否是 selenium 的最佳方法有相當大的加速. 因為 threading
in-python?rq=1">無 IO 綁定代碼 將結束順序執行(一個線程一個接一個).由于 Python GIL(全局解釋器鎖),Python 進程無法并行運行線程(利用多個 cpu 核).
Still I was not sure this was the optimal approach for selenium to have considerable speed-ups. Since threading
on no IO bound code will end-up executed sequentially (one thread after another). Due the Python GIL (Global Interpreter Lock) a Python process cannot run threads in parallel (utilize multiple cpu-cores).
使用包multiprocessing
To try to overcome the Python GIL limitation using the package multiprocessing
and Processes
class I wrote the following code and I ran multiple tests. I even added random page hyperlink clicks on the get_title
function above. Additional code is here.
start_time = time.time()
processes = []
for link in links: # each thread a new 'click'
ps = multiprocessing.Process(target=get_title, args=(link,))
ps.start() # could sleep 1 between 'clicks' with `time.sleep(1)``
processes.append(ps)
for ps in processes:
ps.join() # Main wait for processes finish
return (time.time() - start_time)
與我的預期相反 基于 Python multiprocessing.Process
的 selenium
平均并行度 比 threading.Thread
慢大約 8%. 但很明顯,booth 的平均速度比順序方法快兩倍多.剛剛發現 selenium
chrome-driver 命令使用 HTTP-Requets
(如 POST
, GET
) 所以它是I/O 受限,因此它釋放了 Python GIL,確實使其在線程中并行.
Contrary of what I would expect Python multiprocessing.Process
based parallelism for selenium
in average was around 8% slower than threading.Thread
. But obviously booth were in average more than twice faster than the sequential approach. Just found out that selenium
chrome-driver commands uses HTTP-Requets
(like POST
, GET
) so it is I/O bounded therefore it releases the Python GIL indeed making it parallel in threads.
這不是一個確定的答案,因為我的測試只是一個很小的例子.此外,我使用的是 Windows 和 multiprocessing
在這種情況下有很多限制.每個新的 Process
都不像 Linux 中的分叉,這意味著除了其他缺點外,還浪費了大量內存.
This is not a definitive answer as my tests were only a tiny example. Also I'm using Windows and multiprocessing
have many limitations in this case. Each new Process
is not a fork like in Linux meaning, among other downsides, a lot of memory is wasted.
考慮到所有這些:根據用例,線程可能與嘗試更重的進程方法(特別是對于 Windows 用戶)一樣好或更好.
Taking all that in account: It seams that depending on the use case threads maybe as good or better than trying the heavier approach of process (specially for Windows users).
這篇關于如何在多個線程中運行`selenium-chromedriver`的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!