問題描述
如何讓 multiprocessing.pool.map 按數字順序分配進程?
How can I make multiprocessing.pool.map distribute processes in numerical order?
更多信息:
我有一個程序可以處理幾千個數據文件,為每個文件繪制一個圖.我正在使用 multiprocessing.pool.map
將每個文件分發到處理器,并且效果很好.有時這需要很長時間,在程序運行時查看輸出圖像會很好.如果 map 進程按順序分發快照,這會容易得多;相反,對于我剛剛執行的特定運行,分析的前 8 個快照是:0、78、156、234、312、390、468、546
.有沒有辦法讓它按數字順序更緊密地分布它們?
More Info:
I have a program which processes a few thousand data files, making a plot of each one. I'm using a multiprocessing.pool.map
to distribute each file to a processor and it works great. Sometimes this takes a long time, and it would be nice to look at the output images as the program is running. This would be a lot easier if the map process distributed the snapshots in order; instead, for the particular run I just executed, the first 8 snapshots analyzed were: 0, 78, 156, 234, 312, 390, 468, 546
. Is there a way to make it distribute them more closely to in numerical order?
示例:
這是一個包含相同關鍵元素的示例代碼,并顯示相同的基本結果:
Example:
Here's a sample code which contains the same key elements, and show's the same basic result:
import sys
from multiprocessing import Pool
import time
num_proc = 4; num_calls = 20; sleeper = 0.1
def SomeFunc(arg):
time.sleep(sleeper)
print "%5d" % (arg),
sys.stdout.flush() # otherwise doesn't print properly on single line
proc_pool = Pool(num_proc)
proc_pool.map( SomeFunc, range(num_calls) )
產量:
0 4 2 6 1 5 3 7 8 10 12 14 13 11 9 15 16 18 17 19
<小時>
答案:
來自@Hayden:使用chunksize"參數,def map(self, func, iterable, chunksize=None)
.
更多信息:chunksize
決定了每次分配給每個處理器的迭代次數.例如,我上面的示例使用了 2 的塊大小——這意味著每個處理器關閉并在函數的 2 次迭代中執行其操作,然后返回更多(簽入").chunksize 背后的權衡是,當處理器必須與其他處理器同步時,簽入"會產生開銷——這表明你想要一個 large chunksize.另一方面,如果你有大塊,那么一個處理器可能會完成它的塊,而另一個處理器還有很長的時間要走——所以你應該使用 small chunksize.我想額外的有用信息是有多少范圍,每個函數調用可以花費多長時間.如果它們真的都應該花費相同的時間 - 使用大塊大小會更有效.另一方面,如果某些函數調用的時間可能是其他函數的兩倍,那么您需要一個較小的塊大小,這樣處理器就不會等待.
More Info:
The chunksize
determines how many iterations are allocated to each processor at a time. My example above, for instance, uses a chunksize of 2---which means that each processor goes off and does its thing for 2 iterations of the function, then comes back for more ('check-in'). The trade-off behind chunksize is that there is overhead for the 'check-in' when the processor has to sync up with the others---suggesting you want a large chunksize. On the other hand, if you have large chunks, then one processor might finish its chunk while another-one has a long time left to go---so you should use a small chunksize. I guess the additional useful information is how much range there is, in how long each function call can take. If they really should all take the same amount of time - it's way more efficient to use a large chunk size. On the other hand, if some function calls could take twice as long as others, you want a small chunksize so that processors aren't caught waiting.
對于我的問題,每個函數調用都應該花費非常接近相同的時間(我認為),所以如果我希望按順序調用進程,我會因為簽入而犧牲效率開銷.
For my problem, every function call should take very close to the same amount of time (I think), so if I want the processes to be called in order, I'm going to sacrifice efficiency because of the check-in overhead.
推薦答案
發生這種情況的原因是因為每個進程在調用 map 的開始時都被賦予了預定義的工作量,這取決于 塊大小
.我們可以通過查看 chunksize">pool.map
The reason that this occurs is because each process is given a predefined amount of work to do at the start of the call to map which is dependant on the chunksize
. We can work out the default chunksize
by looking at the source for pool.map
chunksize, extra = divmod(len(iterable), len(self._pool) * 4)
if extra:
chunksize += 1
因此,對于 20 個范圍和 4 個進程,我們將獲得 2 個的 chunksize
.
So for a range of 20, and with 4 processes, we will get a chunksize
of 2.
如果我們修改您的代碼以反映這一點,我們應該會得到與您現在得到的結果相似的結果:
If we modify your code to reflect this we should get similar results to the results you are getting now:
proc_pool.map(SomeFunc, range(num_calls), chunksize=2)
這會產生輸出:
0 2 6 4 1 7 5 3 8 10 12 14 9 13 15 11 16 18 17 19
現在,設置 chunksize=1
將確保池中的每個進程一次只分配一個任務.
Now, setting the chunksize=1
will ensure that each process within the pool will only be given one task at a time.
proc_pool.map(SomeFunc, range(num_calls), chunksize=1)
與未指定塊大小時相比,這應該確保相當好的數字排序.例如,塊大小為 1 會產生輸出:
This should ensure a reasonably good numerical ordering compared to that when not specifying a chunksize. For example a chunksize of 1 yields the output:
0 1 2 3 4 5 6 7 9 10 8 11 13 12 15 14 16 17 19 18
這篇關于multiprocessing pool.map 按特定順序調用函數的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!