問題描述
我想知道 python 的 Multiprocessing.Pool 類與 map、imap 和 map_async 一起工作的方式.我的特殊問題是我想映射一個創(chuàng)建大量內(nèi)存對象的迭代器,并且不希望所有這些對象同時生成到內(nèi)存中.我想看看各種 map() 函數(shù)是否會使我的迭代器干涸,或者僅在子進程緩慢推進時智能地調(diào)用 next() 函數(shù),所以我像這樣破解了一些測試:
I'm wondering about the way that python's Multiprocessing.Pool class works with map, imap, and map_async. My particular problem is that I want to map on an iterator that creates memory-heavy objects, and don't want all these objects to be generated into memory at the same time. I wanted to see if the various map() functions would wring my iterator dry, or intelligently call the next() function only as child processes slowly advanced, so I hacked up some tests as such:
def g():
for el in xrange(100):
print el
yield el
def f(x):
time.sleep(1)
return x*x
if __name__ == '__main__':
pool = Pool(processes=4) # start 4 worker processes
go = g()
g2 = pool.imap(f, go)
g2.next()
map、imap 和 map_async 等等.然而,這是最明顯的例子,因為在 g2 上簡單地調(diào)用一次 next() 會打印出我的生成器 g() 中的所有元素,而如果 imap '懶惰地'這樣做,我希望它只調(diào)用 go.next() 一次,因此只打印出 '1'.
And so on with map, imap, and map_async. This is the most flagrant example however, as simply calling next() a single time on g2 prints out all my elements from my generator g(), whereas if imap were doing this 'lazily' I would expect it to only call go.next() once, and therefore print out only '1'.
有人能弄清楚發(fā)生了什么嗎?是否有某種方法可以讓進程池根據(jù)需要懶惰地"評估迭代器?
Can someone clear up what is happening, and if there is some way to have the process pool 'lazily' evaluate the iterator as needed?
謝謝,
加布
推薦答案
我們先看看程序的結(jié)尾.
Let's look at the end of the program first.
多處理模塊在程序結(jié)束時使用 atexit
調(diào)用 multiprocessing.util._exit_function
.
The multiprocessing module uses atexit
to call multiprocessing.util._exit_function
when your program ends.
如果你刪除 g2.next()
,你的程序會很快結(jié)束.
If you remove g2.next()
, your program ends quickly.
_exit_function
最終調(diào)用Pool._terminate_pool
.主線程將 pool._task_handler._state
的狀態(tài)從 RUN
更改為 TERMINATE
.同時 pool._task_handler
線程在 Pool._handle_tasks
中循環(huán),并在達到條件時退出
The _exit_function
eventually calls Pool._terminate_pool
. The main thread changes the state of pool._task_handler._state
from RUN
to TERMINATE
. Meanwhile the pool._task_handler
thread is looping in Pool._handle_tasks
and bails out when it reaches the condition
if thread._state:
debug('task handler found thread._state != RUN')
break
(參見/usr/lib/python2.6/multiprocessing/pool.py)
(See /usr/lib/python2.6/multiprocessing/pool.py)
這是阻止任務(wù)處理程序完全使用您的生成器 g()
的原因.如果你查看 Pool._handle_tasks
你會看到
This is what stops the task handler from fully consuming your generator, g()
. If you look in Pool._handle_tasks
you'll see
for i, task in enumerate(taskseq):
...
try:
put(task)
except IOError:
debug('could not put task on queue')
break
這是使用您的生成器的代碼.(taskseq
并不完全是您的生成器,但隨著 taskseq
被消耗,您的生成器也是如此.)
This is the code which consumes your generator. (taskseq
is not exactly your generator, but as taskseq
is consumed, so is your generator.)
相反,當你調(diào)用 g2.next()
時,主線程調(diào)用 IMapIterator.next
,并在到達 self._cond.wait(超時)
.
In contrast, when you call g2.next()
the main thread calls IMapIterator.next
, and waits when it reaches self._cond.wait(timeout)
.
主線程正在等待而不是調(diào)用 _exit_function
是允許任務(wù)處理程序線程正常運行的原因,這意味著在生成器 put
的任務(wù)到 worker
s' 時完全消耗生成器inqueue
在 Pool._handle_tasks
函數(shù)中.
That the main thread is waiting instead of
calling _exit_function
is what allows the task handler thread to run normally, which means fully consuming the generator as it put
s tasks in the worker
s' inqueue
in the Pool._handle_tasks
function.
底線是所有 Pool
映射函數(shù)都會消耗給定的整個可迭代對象.如果你想分塊使用生成器,你可以這樣做:
The bottom line is that all Pool
map functions consume the entire iterable that it is given. If you'd like to consume the generator in chunks, you could do this instead:
import multiprocessing as mp
import itertools
import time
def g():
for el in xrange(50):
print el
yield el
def f(x):
time.sleep(1)
return x * x
if __name__ == '__main__':
pool = mp.Pool(processes=4) # start 4 worker processes
go = g()
result = []
N = 11
while True:
g2 = pool.map(f, itertools.islice(go, N))
if g2:
result.extend(g2)
time.sleep(1)
else:
break
print(result)
這篇關(guān)于Python Multiprocessing.Pool 延遲迭代的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網(wǎng)!