問題描述
Multiprocessing是python中一個強大的工具,我想更深入地了解它.我想知道什么時候使用 regular Locks 和 隊列 以及何時使用多處理 Manager 在所有進程之間共享這些.
Multiprocessing is a powerful tool in python, and I want to understand it more in depth. I want to know when to use regular Locks and Queues and when to use a multiprocessing Manager to share these among all processes.
我想出了以下測試場景,其中包含四種不同的多處理條件:
I came up with the following testing scenarios with four different conditions for multiprocessing:
使用池和NO經理
使用池和管理器
使用單獨的流程和NO經理
使用單獨的流程和經理
工作
所有條件都執行一個作業函數the_job
.the_job
包含一些由鎖保護的打印.此外,函數的輸入只是簡單地放入一個隊列中(看是否可以從隊列中恢復).此輸入只是在名為 start_scenario
的主腳本中創建的 range(10)
中的索引 idx
(顯示在底部).
The Job
All conditions execute a job function the_job
. the_job
consists of some printing which is secured by a lock. Moreover, the input to the function is simply put into a queue (to see if it can be recovered from the queue). This input is simply an index idx
from range(10)
created in the main script called start_scenario
(shown at the bottom).
def the_job(args):
"""The job for multiprocessing.
Prints some stuff secured by a lock and
finally puts the input into a queue.
"""
idx = args[0]
lock = args[1]
queue=args[2]
lock.acquire()
print 'I'
print 'was '
print 'here '
print '!!!!'
print '1111'
print 'einhundertelfzigelf
'
who= ' By run %d
' % idx
print who
lock.release()
queue.put(idx)
一個條件的成功被定義為完美地召回了輸入從隊列中,查看底部的函數read_queue
.
The success of a condition is defined as perfectly recalling the input
from the queue, see the function read_queue
at the bottom.
條件 1 和 2 是不言自明的.條件 1 涉及創建鎖和隊列,并將它們傳遞給進程池:
Condition 1 and 2 are rather self-explanatory. Condition 1 involves creating a lock and a queue, and passing these to a process pool:
def scenario_1_pool_no_manager(jobfunc, args, ncores):
"""Runs a pool of processes WITHOUT a Manager for the lock and queue.
FAILS!
"""
mypool = mp.Pool(ncores)
lock = mp.Lock()
queue = mp.Queue()
iterator = make_iterator(args, lock, queue)
mypool.imap(jobfunc, iterator)
mypool.close()
mypool.join()
return read_queue(queue)
(幫助函數 make_iterator
在這篇文章的底部給出.)條件 1 失敗并出現 RuntimeError: Lock objects should only be shared between processes through inheritance
.
(The helper function make_iterator
is given at the bottom of this post.)
Conditions 1 fails with RuntimeError: Lock objects should only be shared between processes through inheritance
.
條件 2 很相似,但現在鎖和隊列都在管理員的監督下:
Condition 2 is rather similar but now the lock and queue are under the supervision of a manager:
def scenario_2_pool_manager(jobfunc, args, ncores):
"""Runs a pool of processes WITH a Manager for the lock and queue.
SUCCESSFUL!
"""
mypool = mp.Pool(ncores)
lock = mp.Manager().Lock()
queue = mp.Manager().Queue()
iterator = make_iterator(args, lock, queue)
mypool.imap(jobfunc, iterator)
mypool.close()
mypool.join()
return read_queue(queue)
在條件 3 中,手動啟動新進程,并且在沒有管理器的情況下創建鎖和隊列:
In condition 3 new processes are started manually, and the lock and queue are created without a manager:
def scenario_3_single_processes_no_manager(jobfunc, args, ncores):
"""Runs an individual process for every task WITHOUT a Manager,
SUCCESSFUL!
"""
lock = mp.Lock()
queue = mp.Queue()
iterator = make_iterator(args, lock, queue)
do_job_single_processes(jobfunc, iterator, ncores)
return read_queue(queue)
條件 4 類似,但現在再次使用經理:
Condition 4 is similar but again now using a manager:
def scenario_4_single_processes_manager(jobfunc, args, ncores):
"""Runs an individual process for every task WITH a Manager,
SUCCESSFUL!
"""
lock = mp.Manager().Lock()
queue = mp.Manager().Queue()
iterator = make_iterator(args, lock, queue)
do_job_single_processes(jobfunc, iterator, ncores)
return read_queue(queue)
在這兩種情況下 - 3 和 4 - 我開始一個新的the_job
的 10 個任務中的每一個的進程,最多有 ncores 個進程同時運行.這是通過以下輔助函數實現的:
In both conditions - 3 and 4 - I start a new
process for each of the 10 tasks of the_job
with at most ncores processes
operating at the very same time. This is achieved with the following helper function:
def do_job_single_processes(jobfunc, iterator, ncores):
"""Runs a job function by starting individual processes for every task.
At most `ncores` processes operate at the same time
:param jobfunc: Job to do
:param iterator:
Iterator over different parameter settings,
contains a lock and a queue
:param ncores:
Number of processes operating at the same time
"""
keep_running=True
process_dict = {} # Dict containing all subprocees
while len(process_dict)>0 or keep_running:
terminated_procs_pids = []
# First check if some processes did finish their job
for pid, proc in process_dict.iteritems():
# Remember the terminated processes
if not proc.is_alive():
terminated_procs_pids.append(pid)
# And delete these from the process dict
for terminated_proc in terminated_procs_pids:
process_dict.pop(terminated_proc)
# If we have less active processes than ncores and there is still
# a job to do, add another process
if len(process_dict) < ncores and keep_running:
try:
task = iterator.next()
proc = mp.Process(target=jobfunc,
args=(task,))
proc.start()
process_dict[proc.pid]=proc
except StopIteration:
# All tasks have been started
keep_running=False
time.sleep(0.1)
結果
只有條件 1 失敗(RuntimeError: Lock objects 只能通過繼承在進程之間共享
),而其他 3 個條件成功.我試圖繞過這個結果.
The Outcome
Only condition 1 fails (RuntimeError: Lock objects should only be shared between processes through inheritance
) whereas the other 3 conditions are successful. I try to wrap my head around this outcome.
為什么池需要在所有進程之間共享鎖和隊列,而條件 3 中的各個進程卻不需要?
Why does the pool need to share a lock and queue between all processes but the individual processes from condition 3 don't?
我所知道的是,對于池條件(1 和 2),來自迭代器的所有數據都通過酸洗傳遞,而在單進程條件(3 和 4)中,來自迭代器的所有數據都通過從主進程繼承來傳遞(我正在使用 Linux).我想在從子進程中更改內存之前,會訪問父進程使用的相同內存(寫時復制).但是一旦有人說lock.acquire()
,就應該改變它,并且子進程確實使用放置在內存中其他地方的不同鎖,不是嗎?一個子進程如何知道兄弟激活了不是通過管理器共享的鎖?
What I know is that for the pool conditions (1 and 2) all data from the iterators is passed via pickling, whereas in single process conditions (3 and 4) all data from the iterators is passed by inheritance from the main process (I am using Linux).
I guess until the memory is changed from within a child process, the same memory that the parental process uses is accessed (copy-on-write). But as soon as one says lock.acquire()
, this should be changed and the child processes do use different locks placed somewhere else in memory, don't they? How does one child process know that a brother has activated a lock that is not shared via a manager?
最后,有點相關的是我的問題 3 和 4 有多少不同.兩者都有單獨的流程,但它們在管理器的使用上有所不同.兩者都被認為是有效代碼嗎?或者如果實際上不需要經理,是否應該避免使用經理?
Finally, somewhat related is my question how much different conditions 3 and 4 are. Both having individual processes but they differ in the usage of a manager. Are both considered to be valid code? Or should one avoid using a manager if there is actually no need for one?
對于那些只想復制和粘貼所有內容來執行代碼的人,這里是完整的腳本:
For those who simply want to copy and paste everything to execute the code, here is the full script:
__author__ = 'Me and myself'
import multiprocessing as mp
import time
def the_job(args):
"""The job for multiprocessing.
Prints some stuff secured by a lock and
finally puts the input into a queue.
"""
idx = args[0]
lock = args[1]
queue=args[2]
lock.acquire()
print 'I'
print 'was '
print 'here '
print '!!!!'
print '1111'
print 'einhundertelfzigelf
'
who= ' By run %d
' % idx
print who
lock.release()
queue.put(idx)
def read_queue(queue):
"""Turns a qeue into a normal python list."""
results = []
while not queue.empty():
result = queue.get()
results.append(result)
return results
def make_iterator(args, lock, queue):
"""Makes an iterator over args and passes the lock an queue to each element."""
return ((arg, lock, queue) for arg in args)
def start_scenario(scenario_number = 1):
"""Starts one of four multiprocessing scenarios.
:param scenario_number: Index of scenario, 1 to 4
"""
args = range(10)
ncores = 3
if scenario_number==1:
result = scenario_1_pool_no_manager(the_job, args, ncores)
elif scenario_number==2:
result = scenario_2_pool_manager(the_job, args, ncores)
elif scenario_number==3:
result = scenario_3_single_processes_no_manager(the_job, args, ncores)
elif scenario_number==4:
result = scenario_4_single_processes_manager(the_job, args, ncores)
if result != args:
print 'Scenario %d fails: %s != %s' % (scenario_number, args, result)
else:
print 'Scenario %d successful!' % scenario_number
def scenario_1_pool_no_manager(jobfunc, args, ncores):
"""Runs a pool of processes WITHOUT a Manager for the lock and queue.
FAILS!
"""
mypool = mp.Pool(ncores)
lock = mp.Lock()
queue = mp.Queue()
iterator = make_iterator(args, lock, queue)
mypool.map(jobfunc, iterator)
mypool.close()
mypool.join()
return read_queue(queue)
def scenario_2_pool_manager(jobfunc, args, ncores):
"""Runs a pool of processes WITH a Manager for the lock and queue.
SUCCESSFUL!
"""
mypool = mp.Pool(ncores)
lock = mp.Manager().Lock()
queue = mp.Manager().Queue()
iterator = make_iterator(args, lock, queue)
mypool.map(jobfunc, iterator)
mypool.close()
mypool.join()
return read_queue(queue)
def scenario_3_single_processes_no_manager(jobfunc, args, ncores):
"""Runs an individual process for every task WITHOUT a Manager,
SUCCESSFUL!
"""
lock = mp.Lock()
queue = mp.Queue()
iterator = make_iterator(args, lock, queue)
do_job_single_processes(jobfunc, iterator, ncores)
return read_queue(queue)
def scenario_4_single_processes_manager(jobfunc, args, ncores):
"""Runs an individual process for every task WITH a Manager,
SUCCESSFUL!
"""
lock = mp.Manager().Lock()
queue = mp.Manager().Queue()
iterator = make_iterator(args, lock, queue)
do_job_single_processes(jobfunc, iterator, ncores)
return read_queue(queue)
def do_job_single_processes(jobfunc, iterator, ncores):
"""Runs a job function by starting individual processes for every task.
At most `ncores` processes operate at the same time
:param jobfunc: Job to do
:param iterator:
Iterator over different parameter settings,
contains a lock and a queue
:param ncores:
Number of processes operating at the same time
"""
keep_running=True
process_dict = {} # Dict containing all subprocees
while len(process_dict)>0 or keep_running:
terminated_procs_pids = []
# First check if some processes did finish their job
for pid, proc in process_dict.iteritems():
# Remember the terminated processes
if not proc.is_alive():
terminated_procs_pids.append(pid)
# And delete these from the process dict
for terminated_proc in terminated_procs_pids:
process_dict.pop(terminated_proc)
# If we have less active processes than ncores and there is still
# a job to do, add another process
if len(process_dict) < ncores and keep_running:
try:
task = iterator.next()
proc = mp.Process(target=jobfunc,
args=(task,))
proc.start()
process_dict[proc.pid]=proc
except StopIteration:
# All tasks have been started
keep_running=False
time.sleep(0.1)
def main():
"""Runs 1 out of 4 different multiprocessing scenarios"""
start_scenario(1)
if __name__ == '__main__':
main()
推薦答案
multiprocessing.Lock
是使用操作系統提供的 Semaphore 對象實現的.在 Linux 上,子進程只是通過 os.fork
從父進程繼承信號量的句柄.這不是信號量的副本;它實際上繼承了父級具有的相同句柄,可以繼承文件描述符的相同方式.另一方面,Windows 不支持 os.fork
,因此它必須腌制 Lock
.它通過使用 Windows DuplicateHandle
API,其中指出:
multiprocessing.Lock
is implemented using a Semaphore object provided by the OS. On Linux, the child just inherits a handle to the Semaphore from the parent via os.fork
. This isn't a copy of the semaphore; it's actually inheriting the same handle the parent has, the same way file descriptors can be inherited. Windows on the other hand, doesn't support os.fork
, so it has to pickle the Lock
. It does this by creating a duplicate handle to the Windows Semaphore used internally by the multiprocessing.Lock
object, using the Windows DuplicateHandle
API, which states:
復制句柄指的是與原始句柄相同的對象.因此,對對象的任何更改都會通過兩者來反映把手
The duplicate handle refers to the same object as the original handle. Therefore, any changes to the object are reflected through both handles
DuplicateHandle
API 允許您將復制句柄的所有權授予子進程,以便子進程可以在 unpickling 后實際使用它.通過創建孩子擁有的重復句柄,您可以有效地共享"鎖定對象.
The DuplicateHandle
API allows you to give ownership of the duplicated handle to the child process, so that the child process can actually use it after unpickling it. By creating a duplicated handle owned by the child, you can effectively "share" the lock object.
這是 multiprocessing/synchronize.py
class SemLock(object):
def __init__(self, kind, value, maxvalue):
sl = self._semlock = _multiprocessing.SemLock(kind, value, maxvalue)
debug('created semlock with handle %s' % sl.handle)
self._make_methods()
if sys.platform != 'win32':
def _after_fork(obj):
obj._semlock._after_fork()
register_after_fork(self, _after_fork)
def _make_methods(self):
self.acquire = self._semlock.acquire
self.release = self._semlock.release
self.__enter__ = self._semlock.__enter__
self.__exit__ = self._semlock.__exit__
def __getstate__(self): # This is called when you try to pickle the `Lock`.
assert_spawning(self)
sl = self._semlock
return (Popen.duplicate_for_child(sl.handle), sl.kind, sl.maxvalue)
def __setstate__(self, state): # This is called when unpickling a `Lock`
self._semlock = _multiprocessing.SemLock._rebuild(*state)
debug('recreated blocker with handle %r' % state[0])
self._make_methods()
注意 __getstate__
中的 assert_spawning
調用,它在腌制對象時被調用.以下是它的實現方式:
Note the assert_spawning
call in __getstate__
, which gets called when pickling the object. Here's how that is implemented:
#
# Check that the current thread is spawning a child process
#
def assert_spawning(self):
if not Popen.thread_is_spawning():
raise RuntimeError(
'%s objects should only be shared between processes'
' through inheritance' % type(self).__name__
)
該函數通過調用 thread_is_spawning
確保您繼承"了 Lock
.在 Linux 上,該方法只返回 False
:
That function is the one that makes sure you're "inheriting" the Lock
, by calling thread_is_spawning
. On Linux, that method just returns False
:
@staticmethod
def thread_is_spawning():
return False
這是因為Linux不需要pickle來繼承Lock
,所以如果__getstate__
真的在Linux上被調用,我們一定不是繼承.在 Windows 上,還有更多事情要做:
This is because Linux doesn't need to pickle to inherit Lock
, so if __getstate__
is actually being called on Linux, we must not be inheriting. On Windows, there's more going on:
def dump(obj, file, protocol=None):
ForkingPickler(file, protocol).dump(obj)
class Popen(object):
'''
Start a subprocess to run the code of a process object
'''
_tls = thread._local()
def __init__(self, process_obj):
...
# send information to child
prep_data = get_preparation_data(process_obj._name)
to_child = os.fdopen(wfd, 'wb')
Popen._tls.process_handle = int(hp)
try:
dump(prep_data, to_child, HIGHEST_PROTOCOL)
dump(process_obj, to_child, HIGHEST_PROTOCOL)
finally:
del Popen._tls.process_handle
to_child.close()
@staticmethod
def thread_is_spawning():
return getattr(Popen._tls, 'process_handle', None) is not None
這里,如果 Popen._tls
對象具有 process_handle
屬性,則 thread_is_spawning
返回 True
.我們可以看到 process_handle
屬性是在 __init__
中創建的,然后我們想要繼承的數據使用 dump
從父級傳遞給子級,然后屬性被刪除.所以 thread_is_spawning
只會在 __init__
期間為 True
.根據 this python-ideas mailing list thread,這實際上是添加人為限制以模擬與 Linux 上的 os.fork
相同的行為.Windows實際上可以支持隨時傳遞Lock
,因為DuplicateHandle
可以隨時運行.
Here, thread_is_spawning
returns True
if the Popen._tls
object has a process_handle
attribute. We can see that the process_handle
attribute gets created in __init__
, then the data we want inherited is passed from the parent to child using dump
, then the attribute is deleted. So thread_is_spawning
will only be True
during __init__
. According to this python-ideas mailing list thread, this is actually an artificial limitation added to simulate the same behavior as os.fork
on Linux. Windows actually could support passing the Lock
at any time, because DuplicateHandle
can be run at any time.
以上所有內容都適用于 Queue
對象,因為它在內部使用 Lock
.
All of the above applies to the Queue
object because it uses Lock
internally.
我想說繼承 Lock
對象比使用 Manager.Lock()
更可取,因為當您使用 Manager.Lock
,您對 Lock
的每一次調用都必須通過 IPC 發送到 Manager
進程,這比使用共享 Lock
進程要慢得多code> 存在于調用進程中.不過,這兩種方法都完全有效.
I would say that inheriting Lock
objects is preferable to using a Manager.Lock()
, because when you use a Manager.Lock
, every single call you make to the Lock
must be sent via IPC to the Manager
process, which is going to be much slower than using a shared Lock
that lives inside the calling process. Both approaches are perfectly valid, though.
最后,可以使用 initializer 將
Lock
傳遞給 Pool
的所有成員,而無需使用 Manager
/initargs
關鍵字參數:
Finally, it is possible to pass a Lock
to all members of a Pool
without using a Manager
, using the initializer
/initargs
keyword arguments:
lock = None
def initialize_lock(l):
global lock
lock = l
def scenario_1_pool_no_manager(jobfunc, args, ncores):
"""Runs a pool of processes WITHOUT a Manager for the lock and queue.
"""
lock = mp.Lock()
mypool = mp.Pool(ncores, initializer=initialize_lock, initargs=(lock,))
queue = mp.Queue()
iterator = make_iterator(args, queue)
mypool.imap(jobfunc, iterator) # Don't pass lock. It has to be used as a global in the child. (This means `jobfunc` would need to be re-written slightly.
mypool.close()
mypool.join()
return read_queue(queue)
這是有效的,因為傳遞給 initargs
的參數被傳遞給在 Pool<中運行的
Process
對象的 __init__
方法/code>,所以它們最終會被繼承,而不是腌制.
This works because arguments passed to initargs
get passed to the __init__
method of the Process
objects that run inside the Pool
, so they end up being inherited, rather than pickled.
這篇關于了解多處理:Python 中的共享內存管理、鎖和隊列的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!