問題描述
我正在使用 python 多處理庫中的 Pool 類編寫一個將在 HPC 集群上運行的程序.
I am using the Pool class from python's multiprocessing library write a program that will run on an HPC cluster.
這是我正在嘗試做的抽象:
Here is an abstraction of what I am trying to do:
def myFunction(x):
# myObject is a global variable in this case
return myFunction2(x, myObject)
def myFunction2(x,myObject):
myObject.modify() # here I am calling some method that changes myObject
return myObject.f(x)
poolVar = Pool()
argsArray = [ARGS ARRAY GOES HERE]
output = poolVar.map(myFunction, argsArray)
函數 f(x) 包含在 *.so 文件中,即它正在調用 C 函數.
The function f(x) is contained in a *.so file, i.e., it is calling a C function.
我遇到的問題是每次運行程序時輸出變量的值都不同(即使函數 myObject.f() 是確定性函數).(如果我只有一個進程,那么每次運行程序時輸出變量都是相同的.)
The problem I am having is that the value of the output variable is different each time I run my program (even though the function myObject.f() is a deterministic function). (If I only have one process then the output variable is the same each time I run the program.)
我嘗試創建對象而不是將其存儲為全局變量:
I have tried creating the object rather than storing it as a global variable:
def myFunction(x):
myObject = createObject()
return myFunction2(x, myObject)
然而,在我的程序中,對象的創建成本很高,因此,創建一次 myObject 然后在每次調用 myFunction2() 時修改它要容易得多.因此,我不想每次都創建對象.
However, in my program the object creation is expensive, and thus, it is a lot easier to create myObject once and then modify it each time I call myFunction2(). Thus, I would like to not have to create the object each time.
你有什么建議嗎?我對并行編程很陌生,所以我可能會做錯這一切.我決定使用 Pool 類,因為我想從簡單的東西開始.但我愿意嘗試更好的方法.
Do you have any tips? I am very new to parallel programming so I could be going about this all wrong. I decided to use the Pool class since I wanted to start with something simple. But I am willing to try a better way of doing it.
推薦答案
我正在使用 python 多處理庫中的 Pool 類來做HPC 集群上的一些共享內存處理.
進程不是線程!您不能簡單地將 Thread
替換為 Process
并期望所有進程都能正常工作.進程
不共享內存,這意味著全局變量被復制,因此它們在原始進程中的值不會改變.
Processes are not threads! You cannot simply replace Thread
with Process
and expect all to work the same. Process
es do not share memory, which means that the global variables are copied, hence their value in the original process doesn't change.
如果你想在進程之間使用共享內存那么你必須使用multiprocessing
的數據類型,例如Value
、Array
、或使用 Manager
創建共享列表等.
If you want to use shared memory between processes then you must use the multiprocessing
's data types, such as Value
, Array
, or use the Manager
to create shared lists etc.
您可能對 Manager.register
方法感興趣,該方法允許 Manager
創建共享的自定義對象(盡管它們必須是可挑選的).
In particular you might be interested in the Manager.register
method, which allows the Manager
to create shared custom objects(although they must be picklable).
但是我不確定這是否會提高性能.由于進程之間的任何通信都需要酸洗,而酸洗通常需要更多時間,然后只是實例化對象.
However I'm not sure whether this will improve the performance. Since any communication between processes requires pickling, and pickling takes usually more time then simply instantiating the object.
請注意,您可以在創建 initializer 和 initargs
參數對工作進程進行一些初始化.org/3.3/library/multiprocessing.html#multiprocessing.pool.Pool" rel="noreferrer">Pool
.
Note that you can do some initialization of the worker processes passing the initializer
and initargs
argument when creating the Pool
.
例如,以最簡單的形式,在工作進程中創建一個全局變量:
For example, in its simplest form, to create a global variable in the worker process:
def initializer():
global data
data = createObject()
用作:
pool = Pool(4, initializer, ())
那么worker函數就可以放心的使用data
全局變量了.
Then the worker functions can use the data
global variable without worries.
樣式說明:從不為您的變量/模塊使用內置名稱.在您的情況下, object
是內置的.否則,您最終會遇到意想不到的錯誤,這些錯誤可能晦澀難懂且難以追蹤.
Style note: Never use the name of a built-in for your variables/modules. In your case object
is a built-in. Otherwise you'll end up with unexpected errors which may be obscure and hard to track down.
這篇關于具有全局變量的 multiprocessing.Pool的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!