問題描述
我正在使用 Python 的 multiprocessing
模塊來并行處理大型 numpy 數組.數組在主進程中使用 numpy.load(mmap_mode='r')
進行內存映射.之后,multiprocessing.Pool()
分叉進程(我猜).
I am using Python's multiprocessing
module to process large numpy arrays in parallel. The arrays are memory-mapped using numpy.load(mmap_mode='r')
in the master process. After that, multiprocessing.Pool()
forks the process (I presume).
一切似乎都運行良好,除了我得到如下行:
Everything seems to work fine, except I am getting lines like:
AttributeError("'NoneType' object has no attribute 'tell'",)
in `<bound method memmap.__del__ of
memmap([ 0.57735026, 0.57735026, 0.57735026, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ], dtype=float32)>`
ignored
在單元測試日志中.盡管如此,測試還是通過了.
in the unittest logs. The tests pass fine, nevertheless.
知道那里發生了什么嗎?
Any idea what's going on there?
使用 Python 2.7.2、OS X、NumPy 1.6.1.
Using Python 2.7.2, OS X, NumPy 1.6.1.
更新:
經過一些調試,我找到了一個代碼路徑的原因,該路徑使用這個內存映射的 numpy 數組的(一小部分)作為 Pool.imap
調用的輸入.
After some debugging, I hunted down the cause to a code path that was using a (small slice of) this memory-mapped numpy array as input to a Pool.imap
call.
顯然,問題"在于 multiprocessing.Pool.imap
將其輸入傳遞給新進程的方式:它使用 pickle.這不適用于 mmap
ed numpy 數組,并且內部的某些東西會導致錯誤.
Apparently the "issue" is with the way multiprocessing.Pool.imap
passes its input to the new processes: it uses pickle. This doesn't work with mmap
ed numpy arrays, and something inside breaks which leads to the error.
我發現 這個回復 Robert Kern 似乎解決了同樣的問題.他建議為 imap
輸入來自內存映射數組時創建一個特殊的代碼路徑:在生成的進程中手動內存映射相同的數組.
I found this reply by Robert Kern which seems to address the same issue. He suggests creating a special code path for when the imap
input comes from a memory-mapped array: memory-mapping the same array manually in the spawned process.
這將是如此復雜和丑陋,以至于我寧愿忍受錯誤和額外的內存副本.有沒有其他方法可以更輕松地修改現有代碼?
This would be so complicated and ugly that I'd rather live with the error and the extra memory copies. Is there any other way that would be lighter on modifying existing code?
推薦答案
我通常的方法(如果你可以忍受額外的內存副本)是在一個進程中完成所有 IO,然后將它們發送到工作線程池中.要將內存映射數組的切片加載到內存中,只需執行 x = np.array(data[yourslice])
(data[yourslice].copy()
實際上并不這樣做會導致一些混亂.).
My usual approach (if you can live with extra memory copies) is to do all IO in one process and then send things out to a pool of worker threads. To load a slice of a memmapped array into memory just do x = np.array(data[yourslice])
(data[yourslice].copy()
doesn't actually do this, which can lead to some confusion.).
首先,讓我們生成一些測試數據:
First off, let's generate some test data:
import numpy as np
np.random.random(10000).tofile('data.dat')
您可以通過以下方式重現您的錯誤:
You can reproduce your errors with something like this:
import numpy as np
import multiprocessing
def main():
data = np.memmap('data.dat', dtype=np.float, mode='r')
pool = multiprocessing.Pool()
results = pool.imap(calculation, chunks(data))
results = np.fromiter(results, dtype=np.float)
def chunks(data, chunksize=100):
"""Overly-simple chunker..."""
intervals = range(0, data.size, chunksize) + [None]
for start, stop in zip(intervals[:-1], intervals[1:]):
yield data[start:stop]
def calculation(chunk):
"""Dummy calculation."""
return chunk.mean() - chunk.std()
if __name__ == '__main__':
main()
如果你只是切換到產生 np.array(data[start:stop])
代替,你會解決問題:
And if you just switch to yielding np.array(data[start:stop])
instead, you'll fix the problem:
import numpy as np
import multiprocessing
def main():
data = np.memmap('data.dat', dtype=np.float, mode='r')
pool = multiprocessing.Pool()
results = pool.imap(calculation, chunks(data))
results = np.fromiter(results, dtype=np.float)
def chunks(data, chunksize=100):
"""Overly-simple chunker..."""
intervals = range(0, data.size, chunksize) + [None]
for start, stop in zip(intervals[:-1], intervals[1:]):
yield np.array(data[start:stop])
def calculation(chunk):
"""Dummy calculation."""
return chunk.mean() - chunk.std()
if __name__ == '__main__':
main()
當然,這確實會為每個塊創建一個額外的內存副本.
Of course, this does make an extra in-memory copy of each chunk.
從長遠來看,您可能會發現從內存映射文件切換到 HDF 之類的文件會更容易.如果您的數據是多維的,則尤其如此.(我推薦 h5py
,但如果您的數據是類似表格的",pyTables
會很好.)
In the long run, you'll probably find that it's easier to switch away from memmapped files and move to something like HDF. This especially true if your data is multidimensional. (I'd reccomend h5py
, but pyTables
is nice if your data is "table-like".)
祝你好運,無論如何!
這篇關于numpy 與多處理和 mmap的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!