日韩免费看,国产精品视频一区二区免费不卡,成人精品一区二区三区

本文介紹了numpy 與多處理和 mmap的處理方法，對大家解決問題具有一定的參考價值，需要的朋友們下面隨著小編來一起學習吧！

問題描述

限時送ChatGPT賬號..

我正在使用 Python 的 multiprocessing 模塊來并行處理大型 numpy 數組.數組在主進程中使用 numpy.load(mmap_mode='r') 進行內存映射.之后，multiprocessing.Pool() 分叉進程(我猜).

I am using Python's multiprocessing module to process large numpy arrays in parallel. The arrays are memory-mapped using numpy.load(mmap_mode='r') in the master process. After that, multiprocessing.Pool() forks the process (I presume).

一切似乎都運行良好，除了我得到如下行:

Everything seems to work fine, except I am getting lines like:

AttributeError("'NoneType' object has no attribute 'tell'",)
  in `<bound method memmap.__del__ of
       memmap([ 0.57735026,  0.57735026,  0.57735026,  0.        ,  0.        ,        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,        0.        ,  0.        ], dtype=float32)>`
     ignored

在單元測試日志中.盡管如此，測試還是通過了.

in the unittest logs. The tests pass fine, nevertheless.

知道那里發生了什么嗎?

Any idea what's going on there?

使用 Python 2.7.2、OS X、NumPy 1.6.1.

Using Python 2.7.2, OS X, NumPy 1.6.1.

更新:

經過一些調試，我找到了一個代碼路徑的原因，該路徑使用這個內存映射的 numpy 數組的(一小部分)作為 Pool.imap 調用的輸入.

After some debugging, I hunted down the cause to a code path that was using a (small slice of) this memory-mapped numpy array as input to a Pool.imap call.

顯然，問題"在于 multiprocessing.Pool.imap 將其輸入傳遞給新進程的方式:它使用 pickle.這不適用于 mmaped numpy 數組，并且內部的某些東西會導致錯誤.

Apparently the "issue" is with the way multiprocessing.Pool.imap passes its input to the new processes: it uses pickle. This doesn't work with mmaped numpy arrays, and something inside breaks which leads to the error.

我發現這個回復 Robert Kern 似乎解決了同樣的問題.他建議為 imap 輸入來自內存映射數組時創建一個特殊的代碼路徑:在生成的進程中手動內存映射相同的數組.

I found this reply by Robert Kern which seems to address the same issue. He suggests creating a special code path for when the imap input comes from a memory-mapped array: memory-mapping the same array manually in the spawned process.

這將是如此復雜和丑陋，以至于我寧愿忍受錯誤和額外的內存副本.有沒有其他方法可以更輕松地修改現有代碼?

This would be so complicated and ugly that I'd rather live with the error and the extra memory copies. Is there any other way that would be lighter on modifying existing code?

推薦答案

我通常的方法(如果你可以忍受額外的內存副本)是在一個進程中完成所有 IO，然后將它們發送到工作線程池中.要將內存映射數組的切片加載到內存中，只需執行 x = np.array(data[yourslice]) (data[yourslice].copy() 實際上并不這樣做會導致一些混亂.).

My usual approach (if you can live with extra memory copies) is to do all IO in one process and then send things out to a pool of worker threads. To load a slice of a memmapped array into memory just do x = np.array(data[yourslice]) (data[yourslice].copy() doesn't actually do this, which can lead to some confusion.).

首先，讓我們生成一些測試數據:

First off, let's generate some test data:

import numpy as np
np.random.random(10000).tofile('data.dat')

您可以通過以下方式重現您的錯誤:

You can reproduce your errors with something like this:

import numpy as np
import multiprocessing

def main():
    data = np.memmap('data.dat', dtype=np.float, mode='r')
    pool = multiprocessing.Pool()
    results = pool.imap(calculation, chunks(data))
    results = np.fromiter(results, dtype=np.float)

def chunks(data, chunksize=100):
    """Overly-simple chunker..."""
    intervals = range(0, data.size, chunksize) + [None]
    for start, stop in zip(intervals[:-1], intervals[1:]):
        yield data[start:stop]

def calculation(chunk):
    """Dummy calculation."""
    return chunk.mean() - chunk.std()

if __name__ == '__main__':
    main()

如果你只是切換到產生 np.array(data[start:stop]) 代替，你會解決問題:

And if you just switch to yielding np.array(data[start:stop]) instead, you'll fix the problem:

import numpy as np
import multiprocessing

def main():
    data = np.memmap('data.dat', dtype=np.float, mode='r')
    pool = multiprocessing.Pool()
    results = pool.imap(calculation, chunks(data))
    results = np.fromiter(results, dtype=np.float)

def chunks(data, chunksize=100):
    """Overly-simple chunker..."""
    intervals = range(0, data.size, chunksize) + [None]
    for start, stop in zip(intervals[:-1], intervals[1:]):
        yield np.array(data[start:stop])

def calculation(chunk):
    """Dummy calculation."""
    return chunk.mean() - chunk.std()

if __name__ == '__main__':
    main()

當然，這確實會為每個塊創建一個額外的內存副本.

Of course, this does make an extra in-memory copy of each chunk.

從長遠來看，您可能會發現從內存映射文件切換到 HDF 之類的文件會更容易.如果您的數據是多維的，則尤其如此.(我推薦 h5py，但如果您的數據是類似表格的"，pyTables 會很好.)

In the long run, you'll probably find that it's easier to switch away from memmapped files and move to something like HDF. This especially true if your data is multidimensional. (I'd reccomend h5py, but pyTables is nice if your data is "table-like".)

祝你好運，無論如何！

這篇關于numpy 與多處理和 mmap的文章就介紹到這了，希望我們推薦的答案對大家有所幫助，也希望大家多多支持html5模板網！

【網站聲明】本站部分內容來源于互聯網,旨在幫助大家更快的解決問題，如果有圖片或者內容侵犯了您的權益，請聯系我們刪除處理，感謝您的支持！

久久久久久久av_日韩在线中文_看一级毛片视频_日本精品二区_成人深夜福利视频_武道仙尊动漫在线观看

numpy 與多處理和 mmap

問題描述

推薦答案

相關文檔推薦