問題描述
我正在嘗試使用 multiprocessing.Pool
處理 tar 文件的內(nèi)容.我能夠在多處理模塊中成功使用 ThreadPool 實(shí)現(xiàn),但希望能夠使用進(jìn)程而不是線程,因?yàn)樗赡軙?huì)更快并消除為 Matplotlib 處理多線程環(huán)境所做的一些更改.我收到一個(gè)錯(cuò)誤,我懷疑與進(jìn)程不共享地址空間有關(guān),但我不確定如何修復(fù)它:
I'm trying to process the contents of a tarfile using multiprocessing.Pool
. I'm able to successfully use the ThreadPool implementation within the multiprocessing module, but would like to be able to use processes instead of threads as it would possibly be faster and eliminate some changes made for Matplotlib to handle the multithreaded environment. I'm getting an error that I suspect is related to processes not sharing address space, but I'm not sure how to fix it:
Traceback (most recent call last):
File "test_tarfile.py", line 32, in <module>
test_multiproc()
File "test_tarfile.py", line 24, in test_multiproc
pool.map(read_file, files)
File "/ldata/whitcomb/epd-7.1-2-rh5-x86_64/lib/python2.7/multiprocessing/pool.py", line 225, in map
return self.map_async(func, iterable, chunksize).get()
File "/ldata/whitcomb/epd-7.1-2-rh5-x86_64/lib/python2.7/multiprocessing/pool.py", line 522, in get
raise self._value
ValueError: I/O operation on closed file
實(shí)際的程序更復(fù)雜,但這是我正在做的一個(gè)重現(xiàn)錯(cuò)誤的示例:
The actual program is more complicated, but this is an example of what I'm doing that reproduces the error:
from multiprocessing.pool import ThreadPool, Pool
import StringIO
import tarfile
def write_tar():
tar = tarfile.open('test.tar', 'w')
contents = 'line1'
info = tarfile.TarInfo('file1.txt')
info.size = len(contents)
tar.addfile(info, StringIO.StringIO(contents))
tar.close()
def test_multithread():
tar = tarfile.open('test.tar')
files = [tar.extractfile(member) for member in tar.getmembers()]
pool = ThreadPool(processes=1)
pool.map(read_file, files)
tar.close()
def test_multiproc():
tar = tarfile.open('test.tar')
files = [tar.extractfile(member) for member in tar.getmembers()]
pool = Pool(processes=1)
pool.map(read_file, files)
tar.close()
def read_file(f):
print f.read()
write_tar()
test_multithread()
test_multiproc()
我懷疑當(dāng) TarInfo
對(duì)象被傳遞到另一個(gè)進(jìn)程但父 TarFile
不是時(shí)出現(xiàn)問題,但我不確定如何修復(fù)它在多進(jìn)程情況下.我可以在不必從 tarball 中提取文件并將它們寫入磁盤的情況下執(zhí)行此操作嗎?
I suspect that the something's wrong when the TarInfo
object is passed into the other process but the parent TarFile
is not, but I'm not sure how to fix it in the multiprocess case. Can I do this without having to extract files from the tarball and write them to disk?
推薦答案
您沒有將 TarInfo
對(duì)象傳遞給其他進(jìn)程,而是將 tar.extractfile 的結(jié)果傳遞給其他進(jìn)程(member)
進(jìn)入另一個(gè)進(jìn)程,其中 member
是一個(gè) TarInfo
對(duì)象.extractfile(...)
方法返回一個(gè)類似文件的對(duì)象,其中包括一個(gè) read()
方法,該方法對(duì)您打開的原始 tar 文件進(jìn)行操作tar = tarfile.open('test.tar')
.
You're not passing a TarInfo
object into the other process, you're passing the result of tar.extractfile(member)
into the other process where member
is a TarInfo
object. The extractfile(...)
method returns a file-like object which has, among other things, a read()
method which operates upon the original tar file you opened with tar = tarfile.open('test.tar')
.
但是,您不能在另一個(gè)進(jìn)程中使用來自一個(gè)進(jìn)程的打開文件,您必須重新打開該文件.我用這個(gè)替換了你的 test_multiproc()
:
However, you can't use an open file from one process in another process, you have to re-open the file. I replaced your test_multiproc()
with this:
def test_multiproc():
tar = tarfile.open('test.tar')
files = [name for name in tar.getnames()]
pool = Pool(processes=1)
result = pool.map(read_file2, files)
tar.close()
并添加了這個(gè):
def read_file2(name):
t2 = tarfile.open('test.tar')
print t2.extractfile(name).read()
t2.close()
并且能夠讓您的代碼正常工作.
and was able to get your code working.
這篇關(guān)于如何使用 Python 多處理池處理 tarfile?的文章就介紹到這了,希望我們推薦的答案對(duì)大家有所幫助,也希望大家多多支持html5模板網(wǎng)!