問題描述
我需要閱讀一些非常大的文本文件(100+ Mb),用正則表達式處理每一行并將數據存儲到一個結構中.我的結構繼承自 defaultdict,它有一個讀取 self.file_name 文件的 read(self) 方法.
I need to read some very huge text files (100+ Mb), process every lines with regex and store the data into a structure. My structure inherits from defaultdict, it has a read(self) method that read self.file_name file.
看這個非常簡單(但不是真實的)示例,我沒有使用正則表達式,但我正在拆分行:
Look at this very simple (but not real) example, I'm not using regex, but I'm splitting lines:
import multiprocessing
from collections import defaultdict
def SingleContainer():
return list()
class Container(defaultdict):
"""
this class store odd line in self["odd"] and even line in self["even"].
It is stupid, but it's only an example. In the real case the class
has additional methods that do computation on readen data.
"""
def __init__(self,file_name):
if type(file_name) != str:
raise AttributeError, "%s is not a string" % file_name
defaultdict.__init__(self,SingleContainer)
self.file_name = file_name
self.readen_lines = 0
def read(self):
f = open(self.file_name)
print "start reading file %s" % self.file_name
for line in f:
self.readen_lines += 1
values = line.split()
key = {0: "even", 1: "odd"}[self.readen_lines %2]
self[key].append(values)
print "readen %d lines from file %s" % (self.readen_lines, self.file_name)
def do(file_name):
container = Container(file_name)
container.read()
return container.items()
if __name__ == "__main__":
file_names = ["r1_200909.log", "r1_200910.log"]
pool = multiprocessing.Pool(len(file_names))
result = pool.map(do,file_names)
pool.close()
pool.join()
print "Finish"
最后,我需要將每個結果加入一個容器中.保持行的順序很重要.返回值時我的方法太慢了.更好的解決方案?我在 Linux 上使用 python 2.6
At the end I need to join every results in a single Container. It is important that the order of the lines is preserved. My approach is too slow when returning values. Better solution? I'm using python 2.6 on Linux
推薦答案
你可能遇到了兩個問題.
You're probably hitting two problems.
提到了其中一個:您正在同時讀取多個文件.這些讀取最終會被交錯,導致磁盤抖動.您想一次讀取整個文件,然后只對數據進行多線程計算.
One of them was mentioned: you're reading multiple files at once. Those reads will end up being interleaved, causing disk thrashing. You want to read whole files at once, and then only multithread the computation on the data.
其次,您遇到了 Python 的多處理模塊的開銷.它實際上不是使用線程,而是啟動多個進程并通過管道序列化結果.這對于批量數據來說非常慢——事實上,它似乎比您在線程中所做的工作要慢(至少在示例中).這是由 GIL 引起的現實問題.
Second, you're hitting the overhead of Python's multiprocessing module. It's not actually using threads, but instead starting multiple processes and serializing the results through a pipe. That's very slow for bulk data--in fact, it seems to be slower than the work you're doing in the thread (at least in the example). This is the real-world problem caused by the GIL.
如果我修改 do() 以返回 None 而不是 container.items() 以禁用額外的數據復制,則此示例 比單個線程快,只要文件已被緩存:
If I modify do() to return None instead of container.items() to disable the extra data copy, this example is faster than a single thread, as long as the files are already cached:
兩個線程:0.36elapsed 168%CPU
Two threads: 0.36elapsed 168%CPU
一個線程(用map替換pool.map):0:00.52elapsed 98%CPU
One thread (replace pool.map with map): 0:00.52elapsed 98%CPU
不幸的是,GIL 問題是根本性的,無法從 Python 內部解決.
Unfortunately, the GIL problem is fundamental and can't be worked around from inside Python.
這篇關于使用多處理讀取多個文件的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!