問題描述
我有一個想要用 python 解析的 XML 文件.最好的方法是什么?將整個文檔記入內(nèi)存將是災(zāi)難性的,我需要以某種方式一次讀取一個節(jié)點.
I've got an XML file I want to parse with python. What is best way to do this? Taking into memory the entire document would be disastrous, I need to somehow read it a single node at a time.
我知道的現(xiàn)有 XML 解決方案:
Existing XML solutions I know of:
- 元素樹
- minixml
但由于我提到的問題,我擔心它們無法正常工作.我也無法在文本編輯器中打開它——generao 中有什么好的技巧來處理巨大的文本文件嗎?
but I'm afraid they aren't quite going to work because of the problem I mentioned. Also I can't open it in a text editor - any good tips in generao for working with giant text files?
推薦答案
首先,您是否嘗試過 ElementTree
(內(nèi)置的純 Python 或 C 版本,或者更好的是 lxml
版本)?我很確定他們都沒有真正將整個文件讀入內(nèi)存.
First, have you tried ElementTree
(either the built-in pure-Python or C versions, or, better, the lxml
version)? I'm pretty sure none of them actually read the whole file into memory.
當然,問題在于,無論它是否將整個文件讀入內(nèi)存,生成的解析樹最終都會在內(nèi)存中.
The problem, of course, is that, whether or not it reads the whole file into memory, the resulting parsed tree ends up in memory.
ElementTree 有一個非常簡單的解決方案,而且通常足夠:iterparse.
ElementTree has a nifty solution that's pretty simple, and often sufficient: iterparse.
for event, elem in ET.iterparse(xmlfile, events=('end')):
...
這里的關(guān)鍵是您可以在樹構(gòu)建時對其進行修改(通過將內(nèi)容替換為僅包含父節(jié)點所需內(nèi)容的摘要).通過丟棄所有不需要保留在內(nèi)存中的內(nèi)容,您可以堅持按通常的順序解析內(nèi)容而不會耗盡內(nèi)存.
The key here is that you can modify the tree as it's built up (by replacing the contents with a summary containing only what the parent node will need). By throwing out all the stuff you don't need to keep in memory as it comes in, you can stick to parsing things in the usual order without running out of memory.
鏈接頁面提供了更多詳細信息,包括在處理 XML-RPC 和 plist 時修改它們的一些示例.(在這些情況下,這是為了使生成的對象更易于使用,而不是為了節(jié)省內(nèi)存,但它們應(yīng)該足以讓這個想法得到理解.)
The linked page gives more details, including some examples for modifying XML-RPC and plist as they're processed. (In those cases, it's to make the resulting object simpler to use, not to save memory, but they should be enough to get the idea across.)
這只有在你能想出一種方法來進行總結(jié)時才會有所幫助.(在最簡單的情況下,父母不需要來自其孩子的任何信息,這只是 elem.clear()
.)否則,這對你不起作用.
This only helps if you can think of a way to summarize as you go. (In the most trivial case, where the parent doesn't need any info from its children, this is just elem.clear()
.) Otherwise, this won't work for you.
標準解決方案是 SAX,這是一個基于回調(diào)的 API,可讓您在樹一次一個節(jié)點.您無需像使用 iterparse 那樣擔心截斷節(jié)點,因為在解析完節(jié)點后這些節(jié)點就不存在了.
The standard solution is SAX, which is a callback-based API that lets you operate on the tree a node at a time. You don't need to worry about truncating nodes as you do with iterparse, because the nodes don't exist after you've parsed them.
大多數(shù)最好的 SAX 示例都是針對 Java 或 Javascript 的,但它們并不難弄清楚.例如,如果您查看 http://cs.au.dk/~amoeller/XML/programming/saxexample.html 你應(yīng)該能夠弄清楚如何用 Python 編寫它(只要你知道在哪里可以找到 xml.sax 的文檔).
Most of the best SAX examples out there are for Java or Javascript, but they're not too hard to figure out. For example, if you look at http://cs.au.dk/~amoeller/XML/programming/saxexample.html you should be able to figure out how to write it in Python (as long as you know where to find the documentation for xml.sax).
還有一些基于 DOM 的庫無需將所有內(nèi)容都讀入內(nèi)存即可工作,但據(jù)我所知,沒有任何一個庫能夠以合理的效率處理 40GB 文件.
There are also some DOM-based libraries that work without reading everything into memory, but there aren't any that I know of that I'd trust to handle a 40GB file with reasonable efficiency.
這篇關(guān)于在 python 中解析一個大的(~40GB)XML 文本文件的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網(wǎng)!