蜜桃视频精品,亚洲视频在线看,一级在线观看

本文介紹了在 python 中解析一個大的(~40GB)XML 文本文件的處理方法，對大家解決問題具有一定的參考價值，需要的朋友們下面隨著小編來一起學習吧！

問題描述

我有一個想要用 python 解析的 XML 文件.最好的方法是什么?將整個文檔記入內存將是災難性的，我需要以某種方式一次讀取一個節點.

I've got an XML file I want to parse with python. What is best way to do this? Taking into memory the entire document would be disastrous, I need to somehow read it a single node at a time.

我知道的現有 XML 解決方案:

Existing XML solutions I know of:

元素樹
minixml

但由于我提到的問題，我擔心它們無法正常工作.我也無法在文本編輯器中打開它——generao 中有什么好的技巧來處理巨大的文本文件嗎?

but I'm afraid they aren't quite going to work because of the problem I mentioned. Also I can't open it in a text editor - any good tips in generao for working with giant text files?

推薦答案

首先，您是否嘗試過 ElementTree(內置的純 Python 或 C 版本，或者更好的是 lxml 版本)?我很確定他們都沒有真正將整個文件讀入內存.

First, have you tried ElementTree (either the built-in pure-Python or C versions, or, better, the lxml version)? I'm pretty sure none of them actually read the whole file into memory.

當然，問題在于，無論它是否將整個文件讀入內存，生成的解析樹最終都會在內存中.

The problem, of course, is that, whether or not it reads the whole file into memory, the resulting parsed tree ends up in memory.

ElementTree 有一個非常簡單的解決方案，而且通常足夠:iterparse.

ElementTree has a nifty solution that's pretty simple, and often sufficient: iterparse.

for event, elem in ET.iterparse(xmlfile, events=('end')):
  ...

這里的關鍵是您可以在樹構建時對其進行修改(通過將內容替換為僅包含父節點所需內容的摘要).通過丟棄所有不需要保留在內存中的內容，您可以堅持按通常的順序解析內容而不會耗盡內存.

The key here is that you can modify the tree as it's built up (by replacing the contents with a summary containing only what the parent node will need). By throwing out all the stuff you don't need to keep in memory as it comes in, you can stick to parsing things in the usual order without running out of memory.

鏈接頁面提供了更多詳細信息，包括在處理 XML-RPC 和 plist 時修改它們的一些示例.(在這些情況下，這是為了使生成的對象更易于使用，而不是為了節省內存，但它們應該足以讓這個想法得到理解.)

The linked page gives more details, including some examples for modifying XML-RPC and plist as they're processed. (In those cases, it's to make the resulting object simpler to use, not to save memory, but they should be enough to get the idea across.)

這只有在你能想出一種方法來進行總結時才會有所幫助.(在最簡單的情況下，父母不需要來自其孩子的任何信息，這只是 elem.clear().)否則，這對你不起作用.

This only helps if you can think of a way to summarize as you go. (In the most trivial case, where the parent doesn't need any info from its children, this is just elem.clear().) Otherwise, this won't work for you.

標準解決方案是 SAX，這是一個基于回調的 API，可讓您在樹一次一個節點.您無需像使用 iterparse 那樣擔心截斷節點，因為在解析完節點后這些節點就不存在了.

The standard solution is SAX, which is a callback-based API that lets you operate on the tree a node at a time. You don't need to worry about truncating nodes as you do with iterparse, because the nodes don't exist after you've parsed them.

大多數最好的 SAX 示例都是針對 Java 或 Javascript 的，但它們并不難弄清楚.例如，如果您查看 http://cs.au.dk/~amoeller/XML/programming/saxexample.html 你應該能夠弄清楚如何用 Python 編寫它(只要你知道在哪里可以找到 xml.sax 的文檔).

Most of the best SAX examples out there are for Java or Javascript, but they're not too hard to figure out. For example, if you look at http://cs.au.dk/~amoeller/XML/programming/saxexample.html you should be able to figure out how to write it in Python (as long as you know where to find the documentation for xml.sax).

還有一些基于 DOM 的庫無需將所有內容都讀入內存即可工作，但據我所知，沒有任何一個庫能夠以合理的效率處理 40GB 文件.

There are also some DOM-based libraries that work without reading everything into memory, but there aren't any that I know of that I'd trust to handle a 40GB file with reasonable efficiency.

這篇關于在 python 中解析一個大的(~40GB)XML 文本文件的文章就介紹到這了，希望我們推薦的答案對大家有所幫助，也希望大家多多支持html5模板網！

【網站聲明】本站部分內容來源于互聯網,旨在幫助大家更快的解決問題，如果有圖片或者內容侵犯了您的權益，請聯系我們刪除處理，感謝您的支持！

久久久久久久av_日韩在线中文_看一级毛片视频_日本精品二区_成人深夜福利视频_武道仙尊动漫在线观看

在 python 中解析一個大的(~40GB)XML 文本文件

問題描述

推薦答案

相關文檔推薦