久久久久久久av_日韩在线中文_看一级毛片视频_日本精品二区_成人深夜福利视频_武道仙尊动漫在线观看

在 python 中解析一個大的(~40GB)XML 文本文件

Parsing a large (~40GB) XML text file in python(在 python 中解析一個大的(~40GB)XML 文本文件)
本文介紹了在 python 中解析一個大的(~40GB)XML 文本文件的處理方法,對大家解決問題具有一定的參考價值,需要的朋友們下面隨著小編來一起學(xué)習(xí)吧!

問題描述

我有一個想要用 python 解析的 XML 文件.最好的方法是什么?將整個文檔記入內(nèi)存將是災(zāi)難性的,我需要以某種方式一次讀取一個節(jié)點.

I've got an XML file I want to parse with python. What is best way to do this? Taking into memory the entire document would be disastrous, I need to somehow read it a single node at a time.

我知道的現(xiàn)有 XML 解決方案:

Existing XML solutions I know of:

  • 元素樹
  • minixml

但由于我提到的問題,我擔心它們無法正常工作.我也無法在文本編輯器中打開它——generao 中有什么好的技巧來處理巨大的文本文件嗎?

but I'm afraid they aren't quite going to work because of the problem I mentioned. Also I can't open it in a text editor - any good tips in generao for working with giant text files?

推薦答案

首先,您是否嘗試過 ElementTree(內(nèi)置的純 Python 或 C 版本,或者更好的是 lxml 版本)?我很確定他們都沒有真正將整個文件讀入內(nèi)存.

First, have you tried ElementTree (either the built-in pure-Python or C versions, or, better, the lxml version)? I'm pretty sure none of them actually read the whole file into memory.

當然,問題在于,無論它是否將整個文件讀入內(nèi)存,生成的解析樹最終都會在內(nèi)存中.

The problem, of course, is that, whether or not it reads the whole file into memory, the resulting parsed tree ends up in memory.

ElementTree 有一個非常簡單的解決方案,而且通常足夠:iterparse.

ElementTree has a nifty solution that's pretty simple, and often sufficient: iterparse.

for event, elem in ET.iterparse(xmlfile, events=('end')):
  ...

這里的關(guān)鍵是您可以在樹構(gòu)建時對其進行修改(通過將內(nèi)容替換為僅包含父節(jié)點所需內(nèi)容的摘要).通過丟棄所有不需要保留在內(nèi)存中的內(nèi)容,您可以堅持按通常的順序解析內(nèi)容而不會耗盡內(nèi)存.

The key here is that you can modify the tree as it's built up (by replacing the contents with a summary containing only what the parent node will need). By throwing out all the stuff you don't need to keep in memory as it comes in, you can stick to parsing things in the usual order without running out of memory.

鏈接頁面提供了更多詳細信息,包括在處理 XML-RPC 和 plist 時修改它們的一些示例.(在這些情況下,這是為了使生成的對象更易于使用,而不是為了節(jié)省內(nèi)存,但它們應(yīng)該足以讓這個想法得到理解.)

The linked page gives more details, including some examples for modifying XML-RPC and plist as they're processed. (In those cases, it's to make the resulting object simpler to use, not to save memory, but they should be enough to get the idea across.)

這只有在你能想出一種方法來進行總結(jié)時才會有所幫助.(在最簡單的情況下,父母不需要來自其孩子的任何信息,這只是 elem.clear().)否則,這對你不起作用.

This only helps if you can think of a way to summarize as you go. (In the most trivial case, where the parent doesn't need any info from its children, this is just elem.clear().) Otherwise, this won't work for you.

標準解決方案是 SAX,這是一個基于回調(diào)的 API,可讓您在樹一次一個節(jié)點.您無需像使用 iterparse 那樣擔心截斷節(jié)點,因為在解析完節(jié)點后這些節(jié)點就不存在了.

The standard solution is SAX, which is a callback-based API that lets you operate on the tree a node at a time. You don't need to worry about truncating nodes as you do with iterparse, because the nodes don't exist after you've parsed them.

大多數(shù)最好的 SAX 示例都是針對 Java 或 Javascript 的,但它們并不難弄清楚.例如,如果您查看 http://cs.au.dk/~amoeller/XML/programming/saxexample.html 你應(yīng)該能夠弄清楚如何用 Python 編寫它(只要你知道在哪里可以找到 xml.sax 的文檔).

Most of the best SAX examples out there are for Java or Javascript, but they're not too hard to figure out. For example, if you look at http://cs.au.dk/~amoeller/XML/programming/saxexample.html you should be able to figure out how to write it in Python (as long as you know where to find the documentation for xml.sax).

還有一些基于 DOM 的庫無需將所有內(nèi)容都讀入內(nèi)存即可工作,但據(jù)我所知,沒有任何一個庫能夠以合理的效率處理 40GB 文件.

There are also some DOM-based libraries that work without reading everything into memory, but there aren't any that I know of that I'd trust to handle a 40GB file with reasonable efficiency.

這篇關(guān)于在 python 中解析一個大的(~40GB)XML 文本文件的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網(wǎng)!

【網(wǎng)站聲明】本站部分內(nèi)容來源于互聯(lián)網(wǎng),旨在幫助大家更快的解決問題,如果有圖片或者內(nèi)容侵犯了您的權(quán)益,請聯(lián)系我們刪除處理,感謝您的支持!

相關(guān)文檔推薦

Troubles while parsing with python very large xml file(使用 python 解析非常大的 xml 文件時出現(xiàn)問題)
Find all nodes by attribute in XML using Python 2(使用 Python 2 在 XML 中按屬性查找所有節(jié)點)
Python - How to parse xml response and store a elements value in a variable?(Python - 如何解析 xml 響應(yīng)并將元素值存儲在變量中?)
How to get XML tag value in Python(如何在 Python 中獲取 XML 標記值)
How to correctly parse utf-8 xml with ElementTree?(如何使用 ElementTree 正確解析 utf-8 xml?)
Parse XML from URL into python object(將 XML 從 URL 解析為 python 對象)
主站蜘蛛池模板: 亚洲视频国产视频 | a级片网站| 欧美一区二区三区一在线观看 | 亚洲人免费视频 | 欧美视频在线一区 | 亚洲精品久久久久国产 | 国产免费一区 | 另类 综合 日韩 欧美 亚洲 | 日本一区二区三区免费观看 | 性福视频在线观看 | 欧美偷偷 | 又黑又粗又长的欧美一区 | 欧美日韩国产一区二区三区 | 欧美一级片久久 | 欧美日韩久久久久 | 久久99精品久久久久久国产越南 | 看av电影| 97起碰| 99久久婷婷| 欧美日韩三级在线观看 | 国产一区二区电影 | 毛片黄 | 日韩视频免费 | 欧美一级在线 | 精品九九 | 国产精品五区 | 久久久久亚洲精品 | 天天精品在线 | 日韩欧美大片在线观看 | 日韩三区 | 国产欧美在线观看 | 久久久久久色 | 99久久久国产精品 | 欧美日韩久久精品 | 亚洲精品乱码久久久久久按摩观 | 国产视频h | 美女露尿口视频 | 成人在线视频一区 | 欧美中文 | 欧美精品黄 | 国产精品久久欧美久久一区 |