問(wèn)題描述
我有一個(gè)大的 xml 文件(大約 84MB),格式如下:
I have a large xml file (about 84MB) which is in this form:
我的目標(biāo)是提取每一本書并獲得其屬性.我嘗試如下解析它(就像我對(duì)其他 xml 文件所做的那樣):
My goal is to extract every single book and get its properties. I tried to parse it (as I did with other xml files) as follows:
但代碼似乎在解析指令中失敗.為什么會(huì)發(fā)生這種情況,我該如何解決?
but the code seems to fail in the parse instruction. Why is this happening and how can I solve this?
我應(yīng)該指出,該文件可能包含希臘語(yǔ)、西班牙語(yǔ)和阿拉伯語(yǔ)字符.
I should point out that the file may contain greek, spanish and arabic characters.
這是我在 ipython 中得到的輸出:
This is the output i got in ipython:
我想指出的是計(jì)算機(jī)在執(zhí)行過(guò)程中凍結(jié),所以這可能與內(nèi)存消耗有關(guān),如下所述.
I would like to point out that the computer freezes during the execution, so this may be related to memory consumption as stated below.
推薦答案
我強(qiáng)烈建議在這里使用 SAX 解析器.我不建議在任何大于幾兆字節(jié)的 XML 文檔上使用 minidom
.我已經(jīng)看到它使用大約 400MB 的 RAM 讀取大小約為 10MB 的 XML 文檔.我懷疑您遇到的問(wèn)題是由 minidom
請(qǐng)求過(guò)多內(nèi)存引起的.
I would strongly recommend using a SAX parser here. I wouldn't recommend using minidom
on any XML document larger than a few megabytes; I've seen it use about 400MB of RAM reading in an XML document that was about 10MB in size. I suspect the problems you are having are being caused by minidom
requesting too much memory.
Python 帶有一個(gè) XML SAX 解析器.要使用它,請(qǐng)執(zhí)行以下操作.
Python comes with an XML SAX parser. To use it, do something like the following.
您的 ContentHandler
子類將覆蓋 ContentHandler(例如 startElement
、startElementNS
、endElement
、endElementNS
或 characters
.這些處理由 SAX 解析器在讀取您的 XML 文檔時(shí)生成的事件.
Your ContentHandler
subclass will override various methods in ContentHandler (such as startElement
, startElementNS
, endElement
, endElementNS
or characters
. These handle events generated by the SAX parser as it reads your XML document in.
SAX 是一種比 DOM 更低級(jí)"的 XML 處理方式;除了從文檔中提取相關(guān)數(shù)據(jù)外,您的 ContentHandler 還需要跟蹤它當(dāng)前包含的元素.不過(guò),從好的方面來(lái)說(shuō),由于 SAX 解析器不會(huì)將整個(gè)文檔保存在內(nèi)存中,因此它們可以處理任何大小的 XML 文檔,包括那些比您更大的文檔.
SAX is a more 'low-level' way to handle XML than DOM; in addition to pulling out the relevant data from the document, your ContentHandler will need to do work keeping track of what elements it is currently inside. On the upside, however, as SAX parsers don't keep the whole document in memory, they can handle XML documents of potentially any size, including those larger than yours.
我還沒(méi)有嘗試過(guò)其他使用 DOM 解析器(例如 lxml)來(lái)處理這種大小的 XML 文檔,但我懷疑 lxml 仍然需要相當(dāng)長(zhǎng)的時(shí)間并使用大量?jī)?nèi)存來(lái)解析您的 XML 文檔.如果每次運(yùn)行代碼時(shí)都必須等待它讀取 84MB XML 文檔,這可能會(huì)減慢您的開發(fā)速度.
I haven't tried other using DOM parsers such as lxml on XML documents of this size, but I suspect that lxml will still take a considerable time and use a considerable amount of memory to parse your XML document. That could slow down your development if every time you run your code you have to wait for it to read in an 84MB XML document.
最后,我不相信你提到的希臘語(yǔ)、西班牙語(yǔ)和阿拉伯語(yǔ)字符會(huì)造成問(wèn)題.
Finally, I don't believe the Greek, Spanish and Arabic characters you mention will cause a problem.
這篇關(guān)于使用 python 解析非常大的 xml 文件時(shí)出現(xiàn)問(wèn)題的文章就介紹到這了,希望我們推薦的答案對(duì)大家有所幫助,也希望大家多多支持html5模板網(wǎng)!