久久久久久久av_日韩在线中文_看一级毛片视频_日本精品二区_成人深夜福利视频_武道仙尊动漫在线观看

使用 python 解析非常大的 xml 文件時(shí)出現(xiàn)問(wèn)題

Troubles while parsing with python very large xml file(使用 python 解析非常大的 xml 文件時(shí)出現(xiàn)問(wèn)題)
本文介紹了使用 python 解析非常大的 xml 文件時(shí)出現(xiàn)問(wèn)題的處理方法,對(duì)大家解決問(wèn)題具有一定的參考價(jià)值,需要的朋友們下面隨著小編來(lái)一起學(xué)習(xí)吧!

問(wèn)題描述

我有一個(gè)大的 xml 文件(大約 84MB),格式如下:

I have a large xml file (about 84MB) which is in this form:

<books>
    <book>...</book>
    ....
    <book>...</book>
</books>

我的目標(biāo)是提取每一本書并獲得其屬性.我嘗試如下解析它(就像我對(duì)其他 xml 文件所做的那樣):

My goal is to extract every single book and get its properties. I tried to parse it (as I did with other xml files) as follows:

from xml.dom.minidom import parse, parseString

fd = "myfile.xml"
parser = parse(fd)
## other python code here

但代碼似乎在解析指令中失敗.為什么會(huì)發(fā)生這種情況,我該如何解決?

but the code seems to fail in the parse instruction. Why is this happening and how can I solve this?

我應(yīng)該指出,該文件可能包含希臘語(yǔ)、西班牙語(yǔ)和阿拉伯語(yǔ)字符.

I should point out that the file may contain greek, spanish and arabic characters.

這是我在 ipython 中得到的輸出:

This is the output i got in ipython:

In [2]: fd = "myfile.xml"

In [3]: parser = parse(fd)
Killed

我想指出的是計(jì)算機(jī)在執(zhí)行過(guò)程中凍結(jié),所以這可能與內(nèi)存消耗有關(guān),如下所述.

I would like to point out that the computer freezes during the execution, so this may be related to memory consumption as stated below.

推薦答案

我強(qiáng)烈建議在這里使用 SAX 解析器.我不建議在任何大于幾兆字節(jié)的 XML 文檔上使用 minidom.我已經(jīng)看到它使用大約 400MB 的 RAM 讀取大小約為 10MB 的 XML 文檔.我懷疑您遇到的問(wèn)題是由 minidom 請(qǐng)求過(guò)多內(nèi)存引起的.

I would strongly recommend using a SAX parser here. I wouldn't recommend using minidom on any XML document larger than a few megabytes; I've seen it use about 400MB of RAM reading in an XML document that was about 10MB in size. I suspect the problems you are having are being caused by minidom requesting too much memory.

Python 帶有一個(gè) XML SAX 解析器.要使用它,請(qǐng)執(zhí)行以下操作.

Python comes with an XML SAX parser. To use it, do something like the following.

from xml.sax.handlers import ContentHandler
from xml.sax import parse

class MyContentHandler(ContentHandler):
    # override various ContentHandler methods as needed...


handler = MyContentHandler()
parse("mydata.xml", handler)

您的 ContentHandler 子類將覆蓋 ContentHandler(例如 startElementstartElementNSendElementendElementNScharacters.這些處理由 SAX 解析器在讀取您的 XML 文檔時(shí)生成的事件.

Your ContentHandler subclass will override various methods in ContentHandler (such as startElement, startElementNS, endElement, endElementNS or characters. These handle events generated by the SAX parser as it reads your XML document in.

SAX 是一種比 DOM 更低級(jí)"的 XML 處理方式;除了從文檔中提取相關(guān)數(shù)據(jù)外,您的 ContentHandler 還需要跟蹤它當(dāng)前包含的元素.不過(guò),從好的方面來(lái)說(shuō),由于 SAX 解析器不會(huì)將整個(gè)文檔保存在內(nèi)存中,因此它們可以處理任何大小的 XML 文檔,包括那些比您更大的文檔.

SAX is a more 'low-level' way to handle XML than DOM; in addition to pulling out the relevant data from the document, your ContentHandler will need to do work keeping track of what elements it is currently inside. On the upside, however, as SAX parsers don't keep the whole document in memory, they can handle XML documents of potentially any size, including those larger than yours.

我還沒(méi)有嘗試過(guò)其他使用 DOM 解析器(例如 lxml)來(lái)處理這種大小的 XML 文檔,但我懷疑 lxml 仍然需要相當(dāng)長(zhǎng)的時(shí)間并使用大量?jī)?nèi)存來(lái)解析您的 XML 文檔.如果每次運(yùn)行代碼時(shí)都必須等待它讀取 84MB XML 文檔,這可能會(huì)減慢您的開發(fā)速度.

I haven't tried other using DOM parsers such as lxml on XML documents of this size, but I suspect that lxml will still take a considerable time and use a considerable amount of memory to parse your XML document. That could slow down your development if every time you run your code you have to wait for it to read in an 84MB XML document.

最后,我不相信你提到的希臘語(yǔ)、西班牙語(yǔ)和阿拉伯語(yǔ)字符會(huì)造成問(wèn)題.

Finally, I don't believe the Greek, Spanish and Arabic characters you mention will cause a problem.

這篇關(guān)于使用 python 解析非常大的 xml 文件時(shí)出現(xiàn)問(wèn)題的文章就介紹到這了,希望我們推薦的答案對(duì)大家有所幫助,也希望大家多多支持html5模板網(wǎng)!

【網(wǎng)站聲明】本站部分內(nèi)容來(lái)源于互聯(lián)網(wǎng),旨在幫助大家更快的解決問(wèn)題,如果有圖片或者內(nèi)容侵犯了您的權(quán)益,請(qǐng)聯(lián)系我們刪除處理,感謝您的支持!

相關(guān)文檔推薦

Find all nodes by attribute in XML using Python 2(使用 Python 2 在 XML 中按屬性查找所有節(jié)點(diǎn))
Python - How to parse xml response and store a elements value in a variable?(Python - 如何解析 xml 響應(yīng)并將元素值存儲(chǔ)在變量中?)
How to get XML tag value in Python(如何在 Python 中獲取 XML 標(biāo)記值)
How to correctly parse utf-8 xml with ElementTree?(如何使用 ElementTree 正確解析 utf-8 xml?)
Parse XML from URL into python object(將 XML 從 URL 解析為 python 對(duì)象)
Large XML File Parsing in Python(Python 中的大型 XML 文件解析)
主站蜘蛛池模板: 精品欧美一区二区三区久久久 | 91丨国产| 国产精品不卡 | 波多野结衣精品在线 | 久久国产精99精产国高潮 | 亚洲欧美综合 | 97起碰 | 日韩在线免费电影 | 国产亚洲一区二区三区在线观看 | 欧美中文字幕在线 | 日韩精品久久久 | 欧美黄色一区 | 99久久婷婷 | 欧美成人高清视频 | 国产亚洲成av人片在线观看桃 | 国产日韩欧美 | cao在线 | 久久久久无码国产精品一区 | 91豆花视频| 亚洲国产成人在线 | 日韩在线视频播放 | 99这里只有精品 | 久久久精品黄色 | 一区二区三区成人 | 香蕉视频91| 人妖av| 午夜三级在线观看 | 欧美bondage紧缚视频 | 国产精品一区二区三区在线 | 在线视频99 | 成人在线 | 久久出精品| 精品久久国产 | 正在播放国产精品 | 欧美一区中文字幕 | 91久色 | 新91 | 久久久蜜桃 | 国产精品美女久久久久久不卡 | 亚洲视频精品 | 精品国产一二三区 |