問題描述
我有一個 xml 文件
I have an xml file
<temp>
<email id="1" Body="abc"/>
<email id="2" Body="fre"/>
.
.
<email id="998349883487454359203" Body="hi"/>
</temp>
我想讀取每個電子郵件標簽的 xml 文件.也就是說,有一次我想讀取電子郵件 id=1..從中提取正文,讀取的電子郵件 id=2...并從中提取正文...等等
I want to read the xml file for each email tag. That is, at a time I want to read email id=1..extract body from it, the read email id=2...and extract body from it...and so on
我嘗試使用 DOM 模型進行 XML 解析,因為我的文件大小為 100 GB..該方法不起作用.然后我嘗試使用:
I tried to do this using DOM model for XML parsing, since my file size is 100 GB..the approach does not work. I then tried using:
from xml.etree import ElementTree as ET
tree=ET.parse('myfile.xml')
root=ET.parse('myfile.xml').getroot()
for i in root.findall('email/'):
print i.get('Body')
現在,一旦我獲得了 root..我不明白為什么我的代碼無法解析.
Now once I get the root..I am not getting why is my code not been able to parse.
使用 iterparse 時的代碼拋出以下錯誤:
The code upon using iterparse is throwing the following error:
"UnicodeEncodeError: 'ascii' codec can't encode character u'u20ac' in position 437: ordinal not in range(128)"
誰能幫忙
推薦答案
一個iterparse的例子:
An example for iterparse:
import cStringIO
from xml.etree.ElementTree import iterparse
fakefile = cStringIO.StringIO("""<temp>
<email id="1" Body="abc"/>
<email id="2" Body="fre"/>
<email id="998349883487454359203" Body="hi"/>
</temp>
""")
for _, elem in iterparse(fakefile):
if elem.tag == 'email':
print elem.attrib['id'], elem.attrib['Body']
elem.clear()
只需將 fakefile 替換為您的真實文件即可.另請閱讀 this 了解更多詳情.
Just replace fakefile with your real file. Also read this for further details.
這篇關于在不使用 DOM 方法的情況下迭代解析大型 XML 文件的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!