問題描述
我需要幫助來了解為什么使用 xml.etree.ElementTree 解析我的 xml 文件* 會產(chǎn)生以下錯誤.
I need help to understand why parsing my xml file* with xml.etree.ElementTree produces the following errors.
*我的測試 xml 文件包含阿拉伯字符.
任務:打開并解析 utf8_file.xml
文件.
我的第一次嘗試:
import xml.etree.ElementTree as etree
with codecs.open('utf8_file.xml', 'r', encoding='utf-8') as utf8_file:
xml_tree = etree.parse(utf8_file)
結果 1:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 236-238: ordinal not in range(128)
我的第二次嘗試:
import xml.etree.ElementTree as etree
with codecs.open('utf8_file.xml', 'r', encoding='utf-8') as utf8_file:
xml_string = etree.tostring(utf8_file, encoding='utf-8', method='xml')
xml_tree = etree.fromstring(xml_string)
結果 2:
AttributeError: 'file' object has no attribute 'getiterator'
請解釋上述錯誤并評論可能的解決方案.
Please explain the errors above and comment on the possible solution.
推薦答案
將字節(jié)解碼留給解析器;不先解碼:
Leave decoding the bytes to the parser; do not decode first:
import xml.etree.ElementTree as etree
with open('utf8_file.xml', 'r') as xml_file:
xml_tree = etree.parse(xml_file)
一個 XML 文件必須在第一行包含足夠的信息來處理解析器的解碼.如果缺少標頭,解析器必須假定使用 UTF-8.
An XML file must contain enough information in the first line to handle decoding by the parser. If the header is missing, the parser must assume UTF-8 is used.
因為保存這些信息的是 XML 標頭,所以解析器負責進行所有解碼.
Because it is the XML header that holds this information, it is the responsibility of the parser to do all decoding.
您的第一次嘗試失敗了,因為 Python 試圖再次編碼 Unicode 值,以便解析器可以按預期處理字節(jié)字符串.第二次嘗試失敗,因為 etree.tostring()
期望解析樹作為第一個參數(shù),而不是 unicode 字符串.
Your first attempt failed because Python was trying to encode the Unicode values again so that the parser could handle byte strings as it expected. The second attempt failed because etree.tostring()
expects a parsed tree as first argument, not a unicode string.
這篇關于如何使用 ElementTree 正確解析 utf-8 xml?的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網(wǎng)!