問題描述
我使用 Python 的 iterparse
來解析 nessus 掃描的 XML 結果(.nessus 文件).意外記錄解析失敗,但類似的記錄已正確解析.
I use Python's iterparse
to parse the XML result of a nessus scan (.nessus file). The parsing fails on unexpected records, wile similar ones have been parsed correctly.
XML 文件的一般結構是很多記錄,如下所示:
The general structure of the XML file is a lot of records like the one below:
<ReportHost>
<ReportItem>
<foo>9.3</foo>
<bar>hello</bar>
</ReportItem>
<ReportItem>
<foo>10.0</foo>
<bar>world</bar>
</ReportHost>
<ReportHost>
...
</ReportHost>
換句話說,很多主機 (ReportHost
) 有很多要報告的項目 (ReportItem
),而后者有幾個特征 (foo
,條
).我將考慮為每個項目生成一行,并具有其特征.
In other words a lot of hosts (ReportHost
) with a lot of items to report (ReportItem
), and the latter having several characteristics (foo
, bar
). I will be looking at generating one line per item, with its characteristics.
在文件中間的一行簡單的解析失敗(foo
在這種情況下是 cvss_base_score
)
The parsing fails in the middle of the file at a simple line (foo
in that case being cvss_base_score
)
<cvss_base_score>9.3</cvss_base_score>
雖然已經解析了大約 200 條類似的行,但沒有問題.
while ~200 similar lines have been parsed without problems.
相關的代碼如下——它設置了上下文標記(inReportHost
和 inReportEvent
告訴我我所在的 XML 文件的具體位置,以及根據上下文分配或打印一個值)
The relevant piece of code is below -- it sets context markers (inReportHost
and inReportEvent
which tell me where in the stricture of the XML file I am in, and either assign or print a value, depending on the context)
import xml.etree.cElementTree as ET
inReportHost = False
inReportItem = False
for event, elem in ET.iterparse("test2.nessus", events=("start", "end")):
if event == 'start' and elem.tag == "ReportHost":
inReportHost = True
if event == 'end' and elem.tag == "ReportHost":
inReportHost = False
elem.clear()
if inReportHost:
if event == 'start' and elem.tag == 'ReportItem':
inReportItem = True
cvss = ''
if event == 'start' and inReportItem:
if event == 'start' and elem.tag == 'cvss_base_score':
cvss = elem.text
if event == 'end' and elem.tag == 'ReportItem':
print cvss
inReportItem = False
cvss
有時具有 None 值(在 cvss = elem.text
賦值之后),即使相同的條目已在文件的前面正確解析.
cvss
sometimes has the None value (after the cvss = elem.text
assignment), even though identical entries have been parsed properely earlier in the file.
如果我在分配下面添加一些類似的東西
If I add below the assignement something along the lines of
if cvss is None: cvss = "0"
然后解析許多進一步的 cvss
分配它們的正確值(還有一些是 None ).
then the parsing of many further cvss
assign their proper values (and some other are None).
當使用 <ReportHost>...</reportHost>
這會導致錯誤的解析并通過程序運行它 - 它工作正常(即.cvss
按預期分配了 9.3
).
When taking the <ReportHost>...</reportHost>
which causes the wrong parsing and running it through the program - it works fine (ie. cvss
is assigned 9.3
as expected).
我迷失在我的代碼中出現錯誤的地方,因為有大量相似的記錄,有些已正確處理,有些 - 未正確處理(有些記錄是相同的,但處理方式仍然不同).我也找不到任何關于失敗記錄的具體信息 - 早晚相同的記錄都可以.
I am lost at where I make a mistake in my code since, withing a large set of similar records, some apre processed correctly and some - not (some of the records are identical, and still are processed differently). I also cannot find anything particular about the records that fail - identical ones earlier and later are fine.
推薦答案
來自 iterparse() 文檔:
注意:iterparse() 只保證它已經看到了>"字符當它發出一個開始"事件時,它的起始標簽,所以屬性是已定義,但 text 和 tail 屬性的內容是那時未定義.這同樣適用于子元素;它們可能存在也可能不存在.如果您需要一個完全填充的元素,而是尋找結束"事件.
Note: iterparse() only guarantees that it has seen the ">" character of a starting tag when it emits a "start" event, so the attributes are defined, but the contents of the text and tail attributes are undefined at that point. The same applies to the element children; they may or may not be present. If you need a fully populated element, look for "end" events instead.
刪除 inReport*
變量并在完全解析后僅在結束"事件上處理 ReportHost.使用 ElementTree API 從當前 ReportHost 元素中獲取必要的信息,例如 cvss_base_score
.
Drop inReport*
variables and process ReportHost only on "end" events when it fully parsed. Use ElementTree API to get necessary info such as cvss_base_score
from current ReportHost element.
要保留內存,請執行以下操作:
To preserve memory, do:
import xml.etree.cElementTree as etree
def getelements(filename_or_file, tag):
context = iter(etree.iterparse(filename_or_file, events=('start', 'end')))
_, root = next(context) # get root element
for event, elem in context:
if event == 'end' and elem.tag == tag:
yield elem
root.clear() # preserve memory
for host in getelements("test2.nessus", "ReportHost"):
for cvss_el in host.iter("cvss_base_score"):
print(cvss_el.text)
這篇關于iterparse 無法解析字段,而其他類似的都可以的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!