問題描述
我有一個解析一些 xml 的腳本.XML 包含:
I have a script that parses some xml. XML contains:
<SD TITLE="A" FLAGS="" HOST="9511.com">
<TITLE TEXT="9511 domain"/>
<ADDR STREET="Pmb#400, San Pablo Ave" CITY="Berkeley" STATE="CA" COUNTRY="US"/>
<CREATED DATE="13-Oct-1990" DAY="13" MONTH="10" YEAR="1990"/>
<OWNER NAME="9511.Org Domain Name Proxy Agents"/>
<EMAIL ADDR="proxy@9511.org"/><LANG LEX="en" CODE="us-ascii"/>
<LINKSIN NUM="75"/><SPEED TEXT="3158" PCT="17"/>
<CHILD SRATING="0"/>
</SD>
<SD>
<POPULARITY URL="9511.com/" TEXT="1417678" SOURCE="panel"/>
</SD>
如何獲取標簽的'TEXT'屬性值(在我的例子中是1417678)?我正在使用正則表達式+Python.正則表達式字符串:
How to get the 'TEXT' attribute value of tag(in my case 1417678)? I'm using regexp+Python. Regexp string:
my_value = re.findall("POPULARITY[^d]*(d+)", xml)
我收到了9511",但我需要1417678".
It gets to me '9511' but i need '1417678'.
推薦答案
您只是匹配出現在元素名稱之后的第一個十進制數字序列.在任意數量的非數字 '[^d]*'
之后的第一個數字序列 '(d+)'
是 9511
.
You are just matching the first sequence of decimal digits that occurs after the element's name. The first sequence of digits '(d+)'
after a arbitrary number of non-digits '[^d]*'
is 9511
.
為了findall
@TEXT
屬性的值,這樣的事情會起作用:
In order to findall
values of @TEXT
attributes, something like this would work:
my_values = re.findall("<POPULARITY(?:D+="S*")*s+TEXT="(d*)"", xml) # returning a list btw
或者,如果除了 @TEXT
之外沒有其他屬性將具有純數字值:
Or, if no other attributes will have digit-only values except @TEXT
:
re.findall("<POPULARITYs+(?:S+s+)*w+="(d+)"", xml)
(?:...)
與包含的表達式匹配,但不像 (...)
那樣充當可尋址組.特殊序列 S
和 D
是它們對應的小寫字母的反轉,分別擴展到(除了)空格和數字.
Where (?:...)
matches the embraced expression, but doesn't act as an addressable group, like (...)
. The special sequences S
and D
are the invertions of their lowercase counterparts, expanding to (anything but) whitespace and digits, respectively.
但是,正如已經提到的,正則表達式不適用于 XML,因為 XML 不是常規語言.
However, like already mentioned, regex are not meant to be used on XML, because XML is not a regular language.
這篇關于如何使用 regexp + Python 從 XML 中獲取指定標簽屬性的值?的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!