問題描述
我的目標是創建一個以字符串為鍵、條目值為字符串的 HashSet 的哈希圖.
My aim is to create a hashmap with a String as the key, and the entry values as a HashSet of Strings.
輸出
現在的輸出如下所示:
Hudson+(surname)=[Q2720681], Hudson,+Quebec=[Q141445], Hudson+(given+name)=[Q5928530], Hudson,+Colorado=[Q2272323], Hudson,+Illinois=[Q2672022], Hudson,+Indiana=[Q2710584], Hudson,+Ontario=[Q5928505], Hudson,+Buenos+Aires+Province=[Q10298710], Hudson,+Florida=[Q768903]]
按照我的想法,應該是這樣的:
According to my idea, it should look like this:
[Hudson+(surname)=[Q2720681,Q141445,Q5928530,Q2272323,Q2672022]]
<小時>
目的是在維基數據中存儲一個特定的名稱,然后存儲與其相關的所有 Q 值的消歧,例如:
The purpose is to store a particular name in Wikidata and then all of the Q values associated with it's disambiguation, so for example:
這個是布什"的頁面.
我希望布什成為關鍵,然后對于所有不同的出發點,布什
可以與維基數據的終端頁面相關聯的所有不同方式,我想存儲相應的Q 值"或唯一的字母數字標識符.
I want Bush to be the Key, and then for all of the different points of departure, all of the different ways that Bush
could be associated with a terminal page of Wikidata, I want to store the corresponding "Q value", or unique alpha-numeric identifier.
我實際上正在做的是嘗試從維基百科消歧中抓取不同的名稱、值,然后在 wikidata 中查找與該值關聯的唯一字母數字標識符.
What I'm actually doing is trying to scrape the different names, values, from the wikipedia disambiguation and then look up the unique alpha-numeric identifier associated with that value in wikidata.
例如,使用 Bush
我們有:
For example, with Bush
we have:
George H. W. Bush
George W. Bush
Jeb Bush
Bush family
Bush (surname)
相應的 Q 值為:
喬治 HW 布什 (Q23505)
喬治·W·布什(Q207)
杰布·布什 (Q221997)
布什家族 (Q2743830)
Bush family (Q2743830)
布什 (Q1484464)
Bush (Q1484464)
我的想法是數據結構應該按如下方式來解釋
關鍵:布什
條目集: Q23505、Q207、Q221997、Q2743830、Q1484464
但我現在的代碼并沒有這樣做.
But the code I have now doesn't do that.
它為每個名稱和 Q 值創建一個單獨的條目.即
It creates a seperate entry for each name and Q value. i.e.
密鑰:杰布·布什
條目集: Q221997
鑰匙:喬治·W·布什
條目集: Q207
等等.
完整的代碼可以在 mygithub頁面,但我也會在下面總結一下.
The full code in all it's glory can be seen on my github page, but I'll summarize it below also.
這是我用來為我的數據結構添加值的方法:
This is what I'm using to add values to my data strucuture:
// add Q values to their arrayList in the hash map at the index of the appropriate entity
public static HashSet<String> put_to_hash(String key, String value)
{
if (!q_valMap.containsKey(key))
{
return q_valMap.put(key, new HashSet<String>() );
}
HashSet<String> list = q_valMap.get(key);
list.add(value);
return q_valMap.put(key, list);
}
這是我獲取內容的方式:
This is how I fetch the content:
while ((line_by_line = wiki_data_pagecontent.readLine()) != null)
{
// if we can determine it's a disambig page we need to send it off to get all
// the possible senses in which it can be used.
Pattern disambig_pattern = Pattern.compile("<div class="wikibase-entitytermsview-heading-description ">Wikipedia disambiguation page</div>");
Matcher disambig_indicator = disambig_pattern.matcher(line_by_line);
if (disambig_indicator.matches())
{
//off to get the different usages
Wikipedia_Disambig_Fetcher.all_possibilities( variable_entity );
}
else
{
//get the Q value off the page by matching
Pattern q_page_pattern = Pattern.compile("<!-- wikibase-toolbar --><span class="wikibase-toolbar-container"><span class="wikibase-toolbar-item " +
"wikibase-toolbar ">\[<span class="wikibase-toolbar-item wikibase-toolbar-button wikibase-toolbar-button-edit"><a " +
"href="/wiki/Special:SetSiteLink/(.*?)">edit</a></span>\]</span></span>");
Matcher match_Q_component = q_page_pattern.matcher(line_by_line);
if ( match_Q_component.matches() )
{
String Q = match_Q_component.group(1);
// 'Q' should be appended to an array, since each entity can hold multiple
// Q values on that basis of disambig
put_to_hash( variable_entity, Q );
}
}
}
這就是我處理消歧頁面的方式:
and this is how I deal with a disambiguation page:
public static void all_possibilities( String variable_entity ) throws Exception
{
System.out.println("this is a disambig page");
//if it's a disambig page we know we can go right to the wikipedia
//get it's normal wiki disambig page
Document docx = Jsoup.connect( "https://en.wikipedia.org/wiki/" + variable_entity ).get();
//this can handle the less structured ones.
Elements linx = docx.select( "p:contains(" + variable_entity + ") ~ ul a:eq(0)" );
for (Element linq : linx)
{
System.out.println(linq.text());
String linq_nospace = linq.text().replace(' ', '+');
Wikidata_Q_Reader.getQ( linq_nospace );
}
}
我在想也許我可以傳遞 Key
值,但我真的不知道.我有點卡住了.也許有人可以看到我如何實現這個功能.
I was thinking maybe I could pass the Key
value around, but I really don't know. I'm kind of stuck. Maybe someone can see how I can implement this functionality.
推薦答案
我不清楚你的問題是什么不起作用,或者你是否看到實際錯誤.但是,雖然您的基本數據結構想法(String
到 Set
的 HashMap
是合理的,但添加"中有一個錯誤功能.
I'm not clear from your question what isn't working, or if you're seeing actual errors. But, while your basic data structure idea (HashMap
of String
to Set<String>
) is sound, there's a bug in the "add" function.
public static HashSet<String> put_to_hash(String key, String value)
{
if (!q_valMap.containsKey(key))
{
return q_valMap.put(key, new HashSet<String>() );
}
HashSet<String> list = q_valMap.get(key);
list.add(value);
return q_valMap.put(key, list);
}
在第一次看到鍵的情況下(if (!q_valMap.containsKey(key))
),它會為該鍵激活一個新的 HashSet
,但它不會在返回之前添加 value
給它.(并且返回的值是該鍵的舊值,因此它將為空.)因此您將丟失每個術語的 Q 值.
In the case where a key is seen for the first time (if (!q_valMap.containsKey(key))
), it vivifies a new HashSet
for that key, but it doesn't add value
to it before returning. (And the returned value is the old value for that key, so it'll be null.) So you're going to be losing one of the Q-values for every term.
對于像這樣的多層數據結構,我通常特例只是中間結構的激活,然后在單個代碼路徑中進行添加和返回.我認為這會解決它.(我也將它稱為 valSet
因為它是一個集合而不是一個列表.而且沒有必要每次都將集合重新添加到地圖中;它是一個引用類型并被添加第一次遇到那個鍵.)
For multi-layered data structures like this, I usually special-case just the vivification of the intermediate structure, and then do the adding and return in a single code path. I think this would fix it. (I'm also going to call it valSet
because it's a set and not a list. And there's no need to re-add the set to the map each time; it's a reference type and gets added the first time you encounter that key.)
public static HashSet<String> put_to_hash(String key, String value)
{
if (!q_valMap.containsKey(key)) {
q_valMap.put(key, new HashSet<String>());
}
HashSet<String> valSet = q_valMap.get(key);
valSet.add(value);
return valSet;
}
還要注意,您返回的 Set
是對該鍵的實時 Set
的引用,因此在調用者中修改它時需要小心,如果你正在做多線程,你會遇到并發訪問問題.
Also be aware that the Set
you return is a reference to the live Set
for that key, so you need to be careful about modifying it in callers, and if you're doing multithreading you're going to have concurrent access issues.
或者只使用 Guava Multimap
這樣您就不必擔心自己編寫實現.
Or just use a Guava Multimap
so you don't have to worry about writing the implementation yourself.
這篇關于用一個固定的Key對應一個HashSet創建一個HashMap.出發點的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!