問題描述
我有一個來自包含多個字段的大型語料庫的索引.這些字段中只有一個包含文本.我需要根據(jù)該字段從整個索引中提取唯一詞.有誰知道我如何在 java 中使用 Lucene 做到這一點?
I have an index from a large corpus with several fields. Only one these fields contain text. I need to extract the unique words from the whole index based on this field. Does anyone know how I can do that with Lucene in java?
推薦答案
你正在尋找 術(shù)語向量(字段中所有單詞的集合以及每個單詞的使用次數(shù),不包括停用詞).您將使用 IndexReader 的 getTermFreqVector(docid, field) 用于索引中的每個文檔,并用它們填充 HashSet
.
You're looking for term vectors (a set of all the words that were in the field and the number of times each word was used, excluding stop words). You'll use IndexReader's getTermFreqVector(docid, field) for each document in the index, and populate a HashSet
with them.
替代方法是使用 terms() 并只選擇您感興趣的領(lǐng)域的術(shù)語:
The alternative would be to use terms() and pick only terms for the field you're interested in:
IndexReader reader = IndexReader.open(index);
TermEnum terms = reader.terms();
Set<String> uniqueTerms = new HashSet<String>();
while (terms.next()) {
final Term term = terms.term();
if (term.field().equals("field_name")) {
uniqueTerms.add(term.text());
}
}
這不是最佳解決方案,您正在閱讀然后丟棄所有其他字段.Lucene 4 中有一個類 Fields
,它返回 terms(field) 僅適用于單個字段.
This is not the optimal solution, you're reading and then discarding all other fields. There's a class Fields
in Lucene 4, that returns terms(field) only for a single field.
這篇關(guān)于如何從 Lucene 的特定字段中獲取唯一術(shù)語列表?的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網(wǎng)!