問題描述
我需要遍歷 Lucene 索引中的所有文檔,并獲取每個術語在每個文檔中出現(xiàn)的位置.據(jù)我能夠從 Lucene javadoc 中了解到,這樣做的方法是執(zhí)行以下操作:
I need to iterate over all documents in a Lucene index, and obtain the positions at which each term occurs in each document. As far as I am able to understand from the Lucene javadoc, the way to do this is to do something like this:
IndexReader ir = obtainIndexReader();
Terms tv = ir.getTermVector( doc, field );
TermsEnum terms = tv.iterator();
PostingsEnum p = null;
while( terms.next() != null ) {
p = terms.postings( p, PostingsEnum.ALL );
while( p.nextDoc() != PostingsEnum.NO_MORE_DOCS ) {
int freq = p.freq();
for( int i = 0; i < freq; i++ ) {
int pos = p.nextPosition(); // Always returns -1!!!
BytesRef data = p.getPayload();
doStuff( freq, pos, data ); // Fails miserably, of course.
}
}
}
但是,即使 (1) 索引確實包含相關字段上的位置,并且 (2) 術語向量聲稱具有位置(即:tv.hasPositions() == true),我仍然得到-1" 適用于所有職位.
However, even though (1) the index does indeed include positions on the relevant field and (2) the term vector claims to have positions (i.e.: tv.hasPositions() == true), I keep getting "-1" for all positions.
首先,我是不是做錯了什么?是否有另一種方法可以在每個文檔的基礎上迭代過帳?第二:到底發(fā)生了什么?該索引包含位置,getTermVector 返回的術語實例聲稱包含位置,并且我正在查看 Luke 中的正確位置值,但是當我嘗試在我的代碼中訪問所述值時仍然得到 -1.什么給了?
First, am I doing something wrong? Is there an alternative way of iterating over postings on a per-document basis? Second: What is going on anyway? The index contains positions, the Terms instance returned by getTermVector claims to include positions, and I'm looking at the correct position values in Luke, yet I still get -1 when I try to access said values in my code. What gives?
相關字段配置有以下選項:
The relevant field was configured with the following options:
FieldType ft = new FieldType();
ft.setIndexOptions( IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS );
ft.setStoreTermVectors( true );
ft.setStoreTermVectorOffsets( true );
ft.setStoreTermVectorPayloads( true );
ft.setStoreTermVectorPositions( true );
ft.setTokenized( true );
return ft;
推薦答案
您是否在索引時為您的字段類型設置了 FieldType.setStoreTermVectorPositions(true)?http://lucene.apache.org/core/5_5_0/core/org/apache/lucene/document/FieldType.html#setStoreTermVectorPositions(boolean)
Did you set FieldType.setStoreTermVectorPositions(true) on your field type at index time? http://lucene.apache.org/core/5_5_0/core/org/apache/lucene/document/FieldType.html#setStoreTermVectorPositions(boolean)
這篇關于如何從 Lucene 中的文檔術語向量中獲取位置?的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網(wǎng)!