問題描述
Lucene (4.6) 熒光筆在搜索常用詞時性能非常慢.搜索速度很快(100 毫秒),但突出顯示可能需要一個多小時(!).
Lucene (4.6) highlighter has very slow performance, when a frequent term is searched. Search is fast (100ms), but highlight may take more than an hour(!).
詳細(xì)信息: 使用了很棒的文本語料庫(1.5GB 純文本).性能不取決于文本是否被分成更多的小塊.(也用 500MB 和 5MB 塊進(jìn)行了測試.)存儲位置和偏移量.如果搜索一個非常頻繁的術(shù)語或模式,TopDocs 檢索速度很快(100 毫秒),但每個searcher.doc(id)"調(diào)用都很昂貴(5-50 秒),getBestFragments() 非常昂貴(超過 1 小時).甚至它們也為此目的被存儲和索引.(硬件:core i7、8GM mem)
Details: great text corpus was used (1.5GB plain text). Performance doesn't depend if text is splitted into more small pieces or not. (Tested with 500MB and 5MB pieces as well.) Positions and offsets are stored. If a very frequent term or pattern is searched, TopDocs are retrieved fast (100ms), but each "searcher.doc(id)" calls are expensive (5-50s), and getBestFragments() are extremely expensive (more than 1 hour). Even they are stored and indexed for this purpose. (hardware: core i7, 8GM mem)
更大的背景:它將服務(wù)于語言分析研究.使用了一種特殊的詞干提取:它也存儲詞性信息.例如,如果 "adj adj adj adj noun" 被搜索,它會給出它在文本中出現(xiàn)的所有內(nèi)容.
Greater background: it would serve a language analysis research. A special stemming is used: it stores the part of speech info, too. For example if "adj adj adj adj noun" is searched, it gives all its occurrences in the text with context.
我可以調(diào)整它的性能,還是應(yīng)該選擇其他工具?
使用代碼:
//indexing
FieldType offsetsType = new FieldType(TextField.TYPE_STORED);
offsetsType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
offsetsType.setStored(true);
offsetsType.setIndexed(true);
offsetsType.setStoreTermVectors(true);
offsetsType.setStoreTermVectorOffsets(true);
offsetsType.setStoreTermVectorPositions(true);
offsetsType.setStoreTermVectorPayloads(true);
doc.add(new Field("content", fileContent, offsetsType));
//quering
TopDocs results = searcher.search(query, limitStart+limit);
int endPos = Math.min(results.scoreDocs.length, limitStart+limit);
int startPos = Math.min(results.scoreDocs.length, limitStart);
for (int i = startPos; i < endPos; i++) {
int id = results.scoreDocs[i].doc;
// bottleneck #1 (5-50s):
Document doc = searcher.doc(id);
FastVectorHighlighter h = new FastVectorHighlighter();
// bottleneck #2 (more than 1 hour):
String[] hs = h.getBestFragments(h.getFieldQuery(query), m, id, "content", contextSize, 10000);
相關(guān)(未回答)問題:https://stackoverflow.com/questions/19416804/very-slow-solr-performance-when-highlighting
推薦答案
BestFragments 依賴于您正在使用的分析器完成的標(biāo)記化.如果要分析這么大的文本,最好在索引時存儲詞向量WITH_POSITIONS_OFFSETS
.
BestFragments relies on the tokenization done by the analyzer that you're using. If you have to analyse such a big text, you'd better to store term vector WITH_POSITIONS_OFFSETS
at indexing time.
請閱讀這個和這本書
通過這樣做,您無需在運(yùn)行時分析所有文本,因?yàn)槟梢赃x擇一種方法來重用現(xiàn)有術(shù)語向量,這將減少突出顯示的時間.
By doing that, you won't need to analyze all the text at runtime as you can pick up a method to reuse the existing term vector and this will reduce the highlighting time.
這篇關(guān)于lucene 中的高光性能非常慢的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網(wǎng)!