問題描述
我是 Lucene 的新手,在創建查詢文本文件集合的簡單代碼時遇到了一些問題.
I am newbie in Lucene, and I'm having some problems creating simple code to query a text file collection.
我試過 這個例子,但是和新版本的Lucene不兼容.
I tried this example, but is incompatible with the new version of Lucene.
UDPATE: 這是我的新代碼,但還是不行還沒有.
UDPATE: This is my new code, but it still doesn't work yet.
推薦答案
Lucene 是一個相當大的話題,涉及到很多類和方法,如果不了解一些基本概念,通常是無法使用它的.如果您需要快速可用的服務,請改用 Solr.如果您需要完全控制 Lucene,請繼續閱讀.我將介紹一些代表它們的核心 Lucene 概念和類.(有關如何在內存中讀取文本文件的信息讀取,例如,this 文章).
Lucene is a quite big topic with a lot of classes and methods to cover, and you normally cannot use it without understanding at least some basic concepts. If you need a quickly available service, use Solr instead. If you need full control of Lucene, go on reading. I will cover some core Lucene concepts and classes, that represent them. (For information on how to read text files in memory read, for example, this article).
無論您要在 Lucene 中做什么 - 索引或搜索 - 您都需要一個分析器.分析器的目標是對輸入文本進行標記(分解成單詞)和詞干(獲取單詞的基礎).它還會拋出最常用的詞,如a"、the"等.您可以找到超過 20 種語言的分析器,或者您可以使用 SnowballAnalyzer 并將語言作為參數傳遞.
要為英語創建 SnowballAnalyzer 的實例,請執行以下操作:
Whatever you are going to do in Lucene - indexing or searching - you need an analyzer. The goal of analyzer is to tokenize (break into words) and stem (get base of a word) your input text. It also throws out the most frequent words like "a", "the", etc. You can find analyzers for more then 20 languages, or you can use SnowballAnalyzer and pass language as a parameter.
To create instance of SnowballAnalyzer for English you this:
Analyzer analyzer = new SnowballAnalyzer(Version.LUCENE_30, "English");
如果你要索引不同語言的文本,并且想自動選擇分析器,你可以使用 tika 的語言標識符.
If you are going to index texts in different languages, and want to select analyzer automatically, you can use tika's LanguageIdentifier.
您需要將索引存儲在某個地方.這有兩種主要的可能性:易于嘗試的內存索引和使用最廣泛的磁盤索引.
使用接下來的 2 行中的任何一行:
You need to store your index somewhere. There's 2 major possibilities for this: in-memory index, which is easy-to-try, and disk index, which is the most widespread one.
Use any of the next 2 lines:
Directory directory = new RAMDirectory(); // RAM index storage
Directory directory = FSDirectory.open(new File("/path/to/index")); // disk index storage
當你想添加、更新或刪除文檔時,你需要IndexWriter:
When you want to add, update or delete document, you need IndexWriter:
IndexWriter writer = new IndexWriter(directory, analyzer, true, new IndexWriter.MaxFieldLength(25000));
任何文檔(在您的情況下為文本文件)都是一組字段.要創建包含文件信息的文檔,請使用以下命令:
Any document (text file in your case) is a set of fields. To create document, which will hold information about your file, use this:
Document doc = new Document();
String title = nameOfYourFile;
doc.add(new Field("title", title, Field.Store.YES, Field.Index.ANALYZED)); // adding title field
String content = contentsOfYourFile;
doc.add(new Field("content", content, Field.Store.YES, Field.Index.ANALYZED)); // adding content field
writer.addDocument(doc); // writing new document to the index
Field
構造函數采用字段名稱、文本和至少 2 個參數.首先是一個標志,顯示 Lucene 是否必須存儲該字段.如果它等于 Field.Store.YES
您將有可能從索引中獲取所有文本,否則只會存儲有關它的索引信息.
第二個參數顯示 Lucene 是否必須索引該字段.將 Field.Index.ANALYZED
用于您要搜索的任何字段.
通常,您使用如上所示的兩個參數.
Field
constructor takes field's name, it's text and at least 2 more parameters. First is a flag, that show whether Lucene must store this field. If it equals Field.Store.YES
you will have possibility to get all your text back from the index, otherwise only index information about it will be stored.
Second parameter shows whether Lucene must index this field or not. Use Field.Index.ANALYZED
for any field you are going to search on.
Normally, you use both parameters as shown above.
別忘了在工作完成后關閉你的 IndexWriter
:
Don't forget to close your IndexWriter
after the job is done:
writer.close();
搜索有點棘手.您將需要幾個類:Query
和 QueryParser
從字符串中進行 Lucene 查詢,IndexSearcher
用于實際搜索,TopScoreDocCollector
存儲結果(它作為參數傳遞給 IndexSearcher
)和 ScoreDoc
迭代結果.下一個片段顯示了這一切是如何組成的:
Searching is a bit tricky. You will need several classes: Query
and QueryParser
to make Lucene query from the string, IndexSearcher
for actual searching, TopScoreDocCollector
to store results (it is passed to IndexSearcher
as a parameter) and ScoreDoc
to iterate through results. Next snippet shows how this all is composed:
IndexSearcher searcher = new IndexSearcher(directory);
QueryParser parser = new QueryParser(Version.LUCENE_30, "content", analyzer);
Query query = parser.parse("terms to search");
TopScoreDocCollector collector = TopScoreDocCollector.create(HOW_MANY_RESULTS_TO_COLLECT, true);
searcher.search(query, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// `i` is just a number of document in Lucene. Note, that this number may change after document deletion
for (int i = 0; i < hits.length; i++) {
Document hitDoc = searcher.doc(hits[i].doc); // getting actual document
System.out.println("Title: " + hitDoc.get("title"));
System.out.println("Content: " + hitDoc.get("content"));
System.out.println();
}
注意 QueryParser
構造函數的第二個參數 - 它是默認字段,即如果沒有給出限定符則將搜索的字段.例如,如果您的查詢是title:term",Lucene 將在所有文檔的title"字段中搜索單詞term",但如果您的查詢只是term",則在默認字段中搜索,在這種情況下- 內容".有關詳細信息,請參閱 Lucene 查詢語法.QueryParser
也將分析器作為最后一個參數.這必須與您用于索引文本的分析器相同.
Note second argument to the QueryParser
constructor - it is default field, i.e. field that will be searched if no qualifier was given. For example, if your query is "title:term", Lucene will search for a word "term" in field "title" of all docs, but if your query is just "term" if will search in default field, in this case - "contents". For more info see Lucene Query Syntax.
QueryParser
also takes analyzer as a last argument. This must be same analyzer as you used to index your text.
您必須知道的最后一件事是 TopScoreDocCollector.create
第一個參數.它只是一個數字,表示您要收集多少個結果.例如,如果它等于 100,Lucene 將只收集第一個(按分數)100 個結果并丟棄其余的.這只是一種優化行為——你收集了最好的結果,如果你對它不滿意,你就用更大的數字重復搜索.
The last thing you must know is a TopScoreDocCollector.create
first parameter. It is just a number that represents how many results you want to collect. For example, if it is equal 100, Lucene will collect only first (by score) 100 results and drop the rest. This is just an act of optimization - you collect best results, and if you're not satisfied with it, you repeat search with a larger number.
最后,不要忘記關閉搜索器和目錄以免丟失系統資源:
Finally, don't forget to close searcher and directory to not loose system resources:
searcher.close();
directory.close();
另見 IndexFiles 演示類,來自 Lucene 3.0 源代碼一個>.
Also see IndexFiles demo class from Lucene 3.0 sources.
這篇關于如何在 Lucene 3.0.2 中索引和搜索文本文件?的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!