問題描述
我正在使用 Lucene 3.5.0,我想輸出每個(gè)文檔的術(shù)語向量.例如,我想知道一個(gè)詞在所有文檔和每個(gè)特定文檔中的頻率.我的索引代碼是:
I am using Lucene 3.5.0 and I want to output term vectors of each document. For example I want to know the frequency of a term in all documents and in each specific document. My indexing code is:
import java.io.FileFilter;
import java.io.FileReader;
import java.io.IOException;
import java.io.File;
import java.io.FileReader;
import java.io.BufferedReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Document;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
public class Indexer {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
throw new IllegalArgumentException("Usage: java " + Indexer.class.getName() + " <index dir> <data dir>");
}
String indexDir = args[0];
String dataDir = args[1];
long start = System.currentTimeMillis();
Indexer indexer = new Indexer(indexDir);
int numIndexed;
try {
numIndexed = indexer.index(dataDir, new TextFilesFilter());
} finally {
indexer.close();
}
long end = System.currentTimeMillis();
System.out.println("Indexing " + numIndexed + " files took " + (end - start) + " milliseconds");
}
private IndexWriter writer;
public Indexer(String indexDir) throws IOException {
Directory dir = FSDirectory.open(new File(indexDir));
writer = new IndexWriter(dir,
new StandardAnalyzer(Version.LUCENE_35),
true,
IndexWriter.MaxFieldLength.UNLIMITED);
}
public void close() throws IOException {
writer.close();
}
public int index(String dataDir, FileFilter filter) throws Exception {
File[] files = new File(dataDir).listFiles();
for (File f: files) {
if (!f.isDirectory() &&
!f.isHidden() &&
f.exists() &&
f.canRead() &&
(filter == null || filter.accept(f))) {
BufferedReader inputStream = new BufferedReader(new FileReader(f.getName()));
String url = inputStream.readLine();
inputStream.close();
indexFile(f, url);
}
}
return writer.numDocs();
}
private static class TextFilesFilter implements FileFilter {
public boolean accept(File path) {
return path.getName().toLowerCase().endsWith(".txt");
}
}
protected Document getDocument(File f, String url) throws Exception {
Document doc = new Document();
doc.add(new Field("contents", new FileReader(f)));
doc.add(new Field("urls", url, Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.add(new Field("filename", f.getName(), Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.add(new Field("fullpath", f.getCanonicalPath(), Field.Store.YES, Field.Index.NOT_ANALYZED));
return doc;
}
private void indexFile(File f, String url) throws Exception {
System.out.println("Indexing " + f.getCanonicalPath());
Document doc = getDocument(f, url);
writer.addDocument(doc);
}
}
誰能幫我寫一個(gè)程序來做到這一點(diǎn)?謝謝.
can anybody help me in writing a program to do that? thanks.
推薦答案
首先,你不需要為了只知道詞在文檔中出現(xiàn)的頻率而存儲詞向量.盡管如此,Lucene 還是存儲了這些數(shù)字以用于 TF-IDF 計(jì)算.您可以通過調(diào)用 IndexReader.termDocs(term)
并遍歷結(jié)果來訪問此信息.
First of all, you don't need to store term vectors in order to know solely the frequency of term in documents. Lucene stores these numbers nevertheless to use in TF-IDF calculation. You can access this information by calling IndexReader.termDocs(term)
and iterating over the result.
如果您有其他目的并且您確實(shí)需要訪問術(shù)語向量,那么您需要告訴 Lucene 存儲它們,方法是將 Field.TermVector.YES
作為Field
構(gòu)造函數(shù).然后,您可以檢索向量,例如與 IndexReader.getTermFreqVector()
.
If you have some other purpose in mind and you actually need to access the term vectors, then you need to tell Lucene to store them, by passing Field.TermVector.YES
as the last argument of Field
constructor. Then, you can retrieve the vectors e.g. with IndexReader.getTermFreqVector()
.
這篇關(guān)于如何在 Lucene 3.5.0 中提取文檔術(shù)語向量的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網(wǎng)!