問題描述
有沒有人比較過來自 Lucene 的這些詞干分析器(包 org.tartarus.snowball.ext):英語Stemmer、PorterStemmer、LovinsStemmer?它們背后的算法的優點/缺點是什么?什么時候應該使用它們?或者也許有更多的算法可用于英語單詞詞干提取?
Have anybody compared these stemmers from Lucene (package org.tartarus.snowball.ext): EnglishStemmer, PorterStemmer, LovinsStemmer? What are the strong/weak points of algorithms behind them? When each of them should be used? Or maybe there are some more algorithms available for english words stemming?
謝謝.
推薦答案
Lovins 詞干分析器是一個 非常古老的算法,沒有太多實際用途,因為 Porter 詞干分析器要強大得多.基于對源代碼的一些快速瀏覽,似乎 PorterStemmer
實現了 Porter 的 原始 (1980) 算法,而 EnglishStemmer
實現了他的 更新版本,應該會更好.
The Lovins stemmer is a very old algorithm that is not of much practical use, since the Porter stemmer is much stronger. Based on some quick skimming of the source code, it seems PorterStemmer
implements Porter's original (1980) algorithm, while EnglishStemmer
implements his updated version, which should be better.
Stanford NLP 工具中提供了更強大的詞干提取算法(實際上是詞形還原器).這里 (API 文檔).
A stronger stemming algorithm (actually a lemmatizer) is available in the Stanford NLP tools. A Lucene-Stanford NLP by yours truly bridge is available here (API docs).
另見 Manning, Raghavan &Schütze 了解有關詞干提取和詞形還原的一般信息.
See also Manning, Raghavan & Schütze for general info about stemming and lemmatization.
這篇關于Lucene 詞干分離器的區別:EnglishStemmer、PorterStemmer、LovinsStemmer的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!