国产网站视频,97久久精品,午夜成人在线视频

本文介紹了Lucene爬蟲(需要建立lucene索引)的處理方法，對大家解決問題具有一定的參考價值，需要的朋友們下面隨著小編來一起學習吧！

問題描述

如果可能的話，我正在尋找用 java 或任何其他語言編寫的 Apache Lucene 網絡爬蟲.爬蟲必須使用lucene并創建有效的lucene索引和文檔文件，所以這就是nutch被淘汰的原因例如...

I am looking for Apache Lucene web crawler written in java if possible or in any other language. The crawler must use lucene and create a valid lucene index and document files, so this is the reason why nutch is eliminated for example...

有誰知道這樣的網絡爬蟲存在嗎?如果答案是肯定的，我可以在哪里找到它.天吶……

Does anybody know does such a web crawler exist and can If answer is yes where I can find it. Tnx...

推薦答案

你要問的是兩個組件:

網絡爬蟲
基于 Lucene 的自動索引器

首先要說一句勇氣:去過那里，做到了.我將從制作您自己的角度來分別處理這兩個組件，因為我不相信您可以使用 Lucene 來完成您所要求的事情，而無需真正了解底層發生的事情.

First a word of couragement: Been there, done that. I'll tackle both of the components individually from the point of view of making your own since I don't believe that you could use Lucene to do something you've requested without really understanding what's going on underneath.

因此，您有一個要爬網"以收集特定資源的網站/目錄.假設它是列出目錄內容的任何普通網絡服務器，那么制作網絡爬蟲很容易:只需將其指向目錄的根并定義收集實際文件的規則，例如以 .txt 結尾".非常簡單的東西，真的.

So you have a web site/directory you want to "crawl" through to collect specific resources. Assuming that it's any common web server which lists directory contents, making a web crawler is easy: Just point it to the root of the directory and define rules for collecting the actual files, such as "ends with .txt". Very simple stuff, really.

實際的實現可能是這樣的:使用 HttpClient獲取實際的網頁/目錄列表，以您認為最有效的方式解析它們，例如使用 XPath 從獲取的文檔中選擇所有鏈接，或者使用 Java 的模式和 Matcher 類隨時可用.如果您決定走 XPath 路線，請考慮使用 JDOM 進行 DOM 處理和 Jaxen 用于實際的 XPath.

The actual implementation could be something like so: Use HttpClient to get the actual web pages/directory listings, parse them in the way you find most efficient such as using XPath to select all the links from the fetched document or just parsing it with regex using Java's Pattern and Matcher classes readily available. If you decide to go the XPath route, consider using JDOM for DOM handling and Jaxen for the actual XPath.

獲得所需的實際資源(例如一堆文本文件)后，您需要確定數據的類型，以便能夠知道要索引的內容以及可以安全忽略的內容.為簡單起見，我假設這些是沒有字段或任何內容的純文本文件，不會深入研究，但如果您有多個字段要存儲，我建議您讓爬蟲生成 1..n 具有訪問器和修改器的專用 bean(獎勵積分: 使 bean immutable，不允許訪問者改變 bean 的內部狀態，為 bean 創建一個復制構造函數)在其他組件中使用.

Once you get the actual resources you want such as bunch of text files, you need to identify the type of data to be able to know what to index and what you can safely ignore. For simplicity's sake I'm assuming these are plaintext files with no fields or anything and won't go deeper into that but if you have multiple fields to store, I suggest you make your crawler to produce 1..n of specialized beans with accessors and mutators (bonus points: Make the bean immutable, don't allow accessors to mutate the internal state of the bean, create a copy constructor for the bean) to be used in the other component.

就 API 調用而言，您應該有類似 HttpCrawler#getDocuments(String url) 的東西，它返回一個 List 以與實際的索引器結合使用.

In terms of API calls, you should have something like HttpCrawler#getDocuments(String url) which returns a List<YourBean> to use in conjuction with the actual indexer.

除了顯而易見的東西與 Lucene 相比，例如設置目錄并了解它的線程模型(任何時候只允許一次寫入操作，即使在更新索引時也可以存在多次讀取)，您當然希望將 bean 提供給索引.我已經鏈接到的五分鐘教程基本上就是這樣做的，查看示例 addDoc(..) 方法并將字符串替換為 YourBean.

Beyond the obvious stuff with Lucene such as setting up a directory and understanding its threading model (only one write operation is allowed at any time, multiple reads can exist even when the index is being updated), you of course want to feed your beans to the index. The five minute tutorial I already linked to basically does exactly that, look into the example addDoc(..) method and just replace the String with YourBean.

請注意，Lucene IndexWriter 確實有一些清理方法可以方便地以受控方式執行，例如調用 IndexWriter#commit() 只有在一堆文檔被添加到索引后才適用性能，然后調用 IndexWriter#optimize() 以確保索引不會隨著時間的推移變得非常臃腫也是一個好主意.永遠記得關閉索引以避免不必要的 LockObtainFailedExceptions 被拋出，與 Java 中的所有 IO 一樣，這樣的操作當然應該在 finally 塊中完成.

Note that Lucene IndexWriter does have some cleanup methods which are handy to execute in a controlled manner, for example calling IndexWriter#commit() only after a bunch of documents have been added to index is good for performance and then calling IndexWriter#optimize() to make sure the index isn't getting hugely bloated over time is a good idea too. Always remember to close the index too to avoid unnecessary LockObtainFailedExceptions to be thrown, as with all IO in Java such operation should of course be done in the finally block.

你需要記得時常讓你的 Lucene 索引的內容過期，否則你永遠不會刪除任何東西，它會變得臃腫并最終因為它自身的內部復雜性而死.
由于線程模型，您很可能需要為索引本身創建一個單獨的讀/寫抽象層，以確保在任何給定時間只有一個實例可以寫入索引.
由于源數據獲取是通過 HTTP 完成的，因此您需要考慮數據驗證和可能的錯誤情況，例如服務器不可用，以避免任何格式錯誤的索引和客戶端掛起.
您需要知道要從索引中搜索的內容，才能決定要放入的內容.請注意，必須按日期進行索引，以便將日期拆分為年、月、日、小時、分鐘、秒而不是毫秒值，因為在從 Lucene 索引進行范圍查詢時，[0 to 5] 實際上被轉換為 +0 +1 +2 +3 +4 +5 這意味著范圍查詢很快就消失了，因為查詢子部分的數量達到了最大值.

You need to remember to expire your Lucene index' contents every now and then too, otherwise you'll never remove anything and it'll get bloated and eventually just dies because of its own internal complexity.
Because of the threading model you most likely need to create a separate read/write abstraction layer for the index itself to ensure that only one instance can write to the index at any given time.
Since the source data acquisition is done over HTTP, you need to consider the validation of data and possible error situations such as server not available to avoid any kind of malformed indexing and client hangups.
You need to know what you want to search from the index to be able to decide what you are going to put into it. Note that indexing by date must be done so that you split the date to say year, month, day, hour, minute, second instead of millisecond value because when doing range queries from Lucene index, the [0 to 5] actually gets transformed into +0 +1 +2 +3 +4 +5 which means the range query dies out very quickly because there's a maximum number of query sub parts.

有了這些信息，我相信您可以在不到一天的時間內制作自己的特殊 Lucene 索引器，如果您想對其進行嚴格測試，則需要三天.

With this information I do believe you could make your own special Lucene indexer in less than a day, three if you want to test it rigorously.

這篇關于Lucene爬蟲(需要建立lucene索引)的文章就介紹到這了，希望我們推薦的答案對大家有所幫助，也希望大家多多支持html5模板網！

【網站聲明】本站部分內容來源于互聯網,旨在幫助大家更快的解決問題，如果有圖片或者內容侵犯了您的權益，請聯系我們刪除處理，感謝您的支持！

久久久久久久av_日韩在线中文_看一级毛片视频_日本精品二区_成人深夜福利视频_武道仙尊动漫在线观看

Lucene爬蟲(需要建立lucene索引)

問題描述

推薦答案

相關文檔推薦