163java
About 68 results
  1. Nutch搜索引擎(第1期)_ Nutch简介及安装

    http://www.cnblogs.com/xia520pi/p/3615554.html
    2014年03月21日 - 1、Nutch简介   Nutch是一个由Java实现的,开放源代码(open-source)的web搜索引擎。主要用于收集网页数据,然后对其进行分析,建立索引,以提供相应的接口来对其网页数据进行查询的一套工具。其底层使用了Hadoop来做分布式计算与存储,索引使用了Solr分布式索引框架来做,Solr是一个开源的全文
  2. Crawl PDF documents using nutch

    http://stackoverflow.com/questions/18054889/crawl-pdf-documents-using-nutch
    up vote 3 down vote favorite I have to crawl PDF documents too from given URL... suggest any tool/API to crawl PDF docs also... now I am using nutch to crawl bu
  3. What database options are available for Nutch 2.1?

    http://stackoverflow.com/questions/13483970/what-database-options-are-available-for-nutch-2-1
    up vote 1 down vote favorite 1 I'm trying to test out Nutch 2.1 on a single Windows machine. The following command dies: nutch crawl seeds -dir crawl -solr http
  4. Solr+Nutch+AjaxSolr query

    http://stackoverflow.com/questions/11487394/solrnutchajaxsolr-query
    up vote 0 down vote favorite 1) I referred https://github.com/evolvingweb/ajax-solr/wiki/reuters-tutorial for Ajax-Solr setup. I want to know that although ajax
  5. eclipse - Nutch plugin development

    http://stackoverflow.com/questions/1213343/nutch-plugin-development
    up vote 2 down vote favorite 1 The nutch wiki has instructions on how to build nutch plugins, but only if you download the entire nutch source tree and put it i
  6. Effect of depth, topn in nutch crawl

    http://stackoverflow.com/questions/11304550/effect-of-depth-topn-in-nutch-crawl
    up vote 2 down vote favorite I have always wondered what is the effect of depth and topn for a nutch crawl? For example, let's assume a depth of 100 and topn of
  7. java - Can I Define a Custom extension point in Apacahe Nutch 1.8

    http://stackoverflow.com/questions/35301335/can-i-define-a-custom-extension-point-in-apacahe-nutch-1-8
    up vote 0 down vote favorite I need to use a postgre sql database instead of txt file for seed urls before running injector job. Can i achieve this problem by u
  8. Nutch : Crawl Broken Links & Index it in Solr

    http://stackoverflow.com/questions/20513035/nutch-crawl-broken-links-index-it-in-solr
    up vote 0 down vote favorite My purpose is to find how many URLs in an HTML page are invalid (404, 500, HostNotFound). So in Nutch is there a config change that
  9. hbase - Nutch in Hadoop 2.x

    http://stackoverflow.com/questions/23436168/nutch-in-hadoop-2-x
    up vote 1 down vote favorite I have a three-node cluster running Hadoop 2.2.0 and HBase 0.98.1 and I need to use a Nutch 2.2.1 crawler on top of that. But it on
  10. Crawling with Apache Nutch 1.9 using Java code

    http://stackoverflow.com/questions/32949695/crawling-with-apache-nutch-1-9-using-java-code
    up vote 0 down vote favorite We have developed a data processing pipeline which crawls web data given a set of configured URLs using Apache Nutch 1.4. The pipel