About 68 results
  1. Nutch搜索引擎(第1期)_ Nutch简介及安装

    2014年03月21日 - 1、Nutch简介   Nutch是一个由Java实现的,开放源代码(open-source)的web搜索引擎。主要用于收集网页数据,然后对其进行分析,建立索引,以提供相应的接口来对其网页数据进行查询的一套工具。其底层使用了Hadoop来做分布式计算与存储,索引使用了Solr分布式索引框架来做,Solr是一个开源的全文
  2. Crawl PDF documents using nutch

    up vote 3 down vote favorite I have to crawl PDF documents too from given URL... suggest any tool/API to crawl PDF docs also... now I am using nutch to crawl bu
  3. What database options are available for Nutch 2.1?

    up vote 1 down vote favorite 1 I'm trying to test out Nutch 2.1 on a single Windows machine. The following command dies: nutch crawl seeds -dir crawl -solr http
  4. Solr+Nutch+AjaxSolr query

    up vote 0 down vote favorite 1) I referred https://github.com/evolvingweb/ajax-solr/wiki/reuters-tutorial for Ajax-Solr setup. I want to know that although ajax
  5. eclipse - Nutch plugin development

    up vote 2 down vote favorite 1 The nutch wiki has instructions on how to build nutch plugins, but only if you download the entire nutch source tree and put it i
  6. Effect of depth, topn in nutch crawl

    up vote 2 down vote favorite I have always wondered what is the effect of depth and topn for a nutch crawl? For example, let's assume a depth of 100 and topn of
  7. java - Can I Define a Custom extension point in Apacahe Nutch 1.8

    up vote 0 down vote favorite I need to use a postgre sql database instead of txt file for seed urls before running injector job. Can i achieve this problem by u
  8. Nutch : Crawl Broken Links & Index it in Solr

    up vote 0 down vote favorite My purpose is to find how many URLs in an HTML page are invalid (404, 500, HostNotFound). So in Nutch is there a config change that
  9. hbase - Nutch in Hadoop 2.x

    up vote 1 down vote favorite I have a three-node cluster running Hadoop 2.2.0 and HBase 0.98.1 and I need to use a Nutch 2.2.1 crawler on top of that. But it on
  10. Crawling with Apache Nutch 1.9 using Java code

    up vote 0 down vote favorite We have developed a data processing pipeline which crawls web data given a set of configured URLs using Apache Nutch 1.4. The pipel