下载安装 (需要下载bin的zip版本, 否则bin目录下没有 运行脚本)
http://www.apache.org/dyn/closer.lua/nutch/1.11/apache-nutch-1.11-bin.zip
解压放入Documents目录下
- 检查是否安装 解压成功
wdxxl@ubuntu:~/Documents/apache-nutch-1.11$ bin/nutch
Perl wdxxl@ubuntu:~/Documents/apache-nutch-1.11$ bin/crawl
- 配置Nutch (在conf/nutch-site.xml加入http.agent.name的属性)
wdxxl@ubuntu:~/Documents/apache-nutch-1.11$ gedit conf/nutch-site.xml
- 定义seed URLs
wdxxl@ubuntu:~/Documents/apache-nutch-1.11/bin$ mkdir -p seed_urls wdxxl@ubuntu:~/Documents/apache-nutch-1.11/bin$ cd seed_ urls/ wdxxl@ubuntu:~/Documents/apache-nutch-1.11/bin/urls$ touch seed.txt wdxxl@ubuntu:~/Documents/apache-nutch-1.11/bin/urls$ echo http://wdxxl.github.io/ >seed.txt wdxxl@ubuntu:~/Documents/apache-nutch-1.11/bin/urls$ cat seed.txt
- 爬虫
wdxxl@ubuntu:~/Documents/apache-nutch-1.11/bin$ ./crawl seed_urls crawl_dir 1 或 wdxxl@ubuntu:~/Documents/apache-nutch-1.11/bin$ ./crawl seed_urls crawl_data 3 (这样至少wdxxl.github.io可以取完全集)
注意可能linux的文件修改还是会影响crawl的内容,比如如下信息
````Perl
wdxxl@ubuntu:~/Documents/apache-nutch-1.11/bin/seed_urls$ rm seed.txt~
````
- 检查状态 crawldb
wdxxl@ubuntu:~/Documents/apache-nutch-1.11/bin$ ./nutch readdb crawl_dir/crawldb/ -stats
- 导出数据到文件 (少了一点信息,主要是爬虫的 Num Round 只选择了1而已吧)
wdxxl@ubuntu:~/Documents/apache-nutch-1.11/bin$ ./nutch readdb crawl_dir/crawldb -dump output/crawldb
- 启动Solr服务器 solr-4.10.4.tgz
solr的schema文件准备wdxxl@ubuntu:~$ cp ~/Documents/apache-nutch-1.11/conf/schema.xml ~/Documents/solr-4.10.4/example/solr/collection1/conf
开启solr服务器 (默认solr4.10.4自带一个collection1)
关闭solr服务器
爬虫结果 导入 Solr
Perl wdxxl@ubuntu:~/Documents/apache-nutch-1.11/bin$ ./nutch solrindex http://localhost:8983/solr/ crawl_dir/crawldb -linkdb crawl_dir/linkdb/ crawl_dir/segments/*
Solr 服务器页面查询
Luke 打开solr数据文件