最近在看爬虫相关的知识,本人学的是Java,google了一下,看到webmagic框架,使用起来还是蛮简单的,于是小试牛刀,想爬简书新上榜的所有文章,分析思路如下(代码在最后):
首页: http://www.jianshu.com/recommendations/notes?category_id=56
文章结构列表如下图:
此处以抓取文章标题和链接演示。
首页全部文章列表:注意最后一个红框的id,后面有用到
滚动加载第二页,看发起的请求:
链接在首页的基础上加了两个参数,max_id和page,page容易理解,页码,主要是要找max_id的变化规律,通过继续滚动和点击加载更多发现,max_id是在最后一次文章的 id - 1,如第二张图红框所示,规律找到以后就简单了,直接撸代码。
1、JianShuNewHotProcessor.java
package com.test.spider.common;
import com.test.spider.po.News;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;
import us.codecraft.webmagic.selector.Selectable;
import java.util.List;
/**
* Created by Administrator on 2017/1/20.
*/
public class JianShuNewHotProcessor implements PageProcessor{
//下一页
public String more_url = "http://www.jianshu.com/recommendations/notes?category_id=56&max_id=%d&page=%d";
//页码计数器
private int count = 1;
private Site site = Site.me()
.setDomain("jianshu.com")
.setSleepTime(2000)
.setUserAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36");
@Override
public void process(Page page) {
//model
News news = null;
//获取文章列表的所有li节点
List<Selectable> nodes = page.getHtml().xpath("//div[@id='list-container']/ul/li").nodes();
for(Selectable s : nodes){
//获取标题
String title = s.xpath("div[@class='content']/a/text()").toString();
//获取链接
String link = s.xpath("div[@class='content']/a").links().toString();
news = new News();
news.setTitle(title);
news.setLink(link);
page.putField("news_"+title,news);
}
//找最后一篇文章的id
int max_id = Integer.parseInt(nodes.get(nodes.size() - 1).regex("data-recommended-at=\"(\\d+)\"").toString());
//页码自增
count++;
//构建下一页的url
String nextUrl = String.format(more_url, max_id -1, count);
//加入下一页的url到抓取url队列
page.addTargetRequest(nextUrl);
}
@Override
public Site getSite() {
return site;
}
public static void main(String[] args) {
Spider.create(new JianShuNewHotProcessor()) //页码处理器,获取页面,解析
.addUrl("http://www.jianshu.com/recommendations/notes?category_id=56") //入口页,此处为简书新上榜首页
.addPipeline(new NewsPipeline()) //数据处理(持久化),此处只在控制台打印结果
.thread(5) //开启5个线程
.setExitWhenComplete(true) //完成后退出
.start(); //异步启动
}
}
2、NewsPipeline.java
package com.test.spider.common;
import com.test.spider.po.News;
import us.codecraft.webmagic.ResultItems;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.pipeline.Pipeline;
import java.util.Map;
/**
* Created by Administrator on 2017/1/17.
*/
public class NewsPipeline implements Pipeline {
@Override
public void process(ResultItems resultItems, Task task) {
System.out.println("get page: " + resultItems.getRequest().getUrl());
for(Map.Entry<String,Object> entry : resultItems.getAll().entrySet()){
if(entry.getKey().contains("news")){
News news=(News) entry.getValue();
System.out.println(news);
}
}
}
}
3、News.java
package com.test.spider.po;
/**
* Created by Administrator on 2017/1/17.
*/
public class News {
private String title;
private String link;
public String getTitle() {
return title;
}
public void setTitle(String title) {
this.title = title;
}
public String getLink() {
return link;
}
public void setLink(String link) {
this.link = link;
}
@Override
public String toString() {
return "News{" +
"title='" + title + '\'' +
", link='" + link + '\'' +
'}';
}
}
此为本人第一次爬虫,求大牛多多指教~~