一、Webmagic总体架构:
WebMagic的结构分为Downloader、PageProcessor、Scheduler、Pipeline四大组件,并由Spider将它们彼此组织起来。这四大组件对应爬虫生命周期中的下载、处理、管理和持久化等功能。
而Spider则将这几个组件组织起来,让它们可以互相交互,流程化的执行,可以认为Spider是一个大的容器,它也是WebMagic逻辑的核心。
二、WebMagic的四个组件
1.Downloader
Downloader负责从互联网上下载页面,以便后续处理。WebMagic默认使用了Apache HttpClient作为下载工具。
2.PageProcessor
PageProcessor负责解析页面,抽取有用信息,以及发现新的链接。WebMagic使用Jsoup作为HTML解析工具,并基于其开发了解析XPath的工具Xsoup。
在这四个组件中,PageProcessor
对于每个站点每个页面都不一样,是需要使用者定制的部分。
3.Scheduler
Scheduler负责管理待抓取的URL,以及一些去重的工作。WebMagic默认提供了JDK的内存队列来管理URL,并用集合来进行去重。也支持使用Redis进行分布式管理。
除非项目有一些特殊的分布式需求,否则无需自己定制Scheduler。
4.Pipeline
Pipeline负责抽取结果的处理,包括计算、持久化到文件、数据库等。WebMagic默认提供了“输出到控制台”和“保存到文件”两种结果处理方案。
Pipeline
定义了结果保存的方式,如果你要保存到指定数据库,则需要编写对应的Pipeline。对于一类需求一般只需编写一个Pipeline
。
三、selenium模拟登陆
selenium本身是一种自动化测试工具,可以模拟浏览器进行页面的加载,好处在于能通过程序,自动的完成例如页面登录、AJAX内容获取的的操作。
尤其是获取AJAX生成的动态信息方面,一般爬虫只会获取当前页面的静态信息,不会加载动态生成的内容,但是selenium则完美的帮我们实现了这一功能。
但同样他也有一些不好的地方,就是使用selenium功能的时候,需要事先加载selenium的驱动,在通过selenium本身加载出页面动态生成的内容,以供之后爬取。
四、下载浏览器和驱动
使用selenium对页面尽进行爬取时,首先需要下载相关的浏览器驱动,不同版本的浏览器对应的驱动也不一样。
centos下载安装Google浏览器:
1.1 chrome下载安装命令:yum install https://dl.google.com/linux/direct/google-chrome-stable_current_x86_64.rpm
1.2 查看chrome版本命令:google-chrome --version
1.3 下载chrome版本号对应的驱动(地址:http://chromedriver.storage.googleapis.com/index.html):
例如:chrome版本号 89.0.4389.82
http://chromedriver.storage.googleapis.com/89.0.4389.23/chromedriver_linux64.zip
1.4 解压下载的驱动包,解压到目录:/home/chrome/
五、项目搭建
1. 添加依赖
<!--java支持的selenium包-->
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-java</artifactId>
<version>3.141.59</version>
</dependency>
<!--chromedriver驱动jar包-->
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-chrome-driver</artifactId>
<version>3.141.59</version>
</dependency>
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-core</artifactId>
<version>0.7.4</version>
</dependency>
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-extension</artifactId>
<version>0.7.4</version>
</dependency>
<!--<!– commons-collections –>-->
<dependency>
<groupId>commons-collections</groupId>
<artifactId>commons-collections</artifactId>
<version>3.2.2</version>
</dependency>
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-selenium</artifactId>
<version>0.7.4</version>
</dependency>
2. 修改WebDriverPool
package com.nieyue.news.webmagic.downloader;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.openqa.selenium.phantomjs.PhantomJSDriver;
import org.openqa.selenium.phantomjs.PhantomJSDriverService;
import org.openqa.selenium.remote.DesiredCapabilities;
import org.openqa.selenium.remote.RemoteWebDriver;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.Properties;
import java.util.concurrent.BlockingDeque;
import java.util.concurrent.LinkedBlockingDeque;
import java.util.concurrent.atomic.AtomicInteger;
/**
* @author lsj
* WebDriverPool:频繁开关phantomJS进程比较耗费资源,所以需要维护一个线程池控制访问以减少内存消耗
*
*/
class WebDriverPool {
private Logger logger= LoggerFactory.getLogger(this.getClass());
private final static int DEFAULT_CAPACITY = 5;
private final int capacity;
private final static int STAT_RUNNING = 1;
private final static int STAT_CLODED = 2;
private AtomicInteger stat = new AtomicInteger(STAT_RUNNING);
/*
* new fields for configuring phantomJS
*/
private WebDriver mDriver = null;
private boolean mAutoQuitDriver = true;
private static final String DEFAULT_CONFIG_FILE = "selenium.properties";
private static final String DRIVER_FIREFOX = "firefox";
private static final String DRIVER_CHROME = "chrome";
private static final String DRIVER_PHANTOMJS = "phantomjs";
protected static Properties sConfig;
protected static DesiredCapabilities sCaps;
/**
* Configure the GhostDriver, and initialize a WebDriver instance. This part
* of code comes from GhostDriver.
* https://github.com/detro/ghostdriver/tree/master/test/java/src/test/java/ghostdriver
*
* @throws IOException
*/
public void configure() throws IOException {
// Read config file
sConfig = new Properties();
String configFile = DEFAULT_CONFIG_FILE;
if (System.getProperty("selenuim_config")!=null){
configFile = System.getProperty("selenuim_config");
}
sConfig.load(Thread.currentThread().getContextClassLoader().getResourceAsStream(configFile));
// sConfig.load(new FileReader(configFile));
// Prepare capabilities
sCaps = new DesiredCapabilities();
sCaps.setJavascriptEnabled(true);
sCaps.setCapability("takesScreenshot", false);
String driver = sConfig.getProperty("driver", DRIVER_PHANTOMJS);
// Fetch PhantomJS-specific configuration parameters
if (driver.equals(DRIVER_PHANTOMJS)) {
// "phantomjs_exec_path"
if (sConfig.getProperty("phantomjs_exec_path") != null) {
sCaps.setCapability(
PhantomJSDriverService.PHANTOMJS_EXECUTABLE_PATH_PROPERTY,
sConfig.getProperty("phantomjs_exec_path"));
} else {
throw new IOException(
String.format(
"Property '%s' not set!",
PhantomJSDriverService.PHANTOMJS_EXECUTABLE_PATH_PROPERTY));
}
// "phantomjs_driver_path"
if (sConfig.getProperty("phantomjs_driver_path") != null) {
System.out.println("Test will use an external GhostDriver");
sCaps.setCapability(
PhantomJSDriverService.PHANTOMJS_GHOSTDRIVER_PATH_PROPERTY,
sConfig.getProperty("phantomjs_driver_path"));
} else {
System.out
.println("Test will use PhantomJS internal GhostDriver");
}
}
// Disable "web-security", enable all possible "ssl-protocols" and
// "ignore-ssl-errors" for PhantomJSDriver
// sCaps.setCapability(PhantomJSDriverService.PHANTOMJS_CLI_ARGS, new
// String[] {
// "--web-security=false",
// "--ssl-protocol=any",
// "--ignore-ssl-errors=true"
// });
ArrayList<String> cliArgsCap = new ArrayList<String>();
cliArgsCap.add("--web-security=false");
cliArgsCap.add("--ssl-protocol=any");
cliArgsCap.add("--ignore-ssl-errors=true");
sCaps.setCapability(PhantomJSDriverService.PHANTOMJS_CLI_ARGS,
cliArgsCap);
// Control LogLevel for GhostDriver, via CLI arguments
sCaps.setCapability(
PhantomJSDriverService.PHANTOMJS_GHOSTDRIVER_CLI_ARGS,
new String[] { "--logLevel="
+ (sConfig.getProperty("phantomjs_driver_loglevel") != null ? sConfig
.getProperty("phantomjs_driver_loglevel")
: "INFO") });
// String driver = sConfig.getProperty("driver", DRIVER_PHANTOMJS);
// Start appropriate Driver
if (isUrl(driver)) {
sCaps.setBrowserName("phantomjs");
mDriver = new RemoteWebDriver(new URL(driver), sCaps);
} else if (driver.equals(DRIVER_FIREFOX)) {
mDriver = new FirefoxDriver(sCaps);
} else if (driver.equals(DRIVER_CHROME)) {
ChromeOptions options = new ChromeOptions();
// 谷歌文档提到需要加上这个属性来规避bug
options.addArguments("headless");
options.addArguments("disable-gpu");
options.addArguments("disable-dev-shm-usage");
options.addArguments("disable-plugins");
// 禁用java
options.addArguments("disable-java");
// 以最高权限运行
options.addArguments("no-sandbox");
// options.addArguments("user-agent=\"Mozilla/5.0 (iPod; U; CPU iPhone OS 2_1 like Mac OS X; ja-jp) AppleWebKit/525.18.1 (KHTML, like Gecko) Version/3.1.1 Mobile/5F137 Safari/525.20\"");
//不显示弹出窗口
options.setHeadless(true);
mDriver = new ChromeDriver(options);
} else if (driver.equals(DRIVER_PHANTOMJS)) {
mDriver = new PhantomJSDriver(sCaps);
}
}
/**
* check whether input is a valid URL
*
* @param urlString urlString
* @return true means yes, otherwise no.
*/
private boolean isUrl(String urlString) {
try {
new URL(urlString);
return true;
} catch (MalformedURLException mue) {
return false;
}
}
/**
* store webDrivers created
*/
private List<WebDriver> webDriverList = Collections
.synchronizedList(new ArrayList<WebDriver>());
/**
* store webDrivers available
*/
private BlockingDeque<WebDriver> innerQueue = new LinkedBlockingDeque<WebDriver>();
public WebDriverPool(int capacity) {
this.capacity = capacity;
}
public WebDriverPool() {
this(DEFAULT_CAPACITY);
}
/**
*
* @return
* @throws InterruptedException
*/
public WebDriver get() throws InterruptedException {
checkRunning();
WebDriver poll = innerQueue.poll();
if (poll != null) {
return poll;
}
if (webDriverList.size() < capacity) {
synchronized (webDriverList) {
if (webDriverList.size() < capacity) {
// add new WebDriver instance into pool
try {
configure();
innerQueue.add(mDriver);
webDriverList.add(mDriver);
} catch (IOException e) {
e.printStackTrace();
}
// ChromeDriver e = new ChromeDriver();
// WebDriver e = getWebDriver();
// innerQueue.add(e);
// webDriverList.add(e);
}
}
}
return innerQueue.take();
}
public void returnToPool(WebDriver webDriver) {
checkRunning();
innerQueue.add(webDriver);
}
protected void checkRunning() {
if (!stat.compareAndSet(STAT_RUNNING, STAT_RUNNING)) {
throw new IllegalStateException("Already closed!");
}
}
public void closeAll() {
boolean b = stat.compareAndSet(STAT_RUNNING, STAT_CLODED);
if (!b) {
throw new IllegalStateException("Already closed!");
}
for (WebDriver webDriver : webDriverList) {
logger.info("Quit webDriver" + webDriver);
webDriver.quit();
webDriver = null;
}
}
}
3. 修改SeleniumDownloader
package com.nieyue.news.webmagic.downloader;
import org.openqa.selenium.By;
import org.openqa.selenium.Cookie;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Request;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.downloader.Downloader;
import us.codecraft.webmagic.selector.PlainText;
import java.io.Closeable;
import java.io.IOException;
import java.util.Map;
/**
* 使用Selenium调用浏览器进行渲染。目前仅支持chrome。
* 需要下载Selenium driver支持。
*/
public class SeleniumDownloader implements Downloader, Closeable {
private volatile WebDriverPool webDriverPool;
private Logger logger= LoggerFactory.getLogger(this.getClass());
private int sleepTime = 0;
private int poolSize = 1;
private static final String DRIVER_PHANTOMJS = "phantomjs";
/**
* 新建
*
* @param chromeDriverPath chromeDriverPath
*/
public SeleniumDownloader(String chromeDriverPath) {
System.getProperties().setProperty("webdriver.chrome.driver",
chromeDriverPath);
}
/**
* Constructor without any filed. Construct PhantomJS browser
*/
public SeleniumDownloader() {
// System.setProperty("phantomjs.binary.path",
// "/Users/Bingo/Downloads/phantomjs-1.9.7-macosx/bin/phantomjs");
}
/**
* set sleep time to wait until load success
*
* @param sleepTime sleepTime
* @return this
*/
public SeleniumDownloader setSleepTime(int sleepTime) {
this.sleepTime = sleepTime;
return this;
}
@Override
public Page download(Request request, Task task) {
checkInit();
WebDriver webDriver;
try {
webDriver = webDriverPool.get();
} catch (InterruptedException e) {
logger.warn("interrupted", e);
return null;
}
logger.info("downloading page " + request.getUrl());
webDriver.get(request.getUrl());
try {
Thread.sleep(sleepTime);
} catch (InterruptedException e) {
e.printStackTrace();
}
WebDriver.Options manage = webDriver.manage();
Site site = task.getSite();
if (site.getCookies() != null) {
for (Map.Entry<String, String> cookieEntry : site.getCookies()
.entrySet()) {
Cookie cookie = new Cookie(cookieEntry.getKey(),
cookieEntry.getValue());
manage.addCookie(cookie);
}
}
/*
* TODO You can add mouse event or other processes
*
*/
WebElement webElement = webDriver.findElement(By.xpath("/html"));
String content = webElement.getAttribute("outerHTML");
Page page = new Page();
page.setRawText(content);
// page.setHtml(new Html(content, request.getUrl()));
page.setUrl(new PlainText(request.getUrl()));
page.setRequest(request);
webDriverPool.returnToPool(webDriver);
return page;
}
private void checkInit() {
if (webDriverPool == null) {
synchronized (this) {
webDriverPool = new WebDriverPool(poolSize);
}
}
}
@Override
public void setThread(int thread) {
this.poolSize = thread;
}
@Override
public void close() throws IOException {
webDriverPool.closeAll();
}
}
4.添加selenium.properties配置文件
# What WebDriver to use for the tests
#driver=phantomjs
#driver=firefox
driver=chrome
#driver=http://localhost:8910
#driver=http://localhost:4444/wd/hub
# PhantomJS specific config (change according to your installation)
#phantomjs_exec_path=/Users/Bingo/bin/phantomjs-qt5
#phantomjs_exec_path=d:/phantomjs.exe
#chrome_exec_path=E:\\demo\\crawler\\chromedriver.exe
#phantomjs_driver_path=/Users/Bingo/Documents/workspace/webmagic/webmagic-selenium/src/main.js
#phantomjs_driver_loglevel=DEBUG
chrome_driver_loglevel=DEBUG
# 本地
#chrome_driver_path=D://MyProject//chromedriver.exe
# 测试环境
chrome_driver_path=/home/chrome/chromedriver
5.使用案例
package com.nieyue.news.webmagic.processor;
import com.nieyue.news.bean.ArticleWebmagic;
import com.nieyue.news.webmagic.downloader.SeleniumDownloader;
import com.nieyue.news.webmagic.pipeline.ArticlePipeline;
import com.nieyue.news.webmagic.utils.WebmagicRedisUtil;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Component;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;
import us.codecraft.webmagic.scheduler.BloomFilterDuplicateRemover;
import us.codecraft.webmagic.scheduler.QueueScheduler;
import us.codecraft.webmagic.selector.Html;
import java.util.*;
/**
* 凤凰网
* PageProcessor解析器
*/
@Component
public class ArticleProcessor implements PageProcessor {
// 初始页URL 推荐
private static final String URL = "https://finance.ifeng.com/c/84cRLNKrrar";
private Logger logger= LoggerFactory.getLogger(this.getClass());
@Autowired
private WebmagicRedisUtil webmagicRedisUtil;
// chromedriver.exe地址
private static final String address = "D:\\MyProject\\chromedriver.exe";
private Site site;
@Override
public Site getSite() {
if (site == null) {
site = Site.me()
.setCharset("utf8") // 字符集,charset具体看网站的字符集
.setSleepTime(3 * 1000) // 抓取间隔时间,单位都是毫秒
.setTimeOut(5 * 1000) // 超时时间
.setRetrySleepTime(3 * 1000) // 重试间隔时间
.setRetryTimes(3); // 重试次数
}
return site;
}
/**
* 具体的解析逻辑
* @param page Page, WebMagic经过Downloader下载后自动封装的
*/
@Override
public void process(Page page) {
Html html = page.getHtml();
Document document = html.getDocument();
int select = document.select("div#root").select("div.layout-u18-agac").size();
if (select == 0 ){
// 详情页
// getElementsByAttributeValueContaining, 寻找键为key,值包含match的元素集
Elements elements = document.getElementsByAttributeValueContaining("class", "main_content-");
if (elements != null && elements.size()>0){
Element element = elements.get(0);
// 移除关键词文章(原创、不得转载、禁止转载、禁止任何方式转载、未经允许)
Elements words1 = element.getElementsContainingText("原创");
Elements words2 = element.getElementsContainingText("不得转载");
Elements words3 = element.getElementsContainingText("禁止转载");
Elements words4 = element.getElementsContainingText("禁止任何方式转载");
Elements words5 = element.getElementsContainingText("未经允许");
if ((words1 != null && words1.size()>0)||(words2 != null && words2.size()>0)|| (words3 != null && words3.size()>0)
||(words4 != null && words4.size()>0)||(words5 != null && words5.size()>0)){
logger.info("凤凰网->此界面包含敏感词:"+page.getRequest().getUrl());
return;
}
// 移除文章中的广告
element.select("div#embed_hzh_div").remove();
// 移除自动播放
element.getElementsByAttributeValueContaining("class", "video_box-").remove();
// 移除底部广告
element.getElementsByAttributeValue("style","position: relative;").remove();
// 保存对象
ArticleWebmagic articleWebmagic = new ArticleWebmagic();
// 标题
// String title = document.getElementsByAttributeValueContaining("class", "leftContent-").select("h1").text();
String title = document.select("h1").text();
// 图片
Elements img = element.getElementsByTag("img");
if (img != null && img.size()>0){
StringBuilder sb = new StringBuilder();
// 三图
if (img.size()>=3){
for (int i = 0;i<3;i++){
articleWebmagic.setImgMode(6);
String src = img.get(i).attr("src");
sb.append(src).append(",");
}
} else { // 右小图
articleWebmagic.setImgMode(4);
String src = img.get(0).attr("src");
sb.append(src).append(",");
}
articleWebmagic.setImgAddress(sb.substring(0,sb.length()-1));
}
String url = page.getRequest().getUrl();
articleWebmagic.setTitle(title);
articleWebmagic.setContent(element.toString());
articleWebmagic.setUrl(url);
// 存数据
page.putField("articleWebmagic",articleWebmagic);
} else {
logger.info("凤凰网->此界面不满足:"+page.getRequest().getUrl());
}
} else {
// 热点资讯
Elements elements = document.select("div.hot_box-1yXFLW7e").select("div.news_list-1dYUdgWQ").get(0).select("a");
// 要闻
elements.addAll(document.select("div.center_box-2F8qYPeE").select("div.tabBodyItemActive-H7rMJtKB").select("a"));
// 军事
elements.addAll(document.select("div.left_box-aXjri-Gu").select("div.news_list-1dYUdgWQ").select("a"));
// 科技
elements.addAll(document.select("div.center_box-_l_Nle8B").select("div.news_list-1dYUdgWQ").select("a"));
// 体育
elements.addAll(document.select("div.left_box-7AdOw5gz").select("div.news_list-1dYUdgWQ").select("a"));
// 娱乐
elements.addAll(document.select("div.center_box-2d2syNWk").select("div.news_list-1dYUdgWQ").select("a"));
// 时尚
elements.addAll(document.select("div.center_box-39hkxdBA").select("div.news_list-1dYUdgWQ").select("a"));
// 教育
elements.addAll(document.select("div.left_box-3iQHsHjU").select("div.news_list-1dYUdgWQ").select("a"));
// 文化·读书
elements.addAll(document.select("div.center_box-2ghWH00s").select("div.news_list-1dYUdgWQ").select("a"));
// 新list
List<String> article = new ArrayList<>();
// 备份新的list
List<String> article1 = new ArrayList<>();
for(Element element : elements){
String url = element.attr("href");
article.add(url);
article1.add(url);
}
// 获取redis中的老文章
// webmagicRedisUtil.del("articleWebmagic");
List<String> articlebmagic = (List<String>) webmagicRedisUtil.get("articleWebmagic");
if (articlebmagic == null || articlebmagic.size() ==0){
webmagicRedisUtil.set("articleWebmagic",article);
for(String a : article){
page.addTargetRequest(a);
}
} else {
for(String article0 : articlebmagic){
Iterator<String> iterator1 = article.iterator();
while (iterator1.hasNext()){
String next = iterator1.next();
if (article0.equals(next)){
iterator1.remove();
}
}
}
// 更新的部分url进行请求
if (article !=null && article.size() > 0){
for(String url : article){
page.addTargetRequest(url);
}
}
// 更新redis
webmagicRedisUtil.del("articleWebmagic");
webmagicRedisUtil.set("articleWebmagic",article1);
}
}
}
// public static void main(String[] args) {
// 执行
// Spider.create(new ArticleProcessor())
// .addUrl(URL)
// .thread(5)
// // 自定义Pipeline,保存到数据库
// .addPipeline(new ArticlePipeline())
// /**
// * 为 SeleniumDownloader 设置休眠时间:
// * 当动态加载页面时,可能还存在部分数据没有加载完毕,为它设置休眠时间后,可保证有足够的时间,加载完
// */
// .setDownloader(new SeleniumDownloader(address).setSleepTime(3 * 1000))
// // 设置调度策略及去重策略(并设置对最多10万数据进行去重)
// .setScheduler(new QueueScheduler().setDuplicateRemover(new BloomFilterDuplicateRemover(10 * 1000)))
// .run();
// }
}
使用定时任务抓取
package com.nieyue.news.webmagic.schedule;
import com.nieyue.common.comments.lock.LockTemplate;
import com.nieyue.common.comments.lock.LockedCallback;
import com.nieyue.common.util.RedisUtil;
import com.nieyue.news.webmagic.downloader.SeleniumDownloader;
import com.nieyue.news.webmagic.pipeline.ArticlePipeline;
import com.nieyue.news.webmagic.processor.ArticleProcessor;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.boot.autoconfigure.condition.ConditionalOnProperty;
import org.springframework.scheduling.annotation.EnableScheduling;
import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.stereotype.Component;
import org.springframework.util.StringUtils;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.scheduler.BloomFilterDuplicateRemover;
import us.codecraft.webmagic.scheduler.QueueScheduler;
import java.io.IOException;
import java.util.Properties;
/**
* 定时任务自动爬取凤凰网文章
*/
@Component
//启用定时任务
@EnableScheduling
//配置文件读取是否启用此配置
@ConditionalOnProperty(prefix = "scheduling", name = "enabled", havingValue = "true")
public class FengArticleScheduled {
private static final String URL = "https://www.ifeng.com/";
@Autowired
private ArticleProcessor articleProcessor;
@Autowired
private ArticlePipeline articlePipeline;
@Value("${server.port}")
private int serverPort;
private Logger logger= LoggerFactory.getLogger(this.getClass());
@Autowired
private LockTemplate lockTemplate;
//每天执行一次
@Scheduled(cron = "0 0 9,12,15 * * ?")
public void updateArticle() {
lockTemplate.doBiz(new LockedCallback<String>() {
@Override
public String callback() {
Properties sConfig = new Properties();
try {
sConfig.load(Thread.currentThread().getContextClassLoader().getResourceAsStream("selenium.properties"));
} catch (IOException e) {
e.printStackTrace();
}
// 执行
Spider.create(articleProcessor)
.addUrl(URL)
// 自定义Pipeline,保存到数据库
.addPipeline(articlePipeline)
.thread(5)
/*
* 为 SeleniumDownloader 设置休眠时间:
* 当动态加载页面时,可能还存在部分数据没有加载完毕,为它设置休眠时间后,可保证有足够的时间,加载完
*/
.setDownloader(new SeleniumDownloader((String)sConfig.get("chrome_driver_path")).setSleepTime(3000))
// 设置调度策略及去重策略(并设置对最多10万数据进行去重)
.setScheduler(new QueueScheduler().setDuplicateRemover(new BloomFilterDuplicateRemover(10 * 1000)))
.run();
logger.info("凤凰网文章定时任务执行完成,端口号: "+serverPort);
return "";
}
},"campusNewFengArticleScheduledUpdateArticleScene","","campusNewFengArticleScheduledUpdateArticleKey",10L,"锁异常campusNewFengArticleScheduledUpdateArticleSceneUniqueId");
}
}
参考文档:
http://webmagic.io/docs/zh/
https://blog.csdn.net/qixinbruce/article/details/71105444
https://blog.csdn.net/panchang199266/article/details/85413746