1.配置好python环境,详情见《python3安装(centos)》
2.安装docker:
yum install -y docker
3.配置国内镜像源:
进入docker安装目录(默认为/etc/docker/),vim目录下的daemon.json:
vim /etc/docker/daemon.json
写入以下内容:
{
"registry-mirrors": [
"https://kfwkfulq.mirror.aliyuncs.com",
"https://2lqq34jg.mirror.aliyuncs.com",
"https://pee6w651.mirror.aliyuncs.com",
"https://registry.docker-cn.com",
"http://hub-mirror.c.163.com"
],
"dns": ["8.8.8.8","8.8.4.4"]
}
3.启动docker:
systemctl start docker
4.拉取splash镜像:
docker pull scrapinghub/splash
5.运行splash:
docker run -d -p 8050:8050 scrapinghub/splash
(如果是阿里云服务器,注意安全组配置)
6.设置mysql,以免报too many connection(如果之前设置过,忽略。详情见《解决Mysql错误Too many connections的方法》)
vim /etc/my.conf
在[mysqld]下添加:
max_connections=1000
wait_timeout=100
interactive_timeout=100
max_allowed_packet=15M
7.安装scrapyd
pip install scrapyd --upgrade
8.设置软链接
ln -s /usr/local/python3/bin/scrapy /usr/bin/scrapy
ln -s /usr/local/python3/bin/scrapyd /usr/bin/scrapyd
ln -s /usr/local/python3/bin/twist /usr/bin/twist
ln -s /usr/local/python3/bin/twistd /usr/bin/twistd
9.设置scrapyd配置文件,允许远程链接:
vim /usr/local/python3/lib/python3.6/site-packages/scrapyd/default_scrapyd.conf
修改:
bind_address = 0.0.0.0
10.启动scrapyd(端口6800,注意配置阿里云安全组。)
nohup scrapyd &
11.创建文件夹用来存储scrapy spider的日志(在spider的setting.py中配置该地址)
mkdir /var/log/spider/log
(以下操作在本地电脑上进行)
12.安装scrapy-client
pip install scrapyd-client
13.配置scrapy.cfg,设置远程scrapyd地址;
14.在爬虫的setting.py配置文件中配置sys_evn,LOG_FILE_DIR(setting.py中的LOG_FILE变量必须写死,地址为12步创建的路径)
15.在scrapy.cfg同目录下执行scrapyd-deploy命令,
scrapyd-deploy
如果报命令找不到,在相应环境的Script/下新增scrapyd-deploy.bat文件,内容为:
@echo off
"D:\anaconda3-5.0.1\envs\py36\python.exe" "D:\anaconda3-5.0.1\envs\py36\Scripts\scrapyd-deploy" %1 %2 %3 %4 %5 %6 %7 %8 %9
(具体路径根据实际情况来)
16.用curl 命令调用scrapyd来启动和停止spider
---------------------