1. 拉取镜像
我一般拉取最新的了.....如果指定版本自己注意
docker pull python
docker pull redis
docker pull selenium/standalone-chrome
2. 编写爬虫
我这里就只展示dockerfile了,
# Use an official Python runtime as a parent image
FROM python
MAINTAINER yeyangfengqi <825681476@qq.com>
# Set the working directory to /app
WORKDIR /app
# Copy the current directory contents into the container at /app
COPY . /app
# Install any needed packages specified in requirements.txt
RUN pip install --trusted-host pypi.python.org -r requirements.txt
# Make port 80 available to the world outside this container
EXPOSE 80
# Define environment variable
# Run app.py when the container launches
CMD ["python", "-u", "spider.py"]
requirements.txt 用于存储pip需要下载的包,内容如下
requests
selenium
beautifulsoup4
redis
3. 部署redis & selenium
docker run -itd --name redis-test -p 6379:6379 redis
docker run -d -p 4444:4444 --shm-size=2g selenium/standalone-chrome:3.141.59-zirconium
4. 部署爬虫
1. 打包成镜像
docker build -t spider_name:test .
2. 获取redis & selenium ip地址
这里获取的ip是需要在下面部署爬虫的命令中用来设置redis&selenium的环境变量的的,用linux的大佬可以直接一条命令解决,(记得代码评论给我,我小白).
docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' redis_ago
docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' selenium
3. 部署爬虫
注意里面的host使用上面获取到的ip!
docker run -itd --name spider_name -e SELENIUM_HOST=172.18.0.3 -e REDIS_HOST=172.18.0.2 -e PYTHONUNBUFFERED=0 spider_name:test /bin/bash
5. 爬虫内部获取环境变量动态链接redis & selenium
我这里做了点兼容,不需要的可以去掉,因为有时候在本地跑
import os
# selenium
try:
selenium_host = os.getenv("SELENIUM_HOST")
except:
selenium_host = "127.0.0.1"
browser = webdriver.Remote(
command_executor="http://{}:4444/wd/hub".format(selenium_host),
desired_capabilities=DesiredCapabilities.CHROME
)
# redis
def get_redis_connect():
try:
myredis = os.environ.get("REDIS_HOST", "127.0.0.1")
except:
myredis = "127.0.0.1"
return redis.StrictRedis(host=myredis, port=6379, db=0)
6. 解决docker logs -f 无法显示日志问题
所有的print使用get_module_logger(name).info()代替
发现没用,大家别试了,有搞定的人分享一下方法吗?我试了好几个都不行
只好退而求其次,保存一个logfile了
已经解决可以在docker logs -f 显示问题
顺便解决这个log模块重复的问题
import logging
def get_module_logger(mod_name):
# 初始化logger
logger = logging.getLogger(mod_name)
logging.basicConfig(format='%(asctime)s - %(name)s - %(levelname)s - %(message)s', level=logging.INFO)
# 设置两个handler分别输出打印台和日志文件
to_file_handler = logging.FileHandler("/agoda_spider_app/logs/agoda_spider.log")
console_handler = logging.StreamHandler()
formatter = logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s",)
to_file_handler.setFormatter(formatter)
# 添加handler
if not logger.handlers:
logger.addHandler(to_file_handler)
logger.addHandler(console_handler)
# 添加下面一句,在记录日志之后移除句柄,防止重复打印
logger.removeHandler(console_handler)
return logger
if __name__ == "__main__":
get_module_logger(__name__).warning("HELLO WORLD!")
7. 解决python镜像导包慢,报timeout错问题
修改 --trusted-host
我承认修改了后慢的毛病依然存在,但是至少不报timeout了
& 添加pythonpath的环境变量,不然找不到包,这个错玩了我一天
最终版本:添加vim可以手动进入容器。
& 解决时区问题
# Use an official Python runtime as a parent image
FROM python:3.7.6
MAINTAINER yeyangfengqi <825681476@qq.com>
# Set the working directory to /app
WORKDIR /app
# Copy the current directory contents into the container at /app
COPY . /app
# Install any needed packages specified in requirements.txt
RUN apt-get update && apt-get install vim -y \
&& apt-get install -y tzdata && ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime \
&& pip install --trusted-host mirrors.aliyun.com -r requirements.txt
# Make port 80 available to the world outside this container
EXPOSE 80
# Define environment variable
ENV PYTHONPATH /app
ENV PYTHONUNBUFFERED 0
VOLUME ["/root/python_script/spider/logs","/app/logs"]
# Run app.py when the container launches
CMD bash -c "python agoda_spider.py"
基本上所有问题都解决了,代码也已经上线测试,记录一下这些坑。
ps:以前觉得docker麻烦的很,现在........真香!