十、Scrapy框架–实战–猎云网爬虫(3)
异步保存MySQL数据
1、使用twisted.enterprise.adbapi来创建连接池。
def __init__(self, mysql_config):
self.dbpool = adbapi.ConnectionPool(
mysql_config['DRIVER'],
host=mysql_config['HOST'],
port=mysql_config['PORT'],
user=mysql_config['USER'],
password=mysql_config['PASSWORD'],
db=mysql_config['DATABASE'],
charset='utf-8'
)
@classmethod
def from_crawler(cls, crawler):
mysql_config =crawler.settings['MYSQL_CONFIG']
return cls(mysql_config)
2、在插入数据的函数中,使用runInteraction来运行真正执行sql语句的函数。
def process_item(self, item, spider):
# runInteraction中除了传递运行sql的函数,还可以传递参数给回调函数使用
result =self.dbpool.runInteraction(self.insert_item, item)
# 如果出现了错误,会执行self.insert_error函数
result.addErrback(self.insert_error)
return item
def insert_item(self, cursor, item):
sql = "insert into article(id,title, author, pub_time, content, origin) values(null, %s,%s,%s,%s,%s)"
args = (item['title'], item['author'],item['pub_time'], item['content'], item['origin'])
cursor.execute(sql, args)
def insert_error(self, failure):
print("="*30)
print(failure)
print("="*30)
3、在插入sql语句的函数中,第一个非self的参数就是cursor对象,使用这个对象执行sql语句。
Scrapy Shell:
在命令行中,进入到项目所在的路径。然后:Scrapy shell 链接
在这个里面,可以先去写提取的规则,没有问题后,就可以把代码拷贝到项目中。方便写代码。
pipline.py示例代码:
from twisted.enterprise import adbapi
class LywPipeline(object):
def __init__(self, mysql_config):
self.dbpool = adbapi.ConnectionPool(
mysql_config['DRIVER'],
host=mysql_config['HOST'],
port=mysql_config['PORT'],
user=mysql_config['USER'],
password=mysql_config['PASSWORD'],
db=mysql_config['DATABASE'],
charset='utf-8'
)
@classmethod
def from_crawler(cls, crawler):
# 只要重写了from_crawler方法,那么以后创建对象的时候,就会调用这个方法来获取pipline对象
mysql_config = crawler.settings['MYSQL_CONFIG']
return cls(mysql_config)
def process_item(self, item, spider):
result =self.dbpool.runInteraction(self.insert_item, item)
result.addErrback(self.insert_error)
return item
def insert_item(self, cursor, item):
sql = "insert into article(id,title, author, pub_time, content, origin) values(null, %s,%s,%s,%s,%s)"
args = (item['title'], item['author'],item['pub_time'], item['content'], item['origin'])
cursor.execute(sql, args)
def insert_error(self, failure):
print("="*30)
print(failure)
print("="*30)
def close_spider(self, spider):
self.dbpool.close()
settings.py添加
DEFAULT_REQUEST_HEADERS= {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1;WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132Safari/537.36'
}
ITEM_PIPELINES= {'lyw.pipelines.LywPipeline': 300,}
MYSQL_CONFIG= {
'DRIVER': "pymysql",
'HOST': "127.0.0.1",
'POST': 3306, # port必须为整形
'USER': "root",
'PASSWORD': "root",
'DATABASE': "lieyunwang"
上一篇文章 第六章 Scrapy框架(九) 2020-03-11 地址:
https://www.jianshu.com/p/b2cf81332955
下一篇文章 第六章 Scrapy框架(十一) 2020-03-13 地址:
https://www.jianshu.com/p/5a4b5bc44a99
以上资料内容来源网络,仅供学习交流,侵删请私信我,谢谢。