<strong>先来一个例子:</strong>
<code>TABLES['urls'] = (
"CREATE TABLE urls
("
" index
int(11) NOT NULL AUTO_INCREMENT," # index of queue
" url
varchar(512) NOT NULL,"
" md5
varchar(16) NOT NULL,"
" status
varchar(11) NOT NULL DEFAULT 'new'," # could be new, downloading and finish
" depth
int(11) NOT NULL,"
" queue_time
timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,"
" done_time
timestamp NOT NULL DEFAULT 0 ON UPDATE CURRENT_TIMESTAMP,"
" PRIMARY KEY (index
),"
" UNIQUE KEY md5
(md5
)"
") ENGINE=InnoDB")
</code>
<ul>
<li><strong>我们一个一个来解释吧:</strong></li>
<li>index我们设计为可以自增的模式;</li>
<li>url和md5都是UNIQUE的,保证了url不会重复,不需要用filter来去重,直接用数据库实现,还可以自动使用哈希索引,如果不设置UNIQUE就会全表查询;</li>
<li>我们必须要有一个status来标记url是否是新的(new),被爬过(done),或者正在被爬,不然多进程爬虫有可能会同时抽取同一个url来爬取,这是我们不希望的;</li>
<li>depth记录爬虫爬取到第几级</li>
<li>queue_time爬虫添加到队列里的时间</li>
<li>done_time爬取完成的时间</li></ul>
<em>最好全都设成NOT NULL,避免出错</em>
我们整个数据库流程大概可以归结为:
<em><strong>读取——>update状态,给进程内的url上锁——>cursor.commit——>解锁</strong><em>