背景:
前两天boss给了一个网站的主站链接,要求对其进行信息收集。
好嘛,按照我的思路肯定是先看下网站的大致内容,然后收集二级域名,查看下whois信息,google搜索一下,网站的敏感信息,收集服务器、程序、组件的bannner信息,然后看看服务器都开放了哪些端口,运行了哪些服务。。。
后来发现我单纯了,baidu搜索了一下网站的关键词,发现此网站有很多分身,于是我就苦逼的一页一页的点击链接,收集相关网址,查询网站的whois信息和ip归属地,当时我的想法就是,一定要写一个脚本,自动化完成这些浪费时间且无意义的工作。于是就有了以下尝试:
功能一:
爬取百度关键词搜索相关链接:
# _*_ coding:utf_8 _*_
# coding by gooyii 2016/09/25
import urllib2 as url
import urllib
import string
import re
from pyquery import PyQuery as py
links = []
pagee = []
visited = []
def baidu_Search(keyword):
p ={'wd': keyword}
res = url.urlopen("http://www.baidu.com/s?"+urllib.urlencode(p))
html = res.read()
returnhtml
def surf(URL):
res = url.urlopen(URL)
html = res.read()
return html
def get_List(regex,text):
arr = []
res = re.findall(regex,text)
if res :
for r in res:
arr.append(r)
return arr
def get_links(html):
py_html = py(html)
h3s = py_html('.t')
for h3 in h3s.items():
h3_hrefs = h3('a')
for h3_href in h3_hrefs.items():
if h3_href.attr('href') not in links:
links.append(h3_href.attr('href'))
for link in links:
print link
returnlinks
def get_pages(html):
py_html = py(html)
pages = py_html('#page')
for p in pages.items():
page_hrefs = p('a')
for page_hrefs_a in page_hrefs.items():
next_page = page_hrefs_a.attr('href')
URL ="http://www.baidu.com"+ next_page
pagee.append(URL)
returnpagee
search_html = baidu_Search("天下无贼")
visits = get_pages(search_html)
for visit in visits:
if visit not in visited:
print("LINK: %s"% visit)
html= surf(visit)
a = get_pages(visit)
if a not in visits:
visits.append(a)
get_links(html)