http协议下的网络爬虫

主管让做个抓取淘宝数据的功能,但是淘宝的比较难,我先从扒新浪新闻开始。

环境,Apache 提供免费的 HTTPClien t源码和 JAR 包下载,可以登陆这里下载,笔者用的是4.51版本。

参考apache提供的例子,使用正则表达式做出如下程序。


public class Main {
    
    public static void Detail(String url) throws Exception {
        CloseableHttpClient httpclient = HttpClients.createDefault();
        String oldStr;
        try {
            HttpGet httpget = new HttpGet(url);
            String encoding="gbk";
            if(url.contains("comments")){
                
                encoding = "utf-8";
            }
            System.out.println(encoding);
            System.out.println("Executing request " + httpget.getURI());
            CloseableHttpResponse response = httpclient.execute(httpget);
          
            try {
                System.out.println("----------------------------------------");
                System.out.println(response.getStatusLine());
                HttpEntity entity = response.getEntity();
                oldStr = EntityUtils.toString(response.getEntity(),encoding);
 
                // Call abort on the request object
                httpget.abort();
            } finally {
                response.close();
            }
        } finally {
            httpclient.close();
        }

        Pattern pattern =  Pattern.compile("<title>[^<]*</title>");
        Matcher matcher = pattern.matcher(oldStr);
        if(matcher.find()){
            String str = matcher.group();
            str = str.substring(7,str.length()-8);
            System.out.println("---"+str);
        }
        
        pattern =  Pattern.compile("<p>[^<]*</p>");
        matcher = pattern.matcher(oldStr);
        while(matcher.find()){
            String str = matcher.group();
            str = str.substring(3,str.length()-4);
            System.out.println(str);
        }

    }

     

    
    public static void main(String[] args) throws Exception {
        CloseableHttpClient httpclient = HttpClients.createDefault();
        String oldStr;
        try {
            
            String str = null;
        str ="http://news.sina.com.cn/hotnews/";
            HttpGet httpget = new HttpGet(str);
            System.out.println("Executing request " + httpget.getURI());
            CloseableHttpResponse response = httpclient.execute(httpget);
            try { System.out.println("----------------------------------------");
                System.out.println(response.getStatusLine());
                HttpEntity entity = response.getEntity();
                oldStr = EntityUtils.toString(response.getEntity(),"UTF-8");
                // Call abort on the request object
                httpget.abort();
            } finally {
                response.close();
            }
        } finally {
            httpclient.close();
        }
        Pattern pattern =  Pattern.compile("href='http://[^']*'");
        Matcher matcher = pattern.matcher(oldStr);
        int i= 1;
        while(matcher.find()){
            String str = matcher.group();
            str = str.substring(6,str.length()-1);
            System.out.println(str);
            Detail(str);
            System.out.println(i++);
        }
    }
}
```
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容

  • Android 自定义View的各种姿势1 Activity的显示之ViewRootImpl详解 Activity...
    passiontim阅读 175,795评论 25 709
  • Spring Cloud为开发人员提供了快速构建分布式系统中一些常见模式的工具(例如配置管理,服务发现,断路器,智...
    卡卡罗2017阅读 135,749评论 19 139
  • 33款可用来抓数据的开源爬虫软件工具 要玩大数据,没有数据怎么玩?这里推荐一些33款开源爬虫软件给大家。 爬虫,即...
    visiontry阅读 12,150评论 1 99
  • 客户套路深,我要回农村,最近和这几个客户周旋都脑子疼,妈蛋英语烂,要去学英文了,这样套路才能走得更精彩
    地瓜222阅读 1,839评论 1 0
  • 在赤壁之战里有三个国家分别是蜀国,吴国,魏国。其中魏国最为强大他是由曹操掌管的,曹操可是让人闻风丧胆,而且曹操还有...
    枫林志颖阅读 3,987评论 0 0