selenium和playwright爬虫linkedIn公共页 Java版

一、selenium

1、pom文件依赖引入：

<groupId>org.seleniumhq.selenium

<artifactId>selenium-java

<version>3.141.59

2、window系统下载与chrome浏览器版本相同的chromedriver，如果已有的话，略过。

Linux系统安装google-chrome-stable，下载与浏览器相同版本的chromedriver，

google-chrome-stable安装成功后，如图

image

chromedriver解压后给予执行权限chmod +x /opt/chrome/chromedriver

3、代码环节

        System.getProperties().setProperty("webdriver.chrome.driver", "C:\\Program Files\\Google\\Chrome\\Application\\chromedriver.exe");       // 设置下载的chromedriver路径

        ChromeOptions chromeOptions =new ChromeOptions();
        chromeOptions.addArguments("--disable-dev-shm-usage");         
        chromeOptions.addArguments("--incognito");                    // 隐身模式打开浏览器，原因是LinkedIn非隐身模式打开用户公共页会跳转到登录页
        chromeOptions.addArguments("--disable-gpu");               // 禁用gpu，部署到服务器采用无头模式
        chromeOptions.addArguments("--headless");                  // 浏览器无头模式运行
        chromeOptions.addArguments("--no-sandbox");         
        chromeOptions.addArguments("--proxy-server=http://localhost:7890");       // LinkedIn需要梯子代理访问，不然访问到的是很简略的信息，没有有用信息
        ChromeDriver chromeDriver = new ChromeDriver(chromeOptions);
        chromeDriver.get(url);                                          // 访问url

        Thread.sleep(9500);                                        // 打开页面后需要时间等待页面加载完成，加载时间需要看网络速度自定义

        WebElement element = chromeDriver.findElement(By.className("contextual-sign-in-modal__modal-dismiss-icon"));   // 通过findElement方法定位位置，可通过标签中class、id、tag，xpath、文本等方式找到元素位置，并获得对应的文本内容进行爬取

        String headline = chromeDriver.findElement(By.className("top-card-layout__headline")).getText();     // 获取class="top-card-layout__headline"的文本内容，即为用户的headline信息，其他内容同理
        chromeDriver.executeScript("arguments[0].scrollIntoView();", element);       // 执行js脚本，页面滚动到指定element位置，图片信息需要模拟滚动到页面可看的位置才会加载真正的图片，不然读取到的都是默认图片。 
        Thread.sleep(50);
         // LinkedIn工作经历中有展开按键，需要定位到该元素，执行js操作将展开按键滑动到页面中，调用点击方法获取展开的信息
         if (element.getText().contains("展开") || element.getText().contains("Show more")){
               element.findElement(By.className("show-more-less-text__button")).click();
         }
        ...
        chromeDriver.quit();          // 处理完后关闭释放资源

LinkedIn工作经历中展开按键，

未展开.png

展开后.png

具体的提取内容根据需要灵活处理。

二、playwright

依赖包

       <dependency>
           <groupId>com.microsoft.playwright</groupId>
           <artifactId>playwright</artifactId>
           <version>1.30.0</version>
       </dependency>

playwright相比selenium不需要配置环境，运行程序后会自动下载所需文件，playwright打开页面后会自动等待页面加载。如果要获取某个元素，但该页面还没有加载完成时，在selenium中会马上返回NoSuchElementExceptionc错误，在playwright中默认会等待30秒，若在此时间内加载完成则成功返回，否则超时报错。相比selenium在页面DOM加载这个时间是相对可控的。
代码环境

        try (Playwright playwright = Playwright.create()){
            BrowserType.LaunchOptions launchOptions = new BrowserType.LaunchOptions();
            launchOptions.setChannel("firefox");                                            // 选用二进制版的火狐浏览器
            launchOptions.proxy = new Proxy("http://IP:port");                              // 代理服务器设置
            Browser browser = playwright.firefox().launch(launchOptions);
            Page page = browser.newPage();
            page.navigate(url);                                                           // 访问url

            String fullName = page.locator(".top-card-layout__title").textContent().trim();            // locator定位器方法，这里是获取class="top-card-layout__title"的元素，textContent()获取文本内容
            System.out.println("姓名："+fullName);

             try{
                  locator.locator(".show-more-less-text__button--more").waitFor(new Locator.WaitForOptions().setTimeout(100));    //设置等待该元素的时间为100毫秒，超时后未找到该元素，抛出异常
                  locator.locator(".show-more-less-text__button--more").click();            // 点击展开按键
               }catch (Exception e){
                 System.out.println("没有展开button");
             }
        }

locator定位器支持多种定位方式，可查阅官方文档https://playwright.dev/java/docs/locators。爬虫这种事还是python代码更简洁。

selenium和playwright爬虫linkedIn公共页 Java版

一、selenium

二、playwright

友情链接更多精彩内容