【原创】使用Puppeteer统计纵横研究院文章数据

最近正好学习到Puppeteer，便以统计纵横研究院文章做一个练习。

Puppeteer是Google Chrome团队官方的无界面Chrome工具，它是一个Node库，提供了一个高级的 API 来控制DevTools协议上的无头版Chrome。使用Puppeteer可以模拟用户在浏览器执行的大部分操作，如截图、抓取网页渲染后的内容、页面交互等。

最终抓取的文章数据地址如下：

数据展示地址：http://47.104.205.189:30000/

接下来就看下puppeteer模拟用户操作抓取数据的过程。

一、获取纵横研究院所有专题

运行一个puppeteer浏览器

const browser = await puppeteer.launch({
  headless: false
})

headless表示是否以无头模式运行，关闭此选项可以开发一个受代码控制的浏览器，便于调试。

进入https://www.jianshu.com/u/9b797d42a0cc页面

// 页面加载参数
const pageOptions = {
  timeout: 0, 
  waitUntil: [
    'domcontentloaded',
    'networkidle0'
  ]
}
const page = await browser.newPage()
await page.goto('https://www.jianshu.com/u/9b797d42a0cc', pageOptions)

timeout：页面超时时间，简书的页面如果频繁加载，会出现资源加载过慢的情况，这里设置为0表示不设置超时时间
waitUntil：页面打开完成的时机，domcontentloaded表示页面的DOMContentLoaded事件触发，networkidle0表示至少500ms内无网络请求

点击他创建的专题中的查看更多，显示所有纵横研究院专题

页面右侧默认只显示10个专题，需要模拟点击事件查看更多

专题列表

async function safeFunc (func) {
  try {
    const res = await func()
    return [null, res]
  } catch (e) {
    return [e, null]
  }
}
await safeFunc(async () => {
  await page.click('.list .check-more')
  await delay(1000)
})

page.click方法用来模拟用户点击事件，如果选择器没有选择到元素会抛出错误，因此用safeFunc通用方法处理了下错误。

获取所有专题

const res = await page.evaluate(async () => {
  const titleDom = Array.from(document.querySelectorAll('.title'))
    .find(one => one.innerText === '他创建的专题')
  if (!titleDom) return []
 // 通过选择器和dom相关方法获取到页面中专题的数据
  return Array.from(titleDom.nextElementSibling.querySelectorAll('li'))
    .reduce((acc, current) => {
      const item = current.querySelector('.name')
      if (!item) return acc
      return acc.concat({
        topicName: item.innerText,
        topicHome: item.href
      })
    }, [])
})

page.evaluate可以在浏览器环境执行传入的函数，因此在传入的函数中可以获取到window、document对象等，能执行浏览器的dom相关方法。

二、到每个专题下获取专题中的所有文章

从专题页获取文章列表如下：

async function getArticles (page) {
  await autoScroll(page)
  const articles = await page.evaluate(async () => {
    return Array.from(document.querySelectorAll('.note-list > li'))
      .reduce((acc, current) => {
        const titleDom = current.querySelector('.title')
        const nicknameDom = current.querySelector('.nickname')
        if (!titleDom || !nicknameDom) return acc

        const starIcon = nicknameDom.parentElement.querySelector('.ic-list-like')
        const stars = (starIcon && Number.parseInt(starIcon.nextSibling.data)) || 0
        const commentIcon = nicknameDom.parentElement.querySelector('.ic-list-comments')
        const comments = (commentIcon && Number.parseInt(commentIcon.nextSibling.data)) || 0
        return acc.concat({
          authorName: nicknameDom.innerText, // 作者名称
          authorHome: nicknameDom.href, // 作者主页
          title: titleDom.innerText, // 文章标题
          url: titleDom.href, // 文章地址
          stars, // 点赞数
          comments // 评论数
        })
      }, [])
  })
  return articles
}

该方法也是在浏览器上下文中用选择器选择到对应的dom元素，挨个获取文章的数据。在获取文章之前有一个方法autoScroll是用来将页面滚动到底部的，因为专题中文章列表为懒加载，滚动到底部才能读取到所有文章。autoScroll方法如下：

async function autoScroll (page) {
  await page.evaluate(async () => {
    await new Promise((resolve, reject) => {
      let totalHeight = 0
      let distance = 100
      let timer = setInterval(() => {
        let scrollHeight = document.body.scrollHeight
        window.scrollBy(0, distance)
        totalHeight += distance
        if (totalHeight >= scrollHeight) {
          clearInterval(timer)
          resolve()
        }
      }, 100)
    })
  })
}

如上所示，通过定时器设置页面的滚动高度来加载更多文章，直到滚动高度为实际页面高度即文章加载完毕。

遍历获取到的专题列表，到每个专题页面获取文章，如下：

const topics = await getTopics(browser)
const page = await browser.newPage()
for (const topic of topics) {
  await page.goto(topic.topicHome, pageOptions)
  const articles = await getArticles(page)
  Object.assign(topic, {
    articles: articles.map(one => ({ ...topic, ...one }))
  })
}

三、到用户页面获取文章的阅读量和发布时间

如果专题页直接显示了文章的阅读量和发布时间，那么根据以上两步拿到的数据就足够统计了。接下来需要对专题内所有的文章按作者分组，再到每个作者的主页获取文章的详细信息。

按作者分组：

const authors = topics.reduce((acc, topic) => {
  topic.articles.forEach(article => {
    const { authorName, authorHome } = article
    const exsitAuthor = acc.find(one => one.authorHome === authorHome)
    if (exsitAuthor) {
      Object.assign(exsitAuthor, { articles: [...exsitAuthor.articles, article] })
    } else {
      acc.push({ authorName, authorHome, articles: [article] })
    }
  })
  return acc
}, [])

从作者的主页获取获取文章的阅读量和发布时间：

async function getArticlesDetail (page) {
  await autoScroll(page)
  const articles = await page.evaluate(async () => {
    return Array.from(document.querySelectorAll('.note-list > li')).map(one => {
      if (!one) return {}
      const titleDom = one.querySelector('.title')
      const url = titleDom && titleDom.href
      const readIcon = one.querySelector('.ic-list-read')
      const readCount = (readIcon && Number.parseInt(readIcon.nextSibling.data)) || 0
      const timeDom = one.querySelector('.time')
      const publishTime = timeDom && moment(timeDom.dataset.sharedAt).format('YYYY-MM-DD HH:mm')
      return { url, readCount, publishTime }
    })
  })
  return articles
}

遍历专题内发布过文章的用户，到每个用户页面获取文章，如下：

for (const author of authors) {
  const { authorHome, articles } = author
  await page.goto(authorHome, pageOptions)
  const authorAllArticles = await getArticlesDetail(page)
  articles.forEach(article => {
    const articleExtraInfo = authorAllArticles.find(one => article.url === one.url)
    Object.assign(article, articleExtraInfo)
  })
}

四、排序、整理数据格式，导出json

const allArticles = authors.reduce((acc, current) => acc.concat(current.articles), [])
const allReadCount = allArticles.reduce((acc, current) => (acc + current.readCount), 0)

// 保存文章列表
output({
  articleCount: allArticles.length,
  readCount: allReadCount,
  articles: allArticles.sort((a, b) => (b.readCount - a.readCount))
}, './纵横研究院文章列表.json')

// 专题文章信息补全
topics.forEach(one => {
  one.articles.forEach(article => {
    const articleExtraInfo = allArticles.find(one => article.url === one.url)
    Object.assign(article, articleExtraInfo)
  })
})

// 保存专题统计信息
output({
  articleCount: allArticles.length,
  readCount: allReadCount,
  topicCount: topics.length,
  topics: topics
    .sort((a, b) => (b.articles.length - a.articles.length))
    .map(one => ({
      articleCount: one.articles.length,
      readCount: one.articles.reduce((acc, current) => (acc + current.readCount), 0),
      ...one,
      articles: one.articles.sort((a, b) => (b.readCount - a.readCount))
    }))
}, './纵横研究院专题统计.json')

// 保存作者统计信息
output({
  articleCount: allArticles.length,
  readCount: allReadCount,
  authorCount: authors.length,
  authors: authors
    .sort((a, b) => (b.articles.length - a.articles.length))
    .map(one => ({
      articleCount: one.articles.length,
      readCount: one.articles.reduce((acc, current) => (acc + current.readCount), 0),
      ...one,
      articles: one.articles.sort((a, b) => (b.readCount - a.readCount))
    }))
}, './纵横研究院作者统计.json')

以上为所有步骤，最终代码和运行结果地址点这里查看。

拓展

执行以上步骤获取统计信息，每次大概会花费6分钟左右，因为需要挨个到20个专题、60多个用户主页去获取信息，对于专题或用户文章较多的页面，需要滚动页面到底部懒加载所有文章。

如果同时打开多个页面，并行去处理这些页面跳转、懒加载、获取信息等，应该可以优化执行时间。用多个页面去处理任务如下：

async function execTasks (browser, tasks, maxPageCount = 5) {
  const taskStatus = new Array(tasks.length).fill(0)
  await Promise.all(Array.from({ length: maxPageCount }).map(async (one, i) => {
    const page = await browser.newPage()
    while (true) {
      const index = findIndex(taskStatus, status => !status)
      if (index === -1) break
      taskStatus[index] = 1
      await tasks[index](page)
    }
  }))
}

const topics = await getTopics(browser)
await execTasks(browser, topics.map(topic => async (page) => {
  await page.goto(topic.topicHome, pageOptions)
  const articles = await getArticles(page)
  Object.assign(topic, {
    articles: articles.map(one => ({ ...topic, ...one }))
  })
}))

以上代码开启了5个网页，共同处理统计专题的任务，不幸的是：

image.png

可能是简书对浏览器并发请求网页有限制，实际只有一个页面正常打开了，经过尝试，就算只打开两个网页窗口并行处理任务，也会出现加载失败的情况，所以最后还是妥协了只用一个page页。

本文参考资源如下：

【原创】使用Puppeteer统计纵横研究院文章数据

一、获取纵横研究院所有专题

二、到每个专题下获取专题中的所有文章

三、到用户页面获取文章的阅读量和发布时间

四、排序、整理数据格式，导出json

推荐阅读更多精彩内容