通过js2xml处理script里的数据

新手叨叨

说实话，现在的网站，对我们这些小爬虫新手实在是太不友好了。
今天要入了一个打开页面啥都有，打开html啥都没有的坑了。

image.png

这时我们应该就要反应过来，去找一下数据在哪呢？直接搜索一下，发现它全在一个script里的变量中，这样的情况我们有两种处理方法

使用scrapy-splash直接加载出来html所有内容，再进行处理。但因为splash的安装过于繁琐，对于一些小伙伴太不友好，先放下，等别的方法处理不了再考虑这个方法
使用js2xml将script里的数据进行转换成lxml的形式，再通过xpath定位到每个所需数据的位置。

# 回复详情
        reply_area = soup.select('script')[15].string
        reply_text = js2xml.parse(reply_area, debug=False)
        reply_tree = js2xml.pretty_print(reply_text)
        selector = etree.HTML(reply_tree)
        username = selector.xpath('//property[@name="kind"]/following-sibling::property[@name="name"]/string/text()')
        content = selector.xpath('//property[@name="text"]/string/text()')
        time = selector.xpath('//property[@name="create_time"]/string/text()')
        comment_id = selector.xpath('//property[@name="create_time"]/following-sibling::property[@name="id"]/string/text()')

通过js2xml处理script里的数据

新手叨叨

推荐阅读更多精彩内容