爬trip advisor景点评论(一)

第一次学习异步加载的网页如何找出真实网页,看了一下午,实在是有点困难。但是就是有这么个毛病,越是找不到的就越想找到。

Paste_Image.png

到现在终于找到了我要的真实网址,泪奔。。。
我们以黄山为例:在输入黄山之后,得到的评论如下图所示:


Paste_Image.png

什么叫异步加载,就是我在选取评论语言的时候,上面的网址是不会变的,说明有猫腻。


Paste_Image.png

我在首先明白了什么叫抓包,以及怎么去抓包之后就开始了漫长的找包之旅,过程就不赘述了,
首先发现在起始网页中加入浏览器信息的时候是可以解析出英文界面的,但是!!!
Paste_Image.png

在这里有一个更多,又是一个异步加载!还得接着找。
在开发者工具里点击 clear


Paste_Image.png

在多次点击更多之后,发现出来一个这个玩意


Paste_Image.png
教训告诉我们,看名字很重要,名字已经告诉我们这是一个扩展。果然,在把找到的URL打开之后发现,终于评论的全文出来了:
Paste_Image.png
Paste_Image.png

到此结束了?

肯定并没有,那些一长串的数字是怎么来的? 下一篇再介绍。 to be continue...

Paste_Image.png

照例,附上单独解析的代码:


import requests
from lxml import etree
url='http://www.tripadvisor.cn/ExpandedUserReviews-g303685-d550738?target=410115359&context=1&reviews=410115359,409344604,407255372,401140048,400179383,398229741,396111020,395334568,394200191,393782571&servlet=Attraction_Review&expand=1'
headers = {'Accept': '*/*',
           'Accept-Encoding': 'gzip, deflate, sdch',
           'Accept-Language': 'zh-CN,zh;q=0.8',
           'Connection': 'keep-alive',
           'Cookie': 'ServerPool=X; TATravelInfo=V2*A.2*MG.-1*HP.2*FL.3*RVL.550738_100*RS.1; TASSK=enc%3AAGMMZ%2Bwe98u9po0Y%2FIY8pNbyuAGi9fbnqnNLKXa4%2BK5cWP0RMuCHTRZhu0uFf1yydRIPPAQ%2FpF7EdW0NLOpBZZId19ek1a9GHWZKvnuTIJ0QcXx1ULQXtiMx%2F%2BHhNCUrIg%3D%3D; TAUnique=%1%enc%3AjrXWw0qqncCEQMzfl5keG315t9yL8iOg6jLwcPiP6q8%3D; _jzqckmp=1; bdshare_firstime=1491815789350; __gads=ID=e5060e1a6b1ed08f:T=1491815796:S=ALNI_MbFkpxx2-zq7ubsIoe4wvdJnbQWoA; TALanguage=en; TAReturnTo=%1%%2FAttraction_Review-g303685-d550738-Reviews-Mt_Huangshan_Yellow_Mountain-Huangshan_Anhui.html; TASession=%1%V2ID.DA0C735ECBB05FFBD2F31EA11943410C*SQ.15*LP.%2FAttraction_Review-g303685-d550738-Reviews-Mt_Huangshan_Yellow_Mountain-Huangshan_Anhui%5C.html*LS.Attraction_Review*GR.70*TCPAR.53*TBR.19*EXEX.62*ABTR.65*PHTB.78*FS.82*CPU.26*HS.popularity*ES.popularity*AS.popularity*DS.5*SAS.popularity*FPS.oldFirst*LF.en*FA.1*DF.0*MS.-1*RMS.-1*FLO.550738*TRA.false*LD.550738; CM=%1%HanaPersist%2C%2C-1%7CPremiumMobSess%2C%2C-1%7Ct4b-pc%2C%2C-1%7CHanaSession%2C%2C-1%7CRCPers%2C%2C-1%7CWShadeSeen%2C%2C-1%7CFtrPers%2C%2C-1%7CTheForkMCCPers%2C%2C-1%7CHomeASess%2C%2C-1%7CPremiumSURPers%2C%2C-1%7CPremiumMCSess%2C%2C-1%7Csesscoestorem%2C%2C-1%7CCpmPopunder_1%2C1%2C1491902222%7CCCSess%2C%2C-1%7CCpmPopunder_2%2C1%2C-1%7CViatorMCPers%2C%2C-1%7Csesssticker%2C%2C-1%7C%24%2C%2C-1%7CPremiumORSess%2C%2C-1%7Ct4b-sc%2C%2C-1%7CMC_IB_UPSELL_IB_LOGOS2%2C%2C-1%7Cb2bmcpers%2C%2C-1%7CMC_IB_UPSELL_IB_LOGOS%2C%2C-1%7CPremMCBtmSess%2C%2C-1%7CPremiumSURSess%2C%2C-1%7CLaFourchette+Banners%2C%2C-1%7Csess_rev%2C%2C-1%7Csessamex%2C%2C-1%7Cperscoestorem%2C%2C-1%7CPremiumRRSess%2C%2C-1%7CSaveFtrPers%2C%2C-1%7CTheForkRRSess%2C%2C-1%7Cpers_rev%2C%2C-1%7CMetaFtrSess%2C%2C-1%7CRBAPers%2C%2C-1%7CWAR_RESTAURANT_FOOTER_PERSISTANT%2C%2C-1%7CFtrSess%2C%2C-1%7CHomeAPers%2C%2C-1%7CPremiumMobPers%2C%2C-1%7CRCSess%2C%2C-1%7CLaFourchette+MC+Banners%2C%2C-1%7Cbookstickcook%2C%2C-1%7Csh%2C%2C-1%7CLastPopunderId%2C137-1859-null%2C-1%7Cpssamex%2C%2C-1%7CTheForkMCCSess%2C%2C-1%7C2016sticksess%2C%2C-1%7CCCPers%2C%2C-1%7CWAR_RESTAURANT_FOOTER_SESSION%2C%2C-1%7Cb2bmcsess%2C%2C-1%7C2016stickpers%2C%2C-1%7CViatorMCSess%2C%2C-1%7CPremiumMCPers%2C%2C-1%7CPremiumRRPers%2C%2C-1%7CPremMCBtmPers%2C%2C-1%7CTheForkRRPers%2C%2C-1%7CSaveFtrSess%2C%2C-1%7CPremiumORPers%2C%2C-1%7CRBASess%2C%2C-1%7Cbookstickpers%2C%2C-1%7Cperssticker%2C%2C-1%7CMetaFtrPers%2C%2C-1%7C; TAUD=LA-1491815815299-1*LG-14277644-2.1.F.*LD-14277645-.....; roybatty=TNI1625!AP9YRq1oHIHfPtXcJCINRrDe7hLPCe8L8uurjbOYo996M1NrdEF3UC8F2w%2BA%2FvgIK20Ptfm2qFK2Y7gBNq3fPyswrYVGd%2BwBp%2FhQTse54C7MDQU3%2FCl9pe%2FrrYw8WiSNYgQ6pewgJ',
           'Host': 'www.tripadvisor.cn',
           'Referer': 'http://www.tripadvisor.cn/Attraction_Review-g303685-d550738-Reviews-Mt_Huangshan_Yellow_Mountain-Huangshan_Anhui.html',
           'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
           }
html=requests.post(url,headers=headers).content
selector=etree.HTML(html)
infos = selector.xpath('//div[@class="entry"]')
print(len(infos))
for info in infos:
    comment = info.xpath('p/text()')[0]
    print(comment)
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容

  • Android 自定义View的各种姿势1 Activity的显示之ViewRootImpl详解 Activity...
    passiontim阅读 175,738评论 25 709
  • Spring Cloud为开发人员提供了快速构建分布式系统中一些常见模式的工具(例如配置管理,服务发现,断路器,智...
    卡卡罗2017阅读 135,628评论 19 139
  • 太平人寿健康管理增值服务一直是广受VIP客户欢迎的服务内容。其中,24小时电话医生、专家门诊预约以及体检服务为使用...
    吉分阅读 3,247评论 0 0
  • 天空澄澈,流云缓缓,温暖的阳光穿透绿色树叶,田野上长长小路沿向远方。 今年夏天,栀子赶回乡下外婆家,这是她...
    倾雪如故阅读 1,664评论 0 0
  • 今天来谈谈我的“治学理念”当中的特指概念:“战力训练”。 未来建立“私学”系统,是我三十六岁之后越来越清晰的一个目...
    设计实验阅读 3,003评论 0 0