微信公眾號文章爬蟲(chóng)
優(yōu)采云 發(fā)布時(shí)間: 2020-08-27 08:38微信公眾號文章爬蟲(chóng)
很多的微信公眾號都提供了質(zhì)量比較高的文章閱讀,對于自己喜歡的微信公眾號,所以想做個(gè)微信公眾號爬蟲(chóng),爬取相關(guān)公眾號的所有文章。抓取公眾號的所有的文章,需要獲取兩個(gè)比較重要的參數。一個(gè)是微信公眾號的惟一ID(__biz)和獲取單一公眾號的文章權限值wap_sid2。接下來(lái)說(shuō)一下思路。
搜索結果
//*[@id="sogou_vr_11002301_box_n"]/div/div[2]/p[1]/a
http://mp.weixin.qq.com/profile?src=3×tamp=1508003829&ver=1&signature=Eu9LOYSA47p6WE0mojhMtFR-gSr7zsQOYo6*w5VxrUgy7RbCsdkuzfFQ1RiSgM3i9buMZPrYzmOne6mJxCtW*g==
只有10條文章達不到要求
?//其中biz值就是微信公眾號的唯一id值。前面和后面省略了大部分代碼;該段代碼位于script標簽里面;該代碼還有最近10條文章的數據,如果單純想獲取最近10條,可以通過(guò)正則表達式來(lái)直接獲取???var?biz?=?"MzIwNDA1OTM4NQ=="?||?"";???var?src?=?"3"?;????var?ver?=?"1"?;????var?timestamp?=?"1508003829"?;????var?signature?=?"Eu9LOYSA47p6WE0mojhMtFR-gSr7zsQOYo6*w5VxrUgy7RbCsdkuzfFQ1RiSgM3i9buMZPrYzmOne6mJxCtW*g=="?;????var?name="python6359"||"python";
獲取到微信公眾號的id值以后,就是要獲取wap_sid值(即單個(gè)微信公眾號的文章權限值。)這個(gè)部份從陌陌客戶(hù)端獲取,接下來(lái)通過(guò)Fiddler抓包工具獲取,如果不知道抓包工具的環(huán)境搭建,可以參考 fiddler抓取摩拜自行車(chē)數據包
獲取的wap_sid2和__biz值
#?-*-?coding:?utf-8?-*-?import?scrapy?from?scrapy?import?Request?from?.mongo?import?MongoOperate?import?json?from?.settings?import?*?class?DataSpider(scrapy.Spider):?????name?=?"data"?????allowed_domains?=?["mp.weixin.qq.com"]?????start_urls?=?['https://mp.weixin.qq.com/']?????count=10?????url="https://mp.weixin.qq.com/mp/profile_ext?action=getmsg&__biz={biz}&f=json&offset={index}&count=10&is_ok=1&scene=124&uin=777&key=777&pass_ticket=ULeI%2BILkTLA2IpuIDqbIla4jG6zBTm1jj75UIZCgIUAFzOX29YQeTm5UKYuXU6JY&wxtoken=&appmsg_token=925_%252B4oEmoVo6AFzfOotcwPrPnBvKbEdnLNzg5mK8Q~~&x5=0&f=json"?????def?start_requests(self):???????MongoObj=MongoOperate(MONGO_URI,MONGO_DATABASE,MONGO_USER,MONGO_PASS,RESPONSE)?????????MongoObj.connect()?????????items=MongoObj.finddata()?????????for?item?in?items:?????????????headers={?????????????????'Accept-Encoding':'gzip,?deflate',?????????????????'Connection':'keep-alive',?????????????????'Accept':'*/*',?????????????????'User-Agent':?'Mozilla/5.0?(iPhone;?CPU?iPhone?OS?10_0_1?like?Mac?OS?X)?AppleWebKit/602.1.50?(KHTML,?like?Gecko)?Mobile/14A403?MicroMessenger/6.5.18?NetType/WIFI?Language/zh_CN',?????????????????'Accept-Language':?'zh-cn',?????????????????'X-Requested-With':?'XMLHttpRequest',?????????????????'X-WECHAT-KEY':?'62526065241838a5d44f7e7e14d5ffa3e87f079dc50a66e615fe9b6169c8fdde0f7b9f36f3897212092d73a3a223ffd21514b690dd8503b774918d8e86dfabbf46d1aedb66a2c7d29b8cc4f017eadee6',?????????????????'X-WECHAT-UIN':?'MTU2MzIxNjQwMQ%3D%3D',?????????????????'Cookie':';wxuin=1563216401;pass_ticket=oQDl45NRtfvQIxv2j2pYDSOOeflIXU7V3x1TUaOTpi6SkMp2B3fJwF6TE40ATCpU;ua_id=Wz1u21T8nrdNEyNaAAAAAOcFaBcyz4SH5DoQIVDcnao=;pgv_pvid=7103943278;sd_cookie_crttime=1501115135519;sd_userid=8661501115135519;3g_guest_id=-8872936809911279616;tvfe_boss_uuid=8ed9ed1b3a838836;mobileUV=1_15c8d374ca8_da9c8;pgv_pvi=8005854208'?????????????}?????????????biz=item["biz"]???????#主要驗證是wap_sid2;pass_ticket不一樣無(wú)所謂?????????????headers["Cookie"]="wap_sid2="+item["wap_sid2"]+headers["Cookie"]?????????????yield?Request(url=self.url.format(biz=biz,index="10"),headers=headers,callback=self.parse,dont_filter=True,meta={"biz":biz,"headers":headers},)?????def?parse(self,?response):?????????biz=response.request.meta["biz"]?????????headers=response.request.meta["headers"]?????????resText=json.loads(response.text)?????????print(resText)?????????list=json.loads(resText["general_msg_list"])?????????print(list)?????????yield?list?????????if?resText["can_msg_continue"]==1:?????????????self.count=self.count+10?????????????yield?Request(url=self.url.format(biz=biz,index=str(self.count)),headers=headers,callback=self.parse,dont_filter=True,meta={"biz":biz,"headers":headers})?????????else:?????????????print("end")
最終捕獲的數據
static?function?OnBeforeResponse(oSession:?Session)?{???????if?(oSession.HostnameIs("mp.weixin.qq.com")?&&?oSession.uriContains("/mp/profile_ext?action=home"))?{???????????oSession["ui-color"]?=?"orange";???????????oSession.SaveResponse("C:\Users\Administrator\Desktop\2.txt",false);???????????//oSession.SaveResponseBody("C:\Users\Administrator\Desktop\1.txt")???????}???????if?(m_Hide304s?&&?oSession.responseCode?==?304)?{???????????oSession["ui-hide"]?=?"true";???????}???}
響應頭
哈哈
源代碼的readme.md文件介紹使用的方法,需要可以直接到github里面獲取源碼,github源碼地址;喜歡的給個(gè)star喲。
其他類(lèi)似文章
作者:Evtion
鏈接: