亚洲国产精品无码久久大片,亚洲AV无码乱码麻豆精品国产,亚洲品质自拍网站,少妇伦子伦精品无码STYLES,国产精久久久久久久

數據采集實(shí)戰:動(dòng)態(tài)網(wǎng)頁(yè)數據采集

優(yōu)采云 發(fā)布時(shí)間: 2022-05-04 17:00

  數據采集實(shí)戰:動(dòng)態(tài)網(wǎng)頁(yè)數據采集

  Part1引言

  我們上一篇推文中,已經(jīng)講解了靜態(tài)網(wǎng)頁(yè)的采集方法,本文我們介紹動(dòng)態(tài)網(wǎng)頁(yè)采集的方法。

  本文采集的示例網(wǎng)站為:,我們的目標是將網(wǎng)頁(yè)中指定的文本信息采集下來(lái)并保存。

  完整的代碼請見(jiàn)文末附件!

  Part2什么是動(dòng)態(tài)網(wǎng)頁(yè)

  通常情況下,我們要提取的數據并不在我們下載到的HTML源代碼中。舉個(gè)例子,我們在刷QQ空間或者微博評論的時(shí)候,一直往下刷,網(wǎng)頁(yè)在不刷新的情況下會(huì )越來(lái)越長(cháng),內容也越來(lái)越多。

  具體而言,當在我們?yōu)g覽網(wǎng)站的時(shí)候,更具用戶(hù)的實(shí)際操作(如鼠標滾輪下滑加載內容),不斷的向服務(wù)器發(fā)起請求,并將請求回來(lái)的數據利用JavaScript技術(shù),將新的內容添加到網(wǎng)頁(yè)中。以百度圖片為例子:,我們進(jìn)入百度圖片之后,搜索我們想要查找的圖片進(jìn)行搜索,隨后不斷地下滑頁(yè)面,我們會(huì )看到網(wǎng)頁(yè)中不斷有圖片加載出來(lái),但是網(wǎng)頁(yè)并沒(méi)有刷新,這就動(dòng)態(tài)加載頁(yè)面。

  Part3手動(dòng)采集的操作步驟

  本文采集的示例網(wǎng)站為:,內容如下圖所示:

  

  假設我們需要采集的內容有:文章的標題、關(guān)鍵詞、發(fā)布日期和詳情鏈接這4部分內容,對于標題、關(guān)鍵詞、發(fā)布日期這3個(gè)信息我們在列表頁(yè)中就可以看到。對于詳情鏈接,我們還需要在網(wǎng)站上點(diǎn)擊指定詳情頁(yè)之后,才能采集,如下圖:

  

  假設我們想要采集的內容有很多,光靠手動(dòng)采集的操作會(huì )浪費大量的時(shí)間,所以我們可以利用Python自動(dòng)化采集數據。

  Part4自動(dòng)采集的操作步驟(一)分析動(dòng)態(tài)加載的頁(yè)面

  在不刷新網(wǎng)頁(yè)的情況下,該網(wǎng)站是需要點(diǎn)擊網(wǎng)頁(yè)末尾的按鈕,才會(huì )加載新的數據,如下圖所示:

  

  我們打開(kāi)開(kāi)發(fā)者工具(谷歌瀏覽器按F12),點(diǎn)擊過(guò)濾器XHR,然后多次點(diǎn)擊網(wǎng)頁(yè)最下方按鈕進(jìn)行內容的加載,我們可以看到,每次點(diǎn)擊按鈕之后,就能抓到一個(gè)包,我們查看抓包的信息,就能發(fā)現,該請求返回的響應內容里面就有我們想要的數據,實(shí)際的操作如下圖:

  

  網(wǎng)頁(yè)中顯示的內容:

  

  所以我們可以直接請求該接口來(lái)獲取我們想要的數據,我們先將這三個(gè)不同請求的URL提取出來(lái),如下所示:

  第2頁(yè):https://www.xfz.cn/api/website/articles/?p=2&n=20&type=<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />第3頁(yè):https://www.xfz.cn/api/website/articles/?p=3&n=20&type=<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />第4頁(yè):https://www.xfz.cn/api/website/articles/?p=4&n=20&type=

  Tip:,該URL是GET請求帶參數的情況,域名和參數之間用?隔開(kāi),每個(gè)參數之間用&amp;間隔。

  我們觀(guān)察每一頁(yè)的URL參數的變化,發(fā)現在三個(gè)參數里面p為變化的參數,每點(diǎn)擊一次,p就自增1,所以p參數跟翻頁(yè)有關(guān),我們可以通過(guò)修改p參數,來(lái)訪(fǎng)問(wèn)不同頁(yè)面的信息內容,我們也可以推斷出,當p參數的值為1的時(shí)候,就是請求網(wǎng)站第1頁(yè)的內容。

 ?。ǘ┐a實(shí)現1. 請求頁(yè)面并解析數據

  import?requests<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />import?time<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" /><br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />for?page?in?range(1,?6):??#?獲取5頁(yè)數據<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????#?利用format構造URL<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????url?=?'https://www.xfz.cn/api/website/articles/?p={}&n=20&type='.format(page)<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????#?發(fā)送請求獲取響應<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????res?=?requests.get(url=url)<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????#?將響應的json格式字符串,解析成為Python字典格式<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????info_dic?=?res.json()<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????#?提取我們想要的數據,并格式化輸出<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????for?info?in?info_dic['data']:<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????????result?=?{<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????????????'title':?info['title'],<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????????????'date':?info['time'],<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????????????'keywords':?'-'.join(info['keywords']),<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????????????'href':?'https://www.xfz.cn/post/'?+?str(info['uid'])?+?'.html'<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????????}<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????????print(result)<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????time.sleep(1)??#?控制訪(fǎng)問(wèn)頻率<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />

  執行結果(部分):

  {'title':?'「分貝通」完成C+輪1.4億美元融資',?'date':?'2022-02-17?10:17:13',?'keywords':?'分貝通-DST?Global',?'href':?'https://www.xfz.cn/post/10415.html'}<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />{'title':?'「塬數科技」完成近億元A輪融資,凡卓資本擔任獨家財務(wù)顧問(wèn)',?'date':?'2022-02-15?10:17:42',?'keywords':?'塬數科技-凡卓資本-晨山資本-博將資本',?'href':?'https://www.xfz.cn/post/10412.html'}<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />{'title':?'「BUD」獲1500萬(wàn)美元A+輪融資',?'date':?'2022-02-14?10:15:35',?'keywords':?'啟明創(chuàng )投-源碼資本-GGV紀源資本-云九資本',?'href':?'https://www.xfz.cn/post/10411.html'}<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />{'title':?'以圖計算引擎切入千億級數據分析市場(chǎng),它要讓人人成為分析師,能否造就國內百億級黑馬',?'date':?'2022-02-10?11:04:52',?'keywords':?'歐拉認知智能-新一代BI',?'href':?'https://www.xfz.cn/post/10410.html'}<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />{'title':?'前有Rivian市值千億,后有經(jīng)緯、博原頻頻押注,滑板底盤(pán)賽道將誕生新巨頭?丨什么值得投',?'date':?'2022-02-09?11:51:36',?'keywords':?'什么值得投',?'href':?'https://www.xfz.cn/post/10409.html'}<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />

  2. 保存到本地csv

  我們在原先的代碼基礎上,添加一點(diǎn)內容,將我們爬取下來(lái)的內容保存到CSV文件中,保存到CSV文件的方法有許多種,這邊采用pandas第三方模塊來(lái)實(shí)現,需要pip install pandas進(jìn)行安裝。

  import?requests<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />import?time<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />import?pandas?as?pd??#?導入模塊<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" /><br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />#?創(chuàng )建一個(gè)數據集,用來(lái)保存數據<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />data_set?=?[<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????('標題',?'日期',?'關(guān)鍵詞',?'詳情鏈接'),??#?這邊先定義頭部?jì)热?lt;br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />]<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />for?page?in?range(1,?6):??#?獲取5頁(yè)數據<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????#?利用format構造URL<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????url?=?'https://www.xfz.cn/api/website/articles/?p={}&n=20&type='.format(page)<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????#?發(fā)送請求獲取響應<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????res?=?requests.get(url=url)<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????#?將響應的json格式字符串,解析成為Python字典格式<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????info_dic?=?res.json()<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????#?提取我們想要的數據,并格式化輸出<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????for?info?in?info_dic['data']:<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????????result?=?{<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????????????'title':?info['title'],<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????????????'date':?info['time'],<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????????????'keywords':?'/'.join(info['keywords']),??#?關(guān)鍵詞會(huì )含有多個(gè),每個(gè)關(guān)鍵詞用斜杠隔開(kāi)<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????????????'href':?'https://www.xfz.cn/post/'?+?str(info['uid'])?+?'.html'??#?構造詳情頁(yè)url<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????????}<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????????#?獲取字典里面的值,并轉換成列表<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????????info_list?=?list(result.values())<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????????#?添加到數據集<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????????data_set.append(info_list)<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????time.sleep(1)??#?控制訪(fǎng)問(wèn)頻率<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" /><br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />#?保存成為csv文件<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />df?=?pd.DataFrame(data_set)<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />df.to_csv('xfz.csv',?mode='a',?encoding='utf-8-sig',?header=False,?index=False)<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />

  執行結果(部分):

  

  Part5總結

  文本講述了動(dòng)態(tài)網(wǎng)站數據采集基本流程與方法,結合我們上一期講的靜態(tài)網(wǎng)頁(yè)數據的采集實(shí)戰,相信大家已經(jīng)掌握了數據采集的基本技能。那么數據采集回來(lái)如何處理呢?敬請期待下期推文:Python數據處理基本方法。

  附件:get_web_data.py

  import?requests<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />import?time<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />import?pandas?as?pd??#?導入模塊<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" /><br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />#?創(chuàng )建一個(gè)數據集,用來(lái)保存數據<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />data_set?=?[<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????('標題',?'日期',?'關(guān)鍵詞',?'詳情鏈接'),??#?這邊先定義頭部?jì)热?lt;br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />]<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />for?page?in?range(1,?6):??#?獲取5頁(yè)數據<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????#?利用format構造URL<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????url?=?'https://www.xfz.cn/api/website/articles/?p={}&n=20&type='.format(page)<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????#?發(fā)送請求獲取響應<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????res?=?requests.get(url=url)<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????#?將響應的json格式字符串,解析成為Python字典格式<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????info_dic?=?res.json()<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????#?提取我們想要的數據,并格式化輸出<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????for?info?in?info_dic['data']:<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????????result?=?{<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????????????'title':?info['title'],<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????????????'date':?info['time'],<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????????????'keywords':?'/'.join(info['keywords']),??#?關(guān)鍵詞會(huì )含有多個(gè),每個(gè)關(guān)鍵詞用斜杠隔開(kāi)<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????????????'href':?'https://www.xfz.cn/post/'?+?str(info['uid'])?+?'.html'??#?構造詳情頁(yè)url<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????????}<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????????#?獲取字典里面的值,并轉換成列表<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????????info_list?=?list(result.values())<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????????#?添加到數據集<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????????data_set.append(info_list)<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />????time.sleep(1)??#?控制訪(fǎng)問(wèn)頻率<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" /><br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />#?保存成為csv文件<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />df?=?pd.DataFrame(data_set)<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />df.to_csv('xfz.csv',?mode='a',?encoding='utf-8-sig',?header=False,?index=False)<br style="outline: 0px;max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;" />

0 個(gè)評論

要回復文章請先登錄注冊


官方客服QQ群

微信人工客服

QQ人工客服


線(xiàn)

亚洲国产精品无码久久大片,亚洲AV无码乱码麻豆精品国产,亚洲品质自拍网站,少妇伦子伦精品无码STYLES,国产精久久久久久久