通用解決方案:怎么通過(guò)CSS選擇器采集網(wǎng)頁(yè)數據
優(yōu)采云 發(fā)布時(shí)間: 2022-10-23 19:30通用解決方案:怎么通過(guò)CSS選擇器采集網(wǎng)頁(yè)數據
按 F12 打開(kāi)開(kāi)發(fā)人員工具,并查看文章列出 HTML 代碼結構:
文章標題可以通過(guò)CSS selector.post 項標題獲得;
文章地址可以通過(guò)CSS selector.post 項標題獲得;
文章介紹可以通過(guò)CSS selector.post 項摘要獲得;
作者可以通過(guò)CSS selector.post 項目作者;
用戶(hù)頭像可以通過(guò)CSS選擇器img.頭像獲得;
喜歡的數量可以通過(guò)CSS獲得 selector.post 項 a.post 元項;
注釋的數量可以通過(guò) CSS selector.post 項腳 a[類(lèi)*=后元項]:第 n 個(gè)類(lèi)型(3) 獲得;
視圖數可以通過(guò) CSS selector.post 項英尺 a[類(lèi)*=元項后]:類(lèi)型 n(4) 跨度獲得;
所以現在開(kāi)始編寫(xiě)采集規則,采集規則保存,進(jìn)入頁(yè)面檢查數據當前是否采集。
{
"title": "博客園首頁(yè)文章列表",
"match": "https://www.cnblogs.com/*",
"demo": "https://www.cnblogs.com/#p2",
"delay": 2,
"rules": [
"root": "#post_list .post-item",
"multi": true,
"desc": "文章列表",
"fetches": [
"name": "文章標題",
<p>
"selector": ".post-item-title"
"name": "文章地址",
"selector": ".post-item-title",
"type": "attr",
"attr": "href"
"name": "文章介紹",
"selector": ".post-item-summary"
"name": "作者",
"selector": ".post-item-author"
"name": "頭像",
"selector": "img.avatar",
"type": "attr",
"attr": "src"
"name": "點(diǎn)贊數",
"selector": ".post-item-foot a.post-meta-item"
"name": "評論數",
"selector": ".post-item-foot a[class*=post-meta-item]:nth-of-type(3)"
"name": "瀏覽數",
"selector": ".post-item-foot a[class*=post-meta-item]:nth-of-type(4)"
</p>
編寫(xiě)內容頁(yè)采集規則
編寫(xiě)方法與上面相同,代碼直接在此處發(fā)布。
{
"title": "博客園文章內容",
"match": "https://www.cnblogs.com/*/p/*.html",
"demo": "https://www.cnblogs.com/bianchengyouliao/p/15541078.html",
"delay": 2,
"rules": [
"multi": false,
"desc": "文章內容",
"fetches": [
"name": "文章標題",
"selector": "#cb_post_title_url"
"name": "正文內容",
"selector": "#cnblogs_post_body",
"type": "html"
添加計劃任務(wù)(用于批量采集、翻頁(yè)采集
?。?。
在定時(shí)任務(wù)中,通過(guò)動(dòng)態(tài)URL采集地址獲取待 采集文章頁(yè)面的地址,插件在獲取完成后會(huì )自動(dòng)打開(kāi)對應的頁(yè)面。打開(kāi)頁(yè)面后,插件將立即采集規則匹配并采集數據。
https://www.cnblogs.com/
[a.post-item-title,href]:https://www.cnblogs.com/#p[2,10,1]
優(yōu)化的解決方案:關(guān)鍵詞爬蟲(chóng),Python花瓣畫(huà)板關(guān)鍵詞采集存儲數據庫
想找圖的朋友不要錯過(guò)這個(gè)網(wǎng)站,對,沒(méi)錯,就是,各種圖都有,而且推薦畫(huà)板里的字還是很不錯的,可惜了和諧了很多,想要采集花瓣畫(huà)板的話(huà),python爬蟲(chóng)當然沒(méi)問(wèn)題,花瓣的數據更有趣!
查詢(xún)源碼,有點(diǎn)類(lèi)似數據接口
app.page["explores"] = [{"keyword_id":1541, "name":"創(chuàng )意燈", "urlname":"創(chuàng )藝燈籠", "cover":{"farm":"farm1", "bucket" :"hbimg", "key":"f77b1c1df184ce91ff529a4d0b5211aa883872c91345f-tdQn2g", "type":"image/jpeg", "width":468, "height":702, "frames":1, "file_id":15723730}, "
想了想,還是用普通訪(fǎng)問(wèn)更簡(jiǎn)單方便!
常規的
explores=re.findall(r'app.page\["explores"\] = \[(.+?)\];.+?app.page\["followers"\]',html,re.S)[0]
復制
注意這里的轉義字符
源代碼:
#花瓣推薦畫(huà)報詞采集
#20200314 by 微信:huguo00289
# -*- coding: UTF-8 -*-
from fake_useragent import UserAgent
import requests,re,time
from csql import Save
key_informations=[]
def search(key,keyurl):
print(f"正在查詢(xún): {key}")
ua = UserAgent()
headers = {"User-Agent": ua.random}
url=f"https://huaban.com/explore/{keyurl}/"
html=requests.get(url,headers=headers).content.decode("utf-8")
time.sleep(2)
if 'app.page["category"]' in html:
#print(html)
explores=re.findall(r'app.page\["explores"\] = \[(.+?)\];.+?app.page\["followers"\]',html,re.S)[0]
#print(explores)
keyfins=re.findall(r', "name":"(.+?)", "urlname":"(.+?)",',explores,re.S)
print(keyfins)
sa=Save(keyfins)
sa.sav()
for keyfin in keyfins:
if keyfin not in key_informations:
key_informations.append(keyfin)
search(keyfin[0], keyfin[1])
print(len(key_informations))
else:
print(f"查詢(xún)關(guān)鍵詞{key}不是工業(yè)設計分類(lèi),放棄查詢(xún)!")
pass
print(len(key_informations))
print(key_informations)
search('3D打印', '3dp')
復制
函數調用本身不斷循環(huán)瀏覽網(wǎng)頁(yè)以獲取數據!
花瓣網(wǎng)板字采集
數據是下拉加載,ajax數據加載
同時(shí)還有一個(gè)規則,就是下一個(gè)下拉的max就是最后一個(gè)petal seq!
源代碼:
#花瓣畫(huà)報詞采集
#20200320 by 微信:huguo00289
# -*- coding: UTF-8 -*-
from csql import Save
import requests,json,time
def get_board(id):
headers={
'Cookie': 'UM_distinctid=170c29e8d8f84f-0b44fc835bc8e3-43450521-1fa400-170c29e8d903de; CNZZDATA1256914954=1367860536-1583810242-null%7C1583837292; _uab_collina=158415646085953266966037; __auc=30586f3f170d7154a5593583b24; __gads=ID=28115786a916a7a1:T=1584156505:S=ALNI_MbtohAUwMbbd5Yoa5OBBaSO0tSJkw; _hmt=1; sid=s%3AkwSz9iaMxZf-XtcJX9rrY4ltNDbqkeYs.bc8fvfAq6DLGxsRQ6LF9%2FmHcjOGIhRSZC0RkuKyHd7w; referer=https%3A%2F%2Fwww.baidu.com%2Flink%3Furl%3Df1FbGruB8SzQQxEDyaJ_mefz-bVnJFZJaAcQYJGXTZq%26wd%3D%26eqid%3Dda22ff4e0005f208000000065e74adf2; uid=29417717; _f=iVBORw0KGgoAAAANSUhEUgAAADIAAAAUCAYAAADPym6aAAABJ0lEQVRYR%2B1VuxHCMAyVFqKjomEjVgkb0VDRMQgrmJMdBcUn2VbAXDiSJpb9%2FHl6%2BiCEEAAAAiL9AJP5sgHSQuMXAOIB6NxXO354DOlhxodMhB8vicQxjgxrN4l1IrMRMRzmVkSeQ4pMIUdRp4RNaU4LsRzPNt9rKekmooWWDJVvjqVTuxKJeTWqJL1vkV2CZzJdifRWZ5EitfJrxbI2r6nEj8rxs5w08pAwLkXUgrGg%2FDoqdTN0IzK5ylAkXG6pgx%2F3sfPntuZqxsh9JUkk%2Fry7FtWbdXZvaNFFkgiPLRJyXe5txZfIbEQ4nMjLNe9K7FS9hJqrUeTnibQm%2BeoV0R5olZZctZqKGr5bsnuISPXy8muRssrv6X6AnNRbVau5LX8A%2BDed%2FQkRsJAorSTxBAAAAABJRU5ErkJggg%3D%3D%2CWin32.1920.1080.24; Hm_lvt_d4a0e7c3cd16eb58a65472f40e7ee543=1584330161,1584348316,1584516528,1584705015; __asc=c7dc256a170f7c78b1b2b6abc60; CNZZDATA1256903590=1599552095-1584151635-https%253A%252F%252Fwww.baidu.com%252F%7C1584704759; _cnzz_CV1256903590=is-logon%7Clogged-in%7C1584705067566%26urlname%7Cxpmvxxfddh%7C1584705067566; Hm_lpvt_d4a0e7c3cd16eb58a65472f40e7ee543=1584705067',
'Referer': 'https://huaban.com/discovery/industrial_design/boards/',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
'X-Request': 'JSON',
'X-Requested-With': 'XMLHttpRequest',
}
url="https://huaban.com/discovery/industrial_design/boards/?k804hb1m&max=%s&limit=20&wfl=1" % id
html=requests.get(url,headers=headers,timeout=8).content.decode('utf-8')
time.sleep(1)
if html:
req=json.loads(html)
print(req)
boards=req['boards']
print(len(boards))
for board in boards:
print(board['title'])
sa = Save(board['title'])
sa.sav2()
#print(board['seq'])
next_id=boards[-1]['seq']
get_board(next_id)
if __name__ == '__main__':
id="1584416341304281760"
while True:
get_board(id)
復制
使用 while 循環(huán)并循環(huán)自身
最后保存到數據庫
源代碼
import pymysql
class Save(object):
def __init__(self,key):
self.host="localhost"
self.user="root"
self.password="123456"
<p>
self.db="xiaoshuo"
self.port=3306
self.connect = pymysql.connect(
host=self.host,
user=self.user,
password=self.password,
db=self.db,
port=self.port,
)
self.cursor = self.connect.cursor() # 設置游標
self.key=key
def insert(self):
for keyword in self.key:
try:
sql="INSERT INTO huaban(keyword)VALUES(%s)"
val = (keyword[0])
self.cursor.execute(sql, val)
self.connect.commit()
print(f'>>> 插入 {keyword[0]} 數據成功!')
except Exception as e:
print(e)
print(f'>>> 插入 {keyword[0]} 數據失??!')
def insert2(self):
keyword=self.key
try:
sql="INSERT INTO huaban2(keyword)VALUES(%s)"
val = keyword
self.cursor.execute(sql, val)
self.connect.commit()
print(f'>>> 插入 {keyword} 數據成功!')
except Exception as e:
print(e)
print(f'>>> 插入 {keyword} 數據失??!')
def cs(self):
# 關(guān)閉數據庫
self.cursor.close()
self.connect.close()
def sav(self):
self.insert()
self.cs()
def sav2(self):
self.insert2()
self.cs()
</p>
復制