querylist采集微信公眾號文章( Python爬蟲(chóng)、數據分析、網(wǎng)站開(kāi)發(fā)等案例教程視頻免費在線(xiàn)觀(guān)看 )
優(yōu)采云 發(fā)布時(shí)間: 2022-02-21 08:11querylist采集微信公眾號文章(
Python爬蟲(chóng)、數據分析、網(wǎng)站開(kāi)發(fā)等案例教程視頻免費在線(xiàn)觀(guān)看
)
前言
本文文字及圖片經(jīng)網(wǎng)絡(luò )過(guò)濾,僅供學(xué)習交流,不具有任何商業(yè)用途。如有任何問(wèn)題,請及時(shí)聯(lián)系我們處理。
Python爬蟲(chóng)、數據分析、網(wǎng)站開(kāi)發(fā)等案例教程視頻免費在線(xiàn)觀(guān)看
https://space.bilibili.com/523606542
基礎開(kāi)發(fā)環(huán)境爬取兩個(gè)公眾號的文章:
1、爬取所有文章
青燈編程公眾號
2、爬取所有關(guān)于python的公眾號文章
爬取所有文章
青燈編程公眾號
1、登錄公眾號后點(diǎn)擊圖文
2、打開(kāi)開(kāi)發(fā)者工具
3、點(diǎn)擊超鏈接
加載相關(guān)數據時(shí),有關(guān)于數據包的,包括文章標題、鏈接、摘要、發(fā)布時(shí)間等。你也可以選擇其他公眾號去爬取,但這需要你有微信公眾號。
添加 cookie
import pprint
import time
import requests
import csv
f = open('青燈公眾號文章.csv', mode='a', encoding='utf-8', newline='')
csv_writer = csv.DictWriter(f, fieldnames=['標題', '文章發(fā)布時(shí)間', '文章地址'])
csv_writer.writeheader()
for page in range(0, 40, 5):
url = f'https://mp.weixin.qq.com/cgi-bin/appmsg?action=list_ex&begin={page}&count=5&fakeid=&type=9&query=&token=1252678642&lang=zh_CN&f=json&ajax=1'
headers = {
'cookie': '加cookie',
'referer': 'https://mp.weixin.qq.com/cgi-bin/appmsg?t=media/appmsg_edit_v2&action=edit&isNew=1&type=10&createType=0&token=1252678642&lang=zh_CN',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',
}
response = requests.get(url=url, headers=headers)
html_data = response.json()
pprint.pprint(response.json())
lis = html_data['app_msg_list']
for li in lis:
title = li['title']
link_url = li['link']
update_time = li['update_time']
timeArray = time.localtime(int(update_time))
otherStyleTime = time.strftime("%Y-%m-%d %H:%M:%S", timeArray)
dit = {
'標題': title,
'文章發(fā)布時(shí)間': otherStyleTime,
'文章地址': link_url,
}
csv_writer.writerow(dit)
print(dit)
爬取所有關(guān)于python的公眾號文章
1、搜狗搜索python選擇微信
注意:如果不登錄,只能爬取前十頁(yè)的數據。登錄后可以爬取2W多篇文章文章.
2、抓取標題、公眾號、文章地址,抓取發(fā)布時(shí)的靜態(tài)網(wǎng)頁(yè)
import time
import requests
import parsel
import csv
f = open('公眾號文章.csv', mode='a', encoding='utf-8', newline='')
csv_writer = csv.DictWriter(f, fieldnames=['標題', '公眾號', '文章發(fā)布時(shí)間', '文章地址'])
csv_writer.writeheader()
for page in range(1, 2447):
url = f'https://weixin.sogou.com/weixin?query=python&_sug_type_=&s_from=input&_sug_=n&type=2&page={page}&ie=utf8'
headers = {
'Cookie': '自己的cookie',
'Host': 'weixin.sogou.com',
'Referer': 'https://www.sogou.com/web?query=python&_asf=www.sogou.com&_ast=&w=01019900&p=40040100&ie=utf8&from=index-nologin&s_from=index&sut=1396&sst0=1610779538290&lkt=0%2C0%2C0&sugsuv=1590216228113568&sugtime=1610779538290',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',
}
response = requests.get(url=url, headers=headers)
selector = parsel.Selector(response.text)
lis = selector.css('.news-list li')
for li in lis:
title_list = li.css('.txt-box h3 a::text').getall()
num = len(title_list)
if num == 1:
title_str = 'python' + title_list[0]
else:
title_str = 'python'.join(title_list)
href = li.css('.txt-box h3 a::attr(href)').get()
article_url = 'https://weixin.sogou.com' + href
name = li.css('.s-p a::text').get()
date = li.css('.s-p::attr(t)').get()
timeArray = time.localtime(int(date))
otherStyleTime = time.strftime("%Y-%m-%d %H:%M:%S", timeArray)
dit = {
'標題': title_str,
'公眾號': name,
'文章發(fā)布時(shí)間': otherStyleTime,
'文章地址': article_url,
}
csv_writer.writerow(dit)
print(title_str, name, otherStyleTime, article_url)