亚洲国产精品无码久久大片,亚洲AV无码乱码麻豆精品国产,亚洲品质自拍网站,少妇伦子伦精品无码STYLES,国产精久久久久久久

全自動(dòng)文章采集、AI生成、自動(dòng)發(fā)布，網(wǎng)站自媒體全搞定！立即注冊

Scrapy爬蟲(chóng)框架：抓取天貓淘寶數據

優(yōu)采云發(fā)布時(shí)間: 2020-05-05 08:05

　　有了前兩篇的基礎，接下來(lái)通過(guò)抓取天貓和淘寶的數據來(lái)詳盡說(shuō)明，如何通過(guò)Scrapy爬取想要的內容。完整的代碼：[不帶數據庫版本][ 數據庫版本]。

　　通過(guò)天貓的搜索，獲取搜索下來(lái)的每件商品的銷(xiāo)量、收藏數、價(jià)格。

　　所以，最終的目的是通過(guò)獲取兩個(gè)頁(yè)面的內容，一個(gè)是搜索結果，從上面找下來(lái)每一個(gè)商品的詳盡地址，然后第二個(gè)是商品詳盡內容，從上面獲取到銷(xiāo)量、價(jià)格等。

　　有了思路如今我們先下載搜索結果頁(yè)面，然后再下載頁(yè)面中每一項詳盡信息頁(yè)面。

　　 def _parse_handler(self, response):

''' 下載頁(yè)面 """

self.driver.get(response.url)

pass

　　很簡(jiǎn)單，通過(guò)self.driver.get(response.url)就能使用selenium下載內容，如果直接使用response中的網(wǎng)頁(yè)內容是靜態(tài)的。

　　上面說(shuō)了怎樣下載內容，當我們下載好內容后，需要從上面去獲取我們想要的有用信息，這里就要用到選擇器，選擇器構造方法比較多，只介紹一種，這里看詳盡信息：

　　>>> body = '<html><body><span>good</span></body></html>'

>>> Selector(text=body).xpath('//span/text()').extract()

[u'good']

　　這樣就通過(guò)xpath取下來(lái)了good這個(gè)詞組，更詳盡的xpath教程點(diǎn)擊這兒。

　　Selector 提供了好多形式出了xpath，還有css選擇器，正則表達式，中文教程看這個(gè)，具體內容就不多說(shuō)，只須要曉得這樣可以快速獲取我們須要的內容。

　　簡(jiǎn)單的介紹了如何獲取內容后，現在我們從第一個(gè)搜索結果中獲取我們想要的商品詳盡鏈接，通過(guò)查看網(wǎng)頁(yè)源代碼可以看見(jiàn)，商品的鏈接在這里：

　　...

<p class="title">

<a class="J_ClickStat" data-nid="523242229702" href="//detail.tmall.com/item.htm?spm=a230r.1.14.46.Mnbjq5&id=523242229702&ns=1&abbucket=14" target="_blank" trace="msrp_auction" traceidx="5" trace-pid="" data-spm-anchor-id="a230r.1.14.46">WD/西部數據 WD30EZRZ臺式機3T電腦<span class="H">硬盤(pán)</span> 西數藍盤(pán)3TB 替綠盤(pán)</a>

</p>

...

　　使用之前的規則來(lái)獲取到a元素的href屬性就是須要的內容：

　　selector = Selector(text=self.driver.page_source) # 這里不要省略text因為省略后Selector使用的是另外一個(gè)構造函數，self.driver.page_source是這個(gè)網(wǎng)頁(yè)的html內容

selector.css(".title").css(".J_ClickStat").xpath("./@href").extract()

　　簡(jiǎn)單說(shuō)一下，這里通過(guò)css工具取了class叫title的p元素，然后又獲取了class是J_ClickStat的a元素，最后通過(guò)xpath規則獲取a元素的href中的內容。啰嗦一句css中若果是取id則應當是selector.css("#title")，這個(gè)和css中的選擇器是一致的。

　　同理，我們獲取到商品詳情后，以獲取銷(xiāo)量為例，查看源代碼：

　　<ul class="tm-ind-panel">

<li class="tm-ind-item tm-ind-sellCount" data-label="月銷(xiāo)量"><div class="tm-indcon"><span class="tm-label">月銷(xiāo)量</span><span class="tm-count">881</span></div></li>

<li class="tm-ind-item tm-ind-reviewCount canClick tm-line3" id="J_ItemRates"><div class="tm-indcon"><span class="tm-label">累計評價(jià)</span><span class="tm-count">4593</span></div></li>

<li class="tm-ind-item tm-ind-emPointCount" data-spm="1000988"><div class="tm-indcon"><a href="//vip.tmall.com/vip/index.htm" target="_blank"><span class="tm-label">送天貓積分</span><span class="tm-count">55</span></a></div></li>

</ul>

　　獲取月銷(xiāo)量:

　　selector.css(".tm-ind-sellCount").xpath("./div/span[@class='tm-count']/text()").extract_first()

　　獲取累計評價(jià):

　　selector.css(".tm-ind-reviewCount").xpath("./div[@class='tm-indcon']/span[@class='tm-count']/text()").extract_first()

　　最后把獲取下來(lái)的數據包裝成Item返回。淘寶或則淘寶她們的頁(yè)面內容不一樣，所以規則也不同，需要分開(kāi)去獲取想要的內容。

　　Item是scrapy中獲取下來(lái)的結果，后面可以處理這種結果。

　　Item通常是放在items.py中

　　import scrapy

class Product(scrapy.Item):

name = scrapy.Field()

price = scrapy.Field()

stock = scrapy.Field()

last_updated = scrapy.Field(serializer=str)

　　>>> product = Product(name='Desktop PC', price=1000)

>>> print product

Product(name='Desktop PC', price=1000)

　　>>> product['name']

Desktop PC

>>> product.get('name')

Desktop PC

>>> product['price']

1000

>>> product['last_updated']

Traceback (most recent call last):

...

KeyError: 'last_updated'

>>> product.get('last_updated', 'not set')

not set

>>> product['lala'] # getting unknown field

Traceback (most recent call last):

...

KeyError: 'lala'

>>> product.get('lala', 'unknown field')

'unknown field'

>>> 'name' in product # is name field populated?

True

>>> 'last_updated' in product # is last_updated populated?

False

>>> 'last_updated' in product.fields # is last_updated a declared field?

True

>>> 'lala' in product.fields # is lala a declared field?

False

　　>>> product['last_updated'] = 'today'

>>> product['last_updated']

today

>>> product['lala'] = 'test' # setting unknown field

Traceback (most recent call last):

...

KeyError: 'Product does not support field: lala'

　　這里只須要注意一個(gè)地方，不能通過(guò)product.name的方法獲取，也不能通過(guò)product.name = "name"的形式設置值。

　　當Item在Spider中被搜集以后，它將會(huì )被傳遞到Item Pipeline，一些組件會(huì )根據一定的次序執行對Item的處理。

　　每個(gè)item pipeline組件(有時(shí)稱(chēng)之為“Item Pipeline”)是實(shí)現了簡(jiǎn)單方式的Python類(lèi)。他們接收到Item并通過(guò)它執行一些行為，同時(shí)也決定此Item是否繼續通過(guò)pipeline，或是被遺棄而不再進(jìn)行處理。

　　以下是item pipeline的一些典型應用：

　　現在實(shí)現一個(gè)Item過(guò)濾器，我們把獲取下來(lái)若果是None的數據形參為0，如果Item對象是None則丟棄這條數據。

　　pipeline通常是放在pipelines.py中

　　 def process_item(self, item, spider):

if item is not None:

if item["p_standard_price"] is None:

item["p_standard_price"] = item["p_shop_price"]

if item["p_shop_price"] is None:

item["p_shop_price"] = item["p_standard_price"]

item["p_collect_count"] = text_utils.to_int(item["p_collect_count"])

item["p_comment_count"] = text_utils.to_int(item["p_comment_count"])

item["p_month_sale_count"] = text_utils.to_int(item["p_month_sale_count"])

item["p_sale_count"] = text_utils.to_int(item["p_sale_count"])

item["p_standard_price"] = text_utils.to_string(item["p_standard_price"], "0")

item["p_shop_price"] = text_utils.to_string(item["p_shop_price"], "0")

item["p_pay_count"] = item["p_pay_count"] if item["p_pay_count"] is not "-" else "0"

return item

else:

raise DropItem("Item is None %s" % item)

　　最后須要在settings.py中添加這個(gè)pipeline

　　ITEM_PIPELINES = {

'TaoBao.pipelines.TTDataHandlerPipeline': 250,

'TaoBao.pipelines.MysqlPipeline': 300,

}

　　后面那種數字越小，則執行的次序越靠前，這里先過(guò)濾處理數據，獲取到正確的數據后，再執行TaoBao.pipelines.MysqlPipeline添加數據到數據庫。

　　完整的代碼：[不帶數據庫版本][ 數據庫版本]。

　　之前說(shuō)的方法都是直接通過(guò)命令scrapy crawl tts來(lái)啟動(dòng)。怎么用IDE的調試功能呢？很簡(jiǎn)單通過(guò)main函數啟動(dòng)爬蟲(chóng)：

　　# 寫(xiě)到Spider里面

if __name__ == "__main__":

settings = get_project_settings()

process = CrawlerProcess(settings)

spider = TmallAndTaoBaoSpider

process.crawl(spider)

process.start()

　　在獲取數據的時(shí)侯，很多時(shí)侯會(huì )碰到網(wǎng)頁(yè)重定向的問(wèn)題，scrapy會(huì )返回302之后不會(huì )手動(dòng)重定向后繼續爬取新地址，在scrapy的設置中，可以通過(guò)配置來(lái)開(kāi)啟重定向，這樣雖然域名是重定向的scrapy也會(huì )手動(dòng)到最終的地址獲取內容。

　　解決方案：settings.py中添加REDIRECT_ENABLED = True

　　很多時(shí)侯爬蟲(chóng)都有自定義數據，比如之前寫(xiě)的是硬碟關(guān)鍵字，現在通過(guò)參數的方法如何傳遞呢？

　　解決方案：

　　大部分時(shí)侯，我們可以取到完整的網(wǎng)頁(yè)信息，如果網(wǎng)頁(yè)的ajax懇求太多，網(wǎng)速很慢的時(shí)侯，selenium并不知道什么時(shí)候ajax懇求完成，這個(gè)時(shí)侯假如通過(guò)self.driver.get(response.url)獲取頁(yè)面天貓反爬蟲(chóng)，然后通過(guò)Selector取數據天貓反爬蟲(chóng)，很可能還沒(méi)加載完成取不到數據。

　　解決方案：通過(guò)selenium提供的工具來(lái)延后獲取內容，直到獲取到數據，或者超時(shí)。

0

2020-05-05

python爬蟲(chóng) scrapy xpath

0 個(gè)評論

要回復文章請先登錄或注冊

視
頻
教
程

官方客服QQ群

在
線(xiàn)
客
服

亚洲国产精品无码久久大片,亚洲AV无码乱码麻豆精品国产,亚洲品质自拍网站,少妇伦子伦精品无码STYLES,国产精久久久久久久