Issue
Code:
# -*- coding: utf-8 -*-
import scrapy
from ..items import LowesspiderItem
from scrapy.http import Request
class LowesSpider(scrapy.Spider):
name = 'lowes'
def start_requests(self):
start_urls = ['https://www.lowes.com/search?searchTerm=8654RM-42']
for url in start_urls:
yield Request(url, cookies={'sn':'2333'}) #Added cookie to bypass location req
def parse(self, response):
items = response.css('.grid-container')
for product in items:
item = LowesspiderItem()
#get product price
productPrice = product.css('.art-pd-price::text').get()
#get lowesNum
productLowesNum = response.url.split("/")[-1]
#get SKU
productSKU = product.css('.met-product-model::text').get()
item["productLowesNum"] = productLowesNum
item["productSKU"] = productSKU
item["productPrice"] = productPrice
yield item
Output:
{'productLowesNum': '1001440644',
'productPrice': None,
'productSKU': '8654RM-42'}
Now, I'll have a list of SKU's so that's how I'm going to format start_urls
, so,
start_urls = ['https://www.lowes.com/search?searchTerm=('some sku)']
This url would redirect me to this link: https://www.lowes.com/pd/ZLINE-KITCHEN-BATH-Ducted-Red-Matte-Wall-Mounted-Range-Hood-Common-42-Inch-Actual-42-in/1001440644
That's handled by scrapy
Now the problem
When I have:
start_urls = ['https://www.lowes.com/search?searchTerm=8654RM-42']
I get the SKU but not the price.
However when I use the actual URL in start_urls
start_urls = ['https://www.lowes.com/pd/ZLINE-KITCHEN-BATH-Ducted-Red-Matte-Wall-Mounted-Range-Hood-Common-42-Inch-Actual-42-in/1001440644']
then my output is fine:
{'productLowesNum': '1001440644',
'productPrice': '1,449.95',
'productSKU': '8654RM-42'}
So, I believe using a URL that has to be redirected causes for my scraper to not get the price for some reason, but I still get the SKU.
Here's my guess: I had to preset a location cookie because the Lowes website does not allow you to see the price unless the user gives them a zip code/ location. so I'd assume I would have to move or adjust cookies={'sn':'2333'}
to make my program work as expected.
Solution
Problem
The main issue here is that some of your cookies which are set by the first request
are carried forward to the request after the redirect which is
These cookies are overriding the cookies set by you.
Solution
You need to send explict cookies to each request and prevent the previous cookies from being added to the next request.
There is a setting in scrapy called dont_merge_cookies
which is used for this purpose. You need to set this setting in your request meta to prevent cookies from previous requests being appended to the next request.
Now you need to explicitly set the cookies in request header. Something like this:
def start_requests(self):
start_urls = ['https://www.lowes.com/search?searchTerm=8654RM-42']
for url in start_urls:
yield Request(url, headers={'Cookie': 'sn=2333;'}, meta={'dont_merge_cookies': True})
Hope it helps.
Answered By - asimhashmi
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.