Issue
I am implementing the following code in spider for scraping shoes from an ecommerce website.
import scrapy
class HugobossSpider(scrapy.Spider):
name = 'hugoboss'
allowed_domains = ['hugoboss.com/de/boss-herren-neuheiten-schuhe/']
start_urls = ['http://hugoboss.com/de/boss-herren-neuheiten-schuhe//']
def parse(self, response):
#Extracting the content using css selectors
url = response.xpath('//div/@data-mouseoverimage').extract()
product_title = response.xpath('//*[@class="product-tile__productInfoWrapper product-tile__productInfoWrapper--is-small font__subline"]/text()').extract()
price = response.css('.product-tile__offer .price-sales::t Zext').getall()
#Give the extracted content row wise
for item in zip(url,product_title,price):
#create a dictionary to store the scraped info
scraped_info = {
'url' : item[0],
'product_title' : item[1],
'price' : item[2]
}
And the shell is returning output normally like this
But, the output CSV file looks so unorganized like this,
I don't get where the problem is happening.
Solution
By the looks of it, your scraper has picked up a bunch on newline characters (\n
) together with the product name.
It also seems to pick up the word von
, which i presume is not necessary too.
My suggestion would be to do some string manipulation to get rid of them:
product_title.replace("\n", '').replace("von", "")
Reason why it's best to use .replace(x,y)
is because .strip()/.lstrip()/.rstrip()
would strip down matching characters within the string and may remove necessary characters from your product name.
Hope this helps
Answered By - Martin Martynas Markevičius
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.