Issue
Looking for advice please on methods to scrape the gender of clothing items on a website that doesn't specify the gender on the product page.
The website I'm crawling is www.very.co.uk and an example of a product page would be this - https://www.very.co.uk/berghaus-combust-reflect-long-jacket-red/1600352465.prd
Looking at that page, there looks to be no easy way to create a script that could identify this item as womenswear. Other websites might have breadcrumbs to use, or the gender might be in the title / URL but this has nothing.
As I'm using scrapy, with the crawl template and Rules to build a hierarchy of links to scrape, I was wondering if it's possible to pass a variable in one of the rules or the starting_URL to identify all items scraped following this rule / starting URL would have a variable as womenswear? I can then feed this variable into a method / loader statement to tag the item as womenswear before putting it into a database.
If not, would anyone have any other ideas on how to categorise this item as womenswear. I saw an example where you could use an excel spreadsheet to create the start_urls and in that excel spreadsheet tag each row as womenswear, mens etc. However, I feel this method might cause issues further down the line and would prefer to avoid it if possible. I'll spare the details of why I think this would be problematic unless anyone asks.
Thanks in advance
Solution
There does seem to be a breadcrumb in your example, however for an alternative you can usually check the page source by simply searching your term - maybe there's some embedded javascript/json that can be extract?
Here you can see some javascript for subcategory that indicates that it's a "womens_everyday_sports_jacket".
You can parse it quite easily with some regex:
re.findall('subcategory: "(.+?)"', response.body_as_unicode())
# womens_everyday_sports_jacket
Answered By - Granitosaurus
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.