Issue
I'm starting around with Scrapy, and managed to extract some of the data I need. However, not everything is properly obtained. I'm applying the knowledge from the official tutorial found here, but it's not working. I've Googled around a bit, and also read this SO question but I'm fairly certain this isn't the problem here.
Anyhow, I'm trying to parse the product information from this webshop. I'm trying to obtain the product name, price, rrp, release date, category, universe, author and publisher. Here is the relevant CSS for one product: https://pastebin.com/9tqnjs7A. Here's my code. Everything with a #!
at the end isn't working as expected.
import scrapy
import pprint
class ForbiddenPlanetSpider(scrapy.Spider):
name = "fp"
start_urls = [
'https://forbiddenplanet.com/catalog/?q=mortal%20realms&sort=release-date&page=1',
]
def parse(self, response):
for item in response.css("section.zshd-00"):
print(response.css)
name = item.css("h3.h4::text").get() #!
price = item.css("span.clr-price::text").get() + item.css("span.t-small::text").get()
rrp = item.css("del.mqr::text").get()
release = item.css("dd.mzl").get() #!
category = item.css("li.inline-list__item::text").get() #!
universe = item.css("dt.txt").get() #!
authors = item.css("a.SubTitleItems").get() #!
publisher = item.css("dd.mzl").get() #!
pprint.pprint(dict(name=name,
price=price,
rrp=rrp,
release=release,
category=category,
universe=universe,
authors=authors,
publisher = publisher
)
)
I think I need to add some sub-searching (at the moment release and publisher have the same criteria, for example), but I don't know how to word it to search for it (I've tried, but ended up with generic tutorials that don't cover it). Anything pointing me in the right direction is appreciated!
Oh, and I didn't include ' ' spaces because whenever I used one Scrapy immediately failed to find.
Solution
Scrapy doesn't render JS, try to disable javascript in your browser and refresh the page, the HTML structure is different for site version without JS.
you should rewrite your selectors with a new HTML structure. Try to use XPATH instead of CSS it's much flexible.
UPD
The easiest way to scrape this website makes a request to https://forbiddenplanet.com/api/products/listing/?q=mortal%20realms&sort=release-date
The response is a JSON object with all necessary data. You may transform the "results" field (or the whole JSON object) to a python dictionary and get all fields with dictionary methods.
A code draft that works and shows the idea.
import scrapy
import json
def get_tags(tags: list):
parsed_tags = []
if tags:
for tag in tags:
parsed_tags.append(tag.get('name'))
return parsed_tags
return None
class ForbiddenplanetSpider(scrapy.Spider):
name = 'forbiddenplanet'
allowed_domains = ['forbiddenplanet.com']
start_urls = ['https://forbiddenplanet.com/api/products/listing/?q=mortal%20realms&sort=release-date']
def parse(self, response):
response_dict = json.loads(response.body)
items = response_dict.get('results')
for item in items:
yield {
'name': item.get('title'),
'price': item.get('site_price'),
'rrp': item.get('rrp'),
'release': item.get('release_date'),
'category': get_tags(item.get('derived_tags').get('type')),
'universe': get_tags(item.get('derived_tags').get('universe')),
'authors': get_tags(item.get('derived_tags').get('author')),
'publisher': get_tags(item.get('derived_tags').get('publisher')),
}
next_page = response_dict.get('next')
if next_page:
yield scrapy.Request(
url=next_page,
callback=self.parse
)
Answered By - soldy
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.