Thursday, January 27, 2022

[FIXED] Scrapy- can't extract data from h3

January 27, 2022 html, python, python-3.x, scrapy, web-crawler No comments

Issue

I'm starting around with Scrapy, and managed to extract some of the data I need. However, not everything is properly obtained. I'm applying the knowledge from the official tutorial found here, but it's not working. I've Googled around a bit, and also read this SO question but I'm fairly certain this isn't the problem here.

Anyhow, I'm trying to parse the product information from this webshop. I'm trying to obtain the product name, price, rrp, release date, category, universe, author and publisher. Here is the relevant CSS for one product: https://pastebin.com/9tqnjs7A. Here's my code. Everything with a #! at the end isn't working as expected.

import scrapy
import pprint

class ForbiddenPlanetSpider(scrapy.Spider):
  name = "fp"
  start_urls = [
            'https://forbiddenplanet.com/catalog/?q=mortal%20realms&sort=release-date&page=1',
    ]

  def parse(self, response):
    for item in response.css("section.zshd-00"):
      print(response.css)
      name = item.css("h3.h4::text").get() #!
      price = item.css("span.clr-price::text").get() + item.css("span.t-small::text").get()
      rrp = item.css("del.mqr::text").get()
      release = item.css("dd.mzl").get() #!
      category = item.css("li.inline-list__item::text").get() #!
      universe = item.css("dt.txt").get() #!
      authors = item.css("a.SubTitleItems").get() #!
      publisher = item.css("dd.mzl").get() #!

      pprint.pprint(dict(name=name,
                         price=price,
                         rrp=rrp,
                         release=release,
                         category=category,
                         universe=universe,
                         authors=authors,
                         publisher = publisher
                         )
                    )

I think I need to add some sub-searching (at the moment release and publisher have the same criteria, for example), but I don't know how to word it to search for it (I've tried, but ended up with generic tutorials that don't cover it). Anything pointing me in the right direction is appreciated!

Oh, and I didn't include ' ' spaces because whenever I used one Scrapy immediately failed to find.

Solution

Scrapy doesn't render JS, try to disable javascript in your browser and refresh the page, the HTML structure is different for site version without JS.

you should rewrite your selectors with a new HTML structure. Try to use XPATH instead of CSS it's much flexible.

UPD

The easiest way to scrape this website makes a request to https://forbiddenplanet.com/api/products/listing/?q=mortal%20realms&sort=release-date

The response is a JSON object with all necessary data. You may transform the "results" field (or the whole JSON object) to a python dictionary and get all fields with dictionary methods.

A code draft that works and shows the idea.

import scrapy
import json


def get_tags(tags: list):
    parsed_tags = []
    if tags:
        for tag in tags:
            parsed_tags.append(tag.get('name'))
        return parsed_tags
    return None


class ForbiddenplanetSpider(scrapy.Spider):
    name = 'forbiddenplanet'
    allowed_domains = ['forbiddenplanet.com']
    start_urls = ['https://forbiddenplanet.com/api/products/listing/?q=mortal%20realms&sort=release-date']

    def parse(self, response):
        response_dict = json.loads(response.body)
        items = response_dict.get('results')

        for item in items:
            yield {
                'name': item.get('title'),
                'price': item.get('site_price'),
                'rrp': item.get('rrp'),
                'release': item.get('release_date'),
                'category': get_tags(item.get('derived_tags').get('type')),
                'universe': get_tags(item.get('derived_tags').get('universe')),
                'authors': get_tags(item.get('derived_tags').get('author')),
                'publisher': get_tags(item.get('derived_tags').get('publisher')),
            }

        next_page = response_dict.get('next')
        if next_page:
            yield scrapy.Request(
                url=next_page,
                callback=self.parse
            )

Answered By - soldy

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, January 27, 2022

[FIXED] Scrapy- can't extract data from h3

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels