Sunday, August 7, 2022

[FIXED] Scrapy file, only running the initial start_urls instead of running though the whole list

August 07, 2022 python, scrapy No comments

Issue

As the title states, I am trying to run my scrapy program, the issue I am running into is that it seems to be only returning the yield from the initial url (https://www.antaira.com/products/10-100Mbps).

I am unsure on where my program is not working, in my code I have also left some commented code on what I have attempted.

import scrapy
from ..items import AntairaItem


class ProductJumperFix(scrapy.Spider):  # classes should be TitleCase

    name = 'productJumperFix'
    allowed_domains = ['antaira.com']
    start_urls = [
        'https://www.antaira.com/products/10-100Mbps',
        'https://www.antaira.com/products/unmanaged-gigabit'
        'https://www.antaira.com/products/unmanaged-10-100Mbps-PoE'
        'https://www.antaira.com/products/Unmanaged-Gigabit-PoE'
        'https://www.antaira.com/products/Unmanaged-10-gigabit'
        'https://www.antaira.com/products/Unmanaged-10-gigabit-PoE'
    ]
    
    #def start_requests(self):
    #    yield scrappy.Request(start_urls, self.parse)

    def parse(self, response):
        # iterate through each of the relative urls
        for url in response.xpath('//div[@class="product-container"]//a/@href').getall():
            product_link = response.urljoin(url)  # use variable
            yield scrapy.Request(product_link, callback=self.parse_new_item)

    def parse_new_item(self, response):
        for product in response.css('main.products'):
            items = AntairaItem() # Unique item for each iteration
            items['product_link'] = response.url # get the product link from response
            name = product.css('h1.product-name::text').get().strip()
            features = product.css(('section.features h3 + ul').strip()).getall()
            overview =   product.css('.products .product-overview::text').getall()
            main_image = response.urljoin(product.css('div.selectors img::attr(src)').get())
            rel_links = product.xpath("//script/@src[contains(., '/app/site/hosting/scriptlet.nl')]").getall()
            items['name'] = name,
            items['features'] = features,
            items['overview'] = overview,
            items['main_image'] = main_image,
            items['rel_links'] = rel_links,
            yield items

Thank you everyone!

Follow up question, for some reason when I run "scrapy crawl productJumperFix" im not getting any output from the terminal,not sure how to debug since I can't even see the output errors.

Solution

Try using the start_requests method:

For example:

import scrapy
from ..items import AntairaItem

class ProductJumperFix(scrapy.Spider):

    name = 'productJumperFix'
    allowed_domains = ['antaira.com']

    def start_requests(self):
        urls = [
            'https://www.antaira.com/products/10-100Mbps',
            'https://www.antaira.com/products/unmanaged-gigabit',
            'https://www.antaira.com/products/unmanaged-10-100Mbps-PoE',
            'https://www.antaira.com/products/Unmanaged-Gigabit-PoE',
            'https://www.antaira.com/products/Unmanaged-10-gigabit',
            'https://www.antaira.com/products/Unmanaged-10-gigabit-PoE',
        ]
        for url in urls:
            yield scrapy.Request(url, callback=self.parse)

    def parse(self, response):
        for url in response.xpath('//div[@class="product-container"]//a/@href').getall():
            product_link = response.urljoin(url)  # use variable
            yield scrapy.Request(product_link, callback=self.parse_new_item)

    def parse_new_item(self, response):
        for product in response.css('main.products'):
            items = AntairaItem() 
            items['product_link'] = response.url
            name = product.css('h1.product-name::text').get().strip()
            features = product.css(('section.features h3 + ul').strip()).getall()
            overview =   product.css('.products .product-overview::text').getall()
            main_image = response.urljoin(product.css('div.selectors img::attr(src)').get())
            rel_links = product.xpath("//script/@src[contains(., '/app/site/hosting/scriptlet.nl')]").getall()
            items['name'] = name,
            items['features'] = features,
            items['overview'] = overview,
            items['main_image'] = main_image,
            items['rel_links'] = rel_links,
            yield items

Answered By - Alexander

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, August 7, 2022

[FIXED] Scrapy file, only running the initial start_urls instead of running though the whole list

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels