Thursday, March 24, 2022

[FIXED] Improving structure of requests to boost speed

March 24, 2022 scrapy No comments

Issue

I have created a script that scrapes some elements from the webpage and then goes into the links attached to each listing. Then it grabs additional further info from that webpage, however it scrapes relatively slow. I get ~ 300/min, and my guess is the structure of my scraper and how it's gathering the requests, following the url, and scraping the info. Might this be the case, and how can I improve the speed?

import scrapy
from scrapy.item import Field
from scrapy.loader import ItemLoader
from itemloaders.processors import TakeFirst, MapCompose, Join
from scrapy.crawler import CrawlerProcess
from price_parser import Price

def get_price(price_raw):
    price_object = Price.fromstring(price_raw)
    return price_object.amount_float

def get_currency(price_raw):
    price_object = Price.fromstring(price_raw)
    currency = price_object.currency
    return currency


class VinylItem(scrapy.Item):
    title = Field(output_processor = TakeFirst())
    label = Field()
    media_condition=Field(input_processor = MapCompose(str.strip),
    output_processor = TakeFirst())
    sleeve_condition = Field(output_processor = TakeFirst())
    location = Field(input_processor = MapCompose(str.strip),
                        output_processor = Join())
    price = Field(input_processor = MapCompose(get_price)
        ,output_processor = TakeFirst())
    currency = Field(input_processor = MapCompose(get_currency)
        ,output_processor  = TakeFirst())
    rated = Field(input_processor = MapCompose(str.strip)
        ,output_processor = Join())
    have_vinyl = Field(output_processor = TakeFirst())
    want_vinyl = Field(output_processor = TakeFirst())
    format = Field(input_processor = MapCompose(str.strip),
    output_processor = Join())
    released = Field(input_processor = MapCompose(str.strip),
    output_processor = Join())
    genre = Field(input_processor = MapCompose(str.strip),
    output_processor = Join())
    style = Field(input_processor = MapCompose(str.strip),
    output_processor = Join())



class VinylSpider(scrapy.Spider):
    name = 'vinyl'
    #allowed_domains = ['x']
    start_urls = ['https://www.discogs.com/sell/list?format=Vinyl']

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(
                url, callback = self.parse
            )

    def parse(self, response):
        content = response.xpath("//table[@class='table_block mpitems push_down table_responsive']//tbody//tr")
        for items in content:
            loader = ItemLoader(VinylItem(), selector = items)
            loader.add_xpath('title', "(.//strong//a)[position() mod 2=1]//text()")
            loader.add_xpath('label', './/p[@class="hide_mobile label_and_cat"]//a//text()')
            loader.add_xpath("media_condition", '(.//p[@class="item_condition"]//span)[position() mod 3=0]//text()')
            loader.add_xpath("sleeve_condition", './/p[@class="item_condition"]//span[@class="item_sleeve_condition"]//text()')
            loader.add_xpath("location", '(.//td[@class="seller_info"]//li)[position() mod 3=0]//text()')
            loader.add_xpath('price', '(//tbody//tr//td//span[@class="price"])[position() mod 2=0]//text()')
            loader.add_xpath('currency', '(//tbody//tr//td//span[@class="price"])[position() mod 2=0]//text()')
            loader.add_xpath('rated', './/td//div[@class="community_rating"]//text()')
            loader.add_xpath('have_vinyl', '(.//td//div[@class="community_result"]//span[@class="community_label"])[contains(text(),"have")]//text()')
            loader.add_xpath('want_vinyl', '(.//td//div[@class="community_result"]//span[@class="community_label"])[contains(text(),"want")]//text()')

            
            links = items.xpath('.//td[@class="item_description"]//strong//@href').get()
            yield response.follow(
                response.urljoin(links), 
                callback = self.parse_vinyls,
                cb_kwargs = {
                    'loader':loader
                }
            )            
        next_page = response.xpath('(//ul[@class="pagination_page_links"]//a)[last()]//@href').get()
        if next_page:
            yield response.follow(
                response.urljoin(next_page),
                callback = self.parse
            )

    def parse_vinyls(self, response, loader):
        #loader = ItemLoader(VinylItem(), selector = response)
        loader.add_value('format', response.xpath("(.//div[@id='page_content']//div[5])[1]//text()").get())
        loader.add_value('released', response.xpath("(.//div[@id='page_content']//div[9])[1]//text()").get())
        loader.add_value('genre', response.xpath("(.//div[@id='page_content']//div[11])[1]//text()").get())
        loader.add_value('style', response.xpath("(.//div[@id='page_content']//div[13])[1]//text()").get())

        yield loader.load_item()


process = CrawlerProcess(
    settings = {
        'FEED_URI':'vinyl.jl',
        'FEED_FORMAT':'jsonlines'
    }
)
process.crawl(VinylSpider)
process.start()

Solution

From the code snippet you have provided, your scraper is set up efficiently as it is yielding many requests at a go which lets scrapy handle the concurrency.

There are a couple of settings you can tweak to increase the speed of scraping. However, note that the first rule of scraping is that you should not harm the website you are scraping. See below sample of the settings you can tweak.

Increase the value of CONCURRENT_REQUESTS. Defaults to 16 in scrapy
Increase the value of CONCURRENT_REQUESTS_PER_DOMAIN. Defaults to 8 in scrapy
Increase Twisted IO thread pool maximum size so that DNS resolution is faster REACTOR_THREADPOOL_MAXSIZE
Reduce log level LOG_LEVEL = 'INFO'
Disable cookies if you do not require them COOKIES_ENABLED = False
Reduce download timeout DOWNLOAD_TIMEOUT = 15
Reduce the value of DOWNLOAD_DELAY if your internet speed is fast and you are sure the website you are targeting is fast enough. This is not recommended

Read more about these settings from the docs

If the above settings do not solve your problem, then you may need to look into distributed crawling

Answered By - msenior_

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, March 24, 2022

[FIXED] Improving structure of requests to boost speed

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels