Sunday, July 3, 2022

[FIXED] Scrapy stucks again and again while running

July 03, 2022 python, scrapy, web-crawler, web-scraping No comments

Issue

The question is solved. The answer is in this tutorial.

I have been running a scrapy script for crawling and scraping. It was all doing fine. But while running, it keeps getting stuck at some point. Here is what it shows

[scrapy.extensions.logstats] INFO: Crawled 1795 pages (at 0 pages/min), scraped 1716 items (at 0 items/min)

I then stopped the code running with Contorl+Z and reran the spider. And then again, after crawling and scraping some data, it gets stuck. Did you face that problem before? How did you overcome it?

Here is the link to the whole code

Here is the code of the spider

import scrapy
from scrapy.loader import ItemLoader
from healthgrades.items import HealthgradesItem
from scrapy_playwright.page import PageMethod 

# make the header elements like they are in a dictionary
def get_headers(s, sep=': ', strip_cookie=True, strip_cl=True, strip_headers: list = []) -> dict():
d = dict()
for kv in s.split('\n'):
    kv = kv.strip()
    if kv and sep in kv:
        v=''
        k = kv.split(sep)[0]
        if len(kv.split(sep)) == 1:
            v = ''
        else:
            v = kv.split(sep)[1]
        if v == '\'\'':
            v =''
        # v = kv.split(sep)[1]
        if strip_cookie and k.lower() == 'cookie': continue
        if strip_cl and k.lower() == 'content-length': continue
        if k in strip_headers: continue
        d[k] = v
return d

# spider class
    class DoctorSpider(scrapy.Spider):
    name = 'doctor'
    allowed_domains = ['healthgrades.com']
    url = 'https://www.healthgrades.com/usearch?what=Massage%20Therapy&entityCode=PS444&where=New%20York&pageNu    m={}&sort.provider=bestmatch&='

# change the header of bot to look like a browser
    def start_requests(self):
        h = get_headers(
            '''
            accept: */*
            accept-encoding: gzip, deflate, be
            accept-language: en-US,en;q=0.9
            dnt: 1
            origin: https://www.healthgrades.com
            referer: https://www.healthgrades.com/
            sec-ch-ua: ".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"
            sec-ch-ua-mobile: ?0
            sec-ch-ua-platform: "Windows"
            sec-fetch-dest: empty
            sec-fetch-mode: cors
        vsec-fetch-site: cross-site
            user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36
            '''
        )

        for i in range(1, 6): # Change the range to the page numbers. more improvement can be done
            # GET request. url to first page
            yield scrapy.Request(self.url.format(i), headers =h, meta=dict(
                playwright = True,
                playwright_include_page = True,
                playwright_page_methods =    [PageMethod('wait_for_selector', 'h3.card-name a')] # for     waiting for a particular element to load 
            )) 

    def parse(self, response):
        for link in response.css('div h3.card-name a::attr(href)'): # individual doctor's link
            yield response.follow(link.get(), callback = self.parse_categories) # enter into the website
        
    def parse_categories(self, response):
        l = ItemLoader(item  = HealthgradesItem(), selector = response)

        l.add_xpath('name', '//*[@id="summary-section"]/div[1]/div[2]/div/div/div[1]/div[1]/h1')
        l.add_xpath('specialty', '//*[@id="summary-section"]/div[1]/div[2]/div/div/div[1]/div[1]/div[2]/p/span[1]')
        l.add_xpath('practice_name', '//*[@id="summary-section"]/div[1]/div[2]/div/div/div[2]/div[1]/p')
        l.add_xpath('address', 'string(//*[@id="summary-section"]/div[1]/div[2]/div/div/div[2]/div[1]/address)')

        yield l.load_item()

Solution

The issue is, there is a limit to concurrent settings.

Here is the solution

Concurrent Requests

Adding concurrency into Scrapy is actually a very simple task. There is already a setting for the number of concurrent requests allowed, which you just have to modify.

You can choose to modify this in the custom settings of the spider you’ve made, or the global settings which effect all spiders.

Global

To add this globally, just add the following line to your settings file.
CONCURRENT_REQUESTS = 30
We’ve set the number of concurrent requests to 30. You may use any value that you wish, within a reasonable limit though.

Local

To add settings locally, we have to use custom settings to add concurrency requests to our Scrapy spider.
custom_settings = {
     'CONCURRENT_REQUESTS' = 30
}
Additional Settings

There are many additional settings that you can use instead of, or together with CONCURRENT_REQUESTS.

CONCURRENT_REQUESTS_PER_IP – Sets the number of concurrent requests per IP address.

CONCURRENT_REQUESTS_PER_DOMAIN – Defines the number of concurrent requests allowed for each domain.

MAX_CONCURRENT_REQUESTS_PER_DOMAIN – Sets a maximum limit on the number of concurrent requests allowed for a domain.

Answered By - Shahidul Islam Pranto

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, July 3, 2022

[FIXED] Scrapy stucks again and again while running

Issue

Solution

Concurrent Requests

0 comments:

Post a Comment

Popular Posts

Labels