Issue
The question is solved. The answer is in this tutorial.
I have been running a scrapy script for crawling and scraping. It was all doing fine. But while running, it keeps getting stuck at some point. Here is what it shows
[scrapy.extensions.logstats] INFO: Crawled 1795 pages (at 0 pages/min), scraped 1716 items (at 0 items/min)
I then stopped the code running with Contorl+Z and reran the spider. And then again, after crawling and scraping some data, it gets stuck. Did you face that problem before? How did you overcome it?
Here is the link to the whole code
Here is the code of the spider
import scrapy
from scrapy.loader import ItemLoader
from healthgrades.items import HealthgradesItem
from scrapy_playwright.page import PageMethod
# make the header elements like they are in a dictionary
def get_headers(s, sep=': ', strip_cookie=True, strip_cl=True, strip_headers: list = []) -> dict():
d = dict()
for kv in s.split('\n'):
kv = kv.strip()
if kv and sep in kv:
v=''
k = kv.split(sep)[0]
if len(kv.split(sep)) == 1:
v = ''
else:
v = kv.split(sep)[1]
if v == '\'\'':
v =''
# v = kv.split(sep)[1]
if strip_cookie and k.lower() == 'cookie': continue
if strip_cl and k.lower() == 'content-length': continue
if k in strip_headers: continue
d[k] = v
return d
# spider class
class DoctorSpider(scrapy.Spider):
name = 'doctor'
allowed_domains = ['healthgrades.com']
url = 'https://www.healthgrades.com/usearch?what=Massage%20Therapy&entityCode=PS444&where=New%20York&pageNu m={}&sort.provider=bestmatch&='
# change the header of bot to look like a browser
def start_requests(self):
h = get_headers(
'''
accept: */*
accept-encoding: gzip, deflate, be
accept-language: en-US,en;q=0.9
dnt: 1
origin: https://www.healthgrades.com
referer: https://www.healthgrades.com/
sec-ch-ua: ".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "Windows"
sec-fetch-dest: empty
sec-fetch-mode: cors
vsec-fetch-site: cross-site
user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36
'''
)
for i in range(1, 6): # Change the range to the page numbers. more improvement can be done
# GET request. url to first page
yield scrapy.Request(self.url.format(i), headers =h, meta=dict(
playwright = True,
playwright_include_page = True,
playwright_page_methods = [PageMethod('wait_for_selector', 'h3.card-name a')] # for waiting for a particular element to load
))
def parse(self, response):
for link in response.css('div h3.card-name a::attr(href)'): # individual doctor's link
yield response.follow(link.get(), callback = self.parse_categories) # enter into the website
def parse_categories(self, response):
l = ItemLoader(item = HealthgradesItem(), selector = response)
l.add_xpath('name', '//*[@id="summary-section"]/div[1]/div[2]/div/div/div[1]/div[1]/h1')
l.add_xpath('specialty', '//*[@id="summary-section"]/div[1]/div[2]/div/div/div[1]/div[1]/div[2]/p/span[1]')
l.add_xpath('practice_name', '//*[@id="summary-section"]/div[1]/div[2]/div/div/div[2]/div[1]/p')
l.add_xpath('address', 'string(//*[@id="summary-section"]/div[1]/div[2]/div/div/div[2]/div[1]/address)')
yield l.load_item()
Solution
The issue is, there is a limit to concurrent settings.
Concurrent Requests
Adding concurrency into Scrapy is actually a very simple task. There is already a setting for the number of concurrent requests allowed, which you just have to modify.
You can choose to modify this in the custom settings of the spider you’ve made, or the global settings which effect all spiders.
Global
To add this globally, just add the following line to your settings file.
CONCURRENT_REQUESTS = 30
We’ve set the number of concurrent requests to 30. You may use any value that you wish, within a reasonable limit though.
Local
To add settings locally, we have to use custom settings to add concurrency requests to our Scrapy spider.
custom_settings = { 'CONCURRENT_REQUESTS' = 30 }
Additional Settings
There are many additional settings that you can use instead of, or together with CONCURRENT_REQUESTS.
CONCURRENT_REQUESTS_PER_IP
– Sets the number of concurrent requests per IP address.CONCURRENT_REQUESTS_PER_DOMAIN
– Defines the number of concurrent requests allowed for each domain.MAX_CONCURRENT_REQUESTS_PER_DOMAIN
– Sets a maximum limit on the number of concurrent requests allowed for a domain.
Answered By - Shahidul Islam Pranto
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.