Issue
I need to collect a lot(really a lot) of data for statistics, all the necessary information is in <script type="application/ld+json"></script>
and I wrote scrapy parser (script inside html) under it, but parsing is very slow (about 3 pages per second). Is there any way to speed up the process? Ideally I would like to see 10+ pages per second
#spider.py:
import scrapy
import json
class Spider(scrapy.Spider):
name = 'scrape'
start_urls = [
about 10000 urls
]
def parse(self, response):
data = json.loads(response.css('script[type="application/ld+json"]::text').extract_first())
name = data['name']
image = data['image']
path = response.css('span[itemprop="name"]::text').extract()
yield {
'name': name,
'image': image,
'path': path
}
return
#settings.py:
USER_AGENT = "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:67.0) Gecko/20100101 Firefox/67.0"
ROBOTSTXT_OBEY = False
CONCURRENT_REQUESTS = 32
DOWNLOAD_DELAY = 0.33
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
}
EXTENSIONS = {
'scrapy.extensions.telnet.TelnetConsole': None,
}
AUTOTHROTTLE_DEBUG = False
LOG_ENABLED = False
My PC specs:
16GB ram, i5 2400, ssd, 1gb ethernet
#Edited
Solution
settings.py
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 0
DOWNLOAD_TIMEOUT = 30
RANDOMIZE_DOWNLOAD_DELAY = True
REACTOR_THREADPOOL_MAXSIZE = 128
CONCURRENT_REQUESTS = 256
CONCURRENT_REQUESTS_PER_DOMAIN = 256
CONCURRENT_REQUESTS_PER_IP = 256
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 0.25
AUTOTHROTTLE_TARGET_CONCURRENCY = 128
AUTOTHROTTLE_DEBUG = True
RETRY_ENABLED = True
RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 400, 401, 403, 404, 405, 406, 407, 408, 409, 410, 429]
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy.spidermiddlewares.referer.RefererMiddleware': 80,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 120,
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 130,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 900,
'scraper.middlewares.ScraperDownloaderMiddleware': 1000
}
Answered By - 0x01h
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.