Issue
I am using custom settings for scrapy spiders and few settings are getting avoided while running the spider. Most importantly 'DOWNLOADER_MIDDLEWARES'
Below is the spider custom settings
custom_settings = {
'DOWNLOADER_MIDDLEWARES': {'app.sitescrapper.sitescrapper.middlewares.RotateUserAgentMiddleware': 400,
'app.sitescrapper.sitescrapper.middlewares.ProjectDownloaderMiddleware': 543,
'app.sitescrapper.sitescrapper.selenium_middlewares.SeleniumMiddleware': 123,
},
'COOKIES_ENABLED': False,
'CONCURRENT_REQUESTS': 6,
'DOWNLOAD_DELAY': 2,
'CELERYD_MAX_TASKS_PER_CHILD' : 1,
'TELNETCONSOLE_ENABLED' : False,
'AUTOTHROTTLE_ENABLED' : True,
'LOG_LEVEL' : 'WARNING',
# Duplicates pipeline
'ITEM_PIPELINES': {'app.sitescrapper.sitescrapper.pipelines.DuplicatesPipeline': 300},
}
From the log the following settings are overridden
Overridden settings:
{'AUTOTHROTTLE_ENABLED': True,
'CONCURRENT_REQUESTS': 6,
'COOKIES_ENABLED': False,
'DOWNLOAD_DELAY': 2,
'LOG_LEVEL': 'WARNING',
'TELNETCONSOLE_ENABLED': False}
pipelines are also executing well. How 'DOWNLOADER_MIDDLEWARES'
can be activated ?
Update
@celery.task(name='CeleryTask.crawl')
def scrape(baseURL):
crawl_data = [baseURL]
def run_process():
process = CrawlerProcess()
process.crawl(myCrawler,category=crawl_data)
process.start()
p = p1(target=run_process)
p.start()
p.join()
The spiders are run as celery asynchronous job and not from command line. When the spider is executed from CLI, middlewares are activated.
Update 2
From CLI
If using scrapy runspider file_name.py
, then middleware in the custom settings are activated.
But using scrapy crawl spider_name
the middleware in the custom settings are not activated.
Solution
Settings listed in Overridden settings
:
cover only settings in /scrapy/settings/default_settings.py - only settings from scrapy (doesn't cover settings from third party modules)
and settings which values are not dictionary (code) - middlewares will not listed here.
In order to make check for custom DOWNLOADER_MIDDLEWARES
, SPIDER_MIDDLEWARES
, ITEM_PIPELINES
or EXTENSIONS
it is required to check log entries (it's right after overridden settings
log entry:
[scrapy.middleware] Enabled extensions:...
[scrapy.middleware] Enabled downloader middlewares:
[scrapy.middleware] Enabled spider middlewares:
[scrapy.middleware] Enabled item pipelines:
If custom middlewares connected corretly - custom middlewares will be in list. (If not - it's probably path issue)
Answered By - Georgiy
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.