Issue
I built a crawler using the python scrapy library. It works perfectly and reliably when running locally. I have attempted to port it over to the AWS lambda (I have packaged it appropriately). However when I run it the process isn't blocked whilst the crawl runs and instead completes before the crawlers can return giving no results. These are the last lines I get out of logs before it exits:
2018-09-12 18:58:07 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-09-12 18:58:07 [scrapy.core.engine] INFO: Spider opened
Whereas normally I would get a whole of information about the pages being crawled. I've tried sleeping after starting the crawl, installing crochet and adding its declarators and installing and using this specific framework that sounds like it addresses this problem but it also doesn't work.
I'm sure this is an issue with Lambda not respecting scrapys blocking, but I have no idea on how to address it.
Solution
I had the same problem and fixed it by creating empty modules for sqlite3
, as described in this answer: https://stackoverflow.com/a/44532317/5441099. Appearently, Scrapy imports sqlite3
, but doesn't necessarily use it. Python3 expects sqlite3
to be on the host machine, but the AWS Lambda machines don't have it. The error message doesn't always show up in the logs.
Which means you can make it work by switching to Python2, or creating empty modules for sqlite3
like I did.
My entry file for running the crawler is as follows, and it works on Lambda with Python3.6:
# run_crawler.py
# crawl() is invoked from the handler function in Lambda
import os
from my_scraper.spiders.my_spider import MySpider
from scrapy.crawler import CrawlerProcess
# Start sqlite3 fix
import imp
import sys
sys.modules["sqlite"] = imp.new_module("sqlite")
sys.modules["sqlite3.dbapi2"] = imp.new_module("sqlite.dbapi2")
# End sqlite3 fix
def crawl():
process = CrawlerProcess(dict(
FEED_FORMAT='json',
FEED_URI='s3://my-bucket/my_scraper_feed/' +
'%(name)s-%(time)s.json',
AWS_ACCESS_KEY_ID=os.getenv('AWS_ACCESS_KEY_ID'),
AWS_SECRET_ACCESS_KEY=os.getenv('AWS_SECRET_ACCESS_KEY'),
))
process.crawl(MySpider)
process.start() # the script will block here until all crawling jobs are finished
if __name__ == '__main__':
crawl()
Answered By - Viktor Andersen
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.