Saturday, July 23, 2022

[FIXED] Unable to get results from Scrapy on AWS Lambda

July 23, 2022 amazon-web-services, aws-lambda, python, python-3.x, scrapy No comments

Issue

I built a crawler using the python scrapy library. It works perfectly and reliably when running locally. I have attempted to port it over to the AWS lambda (I have packaged it appropriately). However when I run it the process isn't blocked whilst the crawl runs and instead completes before the crawlers can return giving no results. These are the last lines I get out of logs before it exits:

2018-09-12 18:58:07 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-09-12 18:58:07 [scrapy.core.engine] INFO: Spider opened

Whereas normally I would get a whole of information about the pages being crawled. I've tried sleeping after starting the crawl, installing crochet and adding its declarators and installing and using this specific framework that sounds like it addresses this problem but it also doesn't work.

I'm sure this is an issue with Lambda not respecting scrapys blocking, but I have no idea on how to address it.

Solution

I had the same problem and fixed it by creating empty modules for sqlite3, as described in this answer: https://stackoverflow.com/a/44532317/5441099. Appearently, Scrapy imports sqlite3, but doesn't necessarily use it. Python3 expects sqlite3 to be on the host machine, but the AWS Lambda machines don't have it. The error message doesn't always show up in the logs.

Which means you can make it work by switching to Python2, or creating empty modules for sqlite3 like I did.

My entry file for running the crawler is as follows, and it works on Lambda with Python3.6:

# run_crawler.py
# crawl() is invoked from the handler function in Lambda
import os
from my_scraper.spiders.my_spider import MySpider
from scrapy.crawler import CrawlerProcess
# Start sqlite3 fix
import imp
import sys
sys.modules["sqlite"] = imp.new_module("sqlite")
sys.modules["sqlite3.dbapi2"] = imp.new_module("sqlite.dbapi2")
# End sqlite3 fix


def crawl():
    process = CrawlerProcess(dict(
        FEED_FORMAT='json',
        FEED_URI='s3://my-bucket/my_scraper_feed/' +
        '%(name)s-%(time)s.json',
        AWS_ACCESS_KEY_ID=os.getenv('AWS_ACCESS_KEY_ID'),
        AWS_SECRET_ACCESS_KEY=os.getenv('AWS_SECRET_ACCESS_KEY'),
    ))
    process.crawl(MySpider)
    process.start()  # the script will block here until all crawling jobs are finished


if __name__ == '__main__':
    crawl()

Answered By - Viktor Andersen

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, July 23, 2022

[FIXED] Unable to get results from Scrapy on AWS Lambda

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels