Issue
I have a Scrapy project and over it, I have ScrapyRT to create an API. First, I deployed the application in Heroku with the default settings and with the Procfile as follows:
web: scrapyrt -i 0.0.0.0 -p $PORT
everything is fine so far, it runs as expected.
The Scrapy project has a pipeline that sends the scraped items to a mongo database. That works fine as well.
Now, since I am already saving the scraped data into a database, my intention was to create an additional resource to handle the get
requests so ScrapyRT checks in the database if the item was scrapped before, and returns it instead of running the spider. According to the documentation for ScrapyRT, in order to add a new resource, I needed to pass custom settings through the command line (PowerShell in windows) like this:
scrapyrt -S nist_scraper.scrapyrt.settings
where nist_scraper
is the name of the project, scrapyrt
is a subdirectory inside the project, and settings
is the name of the python file where the settings are located.
# nist_scraper/scrapyrt/settings.py
RESOURCES = {
'crawl.json': 'nist_scraper.scrapyrt.resources.CheckDatabaseBeforeCrawlResource',
}
# resourse.py
# custom
import os
import json
from pymongo import MongoClient
from dotenv import load_dotenv
load_dotenv()
from scrapyrt.resources import CrawlResource
class CheckDatabaseBeforeCrawlResource(CrawlResource):
def render_GET(self, request, **kwargs):
# Get the url parameters
api_params = dict(
(name.decode('utf-8'), value[0].decode('utf-8'))
for name, value in request.args.items()
)
try:
cas = json.loads(api_params["crawl_args"])["cas"]
collection_name = "substances"
client = MongoClient(os.environ.get("MONGO_URI"))
db = client[os.environ.get("MONGO_DB")]
except:
return super(CheckDatabaseBeforeCrawlResource, self).render_GET(
request, **kwargs)
substance = db[collection_name].find_one({"cas":cas}, {"_id":0})
if substance:
response = {
"status": "ok",
"items": [substance],
} #<== Here is supposed to be the metadata but is gone on purpose
return response
return super(CheckDatabaseBeforeCrawlResource, self).render_GET(
request, **kwargs)
Again, in local, once I sent the get request
{{BASE_URL}}crawl.json?spider_name=webbook_nist&start_requests=true&crawl_args={"cas":"74828"}
I get the desired behavior, the resource sends the item from the database and not from the spider in the Scrapy project. I know the item came from the database because I modified the response that is returned by ScrapyRT and removed all the metadata.
However, here there is the issue. I updated the same local project to Heroku to override the original one mentioned at the beginning which worked fine and changed the Procfile to:
web: scrapyrt -S nist_scraper.scrapyrt.settings -i 0.0.0.0 -p $PORT
But when I sent the same get request, ScrapyRT calls the spider and does not check if the item is in the database. To make it clear, the database is the same, and the item is indeed recorded in that database. The response sent has the metadata I removed from the custom resource.
I am not proficient at either Heroku not ScrapyRT but I am assuming the issue is that Heroku is not adding my custom settings when starting the API so the ScrapyRT module is running its default ones which always scrap the website using the spider.
The project is live here: https://nist-scrapyrt.herokuapp.com/crawl.json?spider_name=webbook_nist&start_requests=true&crawl_args={%22cas%22:%227732185%22}
And there is a GitHub repo here: https://github.com/oscarcontrerasnavas/nist-webbook-scrapyrt-spider
As far as I know, if I do not add the custom settings through the command line arguments, the default settings from the scrapy.cfg
are overwritten by the default for ScrapyRT.
I want the same behavior as the local environment but over Heroku. I do not want to run the spider every time because I know it is less "expensive" to pull the info from the database.
Any suggestion?
Solution
The implementation shown in this question is correct, there was a typo with the environment variables on Heroku. If you have questions on how to do it yourself, you can leave a comment.
Answered By - Oscar Contreras
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.