Issue
I'm trying to make a scrapy scraper work using cloud run. The main idea is that every 20 minutes a cloud scheduler cron should trigger the web scraper and get data from different sites. All sites have the same structure, so I would like to use same code and parallelize the execution of the scraping job, doing something like scrapy crawl scraper -a site=www.site1.com
and scrapy crawl scraper -a site=www.site2.com
.
I have already deployed a version of the scraper, but it only can do scrapy crawl scraper
. How can I do that at execution the command's site change?
Also, should I be using cloud run job or service?
Solution
According to that page of documentation, there is a trick.
- Define a number of task, let's say, you set the number of task equal to the number of site to scrap. use the --task parameter for that
- In your container (or in Cloud Storage, but if you do that, you have to download the file before starting the process), add a file with 1 website to scrap per line.
- At runtime, use the
CLOUD_RUN_TASK_INDEX
environment variable. That variable indicate the number of the task in the execution. For each different number, pick a line in your file of websites (the number of the line equal to the env var value).
Like that, you can leverage Cloud Run jobs and parallelism.
The main tradeoff here is the static form of the websites list to scrap.
Answered By - guillaume blaquiere
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.