Issue
I have a crontab job that runs a myautorun.sh file which contains 3 spider:
crontab -e
14 * * * * * ~bin/myautorun.sh
and myautorun.sh looks like this:
scrapy crawl spider1 &
scrapy crawl spider2 &
scrapy crawl spider3
each spider scrapes 20 pages.
when cronjob runs it ends up with variable scraped pages around 30, and never is 60. Each spider reads few pages but not all 20 pages. However if I comment two spiders and run them separately one at a time, it gives me 60.
So I am puzzled why it cant run parallel spiders properly, I am running the spiders using Crawlera, and they are on a 1GB ram virtual server.
Is there any settings or anything that I am missing? Thanks,
Solution
&
means you are running them in parallel, so that maybe was issue, your spiders were taking too much RAM and hence were being KILLED
Anyways, you should also redirect spider's outputs to logs files so in future there is any error, you will be able to see what has happened
scrapy crawl spider1 > logs_1.txt &
scrapy crawl spider2 > logs_2.txt &
scrapy crawl spider3 > logs_3.txt
Answered By - Umair Ayub
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.