Issue
I have created a spider that crawls news. I want to run that spider and schedule it too. It is within a django project. It is such that the spider has will crawl the data and put it into the database, which will be used by django to display the same data. Here's my spider
`class NewsSpider(CrawlSpider): name = "news"
start_urls = ['https://zeenews.india.com/latest-news']
def start_requests(self):
urls = ['https://zeenews.india.com/latest-news']
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
item = NewsScraperItem()
data = response.css('div.sec-con-box')
item['headlines'] = data.css('h3::text').extract_first()
item['content'] = data.css('p::text').extract_first()
return item`
items.py: `import scrapy from scrapy_djangoitem import DjangoItem from news.models import LatestNews
class NewsScraperItem(DjangoItem): # define the fields for your item here like: # name = scrapy.Field() django_model = LatestNews`
Solution
To enable scheduling and make sure the crawler is doing work at the background, I suggest you use Django Background Tasks repo.
Check out the documentation here.
Answered By - Jeffrey
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.