Issue
I'm crawling all news from the first page of over 50 news websites on a daily basis and storing them in a MongoDB database. I'm using the news URL as _id
as a unique identifier. Some websites take significantly longer than others to crawl. To speed up my crawling process, I need to go through my database first before crawling the newly extracted URLs.
I don't want to use persistent support
as it's not exactly what I'm looking for. Also, as shown below, I have written a duplicate filter but it only helps with a single crawling session, and the data is gone after each process termination.
Here's what my pipeline.py
looks like:
class DuplicatesPipeline:
def __init__(self):
self.urls_seen = set()
def process_item(self, item, spider):
if item['_id'] in self.urls_seen:
raise DropItem("Duplicate item found: %s" % item)
else:
self.urls_seen.add(item['_id'])
return item
class MongoDBPipeline:
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DATABASE')
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def close_spider(self, spider):
self.client.close()
def process_item(self, item, spider):
self.db[spider.name].insert_one(dict(item))
logging.debug("Article added to MongoDB")
return item
I need to extract the URLs first. Then go through the database _id
s to see if the extracted URLs already exist and only then can I start crawling.
Is there an easier way to do this? If not, how can I implement it?
Solution
I managed to check the database to get all previously crawled URLs to prevent duplicates and improve performance by roughly 50%. I used a guide written by Adrien Di Pasquale and got some ideas. Here's what my spider looks like after modification. Also, as suggested in the article pipeline.py
was slightly modified.
class BBCSpider(CrawlSpider):
name = 'bbc'
allowed_domains = ['www.bbc.com']
start_urls = [
'https://www.bbc.com/news/',
'https://www.bbc.com/news/world/us_and_canada',
]
rules = [Rule(LinkExtractor(allow=('https:\/\/www.bbc.com\/news\/world-us-canada-[0-9]+$'),
deny=('https:\/\/www.bbc.com\/news\/av\/.*')),
callback='parse_item',
process_links='filter_links')]
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
kwargs['mongo_uri'] = crawler.settings.get("MONGO_URI")
kwargs['mongo_database'] = crawler.settings.get('MONGO_DATABASE')
return super(BBCSpider, cls).from_crawler(crawler, *args, **kwargs)
def __init__(self, mongo_uri=None, mongo_database=None, *args, **kwargs):
super(BBCSpider, self).__init__(*args, **kwargs)
self.mongo_provider = MongoProvider(mongo_uri, mongo_database)
self.collection = self.mongo_provider.get_collection(self)
# URLs that have already been scraped in previous crawling sessions
self.scraped_urls = self.collection.find().distinct('_id')
def filter_links(self, links):
# Removes URLs that have already been scraped in previous crawling sessions
for url in self.scraped_urls:
for link in links:
if url in str(link):
links.remove(link)
return links
def parse_item(self, response):
if response.status == 200:
item = SmartCrawlerItem()
item['_id'] = response.url
item['title'] = response.css('title::text').get()
item['date'] = response.xpath('//div[@class="story-body"]//ul[@class="mini-info-list"]//div/text()').get()
item['article'] = response.css('div.story-body__inner>*::text').getall()
if None in item.values():
return
else:
item['date'] = get_unique_date(item['date'])
item['article'] = clean_response(item['article'])
yield item
Answered By - Rasool
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.