Monday, October 25, 2021

[FIXED] Scrapy: Check MongoDB for duplicates before crawling

October 25, 2021 mongodb, python, scrapy No comments

Issue

I'm crawling all news from the first page of over 50 news websites on a daily basis and storing them in a MongoDB database. I'm using the news URL as _id as a unique identifier. Some websites take significantly longer than others to crawl. To speed up my crawling process, I need to go through my database first before crawling the newly extracted URLs.

I don't want to use persistent support as it's not exactly what I'm looking for. Also, as shown below, I have written a duplicate filter but it only helps with a single crawling session, and the data is gone after each process termination.

Here's what my pipeline.py looks like:

class DuplicatesPipeline:

    def __init__(self):
        self.urls_seen = set()

    def process_item(self, item, spider):

        if item['_id'] in self.urls_seen:
            raise DropItem("Duplicate item found: %s" % item)
        else:
            self.urls_seen.add(item['_id'])
            return item


class MongoDBPipeline:

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        self.db[spider.name].insert_one(dict(item))
        logging.debug("Article added to MongoDB")
        return item

I need to extract the URLs first. Then go through the database _ids to see if the extracted URLs already exist and only then can I start crawling.

Is there an easier way to do this? If not, how can I implement it?

Solution

I managed to check the database to get all previously crawled URLs to prevent duplicates and improve performance by roughly 50%. I used a guide written by Adrien Di Pasquale and got some ideas. Here's what my spider looks like after modification. Also, as suggested in the article pipeline.py was slightly modified.

class BBCSpider(CrawlSpider):
    name = 'bbc'
    allowed_domains = ['www.bbc.com']
    start_urls = [
        'https://www.bbc.com/news/',
        'https://www.bbc.com/news/world/us_and_canada',
    ]

    rules = [Rule(LinkExtractor(allow=('https:\/\/www.bbc.com\/news\/world-us-canada-[0-9]+$'),
                                deny=('https:\/\/www.bbc.com\/news\/av\/.*')),
                  callback='parse_item',
                  process_links='filter_links')]

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        kwargs['mongo_uri'] = crawler.settings.get("MONGO_URI")
        kwargs['mongo_database'] = crawler.settings.get('MONGO_DATABASE')
        return super(BBCSpider, cls).from_crawler(crawler, *args, **kwargs)

    def __init__(self, mongo_uri=None, mongo_database=None, *args, **kwargs):
        super(BBCSpider, self).__init__(*args, **kwargs)
        self.mongo_provider = MongoProvider(mongo_uri, mongo_database)
        self.collection = self.mongo_provider.get_collection(self)
        # URLs that have already been scraped in previous crawling sessions
        self.scraped_urls = self.collection.find().distinct('_id')

    def filter_links(self, links):
        # Removes URLs that have already been scraped in previous crawling sessions
        for url in self.scraped_urls:
            for link in links:
                if url in str(link):
                    links.remove(link)
        return links

    def parse_item(self, response):
        if response.status == 200:

            item = SmartCrawlerItem()

            item['_id'] = response.url
            item['title'] = response.css('title::text').get()
            item['date'] = response.xpath('//div[@class="story-body"]//ul[@class="mini-info-list"]//div/text()').get()
            item['article'] = response.css('div.story-body__inner>*::text').getall()

            if None in item.values():
                return
            else:
                item['date'] = get_unique_date(item['date'])
                item['article'] = clean_response(item['article'])
                yield item

Answered By - Rasool

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, October 25, 2021

[FIXED] Scrapy: Check MongoDB for duplicates before crawling

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels