Monday, January 17, 2022

[FIXED] How to increase number of scraped items in scrapy for reuters search

January 17, 2022 python, scrapy, web-scraping No comments

Issue

I am trying to scrape reuters search result page. It loads using java script as explained in this question.

I changed numResultsToShow more than 2000 like 9999 or say. Total items on page are over 45000. No matter what number I put it, scrapy is only returning exactly 5000 scraped items.

My code is as follows:

class ReutersSpider(scrapy.Spider):
    name = "reuters"
    start_urls = [
        'https://www.reuters.com/assets/searchArticleLoadMoreJson?blob=steel.&bigOrSmall=big&articleWithBlog=true&sortBy=&dateRange=&numResultsToShow=9999&pn=1&callback=addMoreNewsResults',
    ]

    def parse(self, response):
        html = response.body.decode('utf-8')
        json_string = re.search( r'addMoreNewsResults\((.+?) \);', html, re.DOTALL ).group(1)

        #Below code is used to transform from Javascript-ish JSON-like structure to JSON
        json_string = re.sub( r'^\s*(\w+):', r'"\1":', json_string, flags=re.MULTILINE)
        json_string = re.sub( r'(\w+),\s*$', r'"\1",', json_string, flags=re.MULTILINE)
        json_string = re.sub( r':\s*\'(.+?)\',\s*$', r': "\1",', json_string, flags=re.MULTILINE)

        results = json.loads(json_string)

        for result in results["news"]:
            item = ReuterItem()
            item["href"] = result["href"]
            item["date"] = result["date"]
            item["headline"] = result["headline"]
            yield item

How can I increase it to cover all search results.

Solution

There are more than a few considerations when crawling sites like this, more so if it's by using their internal API's. Here's a few advice points from my experience, in no particular order:

Since you will likely be making a lot of requests while changing the query arguments, a good practice is to build them dynamically so you don't go crazy.
Always try to remove as much boilerplate from your requests as possible, like extra query parameters, headers, etc. It's useful to play around the API with tools like Postman or similar, to come to a bare minimum working requirements.
As the spider gets more complicated and/or there is a more complex crawling logic in place, it's useful to extract relevant code into separate methods for usability and easier maintenance.
You can pass along valuable information in meta of your request, which will be copied to the response's meta. This can be useful in the given example to keep track of the current page being crawled. Alternatively you can just extract the page number from the URL to make it more robust.
Consider if you need any Cookies in order to visit a certain page. You might not be able to get a response directly from the API (or any page for that matter) if you don't have proper cookies. Usually it's enough to just visit the main page before proceeding, and Scrapy will take care of storing cookies.
Always be polite to avoid being banned and putting a lot of stress on the target site. Use high download delays if possible, and keep the concurrency low.

All that said, I've given it a quick run and put together a semi-working example which should be enough to get you started. There are still improvements to be made, like more complex retry logic, revisiting the main page in case cookie expires, etc...

# -*- coding: utf-8 -*-

import json
import re
import urllib

import scrapy

class ReuterItem(scrapy.Item):
    href = scrapy.Field()
    date = scrapy.Field()
    headline = scrapy.Field()

class ReutersSpider(scrapy.Spider):
    name = "reuters"
    NEWS_URL = 'https://www.reuters.com/search/news?blob={}'
    SEARCH_URL = 'https://www.reuters.com/assets/searchArticleLoadMoreJson?'
    RESULTS_PER_PAGE = 1000
    BLOB = 'steel.'

    custom_settings = {
        # blend in
        'USER_AGENT': ('Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0)'
                       ' Gecko/20100101 Firefox/40.1'),
        # be polite
        'DOWNLOAD_DELAY': 5,
    }

    def _build_url(self, page):
        params = {
            'blob': self.BLOB,
            'bigOrSmall': 'big',
            'callback': 'addMoreNewsResults',
            'articleWithBlog': True,
            'numResultsToShow': self.RESULTS_PER_PAGE,
            'pn': page
        }
        return self.SEARCH_URL + urllib.urlencode(params)

    def _parse_page(self, response):
        html = response.body.decode('utf-8')
        json_string = re.search( r'addMoreNewsResults\((.+?) \);', html, re.DOTALL ).group(1)
        #Below code is used to transform from Javascript-ish JSON-like structure to JSON
        json_string = re.sub( r'^\s*(\w+):', r'"\1":', json_string, flags=re.MULTILINE)
        json_string = re.sub( r'(\w+),\s*$', r'"\1",', json_string, flags=re.MULTILINE)
        json_string = re.sub( r':\s*\'(.+?)\',\s*$', r': "\1",', json_string, flags=re.MULTILINE)
        return json.loads(json_string)

    def start_requests(self):
        # visit the news page first to get the cookies needed
        # to visit the API in the next steps
        url = self.NEWS_URL.format(self.BLOB)
        yield scrapy.Request(url, callback=self.start_crawl)

    def start_crawl(self, response):
        # now that we have cookies set,
        # start crawling form the first page
        yield scrapy.Request(self._build_url(1), meta=dict(page=1))

    def parse(self, response):
        data = self._parse_page(response)

        # extract news from the current page
        for item in self._parse_news(data):
            yield item

        # Paginate if needed
        current_page = response.meta['page']
        total_results = int(data['totalResultNumber'])
        if total_results > (current_page * self.RESULTS_PER_PAGE):
            page = current_page + 1
            url = self._build_url(page)
            yield scrapy.Request(url, meta=dict(page=page))

    def _parse_news(self, data):
        for article in data["news"]:
            item = ReuterItem()
            item["href"] = article["href"]
            item["date"] = article["date"]
            item["headline"] = article["headline"]
            yield item

Answered By - bosnjak

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, January 17, 2022

[FIXED] How to increase number of scraped items in scrapy for reuters search

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels