Issue
I am trying to scrape reuters search result page. It loads using java script as explained in this question.
I changed numResultsToShow more than 2000 like 9999 or say. Total items on page are over 45000. No matter what number I put it, scrapy is only returning exactly 5000 scraped items.
My code is as follows:
class ReutersSpider(scrapy.Spider):
name = "reuters"
start_urls = [
'https://www.reuters.com/assets/searchArticleLoadMoreJson?blob=steel.&bigOrSmall=big&articleWithBlog=true&sortBy=&dateRange=&numResultsToShow=9999&pn=1&callback=addMoreNewsResults',
]
def parse(self, response):
html = response.body.decode('utf-8')
json_string = re.search( r'addMoreNewsResults\((.+?) \);', html, re.DOTALL ).group(1)
#Below code is used to transform from Javascript-ish JSON-like structure to JSON
json_string = re.sub( r'^\s*(\w+):', r'"\1":', json_string, flags=re.MULTILINE)
json_string = re.sub( r'(\w+),\s*$', r'"\1",', json_string, flags=re.MULTILINE)
json_string = re.sub( r':\s*\'(.+?)\',\s*$', r': "\1",', json_string, flags=re.MULTILINE)
results = json.loads(json_string)
for result in results["news"]:
item = ReuterItem()
item["href"] = result["href"]
item["date"] = result["date"]
item["headline"] = result["headline"]
yield item
How can I increase it to cover all search results.
Solution
There are more than a few considerations when crawling sites like this, more so if it's by using their internal API's. Here's a few advice points from my experience, in no particular order:
Since you will likely be making a lot of requests while changing the query arguments, a good practice is to build them dynamically so you don't go crazy.
Always try to remove as much boilerplate from your requests as possible, like extra query parameters, headers, etc. It's useful to play around the API with tools like Postman or similar, to come to a bare minimum working requirements.
As the spider gets more complicated and/or there is a more complex crawling logic in place, it's useful to extract relevant code into separate methods for usability and easier maintenance.
You can pass along valuable information in
meta
of your request, which will be copied to the response's meta. This can be useful in the given example to keep track of the current page being crawled. Alternatively you can just extract the page number from the URL to make it more robust.Consider if you need any Cookies in order to visit a certain page. You might not be able to get a response directly from the API (or any page for that matter) if you don't have proper cookies. Usually it's enough to just visit the main page before proceeding, and Scrapy will take care of storing cookies.
Always be polite to avoid being banned and putting a lot of stress on the target site. Use high download delays if possible, and keep the concurrency low.
All that said, I've given it a quick run and put together a semi-working example which should be enough to get you started. There are still improvements to be made, like more complex retry logic, revisiting the main page in case cookie expires, etc...
# -*- coding: utf-8 -*-
import json
import re
import urllib
import scrapy
class ReuterItem(scrapy.Item):
href = scrapy.Field()
date = scrapy.Field()
headline = scrapy.Field()
class ReutersSpider(scrapy.Spider):
name = "reuters"
NEWS_URL = 'https://www.reuters.com/search/news?blob={}'
SEARCH_URL = 'https://www.reuters.com/assets/searchArticleLoadMoreJson?'
RESULTS_PER_PAGE = 1000
BLOB = 'steel.'
custom_settings = {
# blend in
'USER_AGENT': ('Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0)'
' Gecko/20100101 Firefox/40.1'),
# be polite
'DOWNLOAD_DELAY': 5,
}
def _build_url(self, page):
params = {
'blob': self.BLOB,
'bigOrSmall': 'big',
'callback': 'addMoreNewsResults',
'articleWithBlog': True,
'numResultsToShow': self.RESULTS_PER_PAGE,
'pn': page
}
return self.SEARCH_URL + urllib.urlencode(params)
def _parse_page(self, response):
html = response.body.decode('utf-8')
json_string = re.search( r'addMoreNewsResults\((.+?) \);', html, re.DOTALL ).group(1)
#Below code is used to transform from Javascript-ish JSON-like structure to JSON
json_string = re.sub( r'^\s*(\w+):', r'"\1":', json_string, flags=re.MULTILINE)
json_string = re.sub( r'(\w+),\s*$', r'"\1",', json_string, flags=re.MULTILINE)
json_string = re.sub( r':\s*\'(.+?)\',\s*$', r': "\1",', json_string, flags=re.MULTILINE)
return json.loads(json_string)
def start_requests(self):
# visit the news page first to get the cookies needed
# to visit the API in the next steps
url = self.NEWS_URL.format(self.BLOB)
yield scrapy.Request(url, callback=self.start_crawl)
def start_crawl(self, response):
# now that we have cookies set,
# start crawling form the first page
yield scrapy.Request(self._build_url(1), meta=dict(page=1))
def parse(self, response):
data = self._parse_page(response)
# extract news from the current page
for item in self._parse_news(data):
yield item
# Paginate if needed
current_page = response.meta['page']
total_results = int(data['totalResultNumber'])
if total_results > (current_page * self.RESULTS_PER_PAGE):
page = current_page + 1
url = self._build_url(page)
yield scrapy.Request(url, meta=dict(page=page))
def _parse_news(self, data):
for article in data["news"]:
item = ReuterItem()
item["href"] = article["href"]
item["date"] = article["date"]
item["headline"] = article["headline"]
yield item
Answered By - bosnjak
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.