Issue
I try to get the articles from this website. What I have tried:
- get into the main url
- get into the sub url where the complete article is there
- get all the details I need from the complete article
But I got response 403 when I tried to run my code first, then I tried to fix it by adding headers when requesting to the start_urls
as what I read from some answers. I did it, but then my script gives me error where it said response 403 when getting into the sub url where all the information I need is there.
My current code is below
import scrapy
from scrapy import Request
from scrapy.crawler import CrawlerProcess
class climateupdate(scrapy.Spider):
name = 'climateupdate'
start_urls = ['http://www.bom.gov.au/climate/updates/']
def start_requests(self):
headers= {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0'}
for url in self.start_urls:
yield Request(url, headers=headers)
def parse(self, response):
for link in response.xpath('//*[@id="content"]/ul/li[1]/a/@href'):
yield response.follow(
url=link.get(),
callback=self.parse_item
)
def parse_item(self, response):
yield {
'date': response.xpath('//*[@id="updates"]/p[1]/time/text()').extract(),
'title': response.xpath('//*[@id="updates"]/div[1]/h1/text()').get(),
'text':''.join([x.get().strip() for x in response.xpath('//*[@class="key-points box-notice bg-grey"]//p//text()')])
}
if __name__ == '__main__':
process = CrawlerProcess()
process.crawl(weeklymining)
process.start()
How should I write my script in order to get into the sub url and get all the details regarding the articles.
Thank you in advance.
Solution
Using headers here is incorrect that's why you are getting
403
Use custom settings to inject user-agent
your
date and text
selecting xpath expressions are incorrect//*[@id="content"]/ul/li[1]/a/@href
selects only one details url
Full working code:
import scrapy
from scrapy import Request
from scrapy.crawler import CrawlerProcess
class climateupdate(scrapy.Spider):
name = 'climateupdate'
start_urls = ['http://www.bom.gov.au/climate/updates/']
custom_settings = {
'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
'DOWNLOAD_DELAY': 1,
'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
}
def start_requests(self):
for url in self.start_urls:
yield Request(url,callback=self.parse)
def parse(self, response):
for link in response.xpath('//*[@class="list-archive"]/li/a/@href'):
yield response.follow(
url=link.get(),
callback=self.parse_item
)
def parse_item(self, response):
yield {
'date': response.xpath('//*[@id="updates"]/p[1]/time/text()').get(),
'title': ''.join(response.xpath('//*[@id="updates"]//h1//text()').getall()).strip(),
'text':''.join(response.xpath('//*[@id="updates"]//p//text()').getall()[1:]).strip()
}
if __name__ == '__main__':
process = CrawlerProcess()
process.crawl(weeklymining)
process.start()
Answered By - F.Hoque
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.