Issue
I'm still learning how to do web scraping, and I'm trying to scrape a website by getting all the articles from an index page and then grab their information, and also the full text. With the code below, I could get all the information I need – date, time, category, title — except for the full article.
text': news.css('p.categoryArticle__excerpt::text').get()
did not capture all the text.
Here is the code I wrote so far:
import scrapy
class CoalNewsFromOilPrice(scrapy.Spider):
name = 'coalnews'
start_urls = ['https://oilprice.com/Energy/Coal/']
def parse(self, response):
for news in response.css('div.categoryArticle__content'):
yield {
'datetime': news.css('p.categoryArticle__meta::text').get(),
'category': news.xpath('//h1[@class="categoryHeading"]/text()').extract()[0].replace('/', '').replace(' ',''),
'title': news.css('h2.categoryArticle__title::text').get(),
'text': news.css('p.categoryArticle__excerpt::text').get(),
}
next_page = response.css('a.num').attrib['href']
if next_page is not None:
yield response.follow(next_page, callback = self.parse)
and here the elements I need. When I open the html url, it shows the full text. But I still did not get how should I get it. I am thinking to extract the html url, but I dont know how.
<div class="categoryArticle__content">
<a href="https://oilprice.com/Energy/Coal/Russias-Coal-Exports-Are-On-The-Rise-As-EU-Ban-Looms.html">
<h2 class="categoryArticle__title">Russia’s Coal Exports Are On The Rise As EU Ban Looms</h2>
</a>
<p class="categoryArticle__meta">Jul 06, 2022 at 09:41 | Tsvetana Paraskova</p>
<p class="categoryArticle__excerpt"></p>
Russian seaborne coal exports are estimated to have increased since Putin’s
invasion of Ukraine and the EU announcement it was banning Russian coal imports
from August.
</div>
What should I do to get the full text of the articles?
Solution
Now your code is working fine with pulling full text along with pagination in start_urls. Actually, I go to the details page and from the details page, I grab all the required data items using xpath expression.
import scrapy
from scrapy.crawler import CrawlerProcess
class CoalNewsFromOilPrice(scrapy.Spider):
name = 'coalnews'
start_urls = ['https://oilprice.com/Energy/Coal/Page-'+str(x)+'.html' for x in range(1,18)]
def parse(self, response):
for link in response.xpath('//*[@class="categoryArticle__content"]/a/@href'):
yield scrapy.Request(
url=link.get(),
callback=self.parse_item
)
def parse_item(self, response):
yield {
'datetime': response.xpath('//*[@class="article_byline"]/text()[2]').get(),
'category': response.xpath('(//*[@itemprop="name"])[3]/text()').get(),
'title': response.xpath('//*[@class="singleArticle__content"]/h1/text()').get(),
'text':''.join([x.get().strip() for x in response.xpath('//*[@id="article-content"]//p//text()')])
}
if __name__ == '__main__':
process = CrawlerProcess()
process.crawl(CoalNewsFromOilPrice)
process.start()
Answered By - F.Hoque
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.