Tuesday, August 2, 2022

[FIXED] how to get a full news article from a website with scrapy

August 02, 2022 python, scrapy No comments

Issue

I'm still learning how to do web scraping, and I'm trying to scrape a website by getting all the articles from an index page and then grab their information, and also the full text. With the code below, I could get all the information I need – date, time, category, title — except for the full article.

text': news.css('p.categoryArticle__excerpt::text').get() did not capture all the text.

Here is the code I wrote so far:

import scrapy

class CoalNewsFromOilPrice(scrapy.Spider):
    name = 'coalnews'
    start_urls = ['https://oilprice.com/Energy/Coal/']

    def parse(self, response):
        for news in response.css('div.categoryArticle__content'):
            yield {
                'datetime': news.css('p.categoryArticle__meta::text').get(),
                'category': news.xpath('//h1[@class="categoryHeading"]/text()').extract()[0].replace('/', '').replace(' ',''),
                'title': news.css('h2.categoryArticle__title::text').get(),
                'text':  news.css('p.categoryArticle__excerpt::text').get(),
            }
        next_page = response.css('a.num').attrib['href']
        if next_page is not None:
            yield response.follow(next_page, callback = self.parse)

and here the elements I need. When I open the html url, it shows the full text. But I still did not get how should I get it. I am thinking to extract the html url, but I dont know how.

<div class="categoryArticle__content">
    
       <a href="https://oilprice.com/Energy/Coal/Russias-Coal-Exports-Are-On-The-Rise-As-EU-Ban-Looms.html">
          <h2 class="categoryArticle__title">Russia’s Coal Exports Are On The Rise As EU Ban Looms</h2>
       </a>
       <p class="categoryArticle__meta">Jul 06, 2022 at 09:41 | Tsvetana Paraskova</p>
       <p class="categoryArticle__excerpt"></p>
        Russian seaborne coal exports are estimated to have increased since Putin’s 
        invasion of Ukraine and the EU announcement it was banning Russian coal imports 
        from August.&nbsp;&nbsp;&nbsp;

                        </div>

What should I do to get the full text of the articles?

Solution

Now your code is working fine with pulling full text along with pagination in start_urls. Actually, I go to the details page and from the details page, I grab all the required data items using xpath expression.

import scrapy
from scrapy.crawler import CrawlerProcess

class CoalNewsFromOilPrice(scrapy.Spider):
    name = 'coalnews'
    start_urls = ['https://oilprice.com/Energy/Coal/Page-'+str(x)+'.html' for x in range(1,18)]

    def parse(self, response):
        for link in response.xpath('//*[@class="categoryArticle__content"]/a/@href'):
            yield scrapy.Request(
                url=link.get(),
                callback=self.parse_item
            )

    def parse_item(self, response):
        yield {
            'datetime': response.xpath('//*[@class="article_byline"]/text()[2]').get(),
            'category': response.xpath('(//*[@itemprop="name"])[3]/text()').get(),
            'title': response.xpath('//*[@class="singleArticle__content"]/h1/text()').get(),
            'text':''.join([x.get().strip() for x in response.xpath('//*[@id="article-content"]//p//text()')])
            }
       


if __name__ == '__main__':
    process = CrawlerProcess()
    process.crawl(CoalNewsFromOilPrice)
    process.start()

Answered By - F.Hoque

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, August 2, 2022

[FIXED] how to get a full news article from a website with scrapy

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels