Thursday, October 28, 2021

[FIXED] Python Scrapy get article body, extract_first() get None

October 28, 2021 python, scrapy, scrapy-spider No comments

Issue

I tried to use Scrapy to get article body from news site.

import scrapy
import sys 
import json

class ReutersPage(scrapy.Spider):
    name = "reutersPage"
    start_urls = [
        'https://www.reuters.com/article/chile-sqm-stocks/lithium-miner-sqm-shares-up-2-7-pct-chile-court-clears-way-for-tianqi-stake-purchase-idUSC0N1OX01C'
    ]


    def parse(self, response):
        articleBody = response.css('div.StandardArticleBody_body::text').extract_first()
        print('######## Article body ##########')
        print(articleBody)
        yield {
            'body': articleBody
        }

I try to get text in div StandardArticleBody_body but always get None value.

The output is

2018-10-26 14:23:44 [scrapy.core.engine] INFO: Spider opened
2018-10-26 14:23:44 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-10-26 14:23:44 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-10-26 14:23:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.reuters.com/robots.txt> (referer: None)
2018-10-26 14:23:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.reuters.com/article/chile-sqm-stocks/lithium-miner-sqm-shares-up-2-7-pct-chile-court-clears-way-for-tianqi-stake-purchase-idUSC0N1OX01C> (referer: None)
######## Parse article ##########
######## Article body ##########
None
2018-10-26 14:23:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.reuters.com/article/chile-sqm-stocks/lithium-miner-sqm-shares-up-2-7-pct-chile-court-clears-way-for-tianqi-stake-purchase-idUSC0N1OX01C>
{'body': None}
2018-10-26 14:23:45 [scrapy.core.engine] INFO: Closing spider (finished)

Solution

There isn't any text belonging directly to the div you're selecting, but to it's descendants. A space between the selector path and the :: will get text of all descendants, not just the text of the node you've selected.

Try this

articleBody = response.css('div.StandardArticleBody_body ::text').extract_first()

So that you're getting all the text of the div's descendants.

Answered By - pwinz

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, October 28, 2021

[FIXED] Python Scrapy get article body, extract_first() get None

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels