Sunday, December 26, 2021

[FIXED] Scraping for related news using Scrapy

December 26, 2021 python, scrapy, web-scraping No comments

Issue

I want to scrape the Snopes fact-checking website using Scrapy. Here, I want to find out related news based on the user given input. User gives a word and Scrapy crawler will return related news. For example, if I enter NASA as input, Scrapy will give NASA related news. I tried but there is no output.

import scrapy

class fakenews(scrapy.Spider):
    name = "snopes5"
    allowed_domains = ["snopes.com"]
    start_urls = [
            "https://www.snopes.com/fact-check/category/science/"
    ]

    def parse(self, response):
            name1=input('Please Enter the search item you want for fake news: ')
            headers = response.xpath('//div[@class="media-body"]/h5').extract()
            headers = [c.strip().lower() for c in headers]
            if name1 in headers:
                print(response.xpath('//div[@class="navHeader"]/ul'))
                filename = response.url.split("/")[-2] + '.html'
                with open(filename, 'wb') as f:
                    f.write(response.body)

Solution

There's one vital error in your code:

c=response.xpath('//div[@class="navHeader"]/ul')
if name1 in c:
    ...

here c end up being a SelectorList object and you are checking whether string name is in SelectorList object which of course will always be False.
To remedy this you need to extract your values:

c=response.xpath('//div[@class="navHeader"]/ul').extract()
                                                ^^^^^^^^^^

Additionally you probably would want to process the values to make matching more volatile:

headers = response.xpath('//div[@class="navHeader"]/ul').extract()
headers = [c.strip().lower() for c in headers]
if name1 in headers:
    ...

The above will ignore trailing and leading spaces as well as make everything lowercase for case-insensitive matching.

Your use case example:

headers = sel.xpath('//div[@class="media-body"]/h5/text()').extract() 
headers = [c.strip().lower() for c in headers]  
for header in headers: 
    if 'gorilla' in header: 
        print(f'yay matching header: "{header}"')

outputs:

yay matching header: "did this gorilla learn how to knit?"

Answered By - Granitosaurus

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, December 26, 2021

[FIXED] Scraping for related news using Scrapy

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels