Issue
I want to scrape the Snopes fact-checking website using Scrapy. Here, I want to find out related news based on the user given input. User gives a word and Scrapy crawler will return related news. For example, if I enter NASA as input, Scrapy will give NASA related news. I tried but there is no output.
import scrapy
class fakenews(scrapy.Spider):
name = "snopes5"
allowed_domains = ["snopes.com"]
start_urls = [
"https://www.snopes.com/fact-check/category/science/"
]
def parse(self, response):
name1=input('Please Enter the search item you want for fake news: ')
headers = response.xpath('//div[@class="media-body"]/h5').extract()
headers = [c.strip().lower() for c in headers]
if name1 in headers:
print(response.xpath('//div[@class="navHeader"]/ul'))
filename = response.url.split("/")[-2] + '.html'
with open(filename, 'wb') as f:
f.write(response.body)
Solution
There's one vital error in your code:
c=response.xpath('//div[@class="navHeader"]/ul')
if name1 in c:
...
here c
end up being a SelectorList
object and you are checking whether string name
is in SelectorList
object which of course will always be False
.
To remedy this you need to extract your values:
c=response.xpath('//div[@class="navHeader"]/ul').extract()
^^^^^^^^^^
Additionally you probably would want to process the values to make matching more volatile:
headers = response.xpath('//div[@class="navHeader"]/ul').extract()
headers = [c.strip().lower() for c in headers]
if name1 in headers:
...
The above will ignore trailing and leading spaces as well as make everything lowercase for case-insensitive matching.
Your use case example:
headers = sel.xpath('//div[@class="media-body"]/h5/text()').extract()
headers = [c.strip().lower() for c in headers]
for header in headers:
if 'gorilla' in header:
print(f'yay matching header: "{header}"')
outputs:
yay matching header: "did this gorilla learn how to knit?"
Answered By - Granitosaurus
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.