Issue
I tried to scrape CNN homepage with scrapy
.
I used the following xpath
selectors, but all of them returned empty lists.
Current results : all of these returns []
"//strong"
"//h2"
"//span[@class='cd__headline-text']"
Expected results :
[Headline_1, Headline_2, Headline_3, ...]
Can someone help me figure out why? Is CNN doing something to stop people from scraping headlines?
I use Scrapy
.
Solution
In order to write XPath/CSS selector or any web page, first of all, check page source that whether the selectors which you are looking for exists or not. In the current case none of the above selectors are found in page source. They are getting page content in various requests, try checking the network and find appropriate requests for your case. You need to make those requests in your spider in order to scrape news from CNN.
Answered By - Ahmed Buksh
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.