Issue
I am new to web scraping and Scrapy in general. I am trying to scrape from yellowpages and running into challenges. When I run fetch in the terminal, I get a 200 response. But then when trying do response.css('article.address-indicators') for example, I get back an empty array. I tested this with books.toscrape.com and that works fine.
fetch("https://www.yellowpages.com/search?search_term=hairdressers%20&search_location=Los%20Angeles%2C%20CA&search_type=searchbox_top")
Solution
By default, scrapy honors rules in robots.txt. See the log below:
>>> fetch('https://www.yellowpages.com/search?search_term=hairdressers%20&search_location=Los%20Angeles%2C%20CA&search_type=searchbox_top')
2023-12-31 11:09:26 [scrapy.core.engine] INFO: Spider opened
2023-12-31 11:09:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yellowpages.com/robots.txt> (referer: None)
2023-12-31 11:09:27 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET https://www.yellowpages.com/search?search_term=hairdressers%20&search_location=Los%20Angeles%2C%20CA&search_type=searchbox_top>
You can override the default (at your own risk):
scrapy shell --set ROBOTSTXT_OBEY=False
and then you can use response.css('....')
or similar expressions.
Answered By - noam cohen
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.