Issue
I am scraping a web page of tech components and getting results to compare later. For this task, I am using Scrapy and Python. After two months scraping a web, I am getting 403 status error. I have tried to change:
- The bot name
- User Agent with some different agents
- Launch scraper from my friends computer
- I have tried to launch scraper in differents IP
- 3 and 4 together
This five steps make me think they have info about my scraper and not about my computer and they have blocked my bot. This is not the first time happens. They blocked my bot one month ago and unblocked the same bot a week later.
I am looking for fresh ideas because everybody on forums and scraping webs recommend to change user-agents.
I have tried to make a simple request with this code:
import request
url = 'https://www.webwithcloudflareprotection.com/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0'
}
r = requests.get(url, headers=headers)
print(r.status_code)
This code is getting 403 always in every IP I try to launch it. It's very strange. Someone told me about Cloudfare but I don't know how to check if this software is behind all this.
Solution
Finally, the problem was a third party software between my machine and their own IP. I find the way to avoid this integrating scrappy with Selenium and chrome driver.
It could not be the best solution but it works. Performance is slower but results are the same!
Answered By - Leon Lopez
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.