Issue
My scrapy crawler uses random proxies and it works on my computer. But when I run it on a vps, it return 403 error on every requests.
2018-05-26 09:43:18 [scrapy.proxies] DEBUG: Proxy user pass not found
2018-05-26 09:43:18 [scrapy.proxies] DEBUG: Using proxy <http://104.237.210.29:2716>, 20 proxies left
2018-05-26 09:43:19 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.yelp.com/> (failed 1 times): 403 Forbidden
2018-05-26 09:43:19 [scrapy.proxies] DEBUG: Proxy user pass not found
2018-05-26 09:43:19 [scrapy.proxies] DEBUG: Using proxy <http://104.237.210.173:5195>, 20 proxies left
2018-05-26 09:43:19 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.yelp.com/> (failed 1 times): 403 Forbidden
2018-05-26 09:43:19 [scrapy.proxies] DEBUG: Proxy user pass not found
2018-05-26 09:43:19 [scrapy.proxies] DEBUG: Using proxy <http://104.237.210.93:3410>, 20 proxies left
2018-05-26 09:43:19 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.yelp.com/> (failed 1 times): 403 Forbidden
I manually checked the proxies on firefox on the vps and I can access the websites without any error.
This my settings, it's the same as the one on my computer:
DOWNLOADER_MIDDLEWARES = {
# 'monitor.middlewares.MonitorDownloaderMiddleware': 543,
# Proxies
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
'scrapy_proxies.RandomProxy': 100,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
# Proxies end
# Useragent
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
'random_useragent.RandomUserAgentMiddleware': 400,
# Useragent end
}
# Random useragent list
USER_AGENT_LIST = r"C:\Users\Administrator\Desktop\useragents.txt"
# Retry many times since proxies often fail
RETRY_TIMES = 5
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]
# Proxy list containing entries like
# http://host1:port
# http://username:password@host2:port
# http://host3:port
PROXY_LIST = r"C:\Users\Administrator\Desktop\proxies.txt"
# Proxy mode
# 0 = Every requests have different proxy
# 1 = Take only one proxy from the list and assign it to every requests
# 2 = Put a custom proxy to use in the settings
PROXY_MODE = 0
Solution
Sometimes you get 403 because the robots.txt disallow robots on the whole website, or part of website you are scrapping.
Then first of all, write in the settings.py ROBOTSTXT_OBEY = False
. I don't see it in your settings here.
Do not consider the robots.txt is not enough generaly. You must make up your user agent as a conventional browser, in the settings.py too. For instance: USER_AGENT='Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7'
The best to do is create a list of bunch of user agent in the settings, for instance like this:
USER_AGENT_LIST = [
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7',
...,
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/534.55.3 (KHTML, like Gecko) Version/5.1.3 Safari/534.53.10'
]
you seem to did it. Then to make it random, and you seem to did too.
Finally, this is kinda optional but I let you see if it is useful for you, to write a DOWNLOAD_DELAY = 3
in settings.py where the value is 1 at least. The ideal is to make it random too. It makes up your spider acting like a browser. As far as I know, a too quick download delay can make the website understand this is a robot made up with a fake user agent. If the webmaster has a lot of experience, he makes rules with many barriers to protect his website from robots.
I tested it this morning for the same problem like yours, in my scrapy shell. I hope it will be useful for you.
Answered By - AvyWam
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.