Monday, December 6, 2021

[FIXED] Scrapy: 403 Error on all requests

December 06, 2021 python, python-3.x, scrapy, scrapy-spider No comments

Issue

My scrapy crawler uses random proxies and it works on my computer. But when I run it on a vps, it return 403 error on every requests.

2018-05-26 09:43:18 [scrapy.proxies] DEBUG: Proxy user pass not found
2018-05-26 09:43:18 [scrapy.proxies] DEBUG: Using proxy <http://104.237.210.29:2716>, 20 proxies left
2018-05-26 09:43:19 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.yelp.com/> (failed 1 times): 403 Forbidden
2018-05-26 09:43:19 [scrapy.proxies] DEBUG: Proxy user pass not found
2018-05-26 09:43:19 [scrapy.proxies] DEBUG: Using proxy <http://104.237.210.173:5195>, 20 proxies left
2018-05-26 09:43:19 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.yelp.com/> (failed 1 times): 403 Forbidden
2018-05-26 09:43:19 [scrapy.proxies] DEBUG: Proxy user pass not found
2018-05-26 09:43:19 [scrapy.proxies] DEBUG: Using proxy <http://104.237.210.93:3410>, 20 proxies left
2018-05-26 09:43:19 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.yelp.com/> (failed 1 times): 403 Forbidden

I manually checked the proxies on firefox on the vps and I can access the websites without any error.

This my settings, it's the same as the one on my computer:

DOWNLOADER_MIDDLEWARES = {
   # 'monitor.middlewares.MonitorDownloaderMiddleware': 543,
   # Proxies
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
    'scrapy_proxies.RandomProxy': 100,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
    # Proxies end
    # Useragent
    'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
    'random_useragent.RandomUserAgentMiddleware': 400,
    # Useragent end
}

# Random useragent list
USER_AGENT_LIST = r"C:\Users\Administrator\Desktop\useragents.txt"

# Retry many times since proxies often fail
RETRY_TIMES = 5
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]

# Proxy list containing entries like
# http://host1:port
# http://username:password@host2:port
# http://host3:port
PROXY_LIST = r"C:\Users\Administrator\Desktop\proxies.txt"

# Proxy mode
# 0 = Every requests have different proxy
# 1 = Take only one proxy from the list and assign it to every requests
# 2 = Put a custom proxy to use in the settings
PROXY_MODE = 0

Solution

Sometimes you get 403 because the robots.txt disallow robots on the whole website, or part of website you are scrapping.

Then first of all, write in the settings.py ROBOTSTXT_OBEY = False. I don't see it in your settings here.

Do not consider the robots.txt is not enough generaly. You must make up your user agent as a conventional browser, in the settings.py too. For instance: USER_AGENT='Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7' The best to do is create a list of bunch of user agent in the settings, for instance like this:

USER_AGENT_LIST = [
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7',
    ...,
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/534.55.3 (KHTML, like Gecko) Version/5.1.3 Safari/534.53.10'
]

you seem to did it. Then to make it random, and you seem to did too.

Finally, this is kinda optional but I let you see if it is useful for you, to write a DOWNLOAD_DELAY = 3 in settings.py where the value is 1 at least. The ideal is to make it random too. It makes up your spider acting like a browser. As far as I know, a too quick download delay can make the website understand this is a robot made up with a fake user agent. If the webmaster has a lot of experience, he makes rules with many barriers to protect his website from robots.

I tested it this morning for the same problem like yours, in my scrapy shell. I hope it will be useful for you.

Answered By - AvyWam

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, December 6, 2021

[FIXED] Scrapy: 403 Error on all requests

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels