Issue
i need to do many requests to one url, but after ~20 requests, I get a 429 too many requests
. So my plan was to use proxy requests. I have tried 3 things:
- Tor-proxy using python
- Free proxy lists
- ScraperApi
But all of them(even the scraperApi-trial) are unbelieveably slow, like 5-10 seconds each request. An example looks like this:
import requests
url = "https://httpbin.org/ip"
proxies = {"https": "164.155.149.1:80"}
r = requests.get(url,proxies=proxies)
print(r.text)
The proxy-ip was from some free proxy website. Sure, proxies are an extra node inbetween but was hoping to find a way to get proxies which at maximum take 1 second..
Is there any way to solve this issue?
Thanks in advance
Solution
Codedor, one way I could think is:
- Create a pool of EC2 instances on AWS(or any other cloud service provider of your choice). These can be the cheapest ones - even spot instances on AWS.
- Round-robin your requests from each of these VMs. Since each VM will have it's own public IP, you are less likely to get "429 too many requests" sooner. The more instances you have, the less likely.
Eg:
- Say you have 10 VMs.
- In each VM you make 1 request/5s = 12 requests/min.
- Altogether you will make 12X10 = 120 requests/min.
- Add reasonable delays.
Distributing the jobs on the VM would be a little trickier - but doable. You can have a master node running a Python script, that iterates through the VMs and spawns the request command on them. You could use various libraries to execute a command on a remote machine in Python - like paramiko, subprocess, os, etc.
Answered By - Loner
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.