Issue
I was recommended httpx as a way to perform api requests in parallel, with a nice api like requests.
my code
import asyncio
import time
import httpx
async def main():
t0 = time.time()
usernames = [
"author",
"abtinf",
"TheCoelacanth",
"tomcam",
"chauhankiran",
"ulizzle",
"ulizzle",
"ulizzle",
"cratermoon",
"Aeolun",
"ulizzle",
"firexcy",
"kazinator",
"blacksoil",
"lucakiebel",
"ozim",
"tomcam",
"jstummbillig",
"tomcam",
"johnchristopher",
"Tade0",
"lallysingh",
"paulddraper",
"WilTimSon",
"gumby",
"kristopolous",
"zemo",
"aschearer",
"why-el",
"Osiris",
"mdaniel",
"ianbutler",
"vinaypai",
"samtho",
"chazeon",
"taeric",
"yellowapple",
"Kye",
]
bios = []
headers = {"User-Agent": "curl/7.72.0"}
async with httpx.AsyncClient(headers=headers) as client:
for username in usernames:
url = f"https://hn.algolia.com/api/v1/users/{username}"
response = await client.get(url)
data = response.json()
bios.append(data['about'])
print('.')
t1 = time.time()
total = t1-t0
print(bios)
print(f"Total time: {total} seconds") # 11 seconds async
asyncio.run(main())
How do I make sure that this example runs with the requests in parallel?
Solution
First of all, Python's asyncio
does not provide true parallelism (as has been discussed repeatedly on this platform). The event loop runs in a single thread.
The concurrency just allows context switches between multiple coroutines, while they are awaiting some I/O operation to finish, like for example an HTTP request. But the requesting function must be implemented in a particular, non-blocking way for this to work. The httpx
package apparently provides such functions.
As has been pointed out in the comments, you are not getting any concurrency in your code because you are await
ing each request made by the client sequentially in a for
-loop. In other words, there is no chance for a new request to be launched, until the previous one returns completely.
A common pattern to concurrently execute the same coroutine with different arguments is to use asyncio.gather
. I would suggest to factor out the entire GET
request as well as the retrieval of the about
section of the returned data into its own coroutine function and execute whatever number of those you deem appropriate concurrently:
import asyncio
import time
import httpx
BASE_URL = "https://hn.algolia.com/api/v1/users"
async def get_bio(username: str, client: httpx.AsyncClient) -> str:
response = await client.get(f"{BASE_URL}/{username}")
print(".")
data = response.json()
return data["about"]
async def main() -> None:
t0 = time.time()
usernames = [
"author",
"abtinf",
"TheCoelacanth",
# ...
]
headers = {"User-Agent": "curl/7.72.0"}
async with httpx.AsyncClient(headers=headers) as client:
bios = await asyncio.gather(*(get_bio(user, client) for user in usernames))
print(dict(zip(usernames, bios)))
print(f"Total time: {time.time() - t0:.3} seconds")
asyncio.run(main())
Sample output:
.
.
.
{'author': '', 'abtinf': 'You can reach me at [email protected] or @abtinf.', 'TheCoelacanth': '[email protected]'}
Total time: 0.364 seconds
Since this approach allows a great number of HTTP requests to be made in a very short amount of time (because you are not awaiting previous responses before launching more requests), there is always the danger of being subjected to rate limiting or being blocked outright by the API. I don't know anything about this API in particular though. So I don't know if your list of user names is already "too long".
If you are interested in a flexible control mechanism to manage a pool of asynchronous tasks, I wrote the asyncio-taskpool
package to make this easier for my own applications. TaskPool.map
allows you to set a specific maximum number of tasks to work concurrently on an arbitrary iterable of arguments. This could help with the rate limiting issue.
Answered By - Daniil Fajnberg
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.