Issue
When I run the code below, it stops working at url number 9983 if I slice the number of urls to be 10 000, but there is no error displayed in the terminal, the code just stops running (like freezing).
Same behavior if I slice the number of urls at 5000, it stops running just before reaching the 5000th url.
Oddly, the code works if I slice the list of urls to have 1000 urls.
I don't really know where does the problem come from, I imagine it's something related to a parameter of aiohttp or asyncio that I have to add somewhere to increase the number of authorized requests.
Here is my current code:
import asyncio
import time
import aiohttp
import pandas as pd
found = 0
not_found = 0
counter = 0
async def download_site(session, url):
global found, not_found, counter
async with session.get(url) as response:
if str(response.url) == 'https://fake.notfound.url.com':
print('\n\n', response.url, '\n\n')
not_found += 1
else:
found += 1
counter += 1
print(counter)
async def download_all_sites(sites):
session_timeout = aiohttp.ClientTimeout(total=None)
async with aiohttp.ClientSession(timeout=session_timeout) as session:
tasks = []
for url in sites:
task = asyncio.ensure_future(download_site(session, url))
tasks.append(task)
await asyncio.gather(*tasks)
if __name__ == "__main__":
df = pd.read_csv('database_table.csv', sep=';', encoding='utf-8')
sites = df['urls'].tolist()
start_time = time.time()
asyncio.get_event_loop().run_until_complete(download_all_sites(sites[7000:17000]))
duration = time.time() - start_time
print(f'404: {not_found / len(sites[7000:17000]) * 100} %')
print(f'200: {found / len(sites[7000:17000]) * 100} %')
After a long time I press Ctrl+C
and I get this error trace:
^CTraceback (most recent call last):
File "/home/takamura/Documents/corp/scripts/misc_scripts/links_checker.py", line 74, in <module>
File "/usr/lib/python3.9/asyncio/base_events.py", line 629, in run_until_complete
self.run_forever()
File "/usr/lib/python3.9/asyncio/base_events.py", line 596, in run_forever
self._run_once()
File "/usr/lib/python3.9/asyncio/base_events.py", line 1854, in _run_once
event_list = self._selector.select(timeout)
File "/usr/lib/python3.9/selectors.py", line 469, in select
fd_event_list = self._selector.poll(timeout, max_ev)
KeyboardInterrupt
Exception ignored in: <coroutine object download_all_sites at 0x7fc348b856c0>
RuntimeError: coroutine ignored GeneratorExit
Task was destroyed but it is pending!
task: <Task pending name='Task-609' coro=<download_site() running at /home/takamura/Documents/corp/scripts/misc_scripts/links_checker.py:41> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7fc348101520>()]> cb=[gather.<locals>._done_callback() at /usr/lib/python3.9/asyncio/tasks.py:766]>
What am I doing wrong ?
Solution
Adding semaphore like in this code + using
connector = aiohttp.TCPConnector(limit=80)
async with aiohttp.ClientSession(connector=connector) as session:
...
solved my issue.
Answered By - Takamura
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.