Issue
I've created a script using asyncio library to parse the name of different post owner from a webpage. The idea is to supply this link within the script which parses all the links of different posts from each page and traverses next pages to do the same. However, the script then uses all the links within this function fetch_again()
to reach the inner page in order to get the owner of all the posts.
Although I could have scraped the name of owners from it's landing page, I used the following approach only to know as to how I can achieve the same using the design I'm trying with. I've used semaphore
within the script to limit the number of requests.
When I use the following script, I found it working for 100 or few more posts and then it gets stuck. It doesn't throw any error.
I've tried with:
import aiohttp
import asyncio
from lxml.html import fromstring
from urllib.parse import urljoin
link = "https://stackoverflow.com/questions/tagged/web-scraping"
semaphore = asyncio.Semaphore(10)
async def fetch(url):
async with semaphore:
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
text = await response.text()
result = await processing_docs(session, text)
return result
async def processing_docs(session, html):
coros = []
tree = fromstring(html)
titles = [urljoin(link,title.attrib['href']) for title in tree.cssselect(".summary .question-hyperlink")]
for title in titles:
coros.append(fetch_again(session,title))
next_page = tree.cssselect("div.pager a[rel='next']")
if next_page:
page_link = urljoin(link,next_page[0].attrib['href'])
coros.append(fetch(page_link))
await asyncio.gather(*coros)
async def fetch_again(session,url):
async with semaphore:
async with session.get(url) as response:
text = await response.text()
tree = fromstring(text)
title = tree.cssselect("h1[itemprop='name'] a")[0].text
print(title)
if __name__ == '__main__':
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(asyncio.gather(*(fetch(url) for url in [link])))
loop.run_until_complete(future)
loop.close()
How can I let the script go on parsing which currently gets stuck somewhere in it's execution?
Solution
The script likely blocks due to a deadlock: fetch
acquires the semaphore and calls processing_docs
, which recursively calls more instances of fetch
and fetch_again
with the semaphore still held. If the recursion depth of fetch
reaches 10, the innermost fetch
will never acquire the semaphore because it will have been acquired by its callers. I suggest that you replace recursion with an asyncio.Queue
, and drain (and populate) the queue with a fixed number of worker tasks. That way you don't even need a semaphore and you are guaranteed not to deadlock.
An even simpler fix, which doesn't require refactoring, is to just move the recursive call to processing_docs()
outside the async with semaphore
block, i.e. to invoke processing_docs()
with the semaphore released. After all, the purpose of the semaphore is to limit concurrent access to the remote server, not local processing which is not concurrent in the first place as asyncio is single-threaded. That should eliminate the deadlock while still limiting the number of concurrent connections:
async def fetch(url):
async with aiohttp.ClientSession() as session:
async with semaphore:
async with session.get(url) as response:
text = await response.text()
result = await processing_docs(session, text)
return result
Also note that you should probably create a single session in a top-level coroutine and propagate it throughout the code. You are already doing that between fetch
, processing_docs
and fetch_again
, but you could also do it for the top-level calls to fetch
.
Answered By - user4815162342
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.