Saturday, February 19, 2022

[FIXED] How can I let some asyncio script go on scraping which currently gets stuck somewhere in it's execution

February 19, 2022 python, python-3.x, python-asyncio, semaphore, web-scraping No comments

Issue

I've created a script using asyncio library to parse the name of different post owner from a webpage. The idea is to supply this link within the script which parses all the links of different posts from each page and traverses next pages to do the same. However, the script then uses all the links within this function fetch_again() to reach the inner page in order to get the owner of all the posts.

Although I could have scraped the name of owners from it's landing page, I used the following approach only to know as to how I can achieve the same using the design I'm trying with. I've used semaphore within the script to limit the number of requests.

When I use the following script, I found it working for 100 or few more posts and then it gets stuck. It doesn't throw any error.

I've tried with:

import aiohttp
import asyncio
from lxml.html import fromstring
from urllib.parse import urljoin

link = "https://stackoverflow.com/questions/tagged/web-scraping"

semaphore = asyncio.Semaphore(10)

async def fetch(url):
    async with semaphore:
        async with aiohttp.ClientSession() as session:
            async with session.get(url) as response:
                text = await response.text()
                result = await processing_docs(session, text)
            return result

async def processing_docs(session, html):
    coros = []
    tree = fromstring(html)
    titles = [urljoin(link,title.attrib['href']) for title in tree.cssselect(".summary .question-hyperlink")]
    for title in titles:
        coros.append(fetch_again(session,title))

    next_page = tree.cssselect("div.pager a[rel='next']")
    if next_page:
        page_link = urljoin(link,next_page[0].attrib['href'])
        coros.append(fetch(page_link))
    await asyncio.gather(*coros)

async def fetch_again(session,url):
    async with semaphore:
        async with session.get(url) as response:
            text = await response.text()
            tree = fromstring(text)
            title = tree.cssselect("h1[itemprop='name'] a")[0].text
            print(title)

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    future = asyncio.ensure_future(asyncio.gather(*(fetch(url) for url in [link])))
    loop.run_until_complete(future)
    loop.close()

How can I let the script go on parsing which currently gets stuck somewhere in it's execution?

Solution

The script likely blocks due to a deadlock: fetch acquires the semaphore and calls processing_docs, which recursively calls more instances of fetch and fetch_again with the semaphore still held. If the recursion depth of fetch reaches 10, the innermost fetch will never acquire the semaphore because it will have been acquired by its callers. I suggest that you replace recursion with an asyncio.Queue, and drain (and populate) the queue with a fixed number of worker tasks. That way you don't even need a semaphore and you are guaranteed not to deadlock.

An even simpler fix, which doesn't require refactoring, is to just move the recursive call to processing_docs() outside the async with semaphore block, i.e. to invoke processing_docs() with the semaphore released. After all, the purpose of the semaphore is to limit concurrent access to the remote server, not local processing which is not concurrent in the first place as asyncio is single-threaded. That should eliminate the deadlock while still limiting the number of concurrent connections:

async def fetch(url):
    async with aiohttp.ClientSession() as session:
        async with semaphore:
            async with session.get(url) as response:
                text = await response.text()
        result = await processing_docs(session, text)
        return result

Also note that you should probably create a single session in a top-level coroutine and propagate it throughout the code. You are already doing that between fetch, processing_docs and fetch_again, but you could also do it for the top-level calls to fetch.

Answered By - user4815162342

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, February 19, 2022

[FIXED] How can I let some asyncio script go on scraping which currently gets stuck somewhere in it's execution

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels