Wednesday, October 6, 2021

[FIXED] Concurrent.futures + requests_html's render() = "There is no current event loop in thread 'ThreadPoolExecutor-0_0'."

October 06, 2021 concurrent.futures, python, python-asyncio, python-requests-html No comments

Issue

I am using Requests HTML to render javascript on a page. I am also using concurrent.futures to speed up the process. My code was working perfectly until I added the following line:

response.html.render(timeout=60, sleep=1, wait=3, retries=10)

upon which I got the error:

response.html.render(timeout=60, sleep=1, wait=3, retries=10)   
File "C:\Users\Ze\Anaconda3\lib\site-packages\requests_html.py", line 586, in render self.browser = self.session.browser # Automatically create a event loop and browser
File "C:\Users\Ze\Anaconda3\lib\site-packages\requests_html.py", line 727, in browser self.loop = asyncio.get_event_loop()
File "C:\Users\Ze\Anaconda3\lib\asyncio\events.py", line 639, in get_event_loop raise RuntimeError('There is no current event loop in thread %r.' RuntimeError: There is no current event loop in thread 'ThreadPoolExecutor-0_0'.

If I move the problematic line to within the below section it works again, but then the rendering is not occurring in parallel, right?

for result in concurrent.futures.as_completed(futures):
    result = result.result()

What is causing the problem? I've never used asyncio. Do I have to for this? Is it easy to implement?

Thank you very much!

CODE:

def load_page_and_extract_items(url):
    response = session.get(url, headers=get_headers())

    # render javascript
    response.html.render(timeout=60, wait=3)
    source = BeautifulSoup(response.html.raw_html, 'lxml')


def get_pages(remaining_urls):
    with concurrent.futures.ThreadPoolExecutor() as executor:
        # for each of 60 possible pages
        for current_page_number in range(60):
            futures = [executor.submit(load_page_and_extract_items, url) for url in remaining_urls]

            for result in concurrent.futures.as_completed(futures):
                result = result.result()

def main():
    get_pages(urls)

Solution

This doesn't directly answer the question but demonstrates a technique for multithreaded web-scraping which performs well in my tests. It's using the the URL stated in the original question and searches for certain tags that [may] contain HREFs and then processes those URLs. The general idea is that I create a pool of sessions and each thread gets a session object from the pool (a queue), uses it and then puts it back on the queue thus making it available for other threads.

from requests_html import HTMLSession
import concurrent.futures
import queue

QUEUE = queue.Queue()


def makeSessions(n=4):
    for _ in range(n):
        QUEUE.put(HTMLSession())


def cleanup():
    while True:
        try:
            getSession(False).close()
        except queue.Empty:
            break


def getSession(block=True):
    return QUEUE.get(block=block)


def freeSession(session):
    if isinstance(session, HTMLSession):
        QUEUE.put(session)


def getURLs():
    urls = []
    session = getSession()
    try:
        response = session.get('https://www.aliexpress.com')
        response.raise_for_status()
        response.html.render()
        for a in response.html.xpath('//dt[@class="cate-name"]/span/a'):
            if 'href' in a.attrs:
                urls.append(a.attrs['href'])
    finally:
        freeSession(session)
    return urls


def processURL(url):
    print(url)
    session = getSession()
    try:
        response = session.get(url)
        response.raise_for_status()
        response.html.render()
    finally:
        freeSession(session)


if __name__ == '__main__':
    try:
        makeSessions()
        with concurrent.futures.ThreadPoolExecutor() as executor:
            futures = [executor.submit(processURL, url) for url in getURLs()]
            for _ in concurrent.futures.as_completed(futures):
                pass
    finally:
        cleanup()

Answered By - BrutusForcus

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, October 6, 2021

[FIXED] Concurrent.futures + requests_html's render() = "There is no current event loop in thread 'ThreadPoolExecutor-0_0'."

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels