Issue
I am using Requests HTML to render javascript on a page. I am also using concurrent.futures to speed up the process. My code was working perfectly until I added the following line:
response.html.render(timeout=60, sleep=1, wait=3, retries=10)
upon which I got the error:
response.html.render(timeout=60, sleep=1, wait=3, retries=10)
File "C:\Users\Ze\Anaconda3\lib\site-packages\requests_html.py", line 586, in render self.browser = self.session.browser # Automatically create a event loop and browser
File "C:\Users\Ze\Anaconda3\lib\site-packages\requests_html.py", line 727, in browser self.loop = asyncio.get_event_loop()
File "C:\Users\Ze\Anaconda3\lib\asyncio\events.py", line 639, in get_event_loop raise RuntimeError('There is no current event loop in thread %r.' RuntimeError: There is no current event loop in thread 'ThreadPoolExecutor-0_0'.
If I move the problematic line to within the below section it works again, but then the rendering is not occurring in parallel, right?
for result in concurrent.futures.as_completed(futures):
result = result.result()
What is causing the problem? I've never used asyncio. Do I have to for this? Is it easy to implement?
Thank you very much!
CODE:
def load_page_and_extract_items(url):
response = session.get(url, headers=get_headers())
# render javascript
response.html.render(timeout=60, wait=3)
source = BeautifulSoup(response.html.raw_html, 'lxml')
def get_pages(remaining_urls):
with concurrent.futures.ThreadPoolExecutor() as executor:
# for each of 60 possible pages
for current_page_number in range(60):
futures = [executor.submit(load_page_and_extract_items, url) for url in remaining_urls]
for result in concurrent.futures.as_completed(futures):
result = result.result()
def main():
get_pages(urls)
Solution
This doesn't directly answer the question but demonstrates a technique for multithreaded web-scraping which performs well in my tests. It's using the the URL stated in the original question and searches for certain tags that [may] contain HREFs and then processes those URLs. The general idea is that I create a pool of sessions and each thread gets a session object from the pool (a queue), uses it and then puts it back on the queue thus making it available for other threads.
from requests_html import HTMLSession
import concurrent.futures
import queue
QUEUE = queue.Queue()
def makeSessions(n=4):
for _ in range(n):
QUEUE.put(HTMLSession())
def cleanup():
while True:
try:
getSession(False).close()
except queue.Empty:
break
def getSession(block=True):
return QUEUE.get(block=block)
def freeSession(session):
if isinstance(session, HTMLSession):
QUEUE.put(session)
def getURLs():
urls = []
session = getSession()
try:
response = session.get('https://www.aliexpress.com')
response.raise_for_status()
response.html.render()
for a in response.html.xpath('//dt[@class="cate-name"]/span/a'):
if 'href' in a.attrs:
urls.append(a.attrs['href'])
finally:
freeSession(session)
return urls
def processURL(url):
print(url)
session = getSession()
try:
response = session.get(url)
response.raise_for_status()
response.html.render()
finally:
freeSession(session)
if __name__ == '__main__':
try:
makeSessions()
with concurrent.futures.ThreadPoolExecutor() as executor:
futures = [executor.submit(processURL, url) for url in getURLs()]
for _ in concurrent.futures.as_completed(futures):
pass
finally:
cleanup()
Answered By - BrutusForcus
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.