Thursday, March 24, 2022

[FIXED] RuntimeWarning: coroutine 'NewsExtraction.get_article_data_elements' was never awaited

March 24, 2022 python-3.x, python-asyncio No comments

Issue

I have always resisted using asyncio within my code, but using it might help with some performance issues that I'm having.

Here is my scenario:

An end user provides a list of news sites to scrape
Each element is passed to an Article Class
A valid article is passed to an Extraction Class
The Extraction Class passes data to a NewsExtraction Class

90% this of the time this flow is flawless, but on an occasion one of the 12 functions in the NewsExtraction Class fails to extract data, which exist in the HTML provide. It seems that my code is "stepping on itself," which cause the data element not to be parsed. When I rerun the code all the elements are parsed correctly.

The NewsExtraction Class has this function get_article_data_elements, which is called from the Extraction Class.

The function get_article_data_elements call these items:

published_date = self._extract_article_published_date()
modified_date = self._extract_article_modified_date()
title = self._extract_article_title()
description = self._extract_article_description()
keywords = self._extract_article_key_words()
tags = self._extract_article_tags()
authors = self._extract_article_author()
top_image = self._extract_top_image()
language = self._extract_article_language()
categories = self._extract_article_category()
text = self._extract_textual_content()
url = self._extract_article_url()

Each of these data elements are used to populate a Python Dictionary, which is eventually passed back to the End User.

I have been trying to add asyncio code to the NewsExtraction Class, but I kept getting this error message:

RuntimeWarning: coroutine 'NewsExtraction.get_article_data_elements' was never awaited

I have spent the last 3 days trying to figure this issue out. I have looked at dozens of questions on Stack Overflow on this error RuntimeWarning: coroutine never awaited. I have also looked at numerous articles on using asyncio, but I cannot figure out how to use asyncio with my NewsExtraction Class, which is called from the Extraction Class.

Can someone provide me some pointers to solve my issue?

class NewsExtraction(object):
    """
    This class is used to extract common data elements from a news article
    on xyz
    """

    def __init__(self, url, soup):
        self._url = url
        self._raw_soup = soup


    truncated...


    async def _extract_article_published_date(self):
      """
      This function is designed to extract the publish date for the article being parsed.

      :return: date article was originally published
      :rtype: string
      """
      json_date_published = JSONExtraction(self._url, self._raw_soup).extract_article_published_date()
      if json_date_published is not None:
         if len(json_date_published) != 0:
            return json_date_published
         else:
             return None
      elif json_date_published is None:
           if self._raw_soup.find(name='div', attrs={'class': regex.compile("--publishDate")}):
              date_published = self._raw_soup.find(name='div', attrs={'class': regex.compile("--publishDate")})
              if len(date_published) != 0:
                 return date_published.text
              else:
                logger.info('The HTML tag to extract the publish date for the following article was not found.')
                logger.info(f'Article URL -- {self._url}')
                return None


    truncated...


    async def get_article_data_elements(self):
      """
      This function is designed to extract all the common data elements from a
      news article on xyz.

      :return: dictionary of data elements related to the article
      :rtype: dict
        """
      article_data_elements = {}
      
      # I have tried this:
      published_date = self._extract_article_published_date().__await__()

      # and this
      published_date = self.task(self._extract_article_published_date())
      await published_date

      truncated...

I have also tried to use:

if __name__ == "__main__":
    asyncio.run(NewsExtraction.get_article_data_elements())
    # asyncio.run(self.get_article_data_elements())

I'm really banging my head on the wall with using asyncio in my news extraction code.

If this question is off base, I will be happy to delete it and keep reading about how to use asyncio correctly.

Can someone provide me some pointers to solve my issue?

Thanks in advance for any guidance on using asyncio

Solution

Your are defining _extract_article_published_date and get_article_data_elements as coroutines, and this coroutines must be await-ed in your code to get the result of their execution in an asynchronous way.

You can do this creating an instance of type NewsExtraction and calling this methods with the keyword await in front, this await pass the execution to other task in the loop until his awaited task completes its execution. Note that there are no threads or process involved in this task execution, the execution is passed only if it is no using cpu-time (await-ing I/O operations or sleeping).

if __name__ == '__main__':
    extractor = NewsExtraction(...)
    # this creates the event loop and runs the coroutine
    asyncio.run(extractor.get_article_data_elements())

Inside your _extract_article_published_date you must also await your coroutines that perform requests over the network, if you are using some library for the scraping make sure that uses async/await behind the scenes to get a real performance while using asyncio.

async def get_article_data_elements(self):
      article_data_elements = {}
      
     # note here that the instance is self
      published_date = await self._extract_article_published_date()

      truncated...

You must dive into the asyncio documentation to get a better understanding of these features of Python 3.7+.

Answered By - svex99

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, March 24, 2022

[FIXED] RuntimeWarning: coroutine 'NewsExtraction.get_article_data_elements' was never awaited

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels