Thursday, April 21, 2022

[FIXED] Having multiple span tags with the same class get the specific one webscraping with beautiful soup

April 21, 2022 beautifulsoup, python, web-scraping No comments

Issue

I am having trouble with getting numerical values from span tags with the same class

This is what html looks like

<ul class="sds-definition-list review-breakdown--list">
<li>
<span class="sds-definition-list__display-name">Comfort</span>
<span class="sds-definition-list__value">5.0</span>
</li>
<li>
<span class="sds-definition-list__display-name">Interior design</span>
<span class="sds-definition-list__value">4.0</span>
</li>
<li>
<span class="sds-definition-list__display-name">Performance</span>
<span class="sds-definition-list__value">5.0</span>
</li>
<li>
<span class="sds-definition-list__display-name">Value for the money</span>
<span class="sds-definition-list__value">5.0</span>
</li>
<li>
<span class="sds-definition-list__display-name">Exterior styling</span>
<span class="sds-definition-list__value">5.0</span>
</li>
<li>
<span class="sds-definition-list__display-name">Reliability</span>
<span class="sds-definition-list__value">5.0</span>
</li>
</ul>

I basically want to take all the numerical values and put them in different columns, here is what I am using for my code

ua = UserAgent()
header = {'User-Agent':str(ua.safari)}
url = 'https://www.cars.com/research/nissan-leaf-2011/consumer-reviews/?page=1'
response = requests.get(url, headers=header)
print(response)
html_soup = BeautifulSoup(response.text, 'lxml')
content_list = html_soup.find_all('div', attrs={'class': 'consumer-review-container'})
data = []

for e in content_list:
    data.append({
      'review_title': e.h3.text,
      'review_content': e.select_one('p.review-body').text,
      'overall_rating': e.select_one('span.sds-rating__count').text,
      'reviewer_name':e.select_one("div.review-byline div:nth-of-type(2)").text,
      'review_date':e.find("div", {"class":"review-byline"}).div.text,
    })

To the list data I would like to add information about: Comfort, Interior, Performance, Value for the money, Exterior styling and Reliability and this information I would like to get from the previously mentioned html code.

Solution

To get the result you could iterate over the <li> and extract the contents with .stripped_strings in a dict comprehension then update your existing dict and append it to data.

Creating a DataFrame this will create separate columns for each item:

for e in content_list:
    d = {
      'review_title': e.h3.text,
      'review_content': e.select_one('p.review-body').text,
      'overall_rating': e.select_one('span.sds-rating__count').text,
      'reviewer_name':e.select_one("div.review-byline div:nth-of-type(2)").text,
      'review_date':e.find("div", {"class":"review-byline"}).div.text,
    }

    d.update(dict(s.stripped_strings for s in e.select('ul.sds-definition-list li')))

    data.append(d)
data

Output:

[{'review_title': 'Great Electric Car!',
  'review_content': 'This is the perfect electric car for driving around town, doing errands or even for a short daily commuter. It is very comfy and very quick. The only issue was the first gen battery. The 2011-2014 battery degraded quickly and if the owner did not have Nissan replace it, all those cars are now junk and can only go 20 miles or so on a charge. We had Nissan replace our battery with the 2nd gen battery and it is good as new!',
  'overall_rating': '4.7',
  'reviewer_name': 'By EVs are the future from Tucson, AZ',
  'review_date': 'February 24, 2020',
  'Comfort': '5.0',
  'Interior design': '5.0',
  'Performance': '5.0',
  'Value for the money': '5.0',
  'Exterior styling': '3.0',
  'Reliability': '5.0'},...]

Answered By - HedgeHog

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, April 21, 2022

[FIXED] Having multiple span tags with the same class get the specific one webscraping with beautiful soup

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels