Sunday, January 7, 2024

[FIXED] How to retrieve data from multiple URLs using BeautifulSoup (Python is only returning the last line)?

January 07, 2024 beautifulsoup, for-loop, jupyter, python, web-scraping No comments

Issue

I'm in the process of building a code that will retrieve all of the review titles from an airline review website. I am using 5 different URLs because I want to compare the titles between 5 different airlines. However, my code is only listing the review titles for the last URL listed, which is for Alaska Airlines. I initially created a list with all of the URLs together but it had the exact same error with only showing results for Alaska Airlines.

My code:

# Insert the following command into the command prompt before starting for faster run time:

# jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10

#Importing and installing necessary packages
!pip install lxml
from bs4 import BeautifulSoup
import requests
import pandas as pd
from pprint import pprint;

base_url = 'https://www.airlinequality.com/airline-reviews/'

ending = ['american-airlines', 'delta-air-lines', 'united-airlines',
           'southwest-airlines', 'alaska-airlines']

for ending in endings:
    url = base_url + ending
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'html.parser')
    results = soup.find('div', id='container')

# Retrieving all reviews
titles = results.find_all('h2', class_='text_header')

for title in titles:
    print(title, end="\n"*2)

My output:

<h2 class="text_header">"first class customer service"</h2>

<h2 class="text_header">"deeply unsatisfactory"</h2>

<h2 class="text_header">"Everything was just fabulous"</h2>

<h2 class="text_header">"Messed up airline"</h2>

<h2 class="text_header">"agents who obviously care so much" </h2>

<h2 class="text_header">“communication was sorely lacking”</h2>

<h2 class="text_header">"Never encountered ruder gate workers"</h2>

<h2 class="text_header">"Our check-in bag was badly damaged"</h2>

<h2 class="text_header">"never book again with Alaska Airlines"</h2>

<h2 class="text_header">"I could not get on the plane"</h2>

<h2 class="text_header">The Worlds Best Airlines</h2>

<h2 class="text_header">THE NICEST AIRPORT STAFF</h2>

<h2 class="text_header">THE CLEANEST AIRLINE</h2>

<h2 class="text_header">Alaska Airlines Photos</h2>

I expected to get this output but for all 5 URLs. How can I retrieve the review titles from all URLs?

Solution

You are overwriting your results in each loop - Store results in a list to iterate these in another for-loop or scrape needed information directly - Be aware that you just get the reviews from the first review page per airline, to get all of them you have to implement another loop to iterate all pages per airline (get an idea of it after reviewing the examples).

Example is focused on first page as descriped in your OP and stores results in a list of dict, that you could simply convert into dataframe:

import requests
import pandas as pd

base_url = 'https://www.airlinequality.com/airline-reviews/'

endings = ['american-airlines', 'delta-air-lines', 'united-airlines',
           'southwest-airlines', 'alaska-airlines']

results = []
data = []

for ending in endings:
    url = base_url + ending
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'html.parser')
    for e in soup.select('article[itemprop="review"]'):
        data.append({
            'title': e.h2.text,
            'airline': ending,
            'rating':e.select_one('span[itemprop="ratingValue"]').text
        })

data will look like:

[{'title': '"nothing but a headache"',
  'airline': 'american-airlines',
  'rating': '1'},
 {'title': '“provide vegan options”',
  'airline': 'american-airlines',
  'rating': '2'},
 {'title': '“created so much stress and hassle”',
  'airline': 'american-airlines',
  'rating': '1'},...]

Transform data into dataframe:

pd.DataFrame(data)

	title	airline	rating
0	"nothing but a headache"	american-airlines	1
1	“provide vegan options”	american-airlines	2
2	“created so much stress and hassle”	american-airlines	1
3	"Terrible from start to finish"	american-airlines	1
4	"my bags are stuck in Charlotte"	american-airlines	1
...
45	“communication was sorely lacking”	alaska-airlines	1
46	"Never encountered ruder gate workers"	alaska-airlines	3
47	"Our check-in bag was badly damaged"	alaska-airlines	1
48	"never book again with Alaska Airlines"	alaska-airlines	4
49	"I could not get on the plane"	alaska-airlines	1

How to get "all results"

To give you an idea how to work on with all results, check what the additional while-loop is doing - Keep in mind to be gentle with websites you scrape and also some delay:

for ending in endings:
    url = f'https://www.airlinequality.com/airline-reviews/{ending}/page/1/?sortby=post_date%3ADesc&pagesize=100'
    while True:
        r = requests.get(url)
        soup = BeautifulSoup(r.content, 'html.parser')
        for e in soup.select('article[itemprop="review"]'):
            data.append({
                'title': e.h2.text,
                'airline': ending,
                'rating':e.select_one('span[itemprop="ratingValue"]').text
            })
        if soup.select_one('article.comp_reviews-pagination ul li:last-of-type a'):
            url = base_url + soup.select_one('article.comp_reviews-pagination ul li:last-of-type a').get('href')
        else:
            break

Answered By - HedgeHog

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, January 7, 2024

[FIXED] How to retrieve data from multiple URLs using BeautifulSoup (Python is only returning the last line)?

Issue

Solution

How to get "all results"

0 comments:

Post a Comment

Popular Posts

Labels