Issue
I'm in the process of building a code that will retrieve all of the review titles from an airline review website. I am using 5 different URLs because I want to compare the titles between 5 different airlines. However, my code is only listing the review titles for the last URL listed, which is for Alaska Airlines. I initially created a list with all of the URLs together but it had the exact same error with only showing results for Alaska Airlines.
My code:
# Insert the following command into the command prompt before starting for faster run time:
# jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10
#Importing and installing necessary packages
!pip install lxml
from bs4 import BeautifulSoup
import requests
import pandas as pd
from pprint import pprint;
base_url = 'https://www.airlinequality.com/airline-reviews/'
ending = ['american-airlines', 'delta-air-lines', 'united-airlines',
'southwest-airlines', 'alaska-airlines']
for ending in endings:
url = base_url + ending
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
results = soup.find('div', id='container')
# Retrieving all reviews
titles = results.find_all('h2', class_='text_header')
for title in titles:
print(title, end="\n"*2)
My output:
<h2 class="text_header">"first class customer service"</h2>
<h2 class="text_header">"deeply unsatisfactory"</h2>
<h2 class="text_header">"Everything was just fabulous"</h2>
<h2 class="text_header">"Messed up airline"</h2>
<h2 class="text_header">"agents who obviously care so much" </h2>
<h2 class="text_header">“communication was sorely lacking”</h2>
<h2 class="text_header">"Never encountered ruder gate workers"</h2>
<h2 class="text_header">"Our check-in bag was badly damaged"</h2>
<h2 class="text_header">"never book again with Alaska Airlines"</h2>
<h2 class="text_header">"I could not get on the plane"</h2>
<h2 class="text_header">The Worlds Best Airlines</h2>
<h2 class="text_header">THE NICEST AIRPORT STAFF</h2>
<h2 class="text_header">THE CLEANEST AIRLINE</h2>
<h2 class="text_header">Alaska Airlines Photos</h2>
I expected to get this output but for all 5 URLs. How can I retrieve the review titles from all URLs?
Solution
You are overwriting your results in each loop - Store results in a list
to iterate these in another for-loop
or scrape needed information directly - Be aware that you just get the reviews from the first review page per airline, to get all of them you have to implement another loop
to iterate all pages per airline (get an idea of it after reviewing the examples).
Example is focused on first page as descriped in your OP and stores results in a list of dict, that you could simply convert into dataframe:
import requests
import pandas as pd
base_url = 'https://www.airlinequality.com/airline-reviews/'
endings = ['american-airlines', 'delta-air-lines', 'united-airlines',
'southwest-airlines', 'alaska-airlines']
results = []
data = []
for ending in endings:
url = base_url + ending
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
for e in soup.select('article[itemprop="review"]'):
data.append({
'title': e.h2.text,
'airline': ending,
'rating':e.select_one('span[itemprop="ratingValue"]').text
})
data
will look like:
[{'title': '"nothing but a headache"',
'airline': 'american-airlines',
'rating': '1'},
{'title': '“provide vegan options”',
'airline': 'american-airlines',
'rating': '2'},
{'title': '“created so much stress and hassle”',
'airline': 'american-airlines',
'rating': '1'},...]
Transform data
into dataframe:
pd.DataFrame(data)
title | airline | rating | |
---|---|---|---|
0 | "nothing but a headache" | american-airlines | 1 |
1 | “provide vegan options” | american-airlines | 2 |
2 | “created so much stress and hassle” | american-airlines | 1 |
3 | "Terrible from start to finish" | american-airlines | 1 |
4 | "my bags are stuck in Charlotte" | american-airlines | 1 |
... | |||
45 | “communication was sorely lacking” | alaska-airlines | 1 |
46 | "Never encountered ruder gate workers" | alaska-airlines | 3 |
47 | "Our check-in bag was badly damaged" | alaska-airlines | 1 |
48 | "never book again with Alaska Airlines" | alaska-airlines | 4 |
49 | "I could not get on the plane" | alaska-airlines | 1 |
How to get "all results"
To give you an idea how to work on with all results, check what the additional while-loop
is doing - Keep in mind to be gentle with websites you scrape and also some delay:
for ending in endings:
url = f'https://www.airlinequality.com/airline-reviews/{ending}/page/1/?sortby=post_date%3ADesc&pagesize=100'
while True:
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
for e in soup.select('article[itemprop="review"]'):
data.append({
'title': e.h2.text,
'airline': ending,
'rating':e.select_one('span[itemprop="ratingValue"]').text
})
if soup.select_one('article.comp_reviews-pagination ul li:last-of-type a'):
url = base_url + soup.select_one('article.comp_reviews-pagination ul li:last-of-type a').get('href')
else:
break
Answered By - HedgeHog
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.