Issue
I'm trying to get info from a table. There are some links inside the td's, and in this case, I would retrieve the href="" content rather than the td text itself.
Here is the code I have been using:
url = 'http://www.milavia.net/airshows/calendar/showdates-2020-world.html'
page = session.get(url)
soup = BeautifulSoup(page.content, 'lxml')
tableOutput = []
for row in soup.find_all('tr')[1:]:
date, event, location, website, facebook, feature, notes = row.find_all('td')[0:7]
# print(website)
p = {
'Date': date.text.strip(),
'Event': event.text.strip(),
'Location': location.text.strip(),
# 'Site': website.text.strip(),
'Site': website.select('a', href=True, text='TEXT'),
'Facebook': facebook.text.strip(),
'Featuring': feature.text.strip(),
'Notes': notes.text.strip()
}
tableOutput.append(p)
print(tableOutput)
Here is the output:
[{'Data': '15-18 Jan', 'Evento': 'Kuwait Aviation Show', 'Local': 'Kuwait International Airport, Kuwait', 'Site': [<a class="asclnk" href="http://kuwaitaviationshow.com/" target="airshow" title="Visit Kuwait Aviation Show Website: kuwaitaviationshow.com">link</a>], 'Facebook': '', 'Atração': '', 'Obs.': 'public 17-18'}, {'Data': '18 Jan', 'Evento': 'Classics of the Sky Tauranga City Airshow', 'Local': 'Tauranga, New Zealand', 'Site': [<a class="asclnk" href="http://www.tcas.nz" target="airshow" title="Visit Classics of the Sky Tauranga City Airshow Website: www.tcas.nz">link</a>], 'Facebook': '', 'Atração': '', 'Obs.': ''}, {'Data': 'Date', 'Evento': 'Event', 'Local': 'Location', 'Site': [], 'Facebook': 'Facebook', 'Atração': 'Feature', 'Obs.': 'Notes'}]
I'm unable to get only the text inside href, e.g.
<a class="asclnk" href="http://www.tcas.nz" target="airshow" title="Visit Classics of the Sky Tauranga City Airshow Website: www.tcas.nz">
I have tried some approaches using website.select()
or website.find()
, but none of them gave me the result I needed.
Solution
The reason the reference link you have tried not worked because you have iterating rows and some of the rows has no anchor tag href attribute so it gets failed. I have provided an if condition to check.Try now.
import requests
from bs4 import BeautifulSoup
url = 'http://www.milavia.net/airshows/calendar/showdates-2020-world.html'
session=requests.session()
page = session.get(url)
soup = BeautifulSoup(page.content, 'lxml')
tableOutput = []
for row in soup.find_all('tr')[1:]:
date, event, location, website, facebook, feature, notes = row.find_all('td')[0:7]
if website.select_one('a[href]'):
p = {
'Date': date.text.strip(),
'Event': event.text.strip(),
'Location': location.text.strip(),
# 'Site': website.text.strip(),
'Site': website.select_one('a[href]')['href'],
'Facebook': facebook.text.strip(),
'Featuring': feature.text.strip(),
'Notes': notes.text.strip()
}
tableOutput.append(p)
print(tableOutput)
Output:
[{'Featuring': '', 'Location': 'Kuwait International Airport, Kuwait', 'Site': 'http://kuwaitaviationshow.com/', 'Date': '15-18 Jan', 'Facebook': '', 'Event': 'Kuwait Aviation Show', 'Notes': 'public 17-18'}, {'Featuring': '', 'Location': 'Tauranga, New Zealand', 'Site': 'http://www.tcas.nz', 'Date': '18 Jan', 'Facebook': '', 'Event': 'Classics of the Sky Tauranga City Airshow', 'Notes': ''}, {'Featuring': '', 'Location': 'Lucknow, Uttar Pradesh, India', 'Site': 'https://defexpo.gov.in/', 'Date': '05-08 Feb', 'Facebook': '', 'Event': 'Defexpo India 2020', 'Notes': 'public Sat. 8th'}, {'Featuring': '', 'Location': 'Changi Exhibition Centre, Singapore', 'Site': 'http://www.singaporeairshow.com/', 'Date': '11-16 Feb', 'Facebook': '', 'Event': 'Singapore Airshow 2020', 'Notes': 'public Sat-SunReports: 2018 2014'}, {'Featuring': '', 'Location': 'Al Bateen Executive Airport, Abu Dhabi, United Arab Emirates', 'Site': 'http://www.adairexpo.com/', 'Date': '04-06 Mar', 'Facebook': '', 'Event': 'Abu Dhabi Air Expo & Heli Expo 2020', 'Notes': 'trade expo'}, {'Featuring': '', 'Location': "Djerba–Zarzis Int'l Airport, Djerba, Tunisia", 'Site': 'http://www.iadetunisia.com/en/', 'Date': '04-08 Mar', 'Facebook': '', 'Event': 'IADE Tunisia 2020', 'Notes': 'public days 7-8'}, {'Featuring': '', 'Location': 'Tyabb Airport, Tyabb VIC, Australia', 'Site': 'http://www.tyabbairshow.com/', 'Date': '08 Mar', 'Facebook': '', 'Event': 'Tyabb Air Show 2020', 'Notes': ''}, {'Featuring': '', 'Location': 'Echuca Airport, Echuca VIC, Australia', 'Site': 'http://www.antique-aeroplane.com.au/', 'Date': '20-22 Mar', 'Facebook': '', 'Event': 'AAAA National Fly-in', 'Notes': ''}, {'Featuring': '', 'Location': "Santiago Int'l Airport, Santiago, Chile", 'Site': 'http://www.fidae.cl/', 'Date': '31 Mar / 05 Apr', 'Facebook': '', 'Event': 'FIDAE 2020', 'Notes': 'public Apr 4-5'}, {'Featuring': '', 'Location': "Santiago Int'l Airport, Santiago, Chile", 'Site': 'http://www.fidae.cl/', 'Date': '31 Mar / 05 Apr', 'Facebook': '', 'Event': 'FIDAE 2020', 'Notes': 'public Apr 4-5'}, {'Featuring': '', 'Location': 'Wanaka Airport, Otago, New Zealand', 'Site': 'http://www.warbirdsoverwanaka.com/', 'Date': '11-13 Apr', 'Facebook': '', 'Event': 'Warbirds Over Wanaka 2020', 'Notes': 'Report 2010'}, {'Featuring': '', 'Location': 'Illawarra Regional Airport, Wollongong NSW, Australia', 'Site': 'http://www.woi.org.au/', 'Date': '02-03 May', 'Facebook': '', 'Event': 'Wings over Illawarra', 'Notes': ''}, {'Featuring': '', 'Location': 'AFB Waterkloof, Centurion, South Africa', 'Site': 'http://www.aadexpo.co.za/', 'Date': '16-20 Sep', 'Facebook': '', 'Event': 'Africa Aerospace & Defence - AAD 2020', 'Notes': 'public 19-20'}, {'Featuring': '', 'Location': 'JIExpo Kemayoran, Jakarta, Indonesia', 'Site': 'http://www.indoaerospace.com/', 'Date': '04-07 Nov', 'Facebook': '', 'Event': 'Indo Aerospace 2020', 'Notes': 'trade only'}, {'Featuring': '', 'Location': 'Zhuhai, Guangdong, China', 'Site': 'http://www.airshow.com.cn/', 'Date': '10-15 Nov', 'Facebook': '', 'Event': 'Airshow China 2020', 'Notes': 'public 13-15th'}, {'Featuring': '', 'Location': 'Sakhir Air Base, Bahrain', 'Site': 'http://www.bahraininternationalairshow.com/', 'Date': '18-20 Nov', 'Facebook': '', 'Event': 'Bahrain International Airshow BIAS 2020', 'Notes': ''}]
Answered By - KunduK
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.