Issue
I was actually trying to scrape the the "Name"
column in the table shown in this link and save it as a csv file.
I wrote a python script like below:
from bs4 import BeautifulSoup
import requests
import csv
# Step 1: Sending a HTTP request to a URL
url = "https://myaccount.umn.edu/lookup?SET_INSTITUTION=UMNTC&type=name&CN=University+of+Minnesota&campus=a&role=any"
# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text
# Step 2: Parse the html content
soup = BeautifulSoup(html_content, "lxml")
# print(soup.prettify()) # print the parsed data of html
# Step 3: Analyze the HTML tag, where your content lives
# Create a data dictionary to store the data.
data = {}
#Get the table having the class wikitable
gdp_table = soup.find("table")
gdp_table_data = gdp_table.find_all("th") # contains 2 rows
# Get all the headings of Lists
headings = []
for td in gdp_table_data[0].find_all("td"):
# remove any newlines and extra spaces from left and right
headings.append(td.b.text.replace('\n', ' ').strip())
# Get all the 3 tables contained in "gdp_table"
for table, heading in zip(gdp_table_data[1].find_all("table"), headings):
# Get headers of table i.e., Rank, Country, GDP.
t_headers = []
for th in table.find_all("th"):
# remove any newlines and extra spaces from left and right
t_headers.append(th.text.replace('\n', ' ').strip())
# Get all the rows of table
table_data = []
for tr in table.tbody.find_all("tr"): # find all tr's from table's tbody
t_row = {}
# Each table row is stored in the form of
# t_row = {'Rank': '', 'Country/Territory': '', 'GDP(US$million)': ''}
# find all td's(3) in tr and zip it with t_header
for td, th in zip(tr.find_all("td"), t_headers):
t_row[th] = td.text.replace('\n', '').strip()
table_data.append(t_row)
# Put the data for the table with his heading.
data[heading] = table_data
print("table_data")
But when I run this script I am not getting anything. Please help me with this
Solution
It seems like your list gdp_table_data[0].find_all("td")
is empty, hence explaining that you don't find anything (your for-loops are not doing anything). Without more context on your strategy, it's hard to help.
By the way, if you're not against using an external library, using pandas
would make it super easy to scrape this kind of webpage. Just so you know:
>>> import pandas as pd
>>> url = "https://myaccount.umn.edu/lookup?SET_INSTITUTION=UMNTC&type=name&CN=University+of+Minnesota&campus=a&role=any"
>>> df = pd.read_html(url)[0]
>>> print(df)
Name Email Work Phone Phone Dept/College
0 AIESEC at the University of Minnesota (aiesec) [email protected] NaN NaN Student Organization
1 Ayn Rand Study Group University of Minnesota (... [email protected] NaN NaN NaN
2 Balance UMD (balance) [email protected] NaN NaN Student Organization
3 Christians on Campus the University of Minneso... [email protected] NaN NaN Student Organization
4 Climb Club University of Minnesota (climb) [email protected] NaN NaN Student Organization
.. ... ... ... ... ...
74 University of Minnesota Tourism Center (tourism) [email protected] NaN NaN Department
75 University of Minnesota Treasury Accounting (t... [email protected] NaN NaN Department
76 University of Minnesota Twin Cities HOSA (umnh... [email protected] NaN NaN Student Organization
77 University of Minnesota U Write (uwrite) NaN NaN NaN Department
78 University of Minnesota VoiceMail (cs-vcml) [email protected] NaN NaN OIT Network & Design
[79 rows x 5 columns]
Now, getting only names is super easy:
>>> print(df.Name)
0 AIESEC at the University of Minnesota (aiesec)
1 Ayn Rand Study Group University of Minnesota (...
2 Balance UMD (balance)
3 Christians on Campus the University of Minneso...
4 Climb Club University of Minnesota (climb)
...
74 University of Minnesota Tourism Center (tourism)
75 University of Minnesota Treasury Accounting (t...
76 University of Minnesota Twin Cities HOSA (umnh...
77 University of Minnesota U Write (uwrite)
78 University of Minnesota VoiceMail (cs-vcml)
Name: Name, Length: 79, dtype: object
To export only that column into a .csv
use:
>>> df[["Name"]].to_csv("./filename.csv")
Answered By - Arnaud
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.