Issue
I have scraped data from a website using Beautifulsoup, and I want to place it into a Pandas DataFrame and then write it to a file. Most of the data is being written to the file as expected, but some cells are missing values. For example, the first row of the Phone number column is missing a value. The 39th, 45th, and 75th rows of the Postal code column are missing values. Not sure why.
Here is my code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
schools = "https://www.winnipegsd.ca/page/9258/school-directory-a-z"
page = urlopen(schools)
soup = BeautifulSoup(page,features="html.parser")
table_ = soup.find('table')
Name=[]
Address=[]
PostalCode=[]
Phone=[]
Grades=[]
Website=[]
City=[]
Province=[]
for row in table_.findAll("tr"):
cells = row.findAll('td')
if len(cells)==6:
Name.append(cells[1].find(text=True))
Address.append(cells[4].find(text=True))
PostalCode.append(cells[4].find(text=True).next_element.getText())
Phone.append(cells[5].find(text=True).replace('T: ',''))
Grades.append(cells[2].find(text=True))
Website.append('https://www.winnipegsd.ca'+cells[1].findAll('a')[0]['href'])
df = pd.DataFrame(Name,columns=['Name'])
df['Street Address']=Address
df['Postal Code']=PostalCode
df['Phone Number']=Phone
df['Grades']=Grades
df['Website']=Website
df.to_csv("file.tsv", sep = "\t",index=False)
Solution
Try pd.read_html()
to extract data from table. Then you can do basic .str
manipulation:
import requests
import pandas as pd
from bs4 import BeautifulSoup
schools = "https://www.winnipegsd.ca/page/9258/school-directory-a-z"
soup = BeautifulSoup(requests.get(schools).content, "html.parser")
df = pd.read_html(str(soup))[0]
df = df.dropna(how="all", axis=0).drop(columns=["Unnamed: 0", "Unnamed: 3"])
df["Contact"] = (
df["Contact"]
.str.replace(r"T:\s*", "", regex=True)
.str.replace("School Contact Information", "")
.str.strip()
)
df["Postal Code"] = df["Address"].str.extract(r"(.{3} .{3})$")
df["Website"] = [
f'https://www.winnipegsd.ca{a["href"]}'
if "http" not in a["href"]
else a["href"]
for a in soup.select("tbody td:nth-child(2) a")
]
print(df.head(10))
df.to_csv("data.csv", index=False)
Prints:
School Name Grades Address Contact Postal Code Website
0 Adolescent Parent Centre 9-12 136 Cecil St. R3E 2Y9 204-775-5440 R3E 2Y9 https://www.winnipegsd.ca/AdolescentParentCentre/
1 Andrew Mynarski V.C. School 7-9 1111 Machray Ave. R2X 1H6 204-586-8497 R2X 1H6 https://www.winnipegsd.ca/AndrewMynarski/
2 Argyle Alternative High School 10-12 30 Argyle St. R3B 0H4 204-942-4326 R3B 0H4 https://www.winnipegsd.ca/Argyle/
3 Brock Corydon School N-6 1510 Corydon Ave. R3N 0J6 204-488-4422 R3N 0J6 https://www.winnipegsd.ca/BrockCorydon/
4 Carpathia School N-6 300 Carpathia Rd. R3N 1T3 204-488-4514 R3N 1T3 https://www.winnipegsd.ca/Carpathia/
5 Champlain School N-6 275 Church Ave. R2W 1B9 204-586-5139 R2W 1B9 https://www.winnipegsd.ca/Champlain/
6 Children of the Earth High School 9-12 100 Salter St. R2W 5M1 204-589-6383 R2W 5M1 https://www.winnipegsd.ca/ChildrenOfTheEarth/
7 Collège Churchill High School 7-12 510 Hay St. R3L 2L6 204-474-1301 R3L 2L6 https://www.winnipegsd.ca/Churchill/
8 Clifton School N-6 1070 Clifton St. R3E 2T7 204-783-7792 R3E 2T7 https://www.winnipegsd.ca/Clifton/
10 Daniel McIntyre Collegiate Institute 9-12 720 Alverstone St. R3E 2H1 204-783-7131 R3E 2H1 https://www.winnipegsd.ca/DanielMcintyre/
and saves data.csv
(screenshot from LibreOffice):
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.