Issue
I have hundreds of similar Excel files, but containing descriptions and other informations too. (see Excel Screenshot) I would like to extract the tables or their data only for further works. I managed an iterative way to but it is inaccurate, and error prone to even a cell of difference, and in some cases there are differences. So is there any way to process this some dynamic way, or manual labour?
from pathlib import Path
import pandas as pd
import numpy as np
import glob
import os
path = r'/Users/.../datafolder'[enter image description here][1]
all_files = glob.glob(os.path.join(path, "*.xlsx"))
#define empty lists to store the values
East_England = []
London = []
Midlands = []
North_East = []
North_West = []
South_East = []
South_West = []
for f in all_files:
# read the csv file
df = pd.read_excel(f)
# read and append the specific cells to a list
East_England.append(df.iloc[14,3])
London.append(df.iloc[15,3])
Midlands.append(df.iloc[16,3])
North_East.append(df.iloc[17,3])
North_West.append(df.iloc[18,3])
South_East.append(df.iloc[19,3])
South_West.append(df.iloc[20,3])
display(East_England)
display(Midlands)
display(North_East)
display(North_West)
display(South_East)
display(South_West)
Solution
It seems that the common denominator is that the tables you want to extract have border styles. So you could open the file with openpyxl
first, and filter out the cells that have border styles, then pass it to pandas.
from openpyxl import load_workbook
import pandas as pd
import requests
from io import BytesIO
url = 'https://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2021/05/COVID-19-daily-announced-vaccinations-09-May-2021.xlsx'
r = requests.get(url)
wb = load_workbook(filename=BytesIO(r.content))
ws = wb[wb.sheetnames[0]] # select first sheet
#iterate cells, set value of cells with no left border to None
for row in ws.iter_rows():
for cell in row:
if not cell.border.left.style:
try:
cell.value = None
except:
continue
df = pd.DataFrame(ws.values).dropna(how='all', axis=0).dropna(how='all', axis=1).reset_index(drop=True) # load into pandas, drop empty rows and columns
df = df.T.set_index(0).T # set first row as header
Output:
NHS Region of Residence | 1st dose | 2nd dose | Cumulative Total Doses to Date | |
---|---|---|---|---|
1 | Total | 29578216 | 14871208 | 44449424 |
2 | East of England | 3541361 | 1773421 | 5314782 |
3 | London | 3734565 | 1769567 | 5504132 |
4 | Midlands | 5657588 | 2798749 | 8456337 |
5 | North East and Yorkshire | 4659848 | 2396705 | 7056553 |
6 | North West | 3744975 | 1975801 | 5720776 |
7 | South East | 4837856 | 2412643 | 7250499 |
8 | South West | 3209877 | 1699091 | 4908968 |
Answered By - RJ Adriaansen
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.