Saturday, October 29, 2022

[FIXED] How can I extract a table from many Excel documents to Pandas?

October 29, 2022 excel, pandas, python No comments

Issue

I have hundreds of similar Excel files, but containing descriptions and other informations too. (see Excel Screenshot) I would like to extract the tables or their data only for further works. I managed an iterative way to but it is inaccurate, and error prone to even a cell of difference, and in some cases there are differences. So is there any way to process this some dynamic way, or manual labour?

from pathlib import Path
import pandas as pd
import numpy as np
import glob
import os

path = r'/Users/.../datafolder'[enter image description here][1]


all_files = glob.glob(os.path.join(path, "*.xlsx"))

#define empty lists to store the values

East_England = []
London = []
Midlands = []
North_East = []
North_West = []
South_East = []
South_West = []

for f in all_files:
    
    # read the csv file
    df = pd.read_excel(f)

    # read and append the specific cells to a list
    East_England.append(df.iloc[14,3])
    London.append(df.iloc[15,3])
    Midlands.append(df.iloc[16,3])
    North_East.append(df.iloc[17,3])
    North_West.append(df.iloc[18,3])
    South_East.append(df.iloc[19,3])
    South_West.append(df.iloc[20,3])
display(East_England)   
display(Midlands)
display(North_East)
display(North_West)
display(South_East)
display(South_West)

Solution

It seems that the common denominator is that the tables you want to extract have border styles. So you could open the file with openpyxl first, and filter out the cells that have border styles, then pass it to pandas.

from openpyxl import load_workbook
import pandas as pd
import requests
from io import BytesIO

url = 'https://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2021/05/COVID-19-daily-announced-vaccinations-09-May-2021.xlsx'
r = requests.get(url)

wb = load_workbook(filename=BytesIO(r.content))
ws = wb[wb.sheetnames[0]] # select first sheet

#iterate cells, set value of cells with no left border to None
for row in ws.iter_rows():
    for cell in row:
        if not cell.border.left.style:
            try:
                cell.value = None
            except:
                continue

df = pd.DataFrame(ws.values).dropna(how='all', axis=0).dropna(how='all', axis=1).reset_index(drop=True) # load into pandas, drop empty rows and columns
df = df.T.set_index(0).T # set first row as header

Output:

	NHS Region of Residence	1st dose	2nd dose	Cumulative Total Doses to Date
1	Total	29578216	14871208	44449424
2	East of England	3541361	1773421	5314782
3	London	3734565	1769567	5504132
4	Midlands	5657588	2798749	8456337
5	North East and Yorkshire	4659848	2396705	7056553
6	North West	3744975	1975801	5720776
7	South East	4837856	2412643	7250499
8	South West	3209877	1699091	4908968

Answered By - RJ Adriaansen

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, October 29, 2022

[FIXED] How can I extract a table from many Excel documents to Pandas?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels