Sunday, November 20, 2022

[FIXED] Parse html tables from emails to lists then convert to pandas dataframe

November 20, 2022 beautifulsoup, pandas, python No comments

Issue

I’m an absolute Beginner in Python , and I am trying to create a script which loops through an email folder and grabs a html table within the emails and convert to a pandas dataframe for export to excel.

The code below loops through the folder and adds each table and its contents to a list []

# importing the libraries
import pandas as pd
import win32com.client
from bs4 import BeautifulSoup


# connect to outlook email inbox
outlook = win32com.client.Dispatch("Outlook.Application")
mapi= outlook.GetNamespace("MAPI")

inbox = mapi.Folders['emailaddress'].Folders['Inbox'].Folders['Testfolder']
Mail_Messages = inbox.Items


# loop through email folder and seach for table in email messages and add to a list

output = []
for mail in Mail_Messages:
    body = mail.HTMLBody
    html_body = BeautifulSoup(body,"lxml")
    html_tables = html_body.find('table')
    
    # read html table to dataframe list

    df = pd.read_html(str(html_tables))
    #pd.concat(df).set_index(0).T
    df= df[0].set_index(0).T
    df.reset_index(level=None, drop=True, inplace=False, col_level=0, col_fill='')
    output.append(df)

    #print (df)
    
print(output)

[0 Column1 Column2 Column3 Column4 Column5 Column6 Column7 Column8
1  Value1  Value2  Value3  Value4  Value5  Value6  Value7  Value8, 0 Column1 Column2 Column3 Column4 Column5 Column6 Column7 Column8
1  Value1  Value2  Value3  Value4  Value5  Value6  Value7  Value8, 0 Column1 Column2 Column3 Column4 Column5 Column6 Column7 Column8
1  Value1  Value2  Value3  Value4  Value5  Value6  Value7  Value8]

What I’m trying to achieve is, for every table in the email folder add the table to a new row
So it would finally end up something like this

Column1 Column2 Column3 Column4 Column5 Column6 Column7 Column8
Value1  Value2  Value3  Value4  Value5  Value6  Value7  Value8

I’m struggling to get the lists that are created into a structure that could be converted into a pandas dataframe to be exported to excel.

As I’ve said I’m a beginner , and looking some help on how this could be achieved.

Here's a screenshot of the table i'm tring to grab..

Solution

If you had a list of dataframes (df) that looked like

[         0        1
 0  Column1  Value1a
 1  Column2  Value2a
 2  Column3  Value3a
 3  Column4  Value4a
 4  Column5  Value5a
 5  Column6  Value6a
 6  Column7  Value7a
 7  Column8  Value8a,
          0        1
 0  Column1  Value1b
 1  Column2  Value2b
 2  Column3  Value3b
 3  Column4  Value4b
 4  Column5  Value5b
 5  Column6  Value6b
 6  Column7  Value7b
 7  Column8  Value8b,
          0        1
 0  Column1  Value1c
 1  Column2  Value2c
 2  Column3  Value3c
 3  Column4  Value4c
 4  Column5  Value5c
 5  Column6  Value6c
 6  Column7  Value7c
 7  Column8  Value8c]

then pd.concat([d.set_index(d.columns[0]) for d in df], axis='columns', ignore_index=True).T would return a single Dataframe

index	Column1	Column2	Column3	Column4	Column5	Column6	Column7	Column8
0	Value1a	Value2a	Value3a	Value4a	Value5a	Value6a	Value7a	Value8a
1	Value1b	Value2b	Value3b	Value4b	Value5b	Value6b	Value7b	Value8b
2	Value1c	Value2c	Value3c	Value4c	Value5c	Value6c	Value7c	Value8c

But if df was instead oriented as

[         0        1        2        3        4        5        6        7
 0  Column1  Column2  Column3  Column4  Column5  Column6  Column7  Column8
 1  Value1a  Value2a  Value3a  Value4a  Value5a  Value6a  Value7a  Value8a,
          0        1        2        3        4        5        6        7
 0  Column1  Column2  Column3  Column4  Column5  Column6  Column7  Column8
 1  Value1b  Value2b  Value3b  Value4b  Value5b  Value6b  Value7b  Value8b,
          0        1        2        3        4        5        6        7
 0  Column1  Column2  Column3  Column4  Column5  Column6  Column7  Column8
 1  Value1c  Value2c  Value3c  Value4c  Value5c  Value6c  Value7c  Value8c]

then

pd.concat([d.rename(columns=d.iloc[0]).drop(d.index[0]) for d in df], ignore_index=True)

would return the same combined DataFrame.

If you leave out ignore_index=True then the rows will have the same indexes from before combining; i.e., 1,1,1 instead of 0,1,2.

ADDED EDIT:

If you just want the tables from each message in one combined Dataframe, this should do:

pd.concat([mdf.set_index(mdf.columns[0]) for mdf in [
    (pd.read_html(str(mtable))[0] if mtable else None) for mtable in 
    [BeautifulSoup(m.HTMLBody, "lxml").find('table') for m in Mail_Messages] 
] if mdf is not None], axis='columns', ignore_index=True).T

but if you want/need the loop for anything else, then you can also do

output = []
for mail in Mail_Messages:
    html_tables = BeautifulSoup(mail.HTMLBody, "lxml").find('table')
    
    # read html table to dataframe list
    df = pd.read_html(str(html_tables))[0]
    output.append(df.set_index(df.columns[0]))

    ## WHATEVER ELSE YOU NEED TO DO IN THE LOOP ##

output = pd.concat(output, axis='columns', ignore_index=True).T
print(output)

and that should print something like

0 Column1 Column2 Column3 Column4 Column5 Column6 Column7 Column8
0  Value1  Value2  Value3  Value4  Value5  Value6  Value7  Value8
1  Value1  Value2  Value3  Value4  Value5  Value6  Value7  Value8
2  Value1  Value2  Value3  Value4  Value5  Value6  Value7  Value8

Answered By - Driftr95

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, November 20, 2022

[FIXED] Parse html tables from emails to lists then convert to pandas dataframe

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels