Issue
I’m an absolute Beginner in Python , and I am trying to create a script which loops through an email folder and grabs a html table within the emails and convert to a pandas dataframe for export to excel.
The code below loops through the folder and adds each table and its contents to a list []
# importing the libraries
import pandas as pd
import win32com.client
from bs4 import BeautifulSoup
# connect to outlook email inbox
outlook = win32com.client.Dispatch("Outlook.Application")
mapi= outlook.GetNamespace("MAPI")
inbox = mapi.Folders['emailaddress'].Folders['Inbox'].Folders['Testfolder']
Mail_Messages = inbox.Items
# loop through email folder and seach for table in email messages and add to a list
output = []
for mail in Mail_Messages:
body = mail.HTMLBody
html_body = BeautifulSoup(body,"lxml")
html_tables = html_body.find('table')
# read html table to dataframe list
df = pd.read_html(str(html_tables))
#pd.concat(df).set_index(0).T
df= df[0].set_index(0).T
df.reset_index(level=None, drop=True, inplace=False, col_level=0, col_fill='')
output.append(df)
#print (df)
print(output)
[0 Column1 Column2 Column3 Column4 Column5 Column6 Column7 Column8
1 Value1 Value2 Value3 Value4 Value5 Value6 Value7 Value8, 0 Column1 Column2 Column3 Column4 Column5 Column6 Column7 Column8
1 Value1 Value2 Value3 Value4 Value5 Value6 Value7 Value8, 0 Column1 Column2 Column3 Column4 Column5 Column6 Column7 Column8
1 Value1 Value2 Value3 Value4 Value5 Value6 Value7 Value8]
What I’m trying to achieve is, for every table in the email folder add the table to a new row
So it would finally end up something like this
Column1 Column2 Column3 Column4 Column5 Column6 Column7 Column8
Value1 Value2 Value3 Value4 Value5 Value6 Value7 Value8
I’m struggling to get the lists that are created into a structure that could be converted into a pandas dataframe to be exported to excel.
As I’ve said I’m a beginner , and looking some help on how this could be achieved.
Here's a screenshot of the table i'm tring to grab..
Solution
If you had a list of dataframes (df
) that looked like
[ 0 1
0 Column1 Value1a
1 Column2 Value2a
2 Column3 Value3a
3 Column4 Value4a
4 Column5 Value5a
5 Column6 Value6a
6 Column7 Value7a
7 Column8 Value8a,
0 1
0 Column1 Value1b
1 Column2 Value2b
2 Column3 Value3b
3 Column4 Value4b
4 Column5 Value5b
5 Column6 Value6b
6 Column7 Value7b
7 Column8 Value8b,
0 1
0 Column1 Value1c
1 Column2 Value2c
2 Column3 Value3c
3 Column4 Value4c
4 Column5 Value5c
5 Column6 Value6c
6 Column7 Value7c
7 Column8 Value8c]
then pd.concat([d.set_index(d.columns[0]) for d in df], axis='columns', ignore_index=True).T
would return a single Dataframe
index | Column1 | Column2 | Column3 | Column4 | Column5 | Column6 | Column7 | Column8 |
---|---|---|---|---|---|---|---|---|
0 | Value1a | Value2a | Value3a | Value4a | Value5a | Value6a | Value7a | Value8a |
1 | Value1b | Value2b | Value3b | Value4b | Value5b | Value6b | Value7b | Value8b |
2 | Value1c | Value2c | Value3c | Value4c | Value5c | Value6c | Value7c | Value8c |
But if df
was instead oriented as
[ 0 1 2 3 4 5 6 7
0 Column1 Column2 Column3 Column4 Column5 Column6 Column7 Column8
1 Value1a Value2a Value3a Value4a Value5a Value6a Value7a Value8a,
0 1 2 3 4 5 6 7
0 Column1 Column2 Column3 Column4 Column5 Column6 Column7 Column8
1 Value1b Value2b Value3b Value4b Value5b Value6b Value7b Value8b,
0 1 2 3 4 5 6 7
0 Column1 Column2 Column3 Column4 Column5 Column6 Column7 Column8
1 Value1c Value2c Value3c Value4c Value5c Value6c Value7c Value8c]
then
pd.concat([d.rename(columns=d.iloc[0]).drop(d.index[0]) for d in df], ignore_index=True)
would return the same combined DataFrame.
If you leave out ignore_index=True
then the rows will have the same indexes from before combining; i.e., 1,1,1 instead of 0,1,2.
ADDED EDIT:
If you just want the tables from each message in one combined Dataframe, this should do:
pd.concat([mdf.set_index(mdf.columns[0]) for mdf in [
(pd.read_html(str(mtable))[0] if mtable else None) for mtable in
[BeautifulSoup(m.HTMLBody, "lxml").find('table') for m in Mail_Messages]
] if mdf is not None], axis='columns', ignore_index=True).T
but if you want/need the loop for anything else, then you can also do
output = []
for mail in Mail_Messages:
html_tables = BeautifulSoup(mail.HTMLBody, "lxml").find('table')
# read html table to dataframe list
df = pd.read_html(str(html_tables))[0]
output.append(df.set_index(df.columns[0]))
## WHATEVER ELSE YOU NEED TO DO IN THE LOOP ##
output = pd.concat(output, axis='columns', ignore_index=True).T
print(output)
and that should print something like
0 Column1 Column2 Column3 Column4 Column5 Column6 Column7 Column8
0 Value1 Value2 Value3 Value4 Value5 Value6 Value7 Value8
1 Value1 Value2 Value3 Value4 Value5 Value6 Value7 Value8
2 Value1 Value2 Value3 Value4 Value5 Value6 Value7 Value8
Answered By - Driftr95
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.