Issue
I need use pd.read_excel to process every sheet in one excel file.
But in most cases,I did not know the sheet name.
So I use this to judge how many sheet in excel:
i_sheet_count=0
i=0
try:
df.read_excel('/tmp/1.xlsx',sheetname=i)
i_sheet_count+=1
i+=1
else:
i+=1
print(i_sheet_count)
During the process,I found that the process is quite slow,
So,can read_excel only read limited rows to improve the speed?
I tried nrows but did not work..still slow..
Solution
Read all worksheets without guessing
Use sheet_name = None
argument to pd.read_excel
. This will read all worksheets into a dictionary of dataframes. For example:
dfs = pd.read_excel('file.xlsx', sheet_name=None)
# access 'Sheet1' worksheet
res = dfs['Sheet1']
Limit number of rows or columns
You can use parse_cols
and skip_footer
arguments to limit the number of columns and/or rows. This will reduce read time, and also works with sheet_name = None
.
For example, the following will read the first 3 columns and, if your worksheet has 100 rows, it will read only the first 20.
df = pd.read_excel('file.xlsx', sheet_name=None, usecols='A:C', skip_footer=80)
If you wish to apply worksheet-specific logic, you can do so by extracting sheet_names:
sheet_names = pd.ExcelFile('file.xlsx', on_demand=True).sheet_names
dfs = {}
for sheet in sheet_names:
dfs[sheet] = pd.read_excel('file.xlsx', sheet)
Improving performance
Reading Excel files into Pandas is naturally slower than other options (CSV, Pickle, HDF5). If you wish to improve performance, I strongly suggest you consider these other formats.
One option, for example, is to use a VBA script to convert your Excel worksheets to CSV files; then use pd.read_csv
.
Edit 02 Nov: correct sheetname
to sheet_name
Answered By - jpp
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.