Issue
I am using pandas
to operate on dataframes and logging
to log intermediate results, as well as warnings and errors, into a separate log file. I need to also print into the same log file a few intermediate dataframes. Specifically, I want to:
- Print dataframes into the same log file as the rest of the
logging
messages (to ensure easier debugging and avoid writing many intermediate files, as would be the case with calls toto_csv
with a file destination), - Control logging verbosity (as is commonly done) using
logging
levels, such asDEBUG
orINFO
, sharing this with the verbosity of other logging messages (including those that are not related to dataframes). - Control logging verbosity (on a finer level) using a separate variable that determines how many rows of the dataframe to print.
- Pretty-print 1 row per line, with aligned columns, and with each row preceded by the typical logging metadata, such as
240102 10:58:20 INFO:
.
The best I could come up is the code below, which is a bit too verbose. Is there a simpler and more pythonic way to log a dataframe slice?
Note:
Please include an example of usage.
Example:
import io
import logging
import pandas as pd
# Print into log this many lines of several intermediate dataframes,
# set to 20 or so:
MAX_NUM_DF_LOG_LINES = 4
logging.basicConfig(
datefmt = '%y%m%d %H:%M:%S',
format = '%(asctime)s %(levelname)s: %(message)s')
logger = logging.getLogger(__name__)
# Or logging.DEBUG, etc:
logger.setLevel(level = logging.INFO)
# Example of a simple log message:
logger.info('Reading input.')
TESTDATA="""
enzyme regions N length
AaaI all 10 238045
AaaI all 20 170393
AaaI captured 10 292735
AaaI captured 20 229824
AagI all 10 88337
AagI all 20 19144
AagI captured 10 34463
AagI captured 20 19220
"""
df = pd.read_csv(io.StringIO(TESTDATA), sep='\s+')
# ...some code....
# Example of a log message with a chunk of a dataframe, here, using
# `head` (but this can be another method that slices a dataframe):
logger.debug('less important intermediate results: df:')
for line in df.head(MAX_NUM_DF_LOG_LINES).to_string().splitlines():
logger.debug(line)
# ...more code....
logger.info('more important intermediate results: df:')
for line in df.head(MAX_NUM_DF_LOG_LINES).to_string().splitlines():
logger.info(line)
# ...more code....
Prints:
240102 10:58:20 INFO: Reading input.
240102 10:58:20 INFO: more important intermediate results: df:
240102 10:58:20 INFO: enzyme regions N length
240102 10:58:20 INFO: 0 AaaI all 10 238045
240102 10:58:20 INFO: 1 AaaI all 20 170393
240102 10:58:20 INFO: 2 AaaI captured 10 292735
240102 10:58:20 INFO: 3 AaaI captured 20 229824
Related:
None of this accomplishes what I try to do, but it is getting closer:
- How to print multiline logs using python logging module?
- See this comment, which is neat, but not very pythonic, as it calls
print
from inside a list comprehension and then discards the result: "Do note that the latter only works on py2 due to map being lazy; you can do[logger.info(line) for line in 'line 1\nline 2\nline 3'.splitlines()]
on py3. – Kyuuhachi, Jun 22, 2021 at 16:30". - Also, the accepted answer by Qeek has issues: (a) it lacks the functionality to dynamically define the max number of dataframe rows to write into the log (define this once per script, not every call to logger); and (b) it has no examples of usage, so it is unclear.
- See this comment, which is neat, but not very pythonic, as it calls
- Write or log print output of pandas Dataframe - this prints something like this, that is it is missing the timestamp + logging level metadata at the beginning of each line:
240102 12:27:19 INFO: dataframe head - enzyme regions N length
0 AaaI all 10 238045
1 AaaI all 20 170393
2 AaaI captured 10 292735
...
- How to log a data-frame to an output file - same as the previous answer.
Solution
What you are looking for is a custom Formatter
. Using it will be more Pythonic. It provides better flexibility and code readability, as it is part of Python logging system.
class DataFrameFormatter(logging.Formatter):
def __init__(self, fmt: str, n_rows: int = 4) -> None:
self.n_rows = n_rows
super().__init__(fmt)
def format(self, record: logging.LogRecord) -> str:
if isinstance(record.msg, pd.DataFrame):
s = ''
if hasattr(record, 'n_rows'):
self.n_rows = record.n_rows
lines = record.msg.head(self.n_rows).to_string().splitlines()
if hasattr(record, 'header'):
record.msg = record.header.strip()
s += super().format(record) + '\n'
for line in lines:
record.msg = line
s += super().format(record) + '\n'
return s.strip()
return super().format(record)
formatter = DataFrameFormatter('%(asctime)s %(levelname)-8s %(message)s', n_rows=4)
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
ch = logging.StreamHandler()
ch.setFormatter(formatter)
logger.addHandler(ch)
df = pd.DataFrame({'a' : [1,2,3,4,5], 'bb': [10, 20, 30, 40 ,50]})
logger.info(df, extra={'header': "this is a header line"})
logger.debug('foo')
logger.info(df, extra={'n_rows': 2})
This will produce following log:
2024-01-09 15:09:53,384 INFO this is a header line
2024-01-09 15:09:53,384 INFO a bb
2024-01-09 15:09:53,384 INFO 0 1 10
2024-01-09 15:09:53,384 INFO 1 2 20
2024-01-09 15:09:53,384 INFO 2 3 30
2024-01-09 15:09:53,384 INFO 3 4 40
2024-01-09 15:09:53,385 DEBUG foo
2024-01-09 15:09:53,385 INFO a bb
2024-01-09 15:09:53,385 INFO 0 1 10
2024-01-09 15:09:53,385 INFO 1 2 20
This way you can easily control header
and n_rows
through the extra
entry. And if you will not provide them, the default values will be used.
Answered By - igrinis
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.