Monday, January 15, 2024

[FIXED] Log a dataframe using logging and pandas

January 15, 2024 dataframe, logging, pandas, pretty-print, python No comments

Issue

I am using pandas to operate on dataframes and logging to log intermediate results, as well as warnings and errors, into a separate log file. I need to also print into the same log file a few intermediate dataframes. Specifically, I want to:

Print dataframes into the same log file as the rest of the logging messages (to ensure easier debugging and avoid writing many intermediate files, as would be the case with calls to to_csv with a file destination),
Control logging verbosity (as is commonly done) using logging levels, such as DEBUG or INFO, sharing this with the verbosity of other logging messages (including those that are not related to dataframes).
Control logging verbosity (on a finer level) using a separate variable that determines how many rows of the dataframe to print.
Pretty-print 1 row per line, with aligned columns, and with each row preceded by the typical logging metadata, such as 240102 10:58:20 INFO:.

The best I could come up is the code below, which is a bit too verbose. Is there a simpler and more pythonic way to log a dataframe slice?

Note:

Please include an example of usage.

Example:

import io
import logging
import pandas as pd

# Print into log this many lines of several intermediate dataframes,
# set to 20 or so:
MAX_NUM_DF_LOG_LINES = 4

logging.basicConfig(
    datefmt = '%y%m%d %H:%M:%S',
    format = '%(asctime)s %(levelname)s: %(message)s')
logger = logging.getLogger(__name__)

# Or logging.DEBUG, etc:
logger.setLevel(level = logging.INFO)

# Example of a simple log message:
logger.info('Reading input.')

TESTDATA="""
enzyme  regions   N   length
AaaI    all       10  238045
AaaI    all       20  170393
AaaI    captured  10  292735
AaaI    captured  20  229824
AagI    all       10  88337
AagI    all       20  19144
AagI    captured  10  34463
AagI    captured  20  19220
"""

df = pd.read_csv(io.StringIO(TESTDATA), sep='\s+')

# ...some code....

# Example of a log message with a chunk of a dataframe, here, using
# `head` (but this can be another method that slices a dataframe):
logger.debug('less important intermediate results: df:')
for line in df.head(MAX_NUM_DF_LOG_LINES).to_string().splitlines():
    logger.debug(line)

# ...more code....

logger.info('more important intermediate results: df:')
for line in df.head(MAX_NUM_DF_LOG_LINES).to_string().splitlines():
    logger.info(line)

# ...more code....

Prints:

240102 10:58:20 INFO: Reading input.
240102 10:58:20 INFO: more important intermediate results: df:
240102 10:58:20 INFO:   enzyme   regions   N  length
240102 10:58:20 INFO: 0   AaaI       all  10  238045
240102 10:58:20 INFO: 1   AaaI       all  20  170393
240102 10:58:20 INFO: 2   AaaI  captured  10  292735
240102 10:58:20 INFO: 3   AaaI  captured  20  229824

Related:

None of this accomplishes what I try to do, but it is getting closer:

How to print multiline logs using python logging module?
- See this comment, which is neat, but not very pythonic, as it calls print from inside a list comprehension and then discards the result: "Do note that the latter only works on py2 due to map being lazy; you can do [logger.info(line) for line in 'line 1\nline 2\nline 3'.splitlines()] on py3. – Kyuuhachi, Jun 22, 2021 at 16:30".
- Also, the accepted answer by Qeek has issues: (a) it lacks the functionality to dynamically define the max number of dataframe rows to write into the log (define this once per script, not every call to logger); and (b) it has no examples of usage, so it is unclear.
Write or log print output of pandas Dataframe - this prints something like this, that is it is missing the timestamp + logging level metadata at the beginning of each line:

240102 12:27:19 INFO: dataframe head -   enzyme   regions   N  length
0   AaaI       all  10  238045
1   AaaI       all  20  170393
2   AaaI  captured  10  292735
...

How to log a data-frame to an output file - same as the previous answer.

Solution

What you are looking for is a custom Formatter. Using it will be more Pythonic. It provides better flexibility and code readability, as it is part of Python logging system.

class DataFrameFormatter(logging.Formatter):
    def __init__(self, fmt: str, n_rows: int = 4) -> None:
        self.n_rows = n_rows
        super().__init__(fmt)
        
    def format(self, record: logging.LogRecord) -> str:
        if isinstance(record.msg, pd.DataFrame):
            s = ''
            if hasattr(record, 'n_rows'):
                self.n_rows = record.n_rows
            lines = record.msg.head(self.n_rows).to_string().splitlines()
            if hasattr(record, 'header'):
                record.msg = record.header.strip()
                s += super().format(record) + '\n'
            for line in lines:
                record.msg = line
                s += super().format(record) + '\n'
            return s.strip()
        return super().format(record)
        
formatter = DataFrameFormatter('%(asctime)s %(levelname)-8s %(message)s', n_rows=4)
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
ch = logging.StreamHandler()
ch.setFormatter(formatter)
logger.addHandler(ch)

df = pd.DataFrame({'a' : [1,2,3,4,5], 'bb': [10, 20, 30, 40 ,50]})

logger.info(df, extra={'header': "this is a header line"})
logger.debug('foo')    
logger.info(df, extra={'n_rows': 2})

This will produce following log:

2024-01-09 15:09:53,384 INFO     this is a header line
2024-01-09 15:09:53,384 INFO        a  bb
2024-01-09 15:09:53,384 INFO     0  1  10
2024-01-09 15:09:53,384 INFO     1  2  20
2024-01-09 15:09:53,384 INFO     2  3  30
2024-01-09 15:09:53,384 INFO     3  4  40
2024-01-09 15:09:53,385 DEBUG    foo
2024-01-09 15:09:53,385 INFO        a  bb
2024-01-09 15:09:53,385 INFO     0  1  10
2024-01-09 15:09:53,385 INFO     1  2  20

This way you can easily control header and n_rows through the extra entry. And if you will not provide them, the default values will be used.

Answered By - igrinis

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, January 15, 2024

[FIXED] Log a dataframe using logging and pandas

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels