Tuesday, December 7, 2021

[FIXED] Pandas: How to easily share a sample dataframe using df.to_dict()?

December 07, 2021 pandas, plotly, plotly-python, python No comments

Issue

Despite the clear guidance on How do I ask a good question? and How to create a Minimal, Reproducible Example, many just seem to ignore to include a reproducible data sample in their question. So what is a practical and easy way to reproduce a data sample when a simple pd.DataFrame(np.random.random(size=(5, 5))) is not enough? How can you, for example, use df.to_dict() and include the output in a question?

Solution

The answer:

In many situations, using an approach with df.to_dict() will do the job perfectly! Here are two cases that come to mind:

Case 1: You've got a dataframe built or loaded in Python from a local source

Case 2: You've got a table in another application (like Excel)

The details:

Case 1: You've got a dataframe built or loaded from a local source

Given that you've got a pandas dataframe named df, just

run df.to_dict() in you console or editor, and
copy the output that is formatted as a dictionary, and
paste the content into pd.DataFrame(<output>) and include that chunk in your now reproducible code snippet.

Case 2: You've got a table in another application (like Excel)

Depending on the source and separator like (',', ';' '\\s+') where the latter means any spaces, you can simply:

Ctrl+C the contents
run df=pd.read_clipboard(sep='\\s+') in your console or editor, and
run df.to_dict(), and
include the output in df=pd.DataFrame(<output>)

In this case, the start of your question would look something like this:

import pandas as pd
df = pd.DataFrame({0: {0: 0.25474768796402636, 1: 0.5792136563952824, 2: 0.5950396800676201},
                   1: {0: 0.9071073567355232, 1: 0.1657288354283053, 2: 0.4962367707789421},
                   2: {0: 0.7440601352930207, 1: 0.7755487356392468, 2: 0.5230707257648775}})

Of course, this gets a little clumsy with larger dataframes. But very often, all anyone who seeks to answer your question need is a little sample of your real world data to take the structure of your data into consideration.

And there are two ways you can handle larger dataframes:

run df.head(20).to_dict() to only include the first 20 rows, and
change the format of your dict using, for example, df.to_dict('split') (there are other options besides 'split') to reshape your output to a dict that requires fewer lines.

Here's an example using the iris dataset, among other places available from plotly express.

If you just run:

import plotly.express as px
import pandas as pd
df = px.data.iris()
df.to_dict()

This will produce an output of nearly 1000 lines, and won't be very practical as a reproducible sample. But if you include .head(25), you'll get:

{'sepal_length': {0: 5.1, 1: 4.9, 2: 4.7, 3: 4.6, 4: 5.0, 5: 5.4, 6: 4.6, 7: 5.0, 8: 4.4, 9: 4.9},
 'sepal_width': {0: 3.5, 1: 3.0, 2: 3.2, 3: 3.1, 4: 3.6, 5: 3.9, 6: 3.4, 7: 3.4, 8: 2.9, 9: 3.1},
 'petal_length': {0: 1.4, 1: 1.4, 2: 1.3, 3: 1.5, 4: 1.4, 5: 1.7, 6: 1.4, 7: 1.5, 8: 1.4, 9: 1.5},
 'petal_width': {0: 0.2, 1: 0.2, 2: 0.2, 3: 0.2, 4: 0.2, 5: 0.4, 6: 0.3, 7: 0.2, 8: 0.2, 9: 0.1},
 'species': {0: 'setosa', 1: 'setosa', 2: 'setosa', 3: 'setosa', 4: 'setosa', 5: 'setosa', 6: 'setosa', 7: 'setosa', 8: 'setosa', 9: 'setosa'},
 'species_id': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1}}

And now we're getting somewhere. But depending on the structure and content of the data, this may not cover the complexity of the contents in a satisfactory manner. But you can include more data on fewer lines by including to_dict('split') like this:

import plotly.express as px
df = px.data.iris().head(10)
df.to_dict('split')

Now your output will look like:

{'index': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 'columns': ['sepal_length',
  'sepal_width',
  'petal_length',
  'petal_width',
  'species',
  'species_id'],
 'data': [[5.1, 3.5, 1.4, 0.2, 'setosa', 1],
  [4.9, 3.0, 1.4, 0.2, 'setosa', 1],
  [4.7, 3.2, 1.3, 0.2, 'setosa', 1],
  [4.6, 3.1, 1.5, 0.2, 'setosa', 1],
  [5.0, 3.6, 1.4, 0.2, 'setosa', 1],
  [5.4, 3.9, 1.7, 0.4, 'setosa', 1],
  [4.6, 3.4, 1.4, 0.3, 'setosa', 1],
  [5.0, 3.4, 1.5, 0.2, 'setosa', 1],
  [4.4, 2.9, 1.4, 0.2, 'setosa', 1],
  [4.9, 3.1, 1.5, 0.1, 'setosa', 1]]}

And now you can easily increase the number in .head(10) without cluttering your question too much. But there's one minor drawback. Now you can no longer use the input directly in pd.DataFrame. But if you include a few specifications with regards to index, column, and data you'll be just fine. So for this particluar dataset, my preferred approach would be:

import pandas as pd
import plotly.express as px

sample = {'index': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14],
             'columns': ['sepal_length',
              'sepal_width',
              'petal_length',
              'petal_width',
              'species',
              'species_id'],
             'data': [[5.1, 3.5, 1.4, 0.2, 'setosa', 1],
              [4.9, 3.0, 1.4, 0.2, 'setosa', 1],
              [4.7, 3.2, 1.3, 0.2, 'setosa', 1],
              [4.6, 3.1, 1.5, 0.2, 'setosa', 1],
              [5.0, 3.6, 1.4, 0.2, 'setosa', 1],
              [5.4, 3.9, 1.7, 0.4, 'setosa', 1],
              [4.6, 3.4, 1.4, 0.3, 'setosa', 1],
              [5.0, 3.4, 1.5, 0.2, 'setosa', 1],
              [4.4, 2.9, 1.4, 0.2, 'setosa', 1],
              [4.9, 3.1, 1.5, 0.1, 'setosa', 1],
              [5.4, 3.7, 1.5, 0.2, 'setosa', 1],
              [4.8, 3.4, 1.6, 0.2, 'setosa', 1],
              [4.8, 3.0, 1.4, 0.1, 'setosa', 1],
              [4.3, 3.0, 1.1, 0.1, 'setosa', 1],
              [5.8, 4.0, 1.2, 0.2, 'setosa', 1]]}

df = pd.DataFrame(index=sample['index'], columns=sample['columns'], data=sample['data'])
df

Now you'll have this dataframe to work with:

    sepal_length  sepal_width  petal_length  petal_width species  species_id
0            5.1          3.5           1.4          0.2  setosa           1
1            4.9          3.0           1.4          0.2  setosa           1
2            4.7          3.2           1.3          0.2  setosa           1
3            4.6          3.1           1.5          0.2  setosa           1
4            5.0          3.6           1.4          0.2  setosa           1
5            5.4          3.9           1.7          0.4  setosa           1
6            4.6          3.4           1.4          0.3  setosa           1
7            5.0          3.4           1.5          0.2  setosa           1
8            4.4          2.9           1.4          0.2  setosa           1
9            4.9          3.1           1.5          0.1  setosa           1
10           5.4          3.7           1.5          0.2  setosa           1
11           4.8          3.4           1.6          0.2  setosa           1
12           4.8          3.0           1.4          0.1  setosa           1
13           4.3          3.0           1.1          0.1  setosa           1
14           5.8          4.0           1.2          0.2  setosa           1

Which will increase your chances of receiving useful answers significantly!

Edit:

df_to_dict() will not be able to read timestamps like 1: Timestamp('2020-01-02 00:00:00') without also including from pandas import Timestamp

Answered By - vestland

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, December 7, 2021

[FIXED] Pandas: How to easily share a sample dataframe using df.to_dict()?

Issue

Solution

The answer:

The details:

And there are two ways you can handle larger dataframes:

Edit:

0 comments:

Post a Comment

Popular Posts

Labels