Thursday, January 11, 2024

[FIXED] How to fill a pandas dataframe from two 2d numpy arrays in an efficient way?

January 11, 2024 dataframe, numpy, pandas, python No comments

Issue

import pandas as pd
import numpy as np

s = [ "S" + str(i) for i in range(1,101)]
c = [ "C" + str(i) for i in range(1,51)]

arr1 = np.random.randn(len(c),len(s))
arr2 = np.random.randn(len(c),len(s))

How to create and fill pandas dataframe df with 100 * 50 = 5000 rows for each possible s and c pairs such that arr1_col has arr1[s,c] and arr2_col has arr2[s,c]?

df = pd.DataFrame({'S':s, 'C':c, 'arr1_col':arr1[s,c] ,  'arr2_col':arr2[s,c]})

Solution

Assuming this 4x3 reproducible input:

import pandas as pd
import numpy as np

s = [ "S" + str(i) for i in range(1,4+1)]
c = [ "C" + str(i) for i in range(1,3+1)]

arr1 = np.arange(len(c)*len(s)).reshape(len(s), len(c))
arr2 = np.arange(len(c)*len(s)).reshape(len(s), len(c))*10

I imagine that you want to use numpy.repeat, numpy.tile and numpy.ravel:

# numpy_1
df = pd.DataFrame({'S': np.repeat(s, len(c)),
                   'C': np.tile(c, len(s)),
                   'arr1': arr1.ravel(),
                   'arr2': arr2.ravel(),
                  })

Or with concat and stack:

# pandas_1
df = (pd.concat({'arr1': pd.DataFrame(arr1, index=s, columns=c),
                 'arr2': pd.DataFrame(arr2, index=s, columns=c),},
                axis=1)
        .stack().rename_axis(['S', 'C']).reset_index()
     )

Output:

     S   C  arr1  arr2
0   S1  C1     0     0
1   S1  C2     1    10
2   S1  C3     2    20
3   S2  C1     3    30
4   S2  C2     4    40
5   S2  C3     5    50
6   S3  C1     6    60
7   S3  C2     7    70
8   S3  C3     8    80
9   S4  C1     9    90
10  S4  C2    10   100
11  S4  C3    11   110

ordering C first

# numpy_2
df = pd.DataFrame({'S': np.tile(s, len(c)),
                   'C': np.repeat(c, len(s)),
                   'arr1': arr1.ravel(order='F'),
                   'arr2': arr2.ravel(order='F'),
                  })

Or:

# pandas_2
df = (pd.concat({'arr1': pd.DataFrame(arr1, index=s, columns=c),
                 'arr2': pd.DataFrame(arr2, index=s, columns=c),},
                axis=0)
        .T.stack().rename_axis(['C', 'S']).reset_index()
     )

Output:

     S   C  arr1  arr2
0   S1  C1     0     0
1   S2  C1     3    30
2   S3  C1     6    60
3   S4  C1     9    90
4   S1  C2     1    10
5   S2  C2     4    40
6   S3  C2     7    70
7   S4  C2    10   100
8   S1  C3     2    20
9   S2  C3     5    50
10  S3  C3     8    80
11  S4  C3    11   110

comparison of timings

Answered By - mozway

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, January 11, 2024

[FIXED] How to fill a pandas dataframe from two 2d numpy arrays in an efficient way?

Issue

Solution

ordering C first

comparison of timings

0 comments:

Post a Comment

Popular Posts

Labels