Thursday, October 14, 2021

[FIXED] Taking values from a pandas dataframe, modifying them and attaching them in a new dataframe/np.array

October 14, 2021 numpy, pandas No comments

Issue

I'm a bit stuck with the following thing:

I'm trying to take values from a pandas dataframe, modifying them and then putting in a new dataframe/np.array.

In particular the dataframe df1 looks like this:

1.  0   0   ... 0.5   0.5 .. 0
2.  0   0   ...  0     1  .. 0
3.  0.5 0   ...  0    0.5 .. 0
...

i.e. I have a lot of zero entries except from some non-zero entries that sum up to one.

What I want to do is taken each row (vector), modifying the zero entries with values taken by the uniform distribution taken between some low value and the minimum value between the non-zero entries and then attach the result in a new dataframe or a numpy array.

The result that we can call df2 should look like this:

1.  0.22   0.15   ...   0.5       0.5 ..     0.004
2.  0.7    0.654   ...  0.0567     1  ..     0.45
3.  0.5    0.432   ...  0.354     0.5 ..     0.0432
...

I'm trying with the following code:

arr = np.array([[]])

for j in range(len(df1)):
    for i in range(103): #103 is the length of these vectors
        if df1.iloc[j][i] == 0:
            arr=np.append([np.random.uniform(low=0.01, high=df1.iloc[j][3:].min()), arr])
        else:
            arr[j][i]= df1.iloc[j][i]

What I get is the following error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-256-141abfd58de1> in <module>
      3 for j in range(len(data)):
      4     for i in range(103):
----> 5         if data.iloc[j][i] == 0:
      6             arr=np.append([np.random.uniform(low=0.01, high=data.iloc[j][3:].min()), arr])
      7         else:

~\anaconda3\lib\site-packages\pymatgen\core\composition.py in __eq__(self, other)
    167         #  in the elmap, so checking len enables us to only check one
    168         #  compositions elements
--> 169         if len(self) != len(other):
    170             return False
    171         for el, v in self.items():

TypeError: object of type 'int' has no len()

Many thanks,

James

Solution

First, let’s make a df1 with 10 rows and 103 columns that has mostly zeros and all rows sum to 1:

>>> df1 = pd.DataFrame({r: {val: np.random.randint(20) for val in np.random.choice(np.arange(103), np.random.randint(2, 5))} for r in range(10)}).T
>>> df1 = df1.div(df1.sum(axis='columns'), axis='index').reindex(columns=np.arange(103)).fillna(0)

Let’s check what we did by looking at the data, summing rows, and counting zeros per row:

>>> df1
   0    1    2    3    4    5    ...  97   98   99   100  101    102
0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0  0.0  0.0  0.000
1  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0  0.0  0.0  0.000
2  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0  0.0  0.0  0.000
3  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0  0.0  0.0  0.000
4  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0  0.0  0.0  0.000
5  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.1  0.0  0.0  0.0  0.0  0.475
6  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0  0.0  0.0  0.000
7  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0  0.0  0.0  0.000
8  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0  0.0  0.0  0.000
9  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0  0.0  0.0  0.000

[10 rows x 103 columns]
>>> df1.sum(axis='columns')
0    1.0
1    1.0
2    1.0
3    1.0
4    1.0
5    1.0
6    1.0
7    1.0
8    1.0
9    1.0
dtype: float64
>>> df1.ne(0).sum(axis='columns').astype(int)
0    3
1    2
2    3
3    2
4    3
5    4
6    4
7    3
8    3
9    3
dtype: int64

So it seems this respects your specifications for df1, now we can start working.

First, let’s mask all zeros, so we have a dataframe to extract min non-zero values:

>>> df1_nz = df1.mask(df1.eq(0))
>>> df1_nz.min(axis='columns')
0    0.282051
1    0.210526
2    0.181818
3    0.464286
4    0.272727
5    0.100000
6    0.068182
7    0.185185
8    0.050000
9    0.222222
dtype: float64

Now from there min we can call np.uniform once per row to get a dataframe full of random values, and use these random values to fill in df1 where it’s non-zero:

>>> random_vals = pd.DataFrame({
...     r: np.random.uniform(0.01, n, 103) for r, n in df1_nz.min(axis='columns').iteritems()
... }, index=df1.columns).T
>>> df2 = df1_nz.fillna(random_vals)
>>> df2
        0         1         2         3    ...       99        100       101       102
0  0.274312  0.119229  0.200223  0.126925  ...  0.250511  0.076387  0.262691  0.091327
1  0.178858  0.032533  0.171083  0.187775  ...  0.104859  0.141225  0.145604  0.024747
2  0.149279  0.095146  0.067775  0.074993  ...  0.167393  0.109034  0.082226  0.146610
3  0.101093  0.391821  0.266622  0.336723  ...  0.126007  0.438758  0.321557  0.339710
4  0.037873  0.250409  0.123596  0.152685  ...  0.086009  0.190996  0.086574  0.253784
5  0.051473  0.032933  0.085726  0.064984  ...  0.064354  0.050978  0.086429  0.475000
6  0.043807  0.021605  0.049259  0.060036  ...  0.043379  0.052804  0.039904  0.044067
7  0.033173  0.030694  0.178263  0.042904  ...  0.183436  0.019724  0.024167  0.074844
8  0.019714  0.019226  0.028672  0.046260  ...  0.023111  0.042002  0.028637  0.018817
9  0.137686  0.101749  0.127393  0.026675  ...  0.083874  0.197242  0.170042  0.143624

[10 rows x 103 columns]

If we filter df2 on locations where df1 is non-zero we can see it’s still the same values:

>>> df2.where(df1.ne(0)).stack()
0  56     0.410256
   58     0.307692
   77     0.282051
1  13     0.210526
   77     0.789474
2  25     0.181818
   51     0.636364
   92     0.181818
3  19     0.535714
   74     0.464286
4  18     0.454545
   33     0.272727
   91     0.272727
5  38     0.200000
   54     0.225000
   97     0.100000
   102    0.475000
6  7      0.409091
   12     0.068182
   30     0.250000
   73     0.272727
7  18     0.518519
   57     0.185185
   69     0.296296
8  7      0.050000
   40     0.250000
   90     0.700000
9  20     0.259259
   38     0.518519
   89     0.222222
dtype: float64

You didn’t explain the [3:] so I’ll ignore it but you can reintroduce it in this method with df1_nz = df1.mask(…)[df1.columns[3:]].

Answered By - Cimbali

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, October 14, 2021

[FIXED] Taking values from a pandas dataframe, modifying them and attaching them in a new dataframe/np.array

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels