Issue
I'm a bit stuck with the following thing:
I'm trying to take values from a pandas
dataframe, modifying them and then putting in a new dataframe
/np.array
.
In particular the dataframe df1
looks like this:
1. 0 0 ... 0.5 0.5 .. 0
2. 0 0 ... 0 1 .. 0
3. 0.5 0 ... 0 0.5 .. 0
...
i.e. I have a lot of zero entries except from some non-zero entries that sum up to one.
What I want to do is taken each row (vector), modifying the zero entries with values taken by the uniform
distribution taken between some low value and the minimum value between the non-zero entries and then attach the result in a new dataframe or a numpy array.
The result that we can call df2
should look like this:
1. 0.22 0.15 ... 0.5 0.5 .. 0.004
2. 0.7 0.654 ... 0.0567 1 .. 0.45
3. 0.5 0.432 ... 0.354 0.5 .. 0.0432
...
I'm trying with the following code:
arr = np.array([[]])
for j in range(len(df1)):
for i in range(103): #103 is the length of these vectors
if df1.iloc[j][i] == 0:
arr=np.append([np.random.uniform(low=0.01, high=df1.iloc[j][3:].min()), arr])
else:
arr[j][i]= df1.iloc[j][i]
What I get is the following error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-256-141abfd58de1> in <module>
3 for j in range(len(data)):
4 for i in range(103):
----> 5 if data.iloc[j][i] == 0:
6 arr=np.append([np.random.uniform(low=0.01, high=data.iloc[j][3:].min()), arr])
7 else:
~\anaconda3\lib\site-packages\pymatgen\core\composition.py in __eq__(self, other)
167 # in the elmap, so checking len enables us to only check one
168 # compositions elements
--> 169 if len(self) != len(other):
170 return False
171 for el, v in self.items():
TypeError: object of type 'int' has no len()
Many thanks,
James
Solution
First, let’s make a df1
with 10 rows and 103 columns that has mostly zeros and all rows sum to 1:
>>> df1 = pd.DataFrame({r: {val: np.random.randint(20) for val in np.random.choice(np.arange(103), np.random.randint(2, 5))} for r in range(10)}).T
>>> df1 = df1.div(df1.sum(axis='columns'), axis='index').reindex(columns=np.arange(103)).fillna(0)
Let’s check what we did by looking at the data, summing rows, and counting zeros per row:
>>> df1
0 1 2 3 4 5 ... 97 98 99 100 101 102
0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.000
1 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.000
2 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.000
3 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.000
4 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.000
5 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.1 0.0 0.0 0.0 0.0 0.475
6 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.000
7 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.000
8 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.000
9 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.000
[10 rows x 103 columns]
>>> df1.sum(axis='columns')
0 1.0
1 1.0
2 1.0
3 1.0
4 1.0
5 1.0
6 1.0
7 1.0
8 1.0
9 1.0
dtype: float64
>>> df1.ne(0).sum(axis='columns').astype(int)
0 3
1 2
2 3
3 2
4 3
5 4
6 4
7 3
8 3
9 3
dtype: int64
So it seems this respects your specifications for df1
, now we can start working.
First, let’s mask all zeros, so we have a dataframe to extract min non-zero values:
>>> df1_nz = df1.mask(df1.eq(0))
>>> df1_nz.min(axis='columns')
0 0.282051
1 0.210526
2 0.181818
3 0.464286
4 0.272727
5 0.100000
6 0.068182
7 0.185185
8 0.050000
9 0.222222
dtype: float64
Now from there min we can call np.uniform
once per row to get a dataframe full of random values, and use these random values to fill in df1 where it’s non-zero:
>>> random_vals = pd.DataFrame({
... r: np.random.uniform(0.01, n, 103) for r, n in df1_nz.min(axis='columns').iteritems()
... }, index=df1.columns).T
>>> df2 = df1_nz.fillna(random_vals)
>>> df2
0 1 2 3 ... 99 100 101 102
0 0.274312 0.119229 0.200223 0.126925 ... 0.250511 0.076387 0.262691 0.091327
1 0.178858 0.032533 0.171083 0.187775 ... 0.104859 0.141225 0.145604 0.024747
2 0.149279 0.095146 0.067775 0.074993 ... 0.167393 0.109034 0.082226 0.146610
3 0.101093 0.391821 0.266622 0.336723 ... 0.126007 0.438758 0.321557 0.339710
4 0.037873 0.250409 0.123596 0.152685 ... 0.086009 0.190996 0.086574 0.253784
5 0.051473 0.032933 0.085726 0.064984 ... 0.064354 0.050978 0.086429 0.475000
6 0.043807 0.021605 0.049259 0.060036 ... 0.043379 0.052804 0.039904 0.044067
7 0.033173 0.030694 0.178263 0.042904 ... 0.183436 0.019724 0.024167 0.074844
8 0.019714 0.019226 0.028672 0.046260 ... 0.023111 0.042002 0.028637 0.018817
9 0.137686 0.101749 0.127393 0.026675 ... 0.083874 0.197242 0.170042 0.143624
[10 rows x 103 columns]
If we filter df2 on locations where df1 is non-zero we can see it’s still the same values:
>>> df2.where(df1.ne(0)).stack()
0 56 0.410256
58 0.307692
77 0.282051
1 13 0.210526
77 0.789474
2 25 0.181818
51 0.636364
92 0.181818
3 19 0.535714
74 0.464286
4 18 0.454545
33 0.272727
91 0.272727
5 38 0.200000
54 0.225000
97 0.100000
102 0.475000
6 7 0.409091
12 0.068182
30 0.250000
73 0.272727
7 18 0.518519
57 0.185185
69 0.296296
8 7 0.050000
40 0.250000
90 0.700000
9 20 0.259259
38 0.518519
89 0.222222
dtype: float64
You didn’t explain the [3:]
so I’ll ignore it but you can reintroduce it in this method with df1_nz = df1.mask(…)[df1.columns[3:]]
.
Answered By - Cimbali
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.