Issue
The following question is a simplification of this: Iterating through lists within a pandas DataFrame
I have a DataFrame which contains a column of lists:
import numpy as np
import pandas as pd
col = [["A", "B", "C", "D"], ["E", "F"]]
d = {"col" : [["A", "B", "C", "D"], ["E", "F"]]}
df = pd.DataFrame(d)
print(df)
Out[2]:
col
0 [A, B, C, D]
1 [E, F]
For each row I want to iterate through the list and select between the following cases:
- first entry of the list (list index = 0): write the first entry of the list in the first column
- entry between first an last entry (list index = i): write the current entry and the entry before that one in a column depending on the current iteration
- last entry of the list (list index = -1): write the current entry and the entry before that one in a column depending on the current iteration AND write the last entry of the list in a column depending on the current iteration + 1
- if the list index i is greater than the length of the list: write np.nan in a column depending the current iteration
The resulting DataFrame should look like this:
col 0 1 2 3 4 5
0 [A, B, C, D] A B-A C-B D-C D np.nan
1 [E, F] E F-E F np.nan np.nan np.nan
To get to this result I tried a nested numpy.where
function:
for i in range(7):
df[i] = pd.DataFrame(np.where(i == 0,
df["col"].apply(lambda x: x[0]),
np.where(i == df["col"].apply(len),
df["col"].apply(lambda x: x[-1]),
np.where((i > 0) & (i <= df["col"].apply(len) - 1),
df["col"].apply(lambda x: x[i]) + '-' + df["col"].apply(lambda x: x[i-1]),
np.nan
)
)
)
)
print(df)
And here is my problem: I get an IndexError: list index out of range
I suppose it has something to do with the i
.
Even if I catch the invalid cased of i
the whole nested where term is not valid. (I tried it with numpy.select
too, but got the same result.)
If I replace the index i
with 1
it works (of cource it will give me the wrong values, but I don't get an error), so it has to have something to do with this index, but I can't figure out how to solve the problem:
for i in range(7):
df[i] = pd.DataFrame(np.where(i == 0,
df["col"].apply(lambda x: x[0]),
np.where(i == df["col"].apply(len),
df["col"].apply(lambda x: x[-1]),
np.where((i > 0) & (i <= df["col"].apply(len) - 1),
df["col"].apply(lambda x: x[1]) + '-' + df["col"].apply(lambda x: x[1-1]),
np.nan
)
)
)
)
print(df)
col 0 1 2 3 4 5
0 [A, B, C, D] A B-A B-A B-A D np.nan
1 [E, F] E F-E F np.nan np.nan np.nan
Can you think of a solution to this or to an alternative way to get my desired DataFrame?
Solution
I would code the logic as a separate function:
from typing import List
def compute_event_transitions(L: List[str]) -> pd.Series:
if len(L) <= 1:
return pd.Series(L)
first = pd.Series(L[0])
last = pd.Series(L[-1])
s1 = pd.Series(L)
s2 = s1.shift(+1)
middle = (
pd.concat([s2, s1], axis='columns')
[1:] # The first element has no "from" transition
.apply(lambda s: "-".join(s.tolist()), axis='columns')
)
transitions = pd.concat([first, middle, last]).reset_index(drop=True)
return transitions
Now you can apply this calculation to each element in your dataframe:
all_transitions = df['col'].apply(compute_event_transitions)
0 1 2 3 4 0 A A-B B-C C-D D 1 E E-F F NaN NaN
Note that it is indexed the same way as your original dataframe, so you can stitch it back to the column of lists:
pd.concat([df, all_transitions], axis='columns')
col 0 1 2 3 4 0 [A, B, C, D] A A-B B-C C-D D 1 [E, F] E E-F F NaN NaN
Answered By - stevepastelan
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.