Issue
I have a pandas dataframe as follows,
import pandas as pd
import numpy as np
df = pd.DataFrame({'text':['this is the good student','she wears a beautiful green dress','he is from a friendly family of four','the house is empty','the number four five is new'],
'labels':['O,O,O,ADJ,O','O,O,O,ADJ,ADJ,O','O,O,O,O,ADJ,O,O,NUM','O,O,O,O','O,O,NUM,NUM,O,O']})
I would like to add a 'B-' label to the ADJ or NUM is they are not repeated right after, and 'I-' if there is a repetition. so here is my desired output,
output:
text labels
0 this is the good student O,O,O,B-ADJ,O
1 she wears a beautiful green dress O,O,O,B-ADJ,I-ADJ,O
2 he is from a friendly family of four O,O,O,O,B-ADJ,O,O,B-NUM
3 the house is empty O,O,O,O
4 the number four five is new O,O,B-NUM,I-NUM,O,O
so far I have created a list of unique values as such
unique_labels = (np.unique(sum(df["labels"].str.split(',').dropna().to_numpy(), []))).tolist()
unique_labels.remove('O') # no changes required for O label
and tried to first add the B label which I got an error(ValueError: Must have equal len keys and value when setting with an iterable),
for x in unique_labels:
df.loc[df["labels"].str.contains(x), "labels"]= ['B-' + x for x in df["labels"]]
Solution
Try:
from itertools import groupby
def fn(x):
out = []
for k, g in groupby(map(str.strip, x.split(","))):
if k == "O":
out.extend(g)
else:
out.append(f"B-{next(g)}")
out.extend([f"I-{val}" for val in g])
return ",".join(out)
df["labels"] = df["labels"].apply(fn)
print(df)
Prints:
text labels
0 this is the good student O,O,O,B-ADJ,O
1 she wears a beautiful green dress O,O,O,B-ADJ,I-ADJ,O
2 he is from a friendly family of four O,O,O,O,B-ADJ,O,O,B-NUM
3 the house is empty O,O,O,O
4 the number four five is new O,O,B-NUM,I-NUM,O,O
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.