Issue
I've got a dataframe with one column called "text" that is a series of strings, ie. [Joe, Biden, Is, President]
I'm trying to drop every row that contains the word "Joe" in the column "text". To do this I wrote:
dfl[~dfl['text'].str.contains("Joe", na=False)]
I thought this would work, but it's just returning the full dataframe again.
I would also like to create a new dataframe with just the rows that contain "Joe" in the text column. any help with that would be appreciated too!
To clarify, the table looks like:
index | label | score | text | ID |
---|---|---|---|---|
0 | NEGATIVE | 0.983319103717804 | perrosloja,Expresoec,Es,del,partido,social,cristiano,la,lista,6,SenateGOP,SenateDems,JoeBiden,SenBobCasey,AsambleaEcuador,OEADDOT,ONUGeneve,ONUecuador,NoticiasONU,ilo,ONUes,Hay,2,bandos,de,mafias,izquierda,la,derecha,mafias,polc3adticas,ActualidadRT,TelemundoNews,CNNEE,soyfdelrincon,httpstcoYQnZsBKKdF | 0 |
1 | NEGATIVE | 0.990364134311676 | MolinaPvanya,JoeBiden,httpstcojAsJ08durF | 1 |
2 | NEGATIVE | 0.8683468103408813 | Iowa4Nikki,Whoes,Best,Person,PresidentnnDonald,Trump,Robert,F,Kennedy,Vivek,Ramaswamy,rest,presidential,candidates,except,Joe,Biden,good,candidates,due,respect,none,best,best,one,job,United,States,needse280a6 | 2 |
3 | POSITIVE | 0.999308705329895 | amazing,JoeBiden,calls,Dick,f09fa4a3,httpstcopsV0uqG8aL | 3 |
4 | NEGATIVE | 0.7860859036445618 | ChrisDJackson,Whoes,Best,Person,PresidentnnDonald,Trump,Robert,F,Kennedy,Vivek,Ramaswamy,rest,presidential,candidates,except,Joe,Biden,good,candidates,due,respect,none,best,best,one,job,United,States,needse280a6 | 4 |
5 | NEGATIVE | 0.9982330799102783 | PalBint,JoeBiden,much,blow,around,fake,White,House,probably,none | 5 |
6 | POSITIVE | 0.842793345451355 | thehill,e2809cJoeBiden,got,81million,votes,us,presidential,historye2809d,Even,Barack,Hussein,nnSo,wtf,Bolsheviks,afraid,actual,democracy,nf09fa4a3f09fa4a3f09fa4a3f09fa4a3f09fa4a3 | 6 |
7 | NEGATIVE | 0.998753547668457 | tfbow,JimJordan,Weaponization,HouseGOP,JudiciaryGOP,HouseDemocrats,SenateGOP,SenateDems,MaineSenateGOP,2020,Election,Lawfully,CertifiablenJoe,Biden,Win,2020,ElectionnnThe,2022,AZ,Gubernatorial,Election,CertifiablenKatie,Hobbs,Win,2022,ElectionnnThe,Brunsons,Heroes,Others,Follow,LitigationnnDry,WeepyEye | 7 |
8 | NEGATIVE | 0.9963979721069336 | JoeBiden,Hows,youre,boss,httpstcogfuDElJdG1 | 8 |
9 | NEGATIVE | 0.9973702430725098 | mitchellvii,Joe,Biden,committed,treason,intentionally,ensuring,stream,illegal,immigrants,remains,unhindered,least,take,ballot | 9 |
10 | NEGATIVE | 0.9728578925132751 | mirandadevine,DavidHo71155831,JoeBiden,BarackObama,AliMayorkas,SecBlinken,belong,prison | 10 |
Here is the output of : df.loc[0, 'text']
['perrosloja', 'Expresoec', 'Es', 'del', 'partido', 'social', 'cristiano', 'la', 'lista', '6', 'SenateGOP', 'SenateDems', 'JoeBiden', 'SenBobCasey', 'AsambleaEcuador', 'OEADDOT', 'ONUGeneve', 'ONUecuador', 'NoticiasONU', 'ilo', 'ONUes', 'Hay', '2', 'bandos', 'de', 'mafias', 'izquierda', 'la', 'derecha', 'mafias', 'polc3adticas', 'ActualidadRT', 'TelemundoNews', 'CNNEE', 'soyfdelrincon', 'httpstcoYQnZsBKKdF']
So I guess yes it is a list of strings
Solution
IIUC, each row contains a list of words. You can try:
m = df.loc[df['text'].notna(), 'text'].map(' '.join).str.contains('Joe', case=False)
joe = df.loc[m[m].index]
The code above concatenate all words to make a sentence then use str.contains
to find the word 'Joe` (case insensitive)
>>> df['text'].map(' '.join)
0 Joe Biden Sent Spinning Federal Court
1 Election Cycle
2 Donald Trump running next year
Name: text, dtype: object
>>> df['text'].map(' '.join).str.contains('Joe', case=False)
0 True
1 False
2 False
Name: text, dtype: bool
Output:
>>> joe
text
0 [Joe, Biden, Sent, Spinning, Federal, Court]
Details:
Input:
>>> df
text
0 [Joe, Biden, Sent, Spinning, Federal, Court]
1 [Election, Cycle]
2 [Donald, Trump, running, next, year]
Update
If you don't have nan values in text column and you want to create 2 dataframes for each case, you can use this code:
m = df['text'].map(' '.join).str.contains('Joe', case=False)
joe = df[m]
oth = df[~m]
Answered By - Corralien
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.