Sunday, August 7, 2022

[FIXED] Changing dataframe values after regex function problem

August 07, 2022 pandas, python, regex No comments

Issue

I try to make a pipeline voor Twitter sentiment analysis. As usual data preprocessing is a thing...

Based on real tweets I made a dataframe with only 3 rows/tweets, for experiment goal.

What I try to do: 1: clear al @, ', http etc. from the tweet. 2: after that is done I want the cleaned tweet to replace the old tweet.

This works partially: Only a part of some tweets comes back in my dataframe. As the code does clean up the tweets, the code only places a part of the original code back. I think the problem is somewhere in the tweet conversion from string to list, but after many hours trying I am unable to fix it.

The dataframe contents looks like this (only index and 1 column: Tweet) tweets are of type string

Index   Tweet
0       @justanamehere and a sentence here and a link http://www.test.com
1       @Personsname are a fraud and farce, a lying person together with the fake media. Something else Personname? suppose you work with her .. @company1 @company2 #RETWEET https://x.something"
2      @companyx @companyex1 @company3 etc. AS lot of bad words here. It is a cancelculture, these rats want to badword https://x.Something

My code:

def strip_links(text):
            link_regex    = re.compile('((https?):((//)|(\\\\))+([\w\d:#@%/;$()~_?\+-=\\\.&](#!)?)*)', re.DOTALL)
            links         = re.findall(link_regex, text)
            for link in links:
                text = text.replace(link[0], ', ')    
            return text

def strip_all_entities(text):
            entity_prefixes = ['@','#']
            for separator in  string.punctuation:
                if separator not in entity_prefixes :
                    text = text.replace(separator,' ')
            words = []
            for word in text.split():
                word = word.strip()
                if word:
                    if word[0] not in entity_prefixes:
                        words.append(word)
            row['Tweet'] = ' '.join(words)   
                 
            return ' '.join(words)


# Code hieronder is nodig omdat de tekst in het df type str heeft. Omzetten naar een list.

for index, row in df_tweet.iterrows():
  tweet = list(row['Tweet'].split(","))
      
  for t in tweet: 
    strip_all_entities(strip_links(t))

This produces this:

'and a sentence here and a link' 'are a fraud and farce' '' a lying person together with the fake media Something else Personname suppose you work with her' 'etc AS lot of bad words here It is a cancelculture' 'these rats want to badword'

But in df_tweet it shows only this:

    Tweet
0   and a sentence here and a link
1   a lying person together with the fake media So...
2   these rats want to badword

The expected result is:

index   Tweet
0       and a sentence here and a link
1       are a fraud and farce a lying person together with the fake media 
        Something else Personname? suppose you work with her
2       AS lot of bad words here It is a cancelculture these rats want to 
        badword

Thanks for helping me out!! Cheers Jan

Solution

try:

df.Tweet = df.Tweet\
    .str.replace(r'[@#]\w*\b', '', regex=True)\
    .str.replace(r'https?://\S+', '', regex=True)\
    .str.replace(r'\s[#@%/;$()~_?\+-=\\\.&\']+', '', regex=True)\
    .str.strip()

Output:

        Tweet
Index   
0       and a sentence here and a link
1       are a fraud and farce, a lying person together with the fake media. Something else Personname? suppose you work with her
2       etc. AS lot of bad words here. It is a cancelculture, these rats want to badword

To delete only non-western characters from the tweets but keep the tweets:

df.Tweet = df.Tweet\
    .apply(lambda x: ''.join([i if i.isascii() else '' for i in x]))\
    .str.replace(r'[@#]\w*\b', '', regex=True)\
    .str.replace(r'https?://\S+', '', regex=True)\
    .str.replace(r'\s[#@%/;$()~_?\+-=\\\.&\']+', '', regex=True)\
    .str.strip()

To delete tweets containig non-western characters:

df.Tweet = df.Tweet\
    .str.replace(r'[@#]\w*\b', '', regex=True)\
    .str.replace(r'https?://\S+', '', regex=True)\
    .str.replace(r'\s[#@%/;$()~_?\+-=\\\.&\']+', '', regex=True)\
    .str.strip()
df = df[df.Tweet.apply(lambda x: x.isascii())]

Answered By - 99_m4n

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, August 7, 2022

[FIXED] Changing dataframe values after regex function problem

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels