Issue
My data has some foreign characters and some other unicode characters and I'm trying to get rid of them to clean the data. For example, the current string values look like the Before column and the results should be look like the After column.
Before After
Students Num # Student Num
无差异()\nLocation Location
/\nCity City
异\nPercent Percent
I've tried the following code and of course it only eliminates "\n".
df['After'] = df['Before'].str.replace(r'[^\x00-\x7F]+', '').str.strip('\n')
I tried to add other strings like '()\n' in the str.strip argument but it didn't work. How do I modify my code to get rid of all the weird unicode strings?
Thanks.
Solution
You might find that just stripping off non alphanumeric characters achieves what you want:
df['After'] = df['Before'].str.replace(r'[^A-Za-z0-9\s\\]+', '').str.strip()
Answered By - Tim Biegeleisen
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.