Issue
I have a *.csv
file that has 2 columns with 4 rows of data. I want to delete those rows that contain English-like (Hinglish words eg. kya haal pyare) but non-English words.
Note: typos like "alot" or "seperate" should also be removed because these became Non-English words. But, typos like "from" instead of "form" may be included, because this typo still holds English meaning.
Data of *.csv
file with two columns A
and B
are given below:
A B
This is not good so mai yah row hatana chahta hu. ok
Nice!, kya haal pyare friend thik hu
Please help Me Definitely
Google is a comPaNY yes it is
Expected Output:
A B
Please help Me Definitely
Google is a comPaNY yes it is
Solution
I got the correct output. Thanks to Tim Biegeleisen
import pandas as pd
import io
df = pd.read_csv(r'C:\Users\Mini-PC\Desktop\data.csv')
#print(df.head())
import enchant
import re
d = enchant.Dict("en_US")
def all_english(s):
words = s.split()
return len(words) == sum([d.check(re.sub(r'[!@#$?:;,.]+', '', x.lower())) for x in words])
df = df[df["A"].map(lambda x: all_english(x))]
print(df)
Output:
A B
Please help Me Definitely
Google is a comPaNY yes it is
Answered By - Murari Mahaseth
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.