Issue
I have identified acronym in my text using python regex and a number of them have an 's at the end or a '.' at the end of them. For cleaning up my text I am building a dictionary. I need the '.' removed from the end of acronym, any regular english words removed entirely from dictionary and an occurences of 's' at the end of acronyms removed.
Input Dictionary:
{'ceos': 'CEOs', 'cis': 'CIS', 'ceo': 'CEO', 'cios': 'CIOs', 'cio.': 'CIO.', 'cio': 'CIO','info': 'INFO', 'update': 'UPDATE', 'additional': 'ADDITIONAL', '.': '.', 'kpis': 'KPIs'}
Desired output dictionary:
{'ceos': 'CEO', 'cis': 'CIS', 'ceo': 'CEO', 'cios': 'CIO', 'cio.': 'CIO', 'cio': 'CIO', '.': '', 'kpis': 'KPI'}
How should I code in python to achieve this?
Solution
Never mind I found a very long solution to it but will welcome any suggestions to shorten it:
from nltk.corpus import words
#only lower case of words work in words.words()
overall_dict_1=overall_dict.copy()
#remove . from key:value, any values with 's' or '.' modified to remove these and most of the english words removed from dictionary
for key, value in overall_dict.items():
#print(key)
if value[-1] in ['s','.']:
y=len(value)-1
overall_dict_1[key] = value[0:y]
if key=='.':
overall_dict_1.pop(key)
if not (key in ['ai','it','us','es','coo','lan','ea','aer','coe','eu','bot','sa','ma','roi','pa','dod','doe','cad','ope','soc','aum','mot','da','ae','ca','swot','iso','ba','sla','mou','dit','ist','wa','ram','wog','la','ad','os','sis','sow','lam','sop','bod','pst','ga','mo']):
if (key in words.words())== True:
overall_dict_1.pop(key)
Answered By - Shraddha Avasthy
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.