Issue
Given the following column in pandas dataframe:
Name: Hockey Canada; NAICS: 711211
Name: Hockey Canada; NAICS: 711211
Name: International AIDS Society; NAICS: 813212
Name: Rogers Communications Inc; NAICS: 517112, 551112; Name: Hockey Canada; NAICS: 711211
Name: Health Benefits Trust; NAICS: 524114; Name: Hockey Canada; NAICS: 711211; Name: National Equity Fund; NAICS: 523999, 531110
I'd like to extract the NAICS code from each row (where they exist) in the pandas column. The desired result is indicated in column "expected_result".
711211
711211
813212
517112; 551112; 711211
524114; 711211; 523999; 531110
I have NaN
in some rows please any suggestion using regex and python will be very helpful. I tried the regex findall
function but I got an error.
I write this function:
def find_number(text):
num = re.findall(r'[0-9]+',text)
return " ".join(num)
I used it in apply
function like :
df['NAICS']=df['Company'].apply(lambda x: find_number(x))
I got this error:
KeyError Traceback (most recent call last) Input In [81], in <cell line: 1>() ----> 1 df['NAICS']=df['Company'].apply(lambda x: find_number(x))
Solution
There's likely some code golfy or more dataframe-friendly way to pull this off, but the overall logic will look something like:
import pandas as pd
import re
NAICSdf = pd.DataFrame(['Name: Hockey Canada; NAICS: 711211','Name: Hockey Canada; NAICS: 711211','Name: International AIDS Society; NAICS: 813212','Name: Rogers Communications Inc; NAICS: 517112, 551112; Name: Hockey Canada; NAICS: 711211','Name: Health Benefits Trust; NAICS: 524114; Name: Hockey Canada; NAICS: 711211; Name: National Equity Fund; NAICS: 523999, 531110'], columns=['organization'], )
def findNAICS(organization):
NAICSList = []
for found in re.findall(r'NAICS:\s[0-9, ]*', organization):
for NAICS in found.split(': ')[1].split(', '):
NAICSList.append(NAICS)
return '; '.join(NAICSList)
NAICSdf['NAICS'] = NAICSdf['organization'].apply(findNAICS)
print(NAICSdf)
That will create a new column in your dataframe with a semicolon delimited list of NAICS codes from your string.
Answered By - JNevill
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.