Issue
How can I divide the column and the data in it. I have attached for reference.
Example: Column cement_water with values such as three hundred and two; 203.0 has to be split into two columns named cement and water with values 302.0 and 203.0 respectively. The column values have different delimeters (; , _) which has to be handled and also the values have string data which has to be converted into numeric values using word to number.
Previous/Default such columns:
cement_water coarse_fine_aggregate
three hundred and two;203.0 974.0,817.0
one hundred and fifty-one;184.4 992.0;815.9
three hundred and sixty-two_164.9 944.7;755.8
Has to be converted into the following:
cement water coarse_aggregate fine_aggregate
302.0 203.0 974.0 817.0
151.0 184.4 992.0 815.9
362.0 164.9 944.7 755.8
import pandas as pd
from word2number import w2n
df = pd.read_csv('test.csv - Sheet1.csv')
def convert_words_to_numbers(text):
words = text.replace('_', ' ').replace(';', ' ').replace(',', ' ').split()
converted_words = [str(w2n.word_to_num(word)) if word.isalpha() else word for word in words]
return ' '.join(converted_words)
df['cement_water'] = df['cement_water'].apply(lambda x: convert_words_to_numbers(x))
df[['cement', 'water']] = df['cement_water'].str.split(' ', expand=True)
df[['coarse_aggregate', 'fine_aggregate']] = df['coarse_fine_aggregate'].str.split(';', expand=True)
df = df.drop(['cement_water', 'coarse_fine_aggregate'], axis=1)
df = df.apply(pd.to_numeric, errors='ignore')
print(df)
Error- No valid number words found! Please enter a valid number word (eg. two million twenty three thousand and forty nine)
Solution
This works for me using this variant:
from word2number import w2n
out = (pd.concat([df['cement_water'].str.extract(r'(?P<cement>.*)[;,_](?P<water>\d+.?\d*)$'),
df['coarse_fine_aggregate'].str.split('[;,]', expand=True)
.rename(columns={0: 'coarse_aggregate', 1: 'fine_aggregate'})], axis=1)
.assign(cement=lambda d: d['cement'].map(w2n.word_to_num))
.astype(float)
)
Output:
cement water coarse_aggregate fine_aggregate
0 302.0 203.0 974.0 817.0
1 151.0 184.4 992.0 815.9
2 362.0 164.9 944.7 755.8
more generic code with additional example
Here you have a mix of strings and numbers in cement_water
, let's first identify the numbers and only parse the strings:
tmp = df['cement_water'].str.extract(r'(?P<cement>.*)[;,_](?P<water>\d+.?\d*)$')
s = pd.to_numeric(tmp['cement'], errors='coerce')
m = s.isna() & df['cement_water'].notna()
tmp.loc[m, 'cement'] = df.loc[m, 'cement_water'].map(w2n.word_to_num)
out = (pd.concat([tmp,
df['coarse_fine_aggregate'].str.split('[;,_]', expand=True)
.rename(columns={0: 'coarse_aggregate', 1: 'fine_aggregate'})], axis=1)
.astype(float)
)
Output:
cement water coarse_aggregate fine_aggregate
0 200.0 159.2 1043.6 771.9
1 200.0 192.0 965.4 806.2
2 446.0 162.0 967.0 712.0
3 380.0 158.0 903.0 768.0
4 141.0 173.5 882.6 785.3
.. ... ... ... ...
222 200.0 192.0 965.4 806.2
223 270.0 160.6 973.9 875.6
224 150.0 185.7 1040.6 734.3
225 330.0 174.9 944.7 755.8
226 288.0 177.4 907.9 829.5
[227 rows x 4 columns]
Answered By - mozway
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.