Issue
I have a few examples of dataframes, where text or numeric data I want to extract aren't always in the same column or row, or indeed in the same order n different strings:
df:
{1: {0: 'sample', 1: 2},
2: {0: 'project: 4568 date: 7 January 2023', 1: 4},
3: {0: 'substance:water', 1: 6}}
df2:
{1: {0: 'sample', 1: 2},
2: {0: 'user ab', 1: 4},
3: {0: 'project: 4568 date: 7 January 2023', 1: 6},
4: {0: 'substance:water', 1: 3}}
df3:
{1: {0: nan, 1: 'sample', 2: 2},
2: {0: 'Monday', 1: 'user ab', 2: 4},
3: {0: nan, 1: 'project: 4568 substance: water date: 7 January 2023', 2: 6},
4: {0: nan, 1: 'plate 2', 2: 3}}
I'd like to extract the numeric value (which always starts with a 45, and is always 4 digits) that comes after 'project:', and the date and substance from these dataframes into their own variable.
In the example of the project number, could do this by:
a=df.iloc[0,1].split(' ')[1]
to get the project number, and b=df.iloc[0,2].split(':')[1]
to get the substance name, however it will become a tedious exercise when the data is in different columns and rows and the split delimiter needs to be altered for each dataframe.
Is there a way to extract this data (substrings) from strings without having to specify the columns and rows and how do I extract a numeric substring from a string?
Solution
I'd create custom function that searches pd.Series
for string values and then use regular expression to find the correct items. E.g.:
import re
def find_project_and_date(series):
for v in series:
if not isinstance(v, str):
continue
project = re.search(r"project:\s*(45\d{2})", v)
date = re.search(r"date:\s*(.*?\d{4})\b", v)
if project and date:
yield project.group(1), date.group(1)
# find first occurrence of `project` and `date`
out = next(m for c in df.columns if (m := next(find_project_and_date(df[c]), None)))
print(out)
Prints:
('4568', '7 January 2023')
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.