Monday, January 8, 2024

[FIXED] in a pandas dataframe return a numeric substring from a string without specifying indexes

January 08, 2024 dataframe, pandas, python No comments

Issue

I have a few examples of dataframes, where text or numeric data I want to extract aren't always in the same column or row, or indeed in the same order n different strings:

df:

{1: {0: 'sample', 1: 2},
 2: {0: 'project: 4568 date: 7 January 2023', 1: 4},
 3: {0: 'substance:water', 1: 6}}

df2:

{1: {0: 'sample', 1: 2},
 2: {0: 'user ab', 1: 4},
 3: {0: 'project: 4568 date: 7 January 2023', 1: 6},
 4: {0: 'substance:water', 1: 3}}

df3:

{1: {0: nan, 1: 'sample', 2: 2},
 2: {0: 'Monday', 1: 'user ab', 2: 4},
 3: {0: nan, 1: 'project: 4568 substance: water date: 7 January 2023', 2: 6},
 4: {0: nan, 1: 'plate 2', 2: 3}}

I'd like to extract the numeric value (which always starts with a 45, and is always 4 digits) that comes after 'project:', and the date and substance from these dataframes into their own variable.

In the example of the project number, could do this by: a=df.iloc[0,1].split(' ')[1] to get the project number, and b=df.iloc[0,2].split(':')[1] to get the substance name, however it will become a tedious exercise when the data is in different columns and rows and the split delimiter needs to be altered for each dataframe.

Is there a way to extract this data (substrings) from strings without having to specify the columns and rows and how do I extract a numeric substring from a string?

Solution

I'd create custom function that searches pd.Series for string values and then use regular expression to find the correct items. E.g.:

import re


def find_project_and_date(series):
    for v in series:
        if not isinstance(v, str):
            continue
        project = re.search(r"project:\s*(45\d{2})", v)
        date = re.search(r"date:\s*(.*?\d{4})\b", v)

        if project and date:
            yield project.group(1), date.group(1)


# find first occurrence of `project` and `date`
out = next(m for c in df.columns if (m := next(find_project_and_date(df[c]), None)))
print(out)

Prints:

('4568', '7 January 2023')

Answered By - Andrej Kesely

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, January 8, 2024

[FIXED] in a pandas dataframe return a numeric substring from a string without specifying indexes

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels