Issue
I have a column in my dataframe with numbers of clinical trials - NCT
ids.
It starts with \nNTC
, and ends with \n
.
Example:
Old column
0 209629\nCTR20191933\nNCT04136145\nTrialTroveID...
1 54767414ALZ001\nDARZAD\nNCT04070378\nTrialTrov...
2 D5495C00005\nNCT04024501\nTrialTroveID-353576
etc
I want to extract only NCT numbers and create a new column in the dataframe with them
Expected output:
Old column New column
0 209629\nCTR20191933\nNCT04136145\nTrialTroveID... NCT04136145
1 54767414ALZ001\nDARZAD\nNCT04070378\nTrialTrov... NCT04070378
2 D5495C00005\nNCT04024501\nTrialTroveID-353576 NCT04024501
Solution
Use str.extract
:
df['New column'] = df['Old column'].str.extract(r'(NCT\d+)')
print(df)
# Output
Old column New column
0 209629\nCTR20191933\nNCT04136145\nTrialTroveID... NCT04136145
1 54767414ALZ001\nDARZAD\nNCT04070378\nTrialTrov... NCT04070378
2 D5495C00005\nNCT04024501\nTrialTroveID-353576 NCT04024501
Note: the regex means match 'NCT' strings followed by 1 or more digits.
Answered By - Corralien
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.