Issue
I have csv files I would like to do some analysis on, but there are not normal csv files.....whoever decided to create them didn't keep formatting the same down through the files.....
For te most part they follow standard formatting, with data in columns, but every so often there will be a line with some error/warning/info text.
So the csv column data starts with column 1 as date&time in column 1, and then from column 2 to n will be the data, but every so often column 2 will include error/warning/info text and nothin the the other column .
I can easily exclude these lines from the csv an do the analysis on the data, but I would like to extract these lines an store them separately in a separate dataframe.....
But I am struggling to do this simply. I am missing a trick here and is there a way to simply separate out this data in the csv file with juypter?
Solution
You describe data as per code sample below. It's simple to just split after using read_csv()
based on second column being numeric or non-numeric.
from pathlib import Path
csv = """date,seq,val0,val1,val2
2021-01-01,1.0,0.919113692407093,1.6229411332628496,0.24659242223048927
2021-01-02,11.473684210526315,0.07253225286428067,0.5829646480126915,0.8417325582368181
2021-01-03,21.94736842105263,0.32438619968096405,1.4561059102864153,0.09907995077630782
2021-01-04,32.421052631578945,0.7926071257043146,1.7922407755587069,0.398524618028244
2021-01-05,42.89473684210526,0.2157414433351048,0.42316983774076333,0.26429821215433835
2021-01-06,53.368421052631575,0.5880798026850204,0.30631991278000203,2.157299668724619
2021-01-07,63.84210526315789,0.05680775379053116,0.09056762487241565,1.8432282529150985
2021-01-08,74.3157894736842,0.8638058796950695,0.956874782181419,0.560113292182499
2021-01-09,84.78947368421052,0.8578723804393844,1.3962261744237703,1.8002069590315575
2021-01-01,Adipisci etincidunt quiquia consectetur numquam dolorem aliquam.
2021-01-10,95.26315789473684,0.1842964050114777,0.9421910982208783,1.1097524348417385
2021-01-11,105.73684210526315,0.26926150072049215,0.3263406301607237,0.8337896257615581
2021-01-03,Aliquam neque porro est.
2021-01-04,Quisquam labore dolorem amet dolore.
2021-01-12,116.21052631578947,0.1487208436794849,1.9384707893168265,1.1932374325424484
2021-01-13,126.68421052631578,0.9738540881030379,1.2959312690277112,1.9354291047422771
2021-01-14,137.15789473684208,0.1420363534166592,0.6564997473347189,0.7491839162744267
2021-01-02,Magnam modi voluptatem quaerat.
2021-01-05,Neque dolor dolore quisquam dolor ut.
2021-01-06,Dolorem porro aliquam quiquia.
2021-01-07,Sit modi adipisci porro porro eius ipsum quisquam.
2021-01-15,147.6315789473684,0.9961022971940973,0.13346940964659093,2.4870460594816794
2021-01-16,158.10526315789474,0.8866086488360403,1.7565870140977553,2.7345560454964826
2021-01-17,168.57894736842104,0.27548274054720157,1.0466205997810067,2.146515617796502
2021-01-18,179.05263157894734,0.5564778653140571,1.0674809651747388,2.1899218384075683
2021-01-19,189.52631578947367,0.20504429969811966,0.2887690704253574,0.005236244550076985
2021-01-20,200.0,0.15569496004718852,0.28625583495153517,1.3681772459983979
2021-01-08,Dolorem tempora dolor consectetur.
2021-01-09,Velit ipsum consectetur neque modi magnam quaerat.
2021-01-10,Dolor quaerat sit sit dolorem sit amet dolore.
"""
fname = Path.cwd().joinpath("mixed.csv")
with open(fname, "w") as f: f.write(csv)
df = pd.read_csv(fname)
mask = pd.to_numeric(df["seq"], errors="coerce").isna()
dfdata = df.loc[~mask].assign(seq=lambda d: d["seq"].astype(float))
dfmsg = df.loc[mask].pipe(lambda d: d.drop(columns=[c for c in d.columns if d[c].isna().all()]))
Answered By - Rob Raymond
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.