Issue
I have +3000 CSVs with +10 columns. What I need is to get all unique values from just two of these. I am able to read unique values in polars:
import polars as pl
df1 = pl.read_csv("test1.biobank.tsv.gz", sep='\t', dtype={"#chrom": pl.Utf8}, n_threads=8, columns=["#chrom", "pos"], new_columns=["chr", "pos"]).drop_duplicates()
I can read the remaining files one by one, i.e.:
df2 = pl.read_csv("test2.biobank.tsv.gz", sep='\t', dtype={"#chrom": pl.Utf8}, n_threads=8, columns=["#chrom", "pos"], new_columns=["chr", "pos"]).drop_duplicates()
check if all the values are not equal:
if not df1.frame_equal(df2):
df = df1.vstack(df2)
del(df1)
del(df2)
then .drop_duplicates()
. But since all the input files are already sorted on the two columns (chr, pos) and the differences are in thousands out of 16M input rows I hope there is a better way to do it.
Thank you for your help in advance
DK
edit
There is another way to do it using Polars and DuckDB.
- create parquet files for each of the inputs
tsv_pattern = "gwas_*.gz"
for fn in glob.glob(tsv_pattern):
print(fn)
parquet_fn = fn.replace(".gz", ".chr_pos.parquet")
df = pl.read_csv(fn, sep='\t', dtype={"#chrom": pl.Utf8}, n_threads=8, columns=["#chrom", "pos"], new_columns=["chr", "pos"]).drop_duplicates()
df.to_parquet(parquet_fn, compression='zstd')
del(df)
- run duckdb and execute:
CREATE TABLE my_table AS SELECT DISTINCT * FROM 'my_directory/*.parquet'
Credits go to Mark Mytherin from DuckDB
Solution
You can use glob patterns to read the csv's and then call distinct
.
(pl.scan_csv("**/*.csv")
.unique()
.collect())
Answered By - ritchie46
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.