Friday, March 4, 2022

[FIXED] Automising the plot of more than a 100 .txt files using pandas, NaN problems

March 04, 2022 matplotlib, pandas, python No comments

Issue

Good afternoon

I am trying to import more than a 100 separate .txt files containing data I want to plot. I would like to automise this process, since doing the same iteration for every individual file is most tedious.

I have read up on how to read multiple .txt files, and found a nice explanation. However, following the example all my data gets imported as NaNs. I read up some more and found a more reliable way of importing .txt files, namely by using pd.read_fwf() as can be seen here.

Although I can at least see my data now, I have no clue how to plot it, since the data is in one column separated by \t, e.g.

0 Extension (mm)\tLoad (kN)\tMachine extension (mm)\tPreload extension

1 0.000000\t\t\t

2 0.152645\t0.000059312\t.....

... etc.

I have tried using different separators in both the pd.read_csv() and pd.read_fwf() including ' ', '\t' and '-s+', but to now avail.

Of course this causes a problem, because now I can not plot my data. Speaking of, I am also not sure how to plot the data in the dataframe. I want to plot each .txt file's data separately on the same scatter plot.

I am very new to stack overflow, so pardon the format of the question if it does not conform to the normal standard. I attach my code below, but unfortunately I can not attach my .txt files. Each .txt file contains about a thousand rows of data. I attach a picture of the general format of all the files. General format of the .txt files.

import numpy as np
import pandas as pd
from matplotlib import pyplot as pp
import os
import glob

# change the working directory
os.chdir(r"C:\Users\Philip de Bruin\Desktop\Universiteit van Pretoria\Nagraads\sterktetoetse_basislyn\trektoetse\speel")

# get the file names
leggername = [i for i in glob.glob("*.txt")]

# put everything in a dataframe
df = [pd.read_fwf(legger) for legger in leggername]
df

EDIT: the output I get now for the DataFrame is:

[ Time (s)\tLoad (kN)\tMachine Extension (mm)\tExtension
0
1 0.000000\t\t\t
2
3 0.152645\t0.000059312\t-...
4
... ...
997 76.0173\t0.037706\t0.005...
998
999 76.1699\t0.037709\t\t
1000
1001

   from  Preload  (mm)

0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
... ... ... ...
997 NaN NaN NaN
998 NaN NaN NaN
999 NaN NaN NaN
1000 NaN NaN NaN
1001 NaN NaN NaN

[1002 rows x 4 columns], Time (s)\tLoad (kN)\tMachine Extension (mm)\tExtension
0
1 0.000000\t\t\t
2
3 0.128151\t0.000043125\t-...
4
... ...
997 63.8191\t0.034977\t-0.00...
998
999 63.9473\t0.034974\t\t
1000
1001

   from  Preload  (mm)

0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
... ... ... ...
997 NaN NaN NaN
998 NaN NaN NaN
999 NaN NaN NaN
1000 NaN NaN NaN
1001 NaN NaN NaN

[1002 rows x 4 columns], Time (s)\tLoad (kN)\tMachine Extension (mm)\tExtension
0
1 0.000000\t\t\t
2
3 0.174403\t0.000061553\t0...
4
... ...
997 86.8529\t0.036093\t-0.00...
998
999 87.0273\t\t-0.0059160\t-...
1000
1001

   from  Preload  (mm)

0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
... ... ... ...
997 NaN NaN NaN
998 NaN NaN NaN
999 NaN NaN NaN
1000 NaN NaN NaN
1001 NaN NaN NaN

... etc

Solution

The basic gist is to skip the first data row (that has a single value in it), then read the individual files with pd.read_csv, using tab as the separator, and stack them together.

There is, however, a more problematic issue: the data files turn out to be UTF-16 encoded (the binary data show a NUL character at the even positions), but there is no byte-order-mark (BOM) to indicate this. As a result, you can't specify the encoding in read_csv, but have to manually read each file as binary, then decode it with UTF-16 to a string, then feed that string to read_csv. Since the latter requires a filename or IO-stream, the text data needs to be put into a StringIO object first (or save the corrected data to disk first, then read the corrected file; might not be a bad idea).

import pandas as pd
import os
import glob
import io

# change the working directory
os.chdir(r"C:\Users\Philip de Bruin\Desktop\Universiteit van Pretoria\Nagraads\sterktetoetse_basislyn\trektoetse\speel")

dfs = []
for filename in glob.glob("*.txt"):
    with open(filename, 'rb') as fp:
        data = fp.read()  # a single file should fit in memory just fine
    # Decode the UTF-16 data that is missing a BOM
    string = data.decode('UTF-16')
    # And put it into a stream, for ease-of-use with `read_csv`
    stream = io.StringIO(string) 

    # Read the data from the, now properly decoded, stream
    # Skip the single-value row, and use tabs as separators
    df = pd.read_csv(stream, sep='\t', skiprows=[1])

    # To keep track of the individual files, add an "origin" column
    # with its value set to the corresponding filename
    df['origin'] = filename
    dfs.append(df)

# Concate all dataframes (default is to stack the rows)
df = pd.concat(dfs)


# For a quick and dirty plot, you can enjoy the power of Seaborn
import seaborn as sns
# Use appropriate (full) column names, and use the 'origin' 
# column for the hue and symbol
sns.scatterplot(data=df, x='Time (s)', y='Machine Extension (mm)', hue='origin', style='origin')

Seaborn's scatterplot documentation.

Answered By - 9769953

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, March 4, 2022

[FIXED] Automising the plot of more than a 100 .txt files using pandas, NaN problems

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels