Sunday, June 5, 2022

[FIXED] Read content of several text files into pandas Dataframe

June 05, 2022 dataframe, pandas, python, python-3.x No comments

Issue

In my directory Geolife I have several text files named in the format:

Geolife$ ls
labels106.txt  labels153.txt  labels73.txt
labels107.txt  labels154.txt  labels75.txt
labels108.txt  labels161.txt  labels76.txt
labels10.txt   labels163.txt  labels78.txt
labels110.txt  labels167.txt  labels80.txt
labels111.txt  labels170.txt  labels81.txt
...

Each of these files contains data in tab separated format, so for example:

Geolife$ cat labels10.txt
Start Time  End Time    Transportation Mode
2007/06/26 11:32:29 2007/06/26 11:40:29 bus
2008/03/28 14:52:54 2008/03/28 15:59:59 train
2008/03/28 16:00:00 2008/03/28 22:02:00 train
2008/03/29 01:27:50 2008/03/29 15:59:59 train
2008/03/29 16:00:00 2008/03/30 15:59:59 train
2008/03/30 16:00:00 2008/03/31 03:13:11 train
2008/03/31 04:17:59 2008/03/31 15:31:06 train
2008/03/31 16:00:08 2008/03/31 16:09:01 taxi
2008/03/31 17:26:04 2008/04/01 00:35:26 train
2008/04/01 00:48:32 2008/04/01 00:59:23 taxi
...

So I want to read this data into pandas dataframe (date in first column of each file), adding a column to track the file number the data comes from. I am also not interested in the time part of the date, just the date so I can do analysis by year, date, etc.

In intended output, (taking example of the file above) would be:

User-ID  Date   Mode
10  2007-06-26  bus
10  2008-03-28  train
10  2008-03-28  train
10  2008-03-29  train
10  2008-03-29  train
10  2008-03-30  train
10  2008-03-31  train
10  2008-03-31  taxi
10  2008-03-31  train
10  2008-04-01  taxi
...
# and contents of all other files, e.g. labels106.txt
106 2007-10-07 car
106 2007-10-08 car
106 2007-10-09 car
....

How can this be done?

EDIT

labels106.txt (like all other files), contain data in same format.

Geolife$ cat labels106.txt
Start Time  End Time    Transportation Mode
2007/10/07 16:00:00 2007/10/08 15:59:59 car
2007/10/08 16:00:00 2007/10/09 15:59:59 car
2007/10/09 16:00:00 2007/10/10 15:59:59 car

Solution

Not exactly how you want it, but this solution reads the .txt files and write the data to a .csv file, which you can then read using pandas.read_csv(..) method.

import os

files_dir ='your-geolife-dir'

for root, dirs, files in os.walk(files_dir):
    for file in files:
        if file.endswith('.txt'):
            user = file.strip('.txt')
            user = user[6:]
            
            with open(os.path.join(root,file), 'r') as f, open(os.path.join(root,
                'data.csv'), 'a') as out: # out - the output csv file
                for line in f:
                    line = line.rstrip()
                    line = line.replace('\t', ',')
                    line = line.replace('/', '-')
                    if not line.startswith('S'):
                        output = f'{user},{line}'
                        out.write(f'{output}\n')

output:

$ cat data.csv
10,2007-06-26 11:32:29,2007-06-26 11:40:29,bus
10,2008-03-28 14:52:54,2008-03-28 15:59:59,train
10,2008-03-28 16:00:00,2008-03-28 22:02:00,train
10,2008-03-29 01:27:50,2008-03-29 15:59:59,train
10,2008-03-29 16:00:00,2008-03-30 15:59:59,train
10,2008-03-30 16:00:00,2008-03-31 03:13:11,train
10,2008-03-31 04:17:59,2008-03-31 15:31:06,train
10,2008-03-31 16:00:08,2008-03-31 16:09:01,taxi
10,2008-03-31 17:26:04,2008-04-01 00:35:26,train
10,2008-04-01 00:48:32,2008-04-01 00:59:23,taxi
106,2007-10-07 16:00:00,2007-10-08 15:59:59,car
106,2007-10-08 16:00:00,2007-10-09 15:59:59,car
106,2007-10-09 16:00:00,2007-10-10 15:59:59,car

You can customise the solution to your need (of course you can the csv heading).

Answered By - arilwan

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, June 5, 2022

[FIXED] Read content of several text files into pandas Dataframe

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels