Issue
In my directory Geolife
I have several text files named in the format:
Geolife$ ls
labels106.txt labels153.txt labels73.txt
labels107.txt labels154.txt labels75.txt
labels108.txt labels161.txt labels76.txt
labels10.txt labels163.txt labels78.txt
labels110.txt labels167.txt labels80.txt
labels111.txt labels170.txt labels81.txt
...
Each of these files contains data in tab separated format, so for example:
Geolife$ cat labels10.txt
Start Time End Time Transportation Mode
2007/06/26 11:32:29 2007/06/26 11:40:29 bus
2008/03/28 14:52:54 2008/03/28 15:59:59 train
2008/03/28 16:00:00 2008/03/28 22:02:00 train
2008/03/29 01:27:50 2008/03/29 15:59:59 train
2008/03/29 16:00:00 2008/03/30 15:59:59 train
2008/03/30 16:00:00 2008/03/31 03:13:11 train
2008/03/31 04:17:59 2008/03/31 15:31:06 train
2008/03/31 16:00:08 2008/03/31 16:09:01 taxi
2008/03/31 17:26:04 2008/04/01 00:35:26 train
2008/04/01 00:48:32 2008/04/01 00:59:23 taxi
...
So I want to read this data into pandas dataframe (date in first column of each file), adding a column to track the file number the data comes from. I am also not interested in the time part of the date, just the date so I can do analysis by year, date, etc.
In intended output, (taking example of the file above) would be:
User-ID Date Mode
10 2007-06-26 bus
10 2008-03-28 train
10 2008-03-28 train
10 2008-03-29 train
10 2008-03-29 train
10 2008-03-30 train
10 2008-03-31 train
10 2008-03-31 taxi
10 2008-03-31 train
10 2008-04-01 taxi
...
# and contents of all other files, e.g. labels106.txt
106 2007-10-07 car
106 2007-10-08 car
106 2007-10-09 car
....
How can this be done?
EDIT
labels106.txt
(like all other files), contain data in same format.
Geolife$ cat labels106.txt
Start Time End Time Transportation Mode
2007/10/07 16:00:00 2007/10/08 15:59:59 car
2007/10/08 16:00:00 2007/10/09 15:59:59 car
2007/10/09 16:00:00 2007/10/10 15:59:59 car
Solution
Not exactly how you want it, but this solution reads the .txt
files and write the data to a .csv
file, which you can then read using pandas.read_csv(..)
method.
import os
files_dir ='your-geolife-dir'
for root, dirs, files in os.walk(files_dir):
for file in files:
if file.endswith('.txt'):
user = file.strip('.txt')
user = user[6:]
with open(os.path.join(root,file), 'r') as f, open(os.path.join(root,
'data.csv'), 'a') as out: # out - the output csv file
for line in f:
line = line.rstrip()
line = line.replace('\t', ',')
line = line.replace('/', '-')
if not line.startswith('S'):
output = f'{user},{line}'
out.write(f'{output}\n')
output:
$ cat data.csv
10,2007-06-26 11:32:29,2007-06-26 11:40:29,bus
10,2008-03-28 14:52:54,2008-03-28 15:59:59,train
10,2008-03-28 16:00:00,2008-03-28 22:02:00,train
10,2008-03-29 01:27:50,2008-03-29 15:59:59,train
10,2008-03-29 16:00:00,2008-03-30 15:59:59,train
10,2008-03-30 16:00:00,2008-03-31 03:13:11,train
10,2008-03-31 04:17:59,2008-03-31 15:31:06,train
10,2008-03-31 16:00:08,2008-03-31 16:09:01,taxi
10,2008-03-31 17:26:04,2008-04-01 00:35:26,train
10,2008-04-01 00:48:32,2008-04-01 00:59:23,taxi
106,2007-10-07 16:00:00,2007-10-08 15:59:59,car
106,2007-10-08 16:00:00,2007-10-09 15:59:59,car
106,2007-10-09 16:00:00,2007-10-10 15:59:59,car
You can customise the solution to your need (of course you can the csv heading).
Answered By - arilwan
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.