Issue
I have created a data frame using a text file in pandas.
df = pd.read_table('inputfile.txt',names=['Line'])
when I do df
Line
0 17/08/31 13:24:48 INFO spark.SparkContext: Run...
1 17/08/31 13:24:49 INFO spark.SecurityManager: ...
2 17/08/31 13:24:49 INFO spark.SecurityManager: ...
3 17/08/31 13:24:49 INFO spark.SecurityManager: ...
4 17/08/31 13:24:49 INFO util.Utils: Successfull...
5 17/08/31 13:24:49 INFO slf4j.Slf4jLogger: Slf4...
6 17/08/31 13:24:49 INFO Remoting: Starting remo...
7 17/08/31 13:24:50 INFO Remoting: Remoting star...
8 17/08/31 13:24:50 INFO Remoting: Remoting now ...
9 17/08/31 13:24:50 INFO util.Utils: Successfull...
Now I want to save this file as csv
df.to_csv('outputfile')
The result I get is this
0,17/08/31 13:24:48 INFO spark.SparkContext: Running Spark version 1.6.0
1,17/08/31 13:24:49 INFO spark.SecurityManager: Changing view acls to: user1
2,17/08/31 13:24:49 INFO spark.SecurityManager: Changing modify acls to: user1
3,17/08/31 13:24:49 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(user1);
4,17/08/31 13:24:49 INFO util.Utils: Successfully started service 'sparkDriver' on port 17101.
5,17/08/31 13:24:49 INFO slf4j.Slf4jLogger: Slf4jLogger started
6,17/08/31 13:24:49 INFO Remoting: Starting remoting
7,17/08/31 13:24:50 INFO Remoting: Remoting started; listening on addresses :
8,17/08/31 13:24:50 INFO Remoting: Remoting now listens on addresses:
9,17/08/31 13:24:50 INFO util.Utils: Successfully started service 'sparkDriverActorSystem' on port 100033.
I want my output to be
17/08/31 13:24:48 INFO spark.SparkContext: Running Spark version 1.6.0
17/08/31 13:24:49 INFO spark.SecurityManager: Changing view acls to: user1
17/08/31 13:24:49 INFO spark.SecurityManager: Changing modify acls to: user1
17/08/31 13:24:49 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(user1);
17/08/31 13:24:49 INFO util.Utils: Successfully started service 'sparkDriver' on port 17101.
17/08/31 13:24:49 INFO slf4j.Slf4jLogger: Slf4jLogger started
17/08/31 13:24:49 INFO Remoting: Starting remoting
17/08/31 13:24:50 INFO Remoting: Remoting started; listening on addresses :
17/08/31 13:24:50 INFO Remoting: Remoting now listens on addresses:
17/08/31 13:24:50 INFO util.Utils: Successfully started service 'sparkDriverActorSystem' on port 100033.
I have tried a couple of methods like below but still getting the same result not my desired output.
np.savetxt(r'np.txt', df.Line, fmt='%d')
df.to_csv(sep=' ', index=False, header=False)
Solution
It looks like the number might be part of the string in the Line
column. You can replace the leading digits and spaces with nothing and output it to a file with no index using:
df.Line.str.replace('^\d+ +','').to_csv('outputfile.csv', index=False, header=False)
Answered By - James
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.