Issue
"DF","[email protected]","FLTINT1000130394756","26JUL2010","B2C","6799.2"
"Rail","[email protected]","NR251764697478","24JUN2011","B2C","2025"
"DF","[email protected]","NF2513521438550","01JAN2013","B2C","6792"
"Bus","[email protected]","NU27012932319739","26JAN2013","B2C","800"
"Rail","[email protected]","NR251764697526","24JUN2011","B2C","595"
"Rail","[email protected]","NR251277005737","29OCT2011","B2C","957"
"Rail","[email protected]","NR251297862893","21NOV2011","B2C","212"
"DF","[email protected]","NF251327485543","26JUN2011","B2C","17080"
"Rail","[email protected]","NR2512012069809","25OCT2012","B2C","5731"
"DF","[email protected]","NF251355775967","10MAY2011","B2C","2000"
"DF","[email protected]","NF251352240086","22DEC2010","B2C","4006"
"DF","[email protected]","NF251742087846","12DEC2010","B2C","1000"
"DF","[email protected]","NF252022031180","09DEC2010","B2C","3439"
"Rail","[email protected]","NR2151120122283","25JAN2013","B2C","136"
"Rail","[email protected]","NR2151213260036","28NOV2012","B2C","41"
"Rail","[email protected]","NR2151313264432","29NOV2012","B2C","96"
"Rail","[email protected]","NR2151413266728","29NOV2012","B2C","96"
"Rail","[email protected]","NR2512912359037","08DEC2012","B2C","96"
"Rail","[email protected]","NR2517612385569","12DEC2012","B2C","96"
Above is the sample data. Data is sorted according to email addresses and the file is very large around 1.5Gb
I want output in another csv file something like this
"DF","[email protected]","FLTINT1000130394756","26JUL2010","B2C","6799.2",1,0 days
"Rail","[email protected]","NR251764697478","24JUN2011","B2C","2025",1,0 days
"DF","[email protected]","NF2513521438550","01JAN2013","B2C","6792",1,0 days
"Bus","[email protected]","NU27012932319739","26JAN2013","B2C","800",1,0 days
"Rail","[email protected]","NR251764697526","24JUN2011","B2C","595",1,0 days
"Rail","[email protected]","NR251277005737","29OCT2011","B2C","957",1,0 days
"Rail","[email protected]","NR251297862893","21NOV2011","B2C","212",1,0 days
"DF","[email protected]","NF251327485543","26JUN2011","B2C","17080",1,0 days
"Rail","[email protected]","NR2512012069809","25OCT2012","B2C","5731",1,0 days
"DF","[email protected]","NF251355775967","10MAY2011","B2C","2000",1,0 days
"DF","[email protected]","NF251352240086","09DEC2010","B2C","4006",1,0 days
"DF","[email protected]","NF251742087846","12DEC2010","B2C","1000",2,3 days
"DF","[email protected]","NF252022031180","22DEC2010","B2C","3439",3,10 days
"Rail","[email protected]","NR2151213260036","28NOV2012","B2C","41",1,0 days
"Rail","[email protected]","NR2151313264432","29NOV2012","B2C","96",2,1 days
"Rail","[email protected]","NR2151413266728","29NOV2012","B2C","96",3,0 days
"Rail","[email protected]","NR2512912359037","08DEC2012","B2C","96",4,9 days
"Rail","[email protected]","NR2512912359037","08DEC2012","B2C","96",5,0 days
"Rail","[email protected]","NR2517612385569","12DEC2012","B2C","96",6,4 days
"Rail","[email protected]","NR2517612385569","12DEC2012","B2C","96",7,0 days
"Rail","[email protected]","NR2151120122283","25JAN2013","B2C","136",8,44 days
"Rail","[email protected]","NR2151120122283","25JAN2013","B2C","136",9,0 days
i.e if entry occurs 1st time i need to append 1 if it occurs 2nd time i need to append 2 and likewise i mean i need to count no of occurences of an email address in the file and if an email exists twice or more i want difference among dates and remember dates are not sorted so we have to sort them also against a particular email address and i am looking for a solution in python using numpy or pandas library or any other library that can handle this type of huge data without giving out of bound memory exception i have dual core processor with centos 6.3 and having ram of 4GB
Solution
Another possible (system-admin) way, avoiding database and SQL queries plus a whole lot of requirements in runtime processes and hardware resources.
Update 20/04 Added more code and simplified approach:-
- Convert the timestamp to seconds (from Epoch) and use UNIX
sort
, using email and this new field (that is:sort -k2 -k4 -n -t, < converted_input_file > output_file
) - Initialize 3 variable,
EMAIL
,PREV_TIME
andCOUNT
- Interate over each line, if new email is encountered, add "1,0 day". Update
PREV_TIME=timestamp
,COUNT=1
,EMAIL=new_email
- Next line: 3 possible scenario
- a) if same email, different timestamp: calculate days, increment COUNT=1, update PREV_TIME, add "Count, Difference_in_days"
- b) If same email, same timestamp: increment COUNT, add "COUNT, 0 day"
- c) If new email, start from 3.
Alternative to 1. is to add a new field TIMESTAMP and remove it upon printing out the line.
Note: If 1.5GB is too huge to sort at a go, split it into smaller chuck, using email as the split point. You can run these chunks in parallel on different machine
/usr/bin/gawk -F'","' ' {
split("JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC", month, " ");
for (i=1; i<=12; i++) mdigit[month[i]]=i;
print $0 "," mktime(substr($4,6,4) " " mdigit[substr($4,3,3)] " " substr($4,1,2) " 00 00 00"
)}' < input.txt | /usr/bin/sort -k2 -k7 -n -t, > output_file.txt
output_file.txt:
"DF","[email protected]","FLTINT1000130394756","26JUL2010","B2C","6799.2",1280102400 "DF","[email protected]","NF252022031180","09DEC2010","B2C","3439",1291852800 "DF","[email protected]","NF251742087846","12DEC2010","B2C","1000",1292112000 "DF","[email protected]","NF251352240086","22DEC2010","B2C","4006",1292976000
...
You pipe the output to Perl, Python or AWK script to process step 2. through 4.
Answered By - Alvin K.
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.