Wednesday, March 16, 2022

[FIXED] How to correctly index files with asynchronous csv stream data into Splunk?

March 16, 2022 httpx, python, python-asyncio, splunk No comments

Issue

I am putting asynchronous csv stream data from each URL into each file one after another like below.

async with httpx.AsyncClient(headers={"Authorization": 'Token token="sometoken"'}) as session:
    for url in some_urls_list:
        download_data(url, session)

@backoff.on_exception(backoff.expo,exception=(httpx.SomeException,),max_tries=7,)
async def download_data(url, session):
    while True:
        async with session.stream("GET", url) as csv_stream:
            csv_stream.raise_for_status()
            async with aiofiles.open("someuniquepath", "wb") as f:
                async for data in csv_stream.aiter_bytes():
                    await f.write(data)
        break

I am ingesting this data into Splunk via inputs.conf and props.conf as below.

[monitor:///my_main_dir_path]
disabled = 0
index = xx
sourcetype = xx:xx

[xx:xx]
SHOULD_LINEMERGE = false
LINE_BREAKER = ([\r\n]+)
NO_BINARY_CHECK = true
CHARSET = UTF-8
INDEXED_EXTRACTIONS = csv
TIMESTAMP_FIELDS = xx

I am getting several issues in this as below.

Some files are not indexed at all.
From some files only partial rows are indexed.
Some rows are abruptly divided into 2 events on Splunk.

What could be done on the Splunk configuration side to solve above issues while taking care that it does not cause any duplicate data indexing issue?

Sample Data: (First line is the header.)

A,B B,C D,E,F,G H?,I J K,L M?,N/O P,Q R S,T U V (w x),Y Z,AA BB,CC DD,EE FF,GG HH,II JJ KK,some timestamp field,LL,MM,NN-OO,PP?,QQ RR ss TT UU,VV,WW,XX,YY,ZZ,AAA BBB,CCC,DDD-EEE,FFF GGG,HHH,III JJJ,KKK LLL,MMM MMM,NNN OOO,PPP QQQ,RRR SSS 1,TTT UUU 2,VVV WWW 3,XX YYY,ZZZ AAAA,BBBB CCCC
[email protected],"bbdata, bbdata",ccdata ccdata,eedata eedata - eedata,ffdata - ffdata - 725 ffdata ffdata,No,,No,,,,,unknown,unknown,unknown,2.0.0,"Sep 26 22:40:18 iidata-iidata-12cb65d081f745a2b iidata/iidata[4783]: iidata: to=<[email protected]>, iidata=iidata.iidata.iidata.iidata[111.111.11.11]:25, iidata=0.35, iidata=0.08/0/0.07/0.2, iidata=2.0.0, iidata=iidata (250 2.0.0 OK  1569537618 iidata.325 - iidata)",9/26/2019 22:40,,,,,,,wwdata,xxdata,5,"zzdata, zzdata",aaadata aaadata aaadata,cccdata - cccdata,ddddata - ddddata,fffdata,hhhdata,25/06/2010,6,2010,"nnndata nnndata nnndata, nnndata.",(pppdata'pppdata) pppdata pppdata,,,,303185,,

Sample Broken Event:

[email protected],"bbdata, bbdata",ccdata ccdata,eedata eedata - eedata,ffdata - ffdata - 725 ffdata ffdata,No,,No,,,,,unknown,un

known,unknown,2.0.0,"Sep 26 22:40:18 iidata-iidata-12cb65d081f745a2b iidata/iidata[4783]: iidata: to=<[email protected]>, iidata=iidata.iidata.iidata.iidata[111.111.11.11]:25, iidata=0.35, iidata=0.08/0/0.07/0.2, iidata=2.0.0, iidata=iidata (250 2.0.0 OK  1569537618 iidata.325 - iidata)",9/26/2019 22:40,,,,,,,wwdata,xxdata,5,"zzdata, zzdata",aaadata aaadata aaadata,cccdata - cccdata,ddddata - ddddata,fffdata,hhhdata,25/06/2010,6,2010,"nnndata nnndata nnndata, nnndata.",(pppdata'pppdata) pppdata pppdata,,,,303185,,

Solution

I hope you are monitoring something much more specific than a top-level directory. Otherwise, you run the risk of Splunk running out of file opens and/or memory.

Partial rows and divided rows are symptoms of incorrect props.conf settings. It's impossible to suggest corrections without seeing some events.

It's also possible Splunk is reading the file too fast. Try adding these settings to inputs.conf:

multiline_event_extra_waittime = true
time_before_close = 3

Answered By - RichG

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, March 16, 2022

[FIXED] How to correctly index files with asynchronous csv stream data into Splunk?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels