Multi file I/O
Contents
When files grow to infinity
It happened to me and probably to you as well, somebody hands over pcaps of 20 TByte to you. You start multiple T2 as a background process and then after 7 TByte something goes wrong, and you have to start all over again. Grrrrr.
Although T2 has no problem with huge pcap files it is a nuisance, I guess you concur. But what to do if you split them up and having 2000 files 10 GByte long? Don’t worry, the anteater can handle that.
Now you wrote the most sophisticated and genius online post-processing of your flow file and suddenly you run out of disk space. Bummer! Especially if you are only interested in a certain time span or selection of traffic you like to split the resulting flow files to a more manageable size.
And what happens, if the pcaps copied to your computer by an obscure process, and you don’t want T2 to timeout if he runs out of food. So he should wait for the new ones to come preserving it internal state. The polling mode will come to the rescue.
Preparation
First, restore T2 into a pristine state by removing all unnecessary or older plugins from the plugin folder ~/.tranalyzer/plugins:
t2build -e -y
Are you sure you want to empty the plugin folder '/home/wurst/.tranalyzer/plugins' (y/N)? yes
Plugin folder emptied
Then compile the core (tranalyzer2) and the following plugins:
t2build tranalyzer2 basicFlow basicStats tcpStates txtSink
...
BUILD SUCCESSFUL
If you did not create a separate data and results directory yet, please do it now in another bash window, that facilitates your workflow:
mkdir ~/data ~/results
The sample PCAPs used in this tutorial can be downloaded here: annoloc2.pcap.
Please save it in your ~/data folder.
Now you are all set!
PCAP fragmentation
Now fragment the PCAP file into a sequence of 10MB pcaps using tcpdump
and editcap
so that we can test
some different filename formats.
mkdir ~/data/S
tcpdump -r ~/data/annoloc2.pcap -w ~/data/S/annoloc2S.pcap -C 10
ls ~/data/S
annoloc2S.pcap annoloc2S.pcap1 annoloc2S.pcap2 annoloc2S.pcap3 annoloc2S.pcap4 annoloc2S.pcap5 annoloc2S.pcap6 annoloc2S.pcap7 annoloc2S.pcap8
mkdir ~/data/T
editcap -c 100000 ~/data/annoloc2.pcap ~/data/T/annoloc2T.pcap
ls ~/data/T
annoloc2T_00000_20020523183501.pcap annoloc2T_00003_20020523183507.pcap annoloc2T_00006_20020523183514.pcap annoloc2T_00009_20020523183520.pcap annoloc2T_00012_20020523183526.pcap
annoloc2T_00001_20020523183503.pcap annoloc2T_00004_20020523183509.pcap annoloc2T_00007_20020523183516.pcap annoloc2T_00010_20020523183522.pcap
annoloc2T_00002_20020523183505.pcap annoloc2T_00005_20020523183511.pcap annoloc2T_00008_20020523183518.pcap annoloc2T_00011_20020523183524.pcap
Now you are ready for some kung-fu reading.
Read from several defined pcaps in a row
Assume you have a lot of files, e.g. which are not comfortably numbered as in our case,
but in time sequence over months and years. Then you can use the -R
option where T2 accepts
a file containing a list of pcaps.
t2 -R PCAPLIST -w outputfile
It processes all the pcap files listed in PCAPLIST
. T2 keeps its internal state during
the file change, thus all pcaps are treated as one large pcap.
The processing order is defined by the location of the filenames in the text file, so no
sequential numbering is necessary. Nevertheless, the absolute path has to be specified.
To generate the PCAPLIST
you may use the commands below.
ls ~/data/S/annoloc2S* | sort > ~/data/pcap_Slist.txt
cat ~/data/pcap_Slist.txt
# List of PCAP files to process
/home/wurst/data/S/annoloc2S.pcap
/home/wurst/data/S/annoloc2S.pcap1
/home/wurst/data/S/annoloc2S.pcap2
/home/wurst/data/S/annoloc2S.pcap3
/home/wurst/data/S/annoloc2S.pcap4
/home/wurst/data/S/annoloc2S.pcap5
/home/wurst/data/S/annoloc2S.pcap6
/home/wurst/data/S/annoloc2S.pcap7
/home/wurst/data/S/annoloc2S.pcap8
Lines starting with #
are considered as comments and thus ignored by T2.
An easier way is to use the t2caplist
script to generate such a list.
t2caplist -h
Usage:
t2caplist [OPTION...] <FILE|DIR>
Optional arguments:
-d depth List pcaps up to the given depth
-L Follow symbolic links
-r List pcaps recursively
-R Sort the list in reverse order
-z Sort the list by file size
-s Do not sort the list
-v Report invalid files to stderr
Help and documentation arguments:
-h, -?, --help Show this help, then exi
It can even follow symbolic links, sort the files, but here we just generate
a list and see what happens. So start t2
using the -R
option on the generated
~/data/pcap_Slist.txt file
t2 -R ~/data/pcap_Slist.txt -w ~/results/S/
First T2 checks all files, whether they exist and whether they are sound. Then he processes one pcap after the other listed in ~/data/pcap_Slist.txt and terminates with a standard end report.
Read from a sequence of pcaps
Imagine you have a humongous amount of pcaps to process, and lucky you,
they are produced with an index in the file name. Then the -D
option
is the way to go.
The -D
option as specified below demands a FILEPREFIX
, even as a regex *
.
If there is an extension, you have to specify it. The general option is shown below:
t2 -D FILEPREFIX[#Start][*][.ext][#Start][:SCHR][,#Stop]
Whereas #Start
denotes the start index of the filename embedded in the file name or after
the filename, #Stop
the stop index. If the first is omitted T2 starts at 0 or assumes there is no number.
If you omit the latter, T2 will wait for the next pcap if he runs out of food.
SCHR
denotes the search characters for T2, where to find the #Start
number in an arbitrary file name.
It can contain up to three characters. By default SCHR
is set to p
, as defined in tranalyzer.h.
Open the latter and search for // -D option parameters
:
tranalyzer2
vi src/tranalyzer.h
...
// -D option parameters
#define RROP 0 // round robin operation
#define POLLTM 5 // poll timing in sec for files
#define MFPTMOUT 0 // > 0: timeout in sec for poll timing > POLLTM, 0: no poll timeout
#define SCHR 'p' // separating char for number (refer to the doc for examples)
...
The POLLTM
denotes the poll interval T2 checks whether the next missing file is available
under his data directory. If a file index is missing, aka no more food for the anteater,
he will wait and poll every POLLTM
seconds.
This and the other constants will be discussed under polling timeout.
We chose 'p'
as the default because tcpdump
adds the index at the end of the file name,
behind the pcap extension i.e. out.pcapNUM. Nevertheless, t2 covers also the more complicated
editcap
filename format.
The following table summarizes the supported naming patterns and the configuration required:
Note the quotes ("
) which are necessary to avoid preemptive interpretation of regex characters, e.g. "*"
.
Filenames | Command |
---|---|
out, out1, out2, … | t2 -D out:t |
out.pcap, out.pcap1, out.pcap2, … | t2 -D out.pcap |
out.pcap, out.pcap01, out.pcap02, … | t2 -D out.pcap00 |
out.pcap, out1.pcap, out2.pcap, … | t2 -D "out*.pcap:t" |
out0.pcap, out1.pcap, out2.pcap, … | t2 -D out0.pcap:t |
out00.pcap, out01.pcap, out02.pcap, … | t2 -D out00.pcap:t |
out_00_Wurst.pcap, out_01_Nudel.pcap, out_02_Knoedel.pcap | t2 -D "out_00_*.pcap:t_,2" |
out_24.4.20h00.pcap, out_24.4.2016.20h00.pcap1, … | t2 -D "out*.pcap" |
out_24.4.20h00.pcap00, out_24.04.20h00.pcap01, … | t2 -D "out*.pcap00" |
out0.pcap, out1.pcap, ou2.pcap, … | t2 -D out0.pcap:t |
out.pcap00, out.pcap01, out.pcap02, … | t2 -D out.pcap00 |
So if you want to process all files in the tcpdump
split format from index 2 to 4:
t2 -D "~/data/S/annoloc2S.pcap2,4" -w ~/results/S/
The same for the editcap
format: Note again the compulsory quotes for the regex processing.
t2 -D "~/data/T/annoloc2T_00002_*.pcap:T_,4" -w ~/results/T/
The end reports differ because the fragments of tcpdump
and editcap
are different.
Polling timeout
If T2 is running out of files the default behavior of the -D
option is to wait for the next file.
So you could leave him running somewhere, lurking for more food until you copy the next pcap into
his bowl. Try this:
================================================================================ Tranalyzer 0.8.14 (Anteater), Tarantula. PID: 48769 ================================================================================ [INF] Creating flows for L2, IPv4, IPv6 Active plugins: 01: basicFlow, 0.8.14 02: basicStats, 0.8.14 03: tcpStates, 0.8.14 04: txtSink, 0.8.14 [INF] IPv4 Ver: 5, Rev: 16122020, Range Mode: 0, subnet ranges loaded: 406105 (406.11 K) [INF] IPv6 Ver: 5, Rev: 17122020, Range Mode: 0, subnet ranges loaded: 51345 (51.34 K) Processing file: /home/wurst/data/S/annoloc2S.pcap Link layer type: Ethernet [EN10MB/1] Dump start: 1022171701.691172 sec (Thu 23 May 2002 16:35:01 GMT) [WRN] snapL2Length: 54 - snapL3Length: 40 - IP length in header: 1500 Processing file: /home/wurst/data/S/annoloc2S.pcap1 Processing file: /home/wurst/data/S/annoloc2S.pcap2 Processing file: /home/wurst/data/S/annoloc2S.pcap3 Processing file: /home/wurst/data/S/annoloc2S.pcap4 Processing file: /home/wurst/data/S/annoloc2S.pcap5 Processing file: /home/wurst/data/S/annoloc2S.pcap6 Processing file: /home/wurst/data/S/annoloc2S.pcap7 Processing file: /home/wurst/data/S/annoloc2S.pcap8 Processing file: /home/wurst/data/S/annoloc2S.pcap9 ...........Processing file: /home/wurst/data/S/annoloc2S.pcap9 ...........
Now open another bash window and copy annoloc2S.pcap to annoloc2S.pcap9. It does not make sense, but it helps to demonstrate t2’s reaction.
cd ~/data/S
cp annoloc2S.pcap annoloc2S.pcap9
In the T2 window you will suddenly see that he grabs the new file,
processes it and waits for the next victim. Now imagine that No 9
is missing, then T2 waits for ever, even if additional pcaps having
a higher index are copied in his data folder. Sometimes No 9 will never come
and bring everything to a sudden halt. In order to avoid that, for certain
overall statistical analysis, or monitoring it is preferable to skip
the missing file and move on. For that purpose T2 implements a
poll timeout constant MFPTMOUT
. It defines the number of seconds
until T2 moves on the next file index.
Terminate t2 now with ^C^C
and you get the end report and all flows
which did not terminate so far, will be unloaded into the flow file.
tranalyzer2
vi src/tranalyzer.h
// -D option parameters
#define RROP 0 // round robin operation
#define POLLTM 5 // poll timing in sec for files
#define MFPTMOUT 0 // > 0: timeout n sec for poll timing > POLLTM, 0: no poll timeout
#define SCHR 'p' // separating char for number (refer to the doc for examples)
So rename annoloc2S.pcap9 to annoloc2S.pcap10, so that we have a gap.
cd ~/data/S
mv annoloc2S.pcap9 annoloc2S.pcap10
Then set the timeout for poll timing to 10 seconds, so that T2 waits for that period for the No 9 to arrive, otherwise he moves on to No 10. Recompile and rerun T2 on the same pcap.
t2conf tranalyzer2 -D MFPTMOUT=10 && t2build tranalyzer2
t2 -D ~/data/S/annoloc2S.pcap -w ~/results/S/================================================================================ Tranalyzer 0.8.14 (Anteater), Tarantula. PID: 49073 ================================================================================ [INF] Creating flows for L2, IPv4, IPv6 Active plugins: 01: basicFlow, 0.8.14 02: basicStats, 0.8.14 03: tcpStates, 0.8.14 04: txtSink, 0.8.14 [INF] IPv4 Ver: 5, Rev: 16122020, Range Mode: 0, subnet ranges loaded: 406105 (406.11 K) [INF] IPv6 Ver: 5, Rev: 17122020, Range Mode: 0, subnet ranges loaded: 51345 (51.34 K) Processing file: /home/wurst/data/S/annoloc2S.pcap Link layer type: Ethernet [EN10MB/1] Dump start: 1022171701.691172 sec (Thu 23 May 2002 16:35:01 GMT) [WRN] snapL2Length: 54 - snapL3Length: 40 - IP length in header: 1500 Processing file: /home/wurst/data/S/annoloc2S.pcap1 Processing file: /home/wurst/data/S/annoloc2S.pcap2 Processing file: /home/wurst/data/S/annoloc2S.pcap3 Processing file: /home/wurst/data/S/annoloc2S.pcap4 Processing file: /home/wurst/data/S/annoloc2S.pcap5 Processing file: /home/wurst/data/S/annoloc2S.pcap6 Processing file: /home/wurst/data/S/annoloc2S.pcap7 Processing file: /home/wurst/data/S/annoloc2S.pcap8 .....Processing file: /home/wurst/data/S/annoloc2S.pcap10 ...........
Round robin operation
In order to automate the flow file post processing and to conserve disk space a round robin
approach is very helpful. The number of the round robin rollover should be adapted to the
post processing speed and the size of the fragments.
As a test switch on RROP
, set the roll over index to 8 at the command line and reset
the polling timeout mode, as we do not need it for the following demonstration:
t2conf tranalyzer2 -D RROP=1 -D MFPTMOUT=0 && t2build tranalyzer2
t2 -D ~/data/S/annoloc2S.pcap,8 -w ~/results/S/================================================================================ Tranalyzer 0.8.14 (Anteater), Tarantula. PID: 49401 ================================================================================ [INF] Creating flows for L2, IPv4, IPv6 Active plugins: 01: basicFlow, 0.8.14 02: basicStats, 0.8.14 03: tcpStates, 0.8.14 04: txtSink, 0.8.14 [INF] IPv4 Ver: 5, Rev: 16122020, Range Mode: 0, subnet ranges loaded: 406105 (406.11 K) [INF] IPv6 Ver: 5, Rev: 17122020, Range Mode: 0, subnet ranges loaded: 51345 (51.34 K) Processing file: /home/wurst/data/S/annoloc2S.pcap Link layer type: Ethernet [EN10MB/1] Dump start: 1022171701.691172 sec (Thu 23 May 2002 16:35:01 GMT) [WRN] snapL2Length: 54 - snapL3Length: 40 - IP length in header: 1500 Processing file: /home/wurst/data/S/annoloc2S.pcap1 Processing file: /home/wurst/data/S/annoloc2S.pcap2 Processing file: /home/wurst/data/S/annoloc2S.pcap3 Processing file: /home/wurst/data/S/annoloc2S.pcap4 Processing file: /home/wurst/data/S/annoloc2S.pcap5 Processing file: /home/wurst/data/S/annoloc2S.pcap6 Processing file: /home/wurst/data/S/annoloc2S.pcap7 Processing file: /home/wurst/data/S/annoloc2S.pcap8 ^C[INF] SIGINT: Stop flow creation: 0x0002 Processing file: /home/wurst/data/S/annoloc2S.pcap Processing file: /home/wurst/data/S/annoloc2S.pcap1 Processing file: /home/wurst/data/S/annoloc2S.pcap2 Processing file: /home/wurst/data/S/annoloc2S.pcap3 Processing file: /home/wurst/data/S/annoloc2S.pcap4 Processing file: /home/wurst/data/S/annoloc2S.pcap5 Processing file: /home/wurst/data/S/annoloc2S.pcap6 Processing file: /home/wurst/data/S/annoloc2S.pcap7 Processing file: /home/wurst/data/S/annoloc2S.pcap8 Processing file: /home/wurst/data/S/annoloc2S.pcap Processing file: /home/wurst/data/S/annoloc2S.pcap1 Processing file: /home/wurst/data/S/annoloc2S.pcap2 Processing file: /home/wurst/data/S/annoloc2S.pcap3 ^C[INF] SIGINT: Stop flow creation: 0x0001 Dump stop : 1022171713.457599 sec (Thu 23 May 2002 16:35:13 GMT) Total dump duration: 11.766427 sec Finished processing. Elapsed time: 1.411131 sec Finished unloading flow memory. Time: 1.609616 sec ...
Interrupt it with ^C^C
or send a t2stat -TERM
command from another
bash window.
Split output files
As with pcaps you can split flow files into smaller chunks, either measured in bytes or number of flows. The general command line option is defined as follows:
t2 -W PREFIX[:SIZE][,START]
The expression before the :
defines the output file name prefix,
the expression following denotes the maximal file size for each fragment;
if omitted if defaults to OFRWFILELN
defined in tranalyzer.h
tranalyzer2
vi src/tranalyzer.h
// -W option parameters
#define OFRWFILELN 5E8 // default fragmented output file length (500MB)
START
defines the index of the first file generated. If omitted it defaults to 0
.
The SIZE
of the files can be specified in bytes (default), KB (K
), MB (M
) or GB (G
).
Scientific notation, i.e., 1e5 or 1E5 (=100000), can be used as well.
If no size is specified, then the :
can be omitted.
If a f
is appended the unit is flow count. Hence, file chunks are produced containing the same amount of flows.
Some typical examples are shown below.
Command | Fragment | Start Index | Output Files |
---|---|---|---|
t2 -r ~/data/annoloc2.pcap -W ~/results/out:1.5E9,10 |
1.5 GB | 10 | out10, out11, … |
t2 -r ~/data/annoloc2.pcap -W ~/results/out:1.5e9,5 |
1.5 GB | 5 | out5, out6, … |
t2 -r ~/data/annoloc2.pcap -W ~/results/out:1.5G,1 |
1.5 GB | 1 | out1, out2, … |
t2 -r ~/data/annoloc2.pcap -W ~/results/out:5000K |
0.5 MB | 0 | out0, out1, … |
t2 -r ~/data/annoloc2.pcap -W ~/results/out:5Kf |
5000 flows | 0 | out0, out1, … |
t2 -r ~/data/annoloc2.pcap -W ~/results/out:2.5G |
2.5 GB | 0 | out0, out1, … |
t2 -r ~/data/annoloc2.pcap -W ~/results/out,6 |
OFRWFILELN |
0 | out6, out7, … |
t2 -r ~/data/annoloc2.pcap -W ~/results/out |
OFRWFILELN |
0 | out0, out1, … |
Try them out and see what happens. Although being useful in production it is advisable to reset the round robin mode from the last chapter otherwise you end up in a loop with files constantly being overwritten.
t2conf tranalyzer2 -D RROP=0 && t2build tranalyzer2
A prominent application in productive environments is a combination of the -D
and -W
option as
shown below, with max 1000 flows per file and with the devil start index 666:
t2 -D ~/data/S/annoloc2S.pcap,8 -W ~/results/F/:1000f,666
ls ~/results/F
annoloc2S_flows.txt666 annoloc2S_flows.txt669 annoloc2S_flows.txt672 annoloc2S_flows.txt675 annoloc2S_flows.txt678 annoloc2S_flows.txt681 annoloc2S_headers.txt
annoloc2S_flows.txt667 annoloc2S_flows.txt670 annoloc2S_flows.txt673 annoloc2S_flows.txt676 annoloc2S_flows.txt679 annoloc2S_flows.txt682
annoloc2S_flows.txt668 annoloc2S_flows.txt671 annoloc2S_flows.txt674 annoloc2S_flows.txt677 annoloc2S_flows.txt680 annoloc2S_flows.txt683
How to process several different files
Often a multitude of different pcaps uncorrelated in time and source have to be processed in the background. For that you better write a script yourself. Here is an example:
#!/usr/bin/env bash
if [ -z "$1" ]; then
echo "Usage: $0 filename extension startIndex endIndex"
exit 1
fi
EXT=$2
START=$3
END=$4
for ((i=$START; i<=$END; i++)) do
rfile="$HOME/data/$1$i.$EXT"
wfile="$HOME/results/$1$i"
echo "Processing '$rfile', writing to '$wfile'"
if [ -f "$rfile" ]; then
t2 -r "$rfile" -w "$wfile"
fi
done
Conclusion
Make sure that the polling timeout and round robin mode is reset for the following tutorials, if not already done earlier.
t2conf --reset tranalyzer2
Have fun and may the anteater be with you!