Post-processing with Tawk

aggregation filter awk tawk

Prerequisites
General introduction
Examples
Writing a Tawk function
Using Tawk within scripts
Using Tawk with non-Tranalyzer files
Mapping external column names to Tranalyzer column names
- Using Tawk with Bro/Zeek files
Examples
See also

This tutorial presents tawk functionality through various scenarios. tawk works just like awk, but provides access to the columns via their names. In addition, it provides access to helper functions, such as host() or port(). For an overview, refer to the Alphabetical list of Tawk functions. Custom functions can be added in the folder named t2custom where they will be automatically loaded.

Prerequisites

This tutorial assumes a working knowledge of awk.

Dependencies

gawk version 4.1 is required.

Kali/Ubuntu	`sudo apt-get install gawk`
Arch	`sudo pacman -S gawk`
Fedora/Red Hat	`sudo yum install gawk`
Gentoo	`sudo emerge gawk`
openSUSE	`sudo zypper install gawk`
macOS	`brew install gawk`	(Homebrew package manager)

Installation

The recommended way to install tawk is to install t2_aliases as documented in README.md:

Append the following line to ~/.bashrc:

if [ -f "$T2HOME/scripts/t2_aliases" ]; then
    . $T2HOME/scripts/t2_aliases             # Note the leading `.'
fi

Make sure to replace $T2HOME with the actual path, e.g., $HOME/tranalyzer-0.9.0/plugins.

Documentation (man pages)

The man pages for tawk and t2nfdump (more on that later) can be installed by running: ./install.sh man. Once installed, they can be consulted by running man tawk and man t2nfdump respectively.

General introduction

Command line options

First, run tawk -h to list the available command line options:

tawk -h

Usage:
    tawk [OPTION...] 'program' file_flows.txt
    tawk [OPTION...] -I file_flows.txt 'program'

Input arguments:
    -I file             Alternative way to specify the input file

Optional arguments:
    -N num              Row number where column names are to be found
    -s char             First character for the row listing the columns name
    -F fs               Use 'fs' as input field separator
    -O fs               Use 'fs' as output field separator
    --csv               Set input and output separators to ',' and
                        extract names from first row
    --zeek              Configure tawk to work with Bro/Zeek log files
    -f file             Read (t)awk program from file
    -n                  Load nfdump functions
    -e                  Load examples functions
    -H                  Do not output the header (column names)
    -c[=u]              Output command line as a comment
                        (use -c=u for UTC instead of localtime)
    -t                  Validate column names (slow)
    -r                  Try renaming invalid columns (suffix them with '_') (slow)

Tranalyzer specific arguments:
    -k                  Run Wireshark on the extracted data

    -x outfile          Create a PCAP file with the selected flows/packets
    -X xerfile          Specify the '.xer' file to use with -k and -x options
    -P                  Extract specific packets instead of whole flows
    -b                  Always extract both directions (A and B flows)

    -V vname[=value]    Display Tranalyzer variable 'vname' documentation
    -L                  Decode all variables from Tranalyzer log file

Help and documentation arguments:
    -l[=n], --list[=n]  List column names and numbers
    -g[=n], --func[=n]  List available functions

    -d fname            Display function 'fname' documentation

    -D                  Display tawk PDF documentation

    -?, -h, --help      Show help options and exit

-s and -N options

The -s option can be used to specify the starting character(s) of the row containing the column names (default: %). If several rows start with the specified character(s), then the last one is used as column names. To change this behavior, the line number can be specified as well with the help of the -N option. For example, if rows 1 to 5 start with # and row 3 contains the column names, specify the separator as follows: tawk -s "#" -N 3. If the row with column names does not start with a special character, use -s ""}.

What features (columns) are available?

tawk -l FILE_flows.txt

What functions are available?

tawk -g FILE_flows.txt

Alternatively, refer to the Alphabetical list of Tawk functions.

How to use a specific function?

tawk -d function_name

How to interpret a specific column?

tawk -V colName

tawk -V colName=value

How to decode all aggregated fields in Tranalyzer log file?

tawk -L out_log.txt

t2 -r file.pcap | tawk -L

Examples

Print all 5 tuples (source and destination IP and ports, protocol)

tawk '{ print tuple5() }' FILE_flows.txt

Print the hosts involved in the most flows

tawk '{ aggr($srcIP); aggr($dstIP) }' FILE_flows.txt

Ignore all flows between private IPs

tawk 'not(privip($srcIP) && privip($dstIP))' FILE_flows.txt

tawk 'wildcard("^dns.*") ~ /facebook/ { print tuple2() }' FILE_flows.txt

Replace the protocol number by its string representation, e.g., 6 -> TCP

tawk '{ $l4Proto = proto2str($l4Proto); print }' FILE_flows.txt

Replace the Unix timestamp used for timeFirst and timeLast by their value in UTC

tawk '{ $timeFirst = utc($timeFirst); $timeLast = utc($timeLast); print }' FILE_flows.txt

Replace the Unix timestamp used for timeFirst and timeLast by their values in localtime

tawk '{ $timeFirst = localtime($timeFirst); $timeLast = localtime($timeLast); print }' FILE_flows.txt

Print the 10 hosts sending the most bytes over UDP

tawk -H ' udp() && !bitsallset($flowStat, 1) { aggr($srcIP, $numBytesSnt, 10); aggr($dstIP, $numBytesRcvd, 10); } ' FILE_flows.txt

Inspect the flow number 1234 in the flow file

tawk 'flow(1234)' FILE_flows.txt

Follow a specific flow, e.g., the flow with flow index 1234, in the packet file

tawk 'flow(1234)' FILE_packets.txt

Inspect the packet number 1234 in the packet file

tawk 'packet(1234)' FILE_packets.txt

Follow a flow (similar to Wireshark follow TCP/UDP stream):

tawk 'follow_stream(1)' FILE_packets.txt

Recreate a binary file transferred in a B flow:

tawk 'follow_stream(1, 3, "B")' FILE_packets.txt | xxd -p -r > out.data

Extract all flows whose HTTP Host: header matches google using Wireshark field names

tawk 'shark("http.host") ~ /google/' FILE_flows.txt

Extract the DNS query field from all flows where at least one DNS answer was seen (using Wireshark field names)

tawk 'shark("dns.count.answers") { print shark("dns.qry.name") }' FILE_flows.txt

Open all ICMP flows involving the network 1.2.3.4/24 in Wireshark

tawk -k 'icmp() && host("1.2.3.4/24")' FILE_flows.txt

Create a PCAP files with all TCP flows with port 80 or 8080

tawk -x file.pcap 'tcp() && port("80;8080")' FILE_flows.txt

Writing a Tawk function

Ideally one function per file (where the filename is the name of the function)
Private functions are prefixed with an underscore
Always declare local variables 8 spaces after the function arguments
Local variables are prefixed with an underscore
Use uppercase letters and two leading and two trailing underscores for global variables
Include all referenced functions
Files should be structured as follows:

#!/usr/bin/env awk
#
# Function description
#
# Parameters:
#   - arg1: description
#   - arg2: description (optional)
#
# Dependencies:
#   - plugin1
#   - plugin2 (optional)
#
# Examples:
#   - tawk `funcname()' file.txt
#   - tawk `{ print funcname() }' file.txt

@include "hdr"
@include "_validate_col"

function funcname(arg1, arg2, [8 spaces] _locvar1, _locvar2) {
    _locvar1 = _validate_col("colname1;altcolname1", _my_colname1)
    _validate_col("colname2")

    if (hdr()) {
        if (__PRIHDR__) print "header"
    } else {
        print "something", _locvar1, $colname2
    }
}

Copy your files in the t2custom folder.
To have your functions automatically loaded, include them in the file t2custom/t2custom.load.

Using Tawk within scripts

To use tawk from within a script:

Create a TAWK variable pointing to the script: TAWK="$T2HOME/scripts/tawk/tawk" (make sure to replace $T2HOME with the actual path to the scripts folder)
Call tawk as follows: $TAWK 'dport(80)' file.txt

Using Tawk with non-Tranalyzer files

tawk can also be used with files which were not produced by Tranalyzer.

The input field separator can be specified with the -F option, e.g., tawk -F ',' 'program' file.csv
The row listing the column names, can start with any character specified with the -s option, e.g., tawk -s '#' 'program' file.txt
All the column names must not be equal to a function or builtin name (tawk will try renaming them with a trailing underscore if -r option is being used (slow))
Valid column names must start with a letter (a-z, A-Z) and can be followed by any number of alphanumeric characters or underscores
If the column names are different from those used by Tranalyzer, refer to the next section.

Mapping external column names to Tranalyzer column names

If the column names are different from those used by Tranalyzer, a mapping between the different names can be made in the file scripts/tawk/my_vars. The format of the file is as follows:

BEGIN {
    _my_srcIP = non_t2_name_for_srcIP
    _my_dstIP = non_t2_name_for_dstIP
    ...
}

Once edited, run tawk with the -i $T2HOME/scripts/tawk/my_vars option and the external column names will be automatically used by tawk functions, such as tuple2(). For more details, refer to the my_vars file itself.

Using Tawk with Bro/Zeek files

To use tawk with Bro/Zeek log files, use one of --bro or --zeek option:

tawk --bro '{ program }' file.log

tawk --zeek '{ program }' file.log

Examples

Pivoting (variant 1):
- First, extract an attribute of interest, e.g., an unresolved IP address in the Host: field of the HTTP header:
  
  tawk 'aggr($httpHosts)' FILE_flows.txt | tawk '{ print unquote($1); exit }'
- Then, put the result of the last command in the badguy variable and use it to extract flows involving this IP:
  
  tawk -v badguy="$(!!)" 'host(badguy)' FILE_flows.txt
Pivoting (variant 2):
- First, extract an attribute of interest, e.g., an unresolved IP address in the Host: field of the HTTP header, and store it into a badip variable:
  
  badip="$(tawk 'aggr($httpHosts)' FILE_flows.txt | tawk '{ print unquote($1); exit }')"
- Then, use the badip variable to extract flows involving this IP:
  
  tawk -v badguy="$badip" 'host(badguy)' FILE_flows.txt
Aggregate the number of bytes sent between source and destination addresses (independent of the protocol and port) and output the top 10 results:

tawk 'aggr($srcIP4 OFS $dstIP4, $numBytesSnt, 10)' FILE_flows.txt
Aggregate the number of bytes, packets and flows sent over TCP between source and destination addresses (independent of the port) and output the top 20 results (output sorted accorded to numBytesSnt):

tawk 'tcp() { aggr(tuple2(), $numBytesSnt OFS $numPktsSnt OFS "Flows", 20) }' FILE_flows.txt
Sort the flow file according to the duration (longest flows first) and output the top 5 results:

tawk 't2sort(duration, 5)' FILE_flows.txt
Extract all TCP flows:

tawk 'tcp()' FILE_flows.txt
Extract all flows whose destination port is between 6000 and 6008 (included):

tawk 'dport("6000-6008")' FILE_flows.txt
Extract all flows whose destination port is 53, 80 or 8080:

tawk 'dport("53;80;8080")' FILE_flows.txt
Extract all flows involving an IP in the subnet 192.168.1.0/24 (using the host() or net() function):

tawk 'host("192.168.1.0/24")' FILE_flows.txt

tawk 'net("192.168.1.0/24")' FILE_flows.txt
Extract all flows whose destination IP is in subnet 192.168.1.0/24 (using the dhost() or dnet() function):

tawk 'dhost("192.168.1.0/24")' FILE_flows.txt

tawk 'dnet("192.168.1.0/24")' FILE_flows.txt
Extract all flows whose source IP is in subnet 192.168.1.0/24 (using the shost() or snet() function):

tawk 'shost("192.168.1.0/24")' FILE_flows.txt

tawk 'snet("192.168.1.0/24")' FILE_flows.txt
Extract all flows whose source IP is in subnet 192.168.1.0/24 (using the ipinrange() function):

tawk 'ipinrange($srcIP4, "192.168.1.0", "192.168.1.255")' FILE_flows.txt
Extract all flows whose source IP is in subnet 192.168.1.0/24 (using the ipinnet() function):

tawk 'ipinnet($srcIP4, "192.168.1.0", "255.255.255.0")' FILE_flows.txt
Extract all flows whose source IP is in subnet 192.168.1.0/24 (using the ipinnet() function and a hex mask):

tawk 'ipinnet($srcIP4, "192.168.1.0", 0xffffff00)' FILE_flows.txt
Extract all flows whose source IP is in subnet 192.168.1.0/24 (using the ipinnet() function and the CIDR notation):

tawk 'ipinnet($srcIP4, "192.168.1.0/24")' FILE_flows.txt
Extract all flows whose source IP is in subnet 192.168.1.0/24 (using the ipinnet() function and a CIDR mask):

tawk 'ipinnet($srcIP4, "192.168.1.0", 24)' FILE_flows.txt

For more examples, refer to tawk -d option, e.g., tawk -d aggr, where every function is documented and comes with a set of examples. For more complex examples, have a look at the scripts/t2fm/tawk/ folder. The complete documentation can be consulted by running tawk -d all.

Post-processing with Tawk

Contents

Prerequisites

Dependencies

Installation

Documentation (man pages)

General introduction

Command line options

-s and -N options

What features (columns) are available?

What functions are available?

How to use a specific function?

How to interpret a specific column?

How to decode all aggregated fields in Tranalyzer log file?

Examples

Print all 5 tuples (source and destination IP and ports, protocol)

Print the hosts involved in the most flows

Ignore all flows between private IPs

Print the source and destination addresses of all DNS flows related to Facebook

Replace the protocol number by its string representation, e.g., 6 -> TCP

Replace the Unix timestamp used for timeFirst and timeLast by their value in UTC

Replace the Unix timestamp used for timeFirst and timeLast by their values in localtime

Print the 10 hosts sending the most bytes over UDP

Inspect the flow number 1234 in the flow file

Follow a specific flow, e.g., the flow with flow index 1234, in the packet file

Inspect the packet number 1234 in the packet file

Follow a flow (similar to Wireshark follow TCP/UDP stream):

Recreate a binary file transferred in a B flow:

Extract all flows whose HTTP Host: header matches google using Wireshark field names

Extract the DNS query field from all flows where at least one DNS answer was seen (using Wireshark field names)

Open all ICMP flows involving the network 1.2.3.4/24 in Wireshark

Create a PCAP files with all TCP flows with port 80 or 8080

Writing a Tawk function

Using Tawk within scripts

Using Tawk with non-Tranalyzer files

Mapping external column names to Tranalyzer column names

Using Tawk with Bro/Zeek files

Examples

See also