Classification of encrypted video streams
Contents
Introduction
This tutorial explains how to use Tranalyzer to extract the bytes-per-burst (BPB) feature from TLS encrypted YouTube video streams and recognize what video title is contained in a new test sample, or detect that it is a new video. This is an implementation of recent work by Dubin et al. [1]
The bytes-per-burst feature
A flow is viewed as a signal of packets over time. This signal is then transformed in a series of bursts. A burst here is defined as the set of packets that were recorded within a certain time window of each other. (Note that this is not a regular binning of the time dimension as a burst can be arbitrarily large in the time dimension as long as the next packet arrives within that window.) Each burst corresponds to the sum of bytes contained in all the packets aggregated into that burst. The total number of bytes in each burst for a given flow is then used to characterize this flow.
Extracting bursts from a flow
Given the following series of packets:
Using a time window of 50ms in the nFrstPkts
plugin, the following bursts are extracted:
(Note the logarithmic scale on the y-axis and the changing y-limits between the two plots.)
Plots like this can also be generated for any given flow file using the fpsGplt
and t2plot
scripts. More information on this can be found in the documentation or in the traffic mining tutorial.
Identifying YouTube flows
In order to identify YouTube flows in a larger PCAP traffic dump, the Server Name Indication (SNI) TLS extension is used. The sslDecode plugin for Tranalyzer makes the server name available in the sslServerName
column in the flow file.
Prerequisites
- Tranalyzer version 0.8.1lmw4 or higher,
- A folder containing your training data:
- Dubin et al [1] provide their data at: http://www.cse.bgu.ac.il/title_fingerprinting/dataset_chrome_100
- The PCAP files used for training are expected to be at the following location:
DATA_PATH/{Class1,Class2,...,ClassN}/Train/*.pcap
Implementation
We are going to use a few modules:
#!/usr/bin/env python3
import argparse
import matplotlib.pyplot as plt
import tempfile
import subprocess
import threading
import os
from os.path import dirname, basename, isfile, splitext, normpath, expanduser
import sys
import glob
import pickle
Configuration
We need the path to the Tranalyzer directory, and to the PCAP files that we are going to be using to create our model.
# Configuration: Adapt to your system.
= '{}/code/tranalyzer2-0.9.2'.format(expanduser('~'))
T2_ROOT = '/mnt/{}/BPB/data_short'.format(os.environ['USER'])
DATA_PATH
= False VERBOSE
# Automatically derived
= '{}/tranalyzer2/src/tranalyzer'.format(T2_ROOT)
T2_PATH = '{}/scripts/t2conf/t2conf'.format(T2_ROOT)
T2CONF = '{}/autogen.sh'.format(T2_ROOT)
T2BUILD = 'bpbs.data' BPBS_PATH
Setting up Tranalyzer
First, we need to set up Tranalyzer to include the required non-standard plugins:
nFrstPkts
to get the signal for the first few packets of a flow, andsslDecode
to identify YouTube flows.
We can use t2conf
to configure the plugins to our liking. For this tutorial, we set the minimum time window that defines a burst to 50ms and the number of packets to analyze to 200.
To build Tranalyzer and the plugins, we use the included autogen
build script.
= ['nFrstPkts', 'sslDecode']
REQUIRED_PLUGINS
def setup_t2():
= {
options 'NFRST_IAT': '0',
'NFRST_MINIATS': '0',
'NFRST_MINIATU': '50000',
'NFRST_PLAVE': '0',
'NFRST_PKTCNT': '200'
}
# Configure
for opt, val in options.items():
'nFrstPkts', '-D', f'{opt}={val}'])
subprocess.call([T2CONF,
# Build
subprocess.call([T2BUILD])for plugin in REQUIRED_PLUGINS:
subprocess.call([T2BUILD, plugin])
def definitely_need_setup_t2():
= subprocess.check_output([T2BUILD, '-l'], text=True).splitlines()
active_plugins return not all(plugin in active_plugins for plugin in REQUIRED_PLUGINS)
Running Tranalyzer
We need a function that, given a PCAP file, runs Tranalyzer to determine the bursts, and returns the path to the resulting flow file for further processing:
def run_t2(infile, outdir):
subprocess.call([
T2_PATH,'-r', infile,
'-w', outdir
=subprocess.DEVNULL)
], stdoutreturn '{}/{}_flows.txt'.format(outdir, splitext(basename(infile))[0])
Extract the BPB features
Given a flow file, we now need to extract a list of numbers that correspond to the total number of bytes in each burst. For this, we run a small (T)AWK script and some basic postprocessing that we execute using tawk
.
def get_bpb(flowfile):
= '''
EXTRACT_SCRIPT source "$1/scripts/t2utils.sh"
flowfile="$2"
cat $flowfile | $TAWK -t -I $flowfile \
'($dir == "B" && $sslServerName ~ /.googlevideo.com/) {
print $numBytesSnt, $L2L3L4Pl_Iat_nP
}' |
sort -nr | # sort by byte count, descending
head -n 1 | # get biggest flow
cut -d$'\t' -f2 | # get nFrstPkts output (burst)
tr ';' '\n' | # one burst per line
cut -d'_' -f1 | # extract byte count from each burst
cat
'''
= subprocess.Popen(['/bin/bash', '-c', EXTRACT_SCRIPT, '', T2_ROOT, flowfile], stdout=subprocess.PIPE)
proc = [int(bpb) for bpb in proc.stdout] if proc.stdout else []
bpbs return bpbs
def extract_bpb(pcap):
with tempfile.TemporaryDirectory() as tempdir:
= run_t2(pcap, tempdir)
flowfile = get_bpb(flowfile)
bpb return bpb
Let’s check the output using a random PCAP in our training data path:
try:
print(extract_bpb(glob.glob(f'{DATA_PATH}/*/Train/*.pcap')[0]))
except:
print('Unable to find any PCAPs in your training data. Are you sure you set DATA_PATH correctly?')
pass
[434420, 1595848, 359344, 1811560, 351882, 1682472, 1665230, 329019, 2101566, 1793397, 356024, 1789932, 2101566, 338444, 2101566, 1377684, 341082, 1830971, 2084460, 337281, 2101566, 1792682, 328908, 2101566, 1957214, 445567, 1129815]
We extract the bursts for several PCAPS in parallel to speed up the process. Each thread is given a PCAP file, runs Tranalyzer, extracts the bursts and stores them into a unique location per thread.
def thread_t2_bpb(bpbs, index, pcap):
= extract_bpb(pcap)
bpb = bpb
bpbs[index]
def get_bpbs_for_class(c):
= '{}/{}/Train'.format(DATA_PATH.rstrip('/'), c)
pcaps_root = glob.glob('{}/*.pcap'.format(pcaps_root))
pcaps = []
bpbs
= 6
max_pcaps = max_pcaps # `chunk_size` must divide `max_pcaps`.
chunk_size = zip(*[iter(pcaps[:max_pcaps])]*chunk_size)
pcaps_chunks for chunk in map(list, pcaps_chunks):
= [None] * chunk_size
threads_results = [threading.Thread(target=thread_t2_bpb, args=(threads_results, i, pcap))
threads for i, pcap in enumerate(chunk)]
for t in threads:
t.start()for t in threads:
t.join()for bpbs_c in threads_results:
bpbs.append(bpbs_c)return bpbs
def load_bpbs(path):
with open(path, 'rb') as fh:
return pickle.load(fh)
Now let’s look at the BPBs for all samples of a class. For each sample of this class, we get a list of numbers representing the number of bytes in a burst:
print(get_bpbs_for_class('Avengers'))
[[4110, 129278, 778713, 268405, 3728018, 4098388, 384251, 3486785, 379773, 3845582, 375686, 1108149], [4990, 66262, 7826406, 384251, 3486785, 379773, 3845582, 375686, 1108149], [4110, 66262, 4098388, 384251, 3486785, 379773, 3845582, 375686, 1108149], [4400, 129278, 149475, 627878, 268405, 402514, 1608766, 405435, 1203498, 4098388, 384251, 3486785, 379773, 1906720, 1938862, 375686, 1108149], [4110, 129278, 149475, 896283, 2011280, 405435, 4931516, 4098388, 384251, 3486785, 379773, 3845582, 375686, 1108149], [4110, 129278, 1287473, 670919, 1554910, 2094745, 8215957, 384251, 3335118, 379773, 3881665]]
Building the model
Using these building blocks, we can now write a function learn
that, given a list of classes, fetches the corresponding PCAP files from the user-defined path at the top in this script, extracts the bursts for all PCAPs of each class using Tranalyzer, and stores the resulting features in a dictionary.
We persist this dictionary to disk so that we can call this program again in test mode, give it a new PCAP, and get the video title that most closely matches the unknown sample, given the model.
Note that here, we are storing the bursts in a list rather than a set (as in the paper). This helps us understand the data better when exploring it visually later on. It does not have an impact on classification accuracy.
def learn(classes, *, write_to=None):
= { c: [] for c in classes }
bpbs for c in classes:
print('Extracting bursts for class {}'.format(c))
= get_bpbs_for_class(c)
c_bpb for b in c_bpb:
# NOTE: Could be set() to improve performance in production, but we're using
# list() to preserve the order of the bursts and produce more meaningful plots.
list(b))
bpbs[c].append(if write_to:
with open(write_to, 'wb') as fh:
pickle.dump(bpbs, fh)return bpbs
Classification
To classify an unknown sample, a simple nearest neighbor approach is used.
A video is represented as a set of integers, each representing one burst in the signal.
The unknown sample is classified to the video title of the known sample that shares the most bursts in common with the unknown sample.
def similarity(xs, ys):
return sum(x in ys for x in xs)
def nearest_neighbors(x, bpbs, n=3):
= [None] * n, [0] * n
top, sims for c in bpbs:
= 0
best_in_class = -1
j for y in bpbs[c]:
+= 1
j = similarity(x, y)
s for i in range(n):
if s > sims[i] and s > best_in_class:
= s
sims[i] = s
best_in_class = c
top[i] break
return list(zip(top, sims))
Program options
Parsing the options for our program:
def parse_args(argv=sys.argv[1:]):
= argparse.ArgumentParser(description='Classifier for YouTube video streams')
parser = parser.add_argument_group('Classifier options')
group = set(['learn', 'test'])
modes 'mode', metavar='mode', choices=modes, nargs=1,
parser.add_argument(help='Mode to run in. Available: ' + ', '.join(modes))
'-t', '--test', default='', type=str, help='Path to pcap to classify')
group.add_argument('-f', '--force', action='store_true', help='Overwrite existing files without prompting')
group.add_argument('-s', '--setup', action='store_true', help='Perform initial setup of tranalyzer')
group.add_argument(return parser.parse_args(argv)
Exploring the data
To understand our data better, we can plot the extracted bursts for each sample and examine them side by side.
def plot_bursts(bpbs, title=''):
= 10
width = 3
cols = (len(bpbs) + cols - 1 ) // cols
rows = plt.figure(figsize=(width, 3 * rows))
fig = plt.subplot(rows, cols, 1)
ax1 for i, bpb in enumerate(map(list, bpbs), 1):
=ax1)
plt.subplot(rows, cols, i, sharey
# Optional: Filter out audio bursts <=500K.
#bpb = [x for x in bpb if (x > 500*1024)]
# Optional: Sort for alternative visualization perspective.
#bpb.sort()
=range(len(bpb)), height=bpb)
plt.bar(x+ ' ' + str(i), fontsize=12, fontweight=0)
plt.title(title
plt.tight_layout()=True) plt.show(block
Putting it all together
We first train our model using the training data. Afterwards, the program can be run in test mode and will output the top matches for a new unknown encrypted video stream sample.
def main(argv=None):
= parse_args(argv)
args
if args.setup or definitely_need_setup_t2():
print('Setting up tranalyzer ... ', end='')
setup_t2()print('done')
# run t2 on all PCAPs and extract BPB set for each class
if args.mode[0] == 'learn':
print('Storing learned BPB models to file: {}'.format(BPBS_PATH))
if isfile(BPBS_PATH) and not args.force:
print('WARNING: The model file {} already exists. Overwrite it? [yN] '.format(BPBS_PATH), end='')
if input().lower() != 'y':
print('Exiting.')
1)
sys.exit(= [basename(normpath(p)) for p in glob.glob('{}/**/'.format(DATA_PATH), recursive=False)]
classes = classes[:10]
classes print('Found {} classes, building model now ...'.format(len(classes)))
= learn(classes, write_to=BPBS_PATH)
bpbs print('Done building model, ready to test now.')
# NOTE: Collected BPB features are stored in a file and are used when this program is
# invoked in "test" mode.
if VERBOSE:
for c in classes[:3]:
print('Bursts for class {}:'.format(c))
list(bpbs[c]), c) # Plot bursts of whichever class happens to be the first.
plot_bursts(else:
if not args.test:
print("Missing path to test PCAP file. See help page.")
1)
sys.exit(
= args.test
pcap_test = load_bpbs(BPBS_PATH)
bpbs = extract_bpb(pcap_test)
bpb_test
print('Bursts of test sample ({}):'.format(basename(pcap_test)))
'Test sample')
plot_bursts([bpb_test],
= nearest_neighbors(bpb_test, bpbs, 3)
top = top[0][0]
result print('Classification result: {}'.format(result))
print('')
print('(Top 3: {})'.format(top))
plot_bursts(bpbs[result], result)
For the purposes of this tutorial, let’s first train our model, then test it on a new PCAP:
if __name__ == '__main__':
'learn'])
main([
print('-' * 40)
= f'{DATA_PATH}/Fast_and_Furious_six/Test/Fast_and_Furious_six_Train00_40_30.pcap'
test_pcap 'test', '-t', test_pcap]) main([
Storing learned BPB models to file: bpbs.data
WARNING: The model file bpbs.data already exists. Overwrite it? [yN] y
Found 10 classes, building model now ...
Extracting bursts for class Jennifer_Lopez_On_The_Floor
Extracting bursts for class Democratic_Town_Hall
Extracting bursts for class Fast_and_Furious_six
Extracting bursts for class Coolio_Gangsters_Paradise
Extracting bursts for class Lenny_Kravitz_American_Woman
Extracting bursts for class Disconnect
Extracting bursts for class Jungle_Book
Extracting bursts for class Robbie_Williams_Supreme
Extracting bursts for class Meghan_Trainor_All_About_That_Bass
Extracting bursts for class fifty_Cent_In_Da_Club
Done building model, ready to test now.
----------------------------------------
Bursts of test sample (Fast_and_Furious_six_Train00_40_30.pcap):
Classification result: Fast_and_Furious_six
(Top 3: [('Fast_and_Furious_six', 30), ('Disconnect', 1), ('Jungle_Book', 1)])
Conclusion
We wrote a short python program to build and train a nearest neighbor model to classify encrypted YouTube video streams using Tranalyzer.
Download the jupyter notebook for this tutorial here.
If you have any questions or feedback, please do not hesitate to contact us!
References
[1] R. Dubin, A. Dvir, O. Pele and O. Hadar, “I Know What You Saw Last Minute—Encrypted HTTP Adaptive Video Streaming Title Classification,” in IEEE Transactions on Information Forensics and Security, vol. 12, no. 12, pp. 3039-3049, Dec. 2017. doi: 10.1109/TIFS.2017.2730819 https://ieeexplore.ieee.org/document/7987775