Auxiliary Loss Optimization for Hypothesis Augmentation for DGA Domain Detection

  • Detecting malicious URLs from Exploit Kits — possible auxiliary labels: Exploit Kit names, Web Proxy Categories, etc.
  • Detecting malicious C2 domains — possible auxiliary labels: malware family names, DGA or not, proxy categories.
  • Detecting DGA Domains — possible auxiliary labels: malware families, DGA type (wordlist, hex based, alphanumeric, etc).

Data:

  • Alexa top 1m domains
  • classical DGA domains for the following malware families: banjori, corebot, cryptolocker, dircrypt, kraken, lockyv2, pykspa, qakbot, ramdo, ramnit, and simda.
  • Word-based/dictionary DGA domains for the following malware families: gozi, matsnu, and suppobox

Baseline Models:

  • ALOHA CNN
  • ALOHA Bigram
  • ALOHA LSTM
  • ALOHA CNN+LSTM
  • training splits: 76% training, 4% validation, %20 testing
  • all models were trained with a batch size of 128
  • The CNN, LSTM, and CNN+LSTM models used up to 25 epochs, while the bigram models used up to 50 epochs.
In [1]: import pickle
In [2]: from collections import Counter
In [3]: data = pickle.loads(open('traindata.pkl', 'rb').read())
In [4]: Counter([d[0] for d in data]).most_common(100)
Out[4]:
[('benign', 139935),
('qakbot', 10000),
('dircrypt', 10000),
('pykspa', 10000),
('corebot', 10000),
('kraken', 10000),
('suppobox', 10000),
('gozi', 10000),
('ramnit', 10000),
('matsnu', 10000),
('locky', 9999),
('banjori', 9984),
('simda', 9984),
('ramdo', 9984),
('cryptolocker', 9984)]

Results

  • aloha_bigram 0.9435
  • bigram 0.9444
  • cnn 0.9817
  • aloha_cnn 0.9820
  • lstm 0.9944
  • aloha_cnn_lstm 0.9947
  • aloha_lstm 0.9950
  • cnn_lstm 0.9957

Future Work:

  • Type of DGA (hex based, alphanumeric, custom alphabet, dictionary/word-based, etc)
  • C̶l̶a̶s̶s̶i̶c̶a̶l̶ ̶D̶G̶A̶ ̶d̶o̶m̶a̶i̶n̶ ̶f̶e̶a̶t̶u̶r̶e̶s̶ ̶l̶i̶k̶e̶ ̶s̶t̶r̶i̶n̶g̶ ̶e̶n̶t̶r̶o̶p̶y̶,̶ ̶c̶o̶u̶n̶t̶ ̶o̶f̶ ̶l̶o̶n̶g̶e̶s̶t̶ ̶c̶o̶n̶s̶e̶c̶u̶t̶i̶v̶e̶ ̶c̶o̶n̶s̶o̶n̶a̶n̶t̶ ̶s̶t̶r̶i̶n̶g̶,̶ ̶c̶o̶u̶n̶t̶ ̶o̶f̶ ̶l̶o̶n̶g̶e̶s̶t̶ ̶c̶o̶n̶s̶e̶c̶u̶t̶i̶v̶e̶ ̶v̶o̶w̶e̶l̶ ̶s̶t̶r̶i̶n̶g̶,̶ ̶e̶t̶c̶.̶ ̶I̶ ̶a̶m̶ ̶c̶u̶r̶i̶o̶u̶s̶ ̶i̶f̶ ̶f̶o̶r̶c̶i̶n̶g̶ ̶t̶h̶e̶ ̶N̶N̶ ̶t̶o̶ ̶l̶e̶a̶r̶n̶ ̶t̶h̶e̶s̶e̶ ̶w̶o̶u̶l̶d̶ ̶i̶m̶p̶r̶o̶v̶e̶ ̶i̶t̶s̶ ̶p̶r̶i̶m̶a̶r̶y̶ ̶s̶c̶o̶r̶i̶n̶g̶ ̶m̶e̶c̶h̶a̶n̶i̶s̶m̶. Update 7/23/2019 — I augmented the code to add entropy as an aux target and it seemed to have no impact. Oh well, it was worth a try. Code here if interested.
  • Metadata from VT domain report.
  • Summary / stats from Passive DNS (PDNS).
  • Features from various aspects of the domain’s whois record.

--

--

--

Interests: Network security, Digital Forensics, Machine Learning, Big Data. retweets are not endorsements.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Regression with Infinitely Many Parameters: Gaussian Processes

SPAM OR HAM

The Business Model That Will Turn The Privacy Debates Upside Down

Forecasting: A Primer on Innovations in Time Series Analysis

Photo by Kaleidico on Unsplash

Image classification using two approches: Pretrained models with Fast.AI

Transpose Convolutions vs Resizing Images in Segmentation of Mammograms

Road to Becoming a Visible Expert

First impression is the best impression

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Jason Trost

Jason Trost

Interests: Network security, Digital Forensics, Machine Learning, Big Data. retweets are not endorsements.

More from Medium

Creating a Point Cloud Dataset for 3D Deep Learning

Revolutionizing Satellite Communications with AI

Functional Brain Anatomy for AIs

The countless opportunities unlocked by satellite images