diagrams from paper“iDetector: Automate Underground Forum Analysis Based on Heterogeneous Information Network”

This post explores Heterogeneous Information Networks (HIN) and applications to Cyber security.

Over the past few months I have been researching Heterogeneous Information Networks (HIN) and Cyber security use cases. I first encountered HIN’s after discovering this paper: “Gotcha: Sly Malware!- Scorpion A Metagraph2vec Based Malware Detection System” through a Google Scholar Alert I had setup for “Guilt by Association: Large Scale Malware Detection by Mining File-relation Graphs”. If you’re interested in how I setup my Google Alerts to stay abreast of the latest security data science research, see this: Security Data Science Learning Resources.

Heterogeneous Information Networks are a relatively simple way of modelling one or more datasets as a graph consisting of nodes and edges where 1) all nodes and edges have defined types, and 2) types of nodes > 1 or types of edges > 1 (hence “Heterogeneous”). The set of node and edge types represents the schema of the network. This differs from homogeneous networks where the nodes and edges are all the same type (e.g. Facebook Social Network Graph, World Wide Web, etc.). HINs provide a very rich abstraction for modelling complex datasets.

Below, I will walk through important HIN concepts using the HinDom paper as an example. HinDom uses DNS relationship data from passive DNS, DNS query logs, and DNS response logs to build a malicious domain classifier using HIN. They use Alexa Top 1K list, Malwaredomains.com, Malwaredomainlist.com, DGArchive, Google Safe Browsing, and VirusTotal for deriving labels. Below is an example HIN schema taken from this paper.

This schema represents three combined datasets (Passive DNS, DNS query logs, DNS response logs) and it models three node types (Client, Domain, and IP Address) and six edge types (segment, query, CNAME, similar, resolve, and same-domain). Here is an expanded example and descriptions of the relationships:

  • Client-query-Domain — matrix Q denotes that domain i is queried by client j.
  • Client-segment-Client — matrix N denotes that client i and client j belong to the same network segment.
  • Domain-resolve-IP — matrix R denotes that domain i is resolved to IP address j.
  • Domain-similar-Domain — matrix S denotes the character-level similarity between domain i and j.
  • Domain-cname-Domain — matrix C denotes that domain i and domain j are in a CNAME record.
  • IP-domain-IP — matrix D denotes that IP address i and IP address j are once mapped to the same domain.

Once the dataset is represented as a graph, feature vectors need to be extracted before machine learning models can be built. A common technique for featurizing a HIN is by defining Meta-paths or Meta-graphs against the graph and then performing guided random walks against the defined meta-paths/graphs. Meta-paths represent graph traversals through specific node and edge sequences. Meta-paths selection are akin to feature engineering in classical machine learning as it is very important to select meta-paths that provide useful signals for whatever variable is being predicted. As seen in many HIN papers, meta-paths/graphs are often evaluated individually or in combination to determine their influence on model performance. Guided random walks against meta-paths produce a sequence of nodes (similar to sentences of words), which can then be fed into models like Skipgram or Continuous Bag-of-Words (CBOW) to create embeddings. Once the nodes are represented as embeddings many different models (SVM, DNN, etc) can be used to solve many different types of problems (Similarity Search, Classification, Clustering, Recommendation, etc). Below are the meta-paths used in the HinDom paper.

Below is the HinDom Architecture to illustrate how all these concepts come together.

Below are some resources that I found useful for learning more about Heterogeneous Information Networks as well as several security related papers that used HIN.

Books:

HIN Papers:

Security-related HIN Papers:

Malware Detection / Code Analysis:

Mining the Darkweb / Fraud Detection / Social Network Analysis:

Tutorials:

Code:

Prominent Security Researchers using HIN:

As always, feedback is welcome so please leave a message here, on Medium, or @ me on twitter!

–Jason
@jason_trost

Note: this was originally posted on my personal blog covert.io on 1/20/2020.

--

--

Jason Trost

Interests: Network security, Digital Forensics, Machine Learning, Big Data. retweets are not endorsements.