Heterogeneous Information Networks and Applications to Cyber Security
This post explores Heterogeneous Information Networks (HIN) and applications to Cyber security.
Over the past few months I have been researching Heterogeneous Information Networks (HIN) and Cyber security use cases. I first encountered HIN’s after discovering this paper: “Gotcha: Sly Malware!- Scorpion A Metagraph2vec Based Malware Detection System” through a Google Scholar Alert I had setup for “Guilt by Association: Large Scale Malware Detection by Mining File-relation Graphs”. If you’re interested in how I setup my Google Alerts to stay abreast of the latest security data science research, see this: Security Data Science Learning Resources.
Heterogeneous Information Networks are a relatively simple way of modelling one or more datasets as a graph consisting of nodes and edges where 1) all nodes and edges have defined types, and 2) types of nodes > 1 or types of edges > 1 (hence “Heterogeneous”). The set of node and edge types represents the schema of the network. This differs from homogeneous networks where the nodes and edges are all the same type (e.g. Facebook Social Network Graph, World Wide Web, etc.). HINs provide a very rich abstraction for modelling complex datasets.
Below, I will walk through important HIN concepts using the HinDom paper as an example. HinDom uses DNS relationship data from passive DNS, DNS query logs, and DNS response logs to build a malicious domain classifier using HIN. They use Alexa Top 1K list, Malwaredomains.com, Malwaredomainlist.com, DGArchive, Google Safe Browsing, and VirusTotal for deriving labels. Below is an example HIN schema taken from this paper.
This schema represents three combined datasets (Passive DNS, DNS query logs, DNS response logs) and it models three node types (Client, Domain, and IP Address) and six edge types (segment, query, CNAME, similar, resolve, and same-domain). Here is an expanded example and descriptions of the relationships:
- Client-query-Domain — matrix Q denotes that domain i is queried by client j.
- Client-segment-Client — matrix N denotes that client i and client j belong to the same network segment.
- Domain-resolve-IP — matrix R denotes that domain i is resolved to IP address j.
- Domain-similar-Domain — matrix S denotes the character-level similarity between domain i and j.
- Domain-cname-Domain — matrix C denotes that domain i and domain j are in a CNAME record.
- IP-domain-IP — matrix D denotes that IP address i and IP address j are once mapped to the same domain.
Once the dataset is represented as a graph, feature vectors need to be extracted before machine learning models can be built. A common technique for featurizing a HIN is by defining Meta-paths or Meta-graphs against the graph and then performing guided random walks against the defined meta-paths/graphs. Meta-paths represent graph traversals through specific node and edge sequences. Meta-paths selection are akin to feature engineering in classical machine learning as it is very important to select meta-paths that provide useful signals for whatever variable is being predicted. As seen in many HIN papers, meta-paths/graphs are often evaluated individually or in combination to determine their influence on model performance. Guided random walks against meta-paths produce a sequence of nodes (similar to sentences of words), which can then be fed into models like Skipgram or Continuous Bag-of-Words (CBOW) to create embeddings. Once the nodes are represented as embeddings many different models (SVM, DNN, etc) can be used to solve many different types of problems (Similarity Search, Classification, Clustering, Recommendation, etc). Below are the meta-paths used in the HinDom paper.
Below is the HinDom Architecture to illustrate how all these concepts come together.
Below are some resources that I found useful for learning more about Heterogeneous Information Networks as well as several security related papers that used HIN.
Books:
- Mining Heterogeneous Information Networks: Principles and Methodologies
- Heterogeneous Information Network Analysis and Applications
HIN Papers:
- Mining Heterogeneous Information Networks- A Structural Analysis Approach
- HIN2Vec: Explore Meta-paths in Heterogeneous Information Networks for Representation Learning
- PathSim: Meta Path-Based Top-K Similarity Search in Heterogeneous Information Networks
- Ranking-Based Clustering of Heterogeneous Information Networks with Star Network Schema
- Metapath2vec: Scalable Representation Learning for Heterogeneous Networks
- A Survey of Heterogeneous Information Network Analysis
- Adversarial Learning on Heterogeneous Information Networks
Security-related HIN Papers:
Malware Detection / Code Analysis:
- AiDroid: When Heterogeneous Information Network Marries Deep Neural Network for Real-time Android Malware Detection
- Gotcha: Sly Malware!- Scorpion A Metagraph2vec Based Malware Detection System
- HinDroid: An Intelligent Android Malware Detection System Based on Structured Heterogeneous Information Network
- Make Evasion Harder: An Intelligent Android Malware Detection System
- DeepAM: a heterogeneous deep learning framework for intelligent malware detection
- HinDom: A Robust Malicious Domain Detection System based on Heterogeneous Information Network with Transductive Classification
- iTrustSO: An Intelligent System for Automatic Detection of Insecure Code Snippets in Stack Overflow
Mining the Darkweb / Fraud Detection / Social Network Analysis:
- Key Player Identification in Underground Forums over Attributed Heterogeneous Information Network Embedding Framework
- Your Style Your Identity: Leveraging Writing and Photography Styles for Drug Trafficker Identification in Darknet Markets over Attributed Heterogeneous Information Network
- iDetector: Automate Underground Forum Analysis Based on Heterogeneous Information Network
- Cash-out User Detection based on Attributed Heterogeneous Information Network with a Hierarchical Attention Mechanism
- iDev: Enhancing Social Coding Security by Cross-platform User Identification Between GitHub and Stack Overflow
Tutorials:
Code:
- github.com/zhoushengisnoob/HINE — Heterogeneous Information Network Embedding: papers and code implementations.
- github.com/stellargraph/stellargraph (see stellargraph-metapath2vec.ipynb)
- github.com/hetio/hetnetpy — HIN library
- github.com/hetio/hetmatpy — HIN library that represents as matrices.
- github.com/csiesheep/hin2vec
Prominent Security Researchers using HIN:
As always, feedback is welcome so please leave a message here, on Medium, or @ me on twitter!
–Jason
@jason_trost
Note: this was originally posted on my personal blog covert.io on 1/20/2020.