Improving Malware Detection by Applying Multi-Inducer Ensemble, Hacking and IT E-Book Dump Release, pobierz pdf

[ Pobierz całość w formacie PDF ]
Improving Malware Detection by Applying
Multi-Inducer Ensemble
Eitan Menahem, Asaf Shabtai, Lior Rokach, and Yuval Elovici
Duetsche Telekom Laboratories at Ben-Gurion University of the Negev,
Be’er Sheva, 84105 Israel
{eitanme, shabtaia, liorrk, elovici}@bgu.ac.il
Abstract.
Detection of malicious software (malware) using machine learning
methods has been explored extensively to enable fast detection of new released
malware. The performance of these classifiers depends on the induction
algorithms being used. In order to benefit from multiple different classifiers,
and exploit their strengths we suggest using an ensemble method that will
combine the results of the individual classifiers into one final result to achieve
overall higher detection accuracy. In this paper we evaluate several combining
methods using five different base inducers (C4.5 Decision Tree, Naïve Bayes,
KNN, VFI and OneR) on five malware datasets. The main goal is to find the
best combining method for the task of detecting malicious files in terms of
accuracy, AUC and execution time.
Keywords:
Classification Algorithms, Malicious Code, Ensemble learning,
Decision Tree, Naıve Bayes classifier
1 Introduction
Modern computer and communication infrastructure are highly susceptible to various
types of attacks. A common way of launching these attacks is using malware
(malicious software) such as worm, virus, Trojan or spyware [Kienzle et al., 2003;
Heidrai, 2004]. A spreading malicious software might generate great damage to
private users, commercial companies and governments. A single malware in a
computer, which is a part of a computer network, can result in the loss, or
unauthorized utilization or modification, of large amounts of data and cause users to
question the reliability of all of the information on the network.
Detection of known malware is commonly performed by anti-virus tools. These
tools detect the known malware using signature detection methods [White, 1999;
Dikinson, 2005]. When a new malware is released, the anti-virus vendors need to
catch an instance of the new malware, analyze it, create a new signature and update
their clients. During the period between the appearance of a new unknown malware
and the update of the signature-base of the anti-virus clients, millions of computers
are vulnerable to the new malware. Therefore, although signature-based detection
methods eventually yield high detection rate, they cannot cope with new, unseen
before, malware [White, 1998]. High-speed communication channels enable malware
to propagate and infect hosts quickly and so it is essential to detect and eliminate new
(unknown) spreading malware at early stages of the attack.
Recent studies have proposed methods for detecting malicious executables using
Machine Learning (ML) techniques [Kabiri et al., 2005]. Given a training set of
malicious and benign executables binary code, a classifier is trained to identify and
classify unknown malicious executables as being malicious. Some of the methods are
inspired by text categorization techniques, in which words are analogous to sequences
in the binary code. Malicious and benign files are represented by a vector of features,
extracted from the PE-header and the binary code of the executable. These files are
used to train a classifier. During the detection phase, based on the generalization
capability of the classifier, an unknown file can be classified as malicious or benign.
Some of the studies applying machine learning methods on content-based features are
mentioned below.
In [Schultz et al., 2001], ML techniques were applied for detection of unknown
malicious code based on binary code content. Three different feature extraction
methods are used: program header, printable strings features and byte Sequences
features, on which they applied four classifiers: Signature-based method (anti-virus),
Ripper – a rule based learner, Naïve Bayes and Multi Naïve Bayes. The experiments
conducted using 4,301 programs in Win32 PE format (3,301 malicious programs and
1000 benign programs) and the results show that all ML methods outperform the
signature-based classifier.
The study in [Abou-Assaleh et al., 2004] used
n
-gram features extracted from the
binary code of the file. The
L
most frequent
n
-grams of each class in the training set
are chosen and represent the profile of the class. A new instance is associated to a
class with the closest profile using KNN algorithm. The experiment conducted on a
small set of 65 malicious and benign files achieved 98% accuracy. Caia et al. (2007)
and Moskovitch et al. (2008) have compared different feature selection methods for
building a single classifier.
Kolter and Maloof (2006) extracted
n
-grams from the executables. Most relevant
n
-gram attributes were chosen using Info-Grain ratio and the data-sets were evaluated
using several inducers: IBK, Naive Bayes, Support Vector Machines (SVMs),
Decision Trees, Boosted Naive Bayes, Boosted SVMs, and Boosted Decision Trees
algorithms. Since costs of misclassification error are usually unknown, the methods
were evaluated using receiver operating characteristic (ROC) analysis. When
classifying new instances into two optional classes 'Benign' and 'Malicious', Boosted
decision trees outperformed other methods with an area under the ROC curve of
0.996. When classifying new instances into multi-class (such as Virus, Backdoor
etc.), the area under the ROC curve reaches an average of 0.9.
By its nature, the measured performance (e.g., accuracy) of a machine learning
classifier is influenced by two main factors: (1) the inducer algorithm that is used to
generate the classifier (e.g., Artificial Neural Network, Decision Tree, Naïve Bayes);
and (2) the type of features that are used to represent the instances. Thus, various
types of classifiers have different “inductive biases”. When generating multiple
classifiers using different inducers and various features (that are extracted from the
same set of instances) each classifier will perform differently. It might be the case that
a certain classifier will specialized in classifying instances into a sub-set of classes
(for example, a classifier that classifies an executable file as benign or Worm with
high accuracy but with low accuracy when classifying into other malicious classes
such as Trojan and Virus).
The outcome of this observation is that one should use multiple different classifiers
that will classify a new instance predicting its real class (like a group of experts) and
then combining all the results, using an intelligent combination method, into one final
result. Hopefully, this “super-classifier” will outperform each of the individual
classifiers and classify new instances with higher accuracy by learning the strength
and weakness of each classifier.
In this paper we introduce three main contributions: (1) we show that multi-inducer
ensembles are capable to detect malwares; (2) we introduce an innovative combining
method, called Troika [Menahem, 2008] which extends Stacking and show it
superiority in the malware detection tasks; and (3) we present empirical results from
an extensive real world study of various malwares using different types of features.
The rest of the paper is organized as follows. In section 2 we describe related work
in the field of ensemble application in the security domain. Section 3 presents the
experimental setup and in section 4 we present the evaluation results. Conclusions and
future work are discussed in section 5.
2 Related Work
In this section we present studies in which ensemble methods are used to aid in
malicious code detection and in intrusion detection. We also describe the six
ensemble methods used in our experiments.
2.1 Malware Detection Using Ensemble Methods
The main idea of an ensemble methodology is to combine a set of classifiers in order
to obtain a better composite global classifier. Ensemble methodology imitates our
second nature to seek several opinions before making any crucial decision. We weigh
the individual opinions, and combine them to reach a final decision. Recent studies
evaluated the improvement in performance (in terms of accuracy, false positive rate
and area under the ROC graph) when using ensemble algorithms to combine the
individual classifiers.
Several researchers examined the applicability of ensemble methods in detecting
malicious code. Zhang et. al., (2007) classifies new malicious code based on
n
-gram
features extracted from binary code of the file. First,
n
-gram based features are
extracted and the best
n
-grams are chosen by applying the Information-Gain feature
selection method. Then, probabilistic neural network (PNN) is used to construct
several classifiers for detection. Finally, the individual decisions of the PNN
classifiers are combined by using the Dempster-Shafer theory to create the combining
rules. The method was evaluated on a set of malicious executables that were
downloaded from the website VX Heavens (http://www.vx.netlux.org) and clean
programs gathered from a Windows 2000 server machine: total of 423 benign
programs and 450 malicious executable files (150 Viruses, 150 Trojans and 150
Worms) all in Windows PE format. The results show a better ROC curve for the
ensemble of PNNs when compared to the individual best PNN of each of the three
malware classes.
Ensemble of classifiers was evaluated on intrusion detection tasks as well.
Mukkamalaa et al., (2005) used the Majority Voting schema to ensemble the results
of multiple classifiers trained to detected intrusions within network traffic. The
performance of the ensemble results were compared to the performance of the
individual classifiers: SVM, MARS and three types of ANNs (RP, SCG and OSS) and
it was evaluated using DARPA data-set which is considered as an IDS evaluation
benchmark. The classifiers were trained to detect normal traffic as well as four classes
of attacks (probing, DOS, user to root and remote to user). It is shown that the simple
majority voting schema improves the accuracy of the detection.
The ensemble approach in [Kelly et al., 2006] was evaluated using two anomaly
network intrusion detection systems, LERAD [Mahoney, 2003] and PAYL [Wang
and Stolfo, 2004]. LERAD's normal behavior is composed of rule sets of expected
(normal) user behavior, and PAYL uses byte distributions derived from normal
packets. Based on the classification of the two systems as produced on a training-set,
an ensemble probability classification matrix (EPCM) is generated using conditional
probability. The method was evaluated on DARPA data-set and results show an
increase in the rate of detecting attacks and the accuracy in determining their exact
classes.
In order to make the ensemble more effective, there should be some sort of
diversity between the classifiers [Kuncheva, 2005]. In Multi-Inducer strategy,
diversity is obtained by using different types of inducers. Each inducer contains an
explicit or implicit bias that leads it to prefer certain generalizations over others.
Ideally, this multi-inducer strategy would always perform as well as the best of its
ingredients. Even more ambitiously, there is hope that this combination of paradigms
might produce synergistic effects, leading to levels of accuracy that neither atomic
approach by itself would be able to achieve.
In this study we compare several methods for combining various classifiers in
order to understand which of these methods (if any) best improves the detection
accuracy.
2.2 Methods for Combining Ensemble’s Members
In the
Majority Voting
combining scheme [Kittler et al., 1998], a classification of
an unlabeled instance is performed according to the class that obtains the highest
number of votes (the most frequent vote). This method is also known as the plurality
vote (PV) or the basic ensemble method (BEM). This approach has frequently been
used as a combining method for comparing to newly proposed methods.
Mathematically it can be written as:
(
1
)
∑
class
(
x
)
=
arg
max
g
(
y
(
x
),
c
)
k
i
c
Î
dom
(
y
)
k
i
where
y
k
(
x
) is the classification of the
k
th
classifier and
g
(
y
,
c
) is an indicator function
defined as:
(
2
)
1
y
=
c
g
(
y
,
c
)
=
0
y
¹
c
Note that in case of a probabilistic classifier, the crisp classification
y
k
(
x
) is usually
obtained as follows:
ˆ
(
3
)
y
(
x
)
=
arg
max
P
(
y
=
c
|
x
)
k
M
i
k
c
Î
dom
(
y
)
i
ˆ
where
M
k
denotes classifier
k
and denotes the probability of
y
obtaining
the value
c
given an instance
x
.
In
Performance Weighting
, the weight of each classifier is set proportional to its
accuracy performance on a validation set [Opitz and Shavlik, 1996]:
P
(
y
=
c
|
x
)
M
i
k
(
-
E
)
(
4
)
a
=
i
i
∑
=
t
j
(
-
E
)
i
1
where
E
i
is a normalization factor which is based on the performance evaluation of
classifier
i
on a validation set.
The idea of
Distribution Summation
combining method is to sum up the
conditional probability vector obtained from each classifier [Clark and Boswell,
1991]. The selected class is chosen according to the highest value in the total vector.
Mathematically, it can be written as:
ˆ
(
5
)
∑
Class
(
x
)
=
arg
max
P
(
y
=
c
|
x
)
M
i
k
c
Î
dom
(
y
)
k
i
In the
Bayesian Combination
method the weight associated with each classifier is
the posterior probability of the classifier given the training set [Buntine, 1990].
∑
ˆ
(
6
)
Class
(
x
)
=
arg
max
P
(
M
|
S
)
×
P
(
y
=
c
|
x
)
k
M
k
i
c
Î
dom
(
y
)
k
i
where
P
(
M
k
|
S
) denotes the probability that the classifier
M
k
is correct given the
training set
S
. The estimation of
P
(
M
k
|
S
) depends on the classifier's representation. To
estimate this value for decision trees the reader is referred to [Buntine, 1990].
Using Bayes' rule, one can extend the
Naive Bayes
idea for combining various
classifiers [John and Langley, 1995]:
ˆ
(
7
)
P
(
y
=
c
|
x
)
ˆ
Õ
=
M
k
j
Class
(
x
)
=
arg
max
P
(
y
=
c
)
×
j
ˆ
P
(
y
=
c
)
c
Î
dom
(
y
)
k
1
j
j
ˆ
P
(
y
=
c
)
>
0
j
Stacking
is a technique whose purpose is to achieve the highest generalization
accuracy [Wolpert, 1992]. By using a meta-learner, this method tries to induce which
classifiers are reliable and which are not. Stacking is usually employed to combine
models built by different inducers. The idea is to create a meta-dataset containing a
tuple for each tuple in the original dataset. However, instead of using the original
input attributes, it uses the predicted classifications by the classifiers as the input
attributes. The target attribute remains as in the original training set. A test instance is
[ Pobierz całość w formacie PDF ]

Wątki