Phân loại mã độc Android sử dụng học sâu

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (3.86 MB, 176 trang )

<span class="text_page_counter">Trang 1</span><div class="page_container" data-page="1">

MINISTRY OF EDUCATION AND TRAINING

<b>HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY</b>

</div><span class="text_page_counter">Trang 2</span><div class="page_container" data-page="2">

MINISTRY OF EDUCATION AND TRAINING

<b>HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY</b>

<b>Ph.D. Nguyen Kim Khanh Ph.D. Hoang Van Hiep</b>

<b>Hanoi−2024</b>

</div><span class="text_page_counter">Trang 3</span><div class="page_container" data-page="3">

<b>DECLARATION OF AUTHORSHIP</b>

I declare thatmydissertation titled "Android malware classification using deep learning"hasbeenentirelycomposedbymyself,supervisedbymyco-supervisors,Ph.D. Nguyen Kim Khanh and Ph.D. HoangVanHiep. I assureyousome statements as follows:

• This workwasdone as a part of the requirements for the degree of Ph.D. Hanoi University of Science andTechnology.

• Thisdissertationhasnotpreviouslybeensubmittedfor anydegree.

• The results inmydissertation aremyindependent work, except where works in the collaborationhavebeen included. Other appropriate acknowledgments are given within this dissertationbyexplicitrefe rences.

Ph.D. NGUYEN KIM KHANH

Ph.D. HOANG VAN HIEP

</div><span class="text_page_counter">Trang 4</span><div class="page_container" data-page="4">

Communication andTechnology(SoICT), Hanoi University of Science andTechnology(HUST). HUST is a special place where I accumulated immense knowledge inmyPh.D. process.

A Ph.D. process is not a one-man process. Therefore, I am heartily thankful to my supervisors, Ph.D. Nguyen Kim Khanh and Ph.D. Hoang Van Hiep, whose encourage-ment, guidance, and support from start to finish enabled me to develop my research skills and understanding of the subject. I have learned countless things from them. This dissertation would not have been possible without their precious support.

I wouldliketo thank the Executive Board and all members of the Computer Engi-neering Department, SoICT, and HUST for their frequent support inmyPh.D. course.

</div><span class="text_page_counter">Trang 5</span><div class="page_container" data-page="5">

1.2.3 Android Malware ClassificationE v a l u a t i o n Metrics...18

1.2.3.1 Metrics for the BinaryClassificationProblem...19

1.2.3.2 Metrics for Multi-labelledC l a s s i f i c a t i o n Problem...20

</div><span class="text_page_counter">Trang 6</span><div class="page_container" data-page="6">

2.1.4.3 Malware Classification based onC N N Model...54

2.1.4.4 Summary ofE x p e r i m e n t a l Results...54

2.2 FeatureAugmentation based onAprioriAlgorithm...55

2.2.4.1 Experimental Dataseta n d Scenario...60

2.2.4.2 experiment based onCNNModel...61

2.2.4.3 Summary ofE x p e r i m e n t a l Results...61

</div><span class="text_page_counter">Trang 7</span><div class="page_container" data-page="7">

2.4 ChapterS umma r y... 72

<b>3 DEEP LEARNING-BASED </b>

3.2.2.3 Malware Classification usingCNNModel...83

3.2.2.4 Summary ofE x p e r i m e n t a l Results...83

3.3 Proposed Method using WDCNN Model for Android Malware Classifi-cation

</div><span class="text_page_counter">Trang 9</span><div class="page_container" data-page="9">

<b>No. Abbreviati</b>

2 API Application Programming Interface 3 CNN Convolutional Neural Network 4 DBN Deep Belief Network

5 DNN Deep Neural Network 13 LSTM Long Short-Term Memory 14 PSO Particle Swarm Optimization

16 RNN Recurrent Neural Network 17 SVM Support Vector Machine

18 TF-IDF Term Frequency – Inverse Document Frequency

21 RBM Restricted Boltzmann Machine 22 WDCNN Wide and Deep CNN

23 XML Androidmanifest.xml

</div><span class="text_page_counter">Trang 10</span><div class="page_container" data-page="10">

3.1 Result withAccmeasure (%) ins ce n a r i o 1...79

3.2 Result withAccmeasure (%) ins ce n a r i o 2...79

3.3 Resultswithmeasuresinscenario3(%)...79

3.4 Experimental results usingCNNmodel...84

3.5 The datasets used fort h e experiment...91

3.6 Experimental results ofSimpledataset...95

3.7 Experimental results ofComplexdataset...96

3.8 Experimental results whencomparingmodels...96

3.9 Accuracy comparison of modelsFeatures:Images 128x128+permission +API...97

3.10 Experimental results with scenario3 (%)...97

3.11 Averageset ofweights(accuracy-%)...104

3.12 SetofWeightsaccordingtothenumberofsamples(accuracy-%)...105

3.13 Our proposed set ofweights(accuracy-%)...105

3.14 Summary of results of proposed machine learning, deep learningmodels andcomparison... 106

</div><span class="text_page_counter">Trang 11</span><div class="page_container" data-page="11">

<b>LIST OF FIGURES</b>

1.1 ArchitectureofAndroidOSsystem[37].. . . . 7

1.2 The increase of malware onAndroidOS...14

1.3 Types of malware onAndroidOS...14

1.4 Anomaly-BasedDetectionTechnique...17

1.5 Overview of the problem of detecting malware ont h e Android...25

1.6 General model of featuree x t r a c t i o n methods...27

1.7 Statistics of papers using machine learning and deep learning from2019-2022ondblp... 40

1.8 Architecture of the CNNm o d e l [133]...45

2.1 Evaluation model for Android malware classification using co-occurrencematrix... 50

2.2 Output matrix withd i f f e r e n t size...52

2.3 Top(10) malware families inD r e b i n dataset...53 2.11 Experimental model when applying featureselectionalgorithm...69

2.12 Experimental results when applying featureselectionalgorithm...71

3.1 System development and evaluation process usingt h e DBN...76

</div><span class="text_page_counter">Trang 12</span><div class="page_container" data-page="12">

3.11 Overall model usingfederatedlearning...101 3.12 Compare the results of theweightedaggregationmethods...106 3.13 Classification results withinfluencefactor...107

</div><span class="text_page_counter">Trang 13</span><div class="page_container" data-page="13">

In the presentday,there is a growing inclinationtowardsthe adoption of digital transformation and artificial intelligence in smart device applications across diverse operating systems. This trend aligns with the advancements of the fourth industrial

andvarietyof devices that use the Android operating system (OS)havecontributed to the significant increase in the number, style, and appearance of malware.According to the statistics[2],in 2021, there were a total of 3.36 million malware found in the Android OS market. This situation leads to danger for users of mobile operating sys- tems. Solving the problems of malware detection is, therefore, urgent andnecessary.As reported in the DBLP database[3]from 2013 to 2022, there were 1,081researches on thisissue.

Twomain approaches are commonly applied to detect Android malware: static and dynamic analysis. Static analysisinvolvesinspecting a program’s executable file structure, characteristics, and source code. Theadvantageof static analysis is that it does not require that the codebeexecuted (of course, it is pretty dangerous to run a malware file on a natural system). By examining the decompiled code, the static analysis can determine the flows and actions of the execution file andthusidentify it as either malware or benign. Thedisadvantage, however,is that some sophisticated malware can include malicious runtime behavior that can go undetected. On the other hand,dynamicanalysisinvolvesexecutingpotentiallymaliciouscodeinarealorsand- box environment to monitor its behavior. The sandbox environment helps analysts examine potential threats without putting the system at risk of infection. Although dynamic analysis could detect threats that mightbeignoredbystatic analysis, this approach requires more time and resources than static analysis. Itmaynotbeablet o coverall the possible execution paths of the malware. Insummary,static analysis is said to help find known threats and vulnerabilities. In contrast, dynamic analysis is suitable for finding new types and uncovering threats not previously documented (i.e., zero-day threats).Forthe problem of malware detection, dynamic analysis seems recommendedfororganizationsthatneedadeeperunderstandingofmalwarebehavior or impact andhavethe necessary tools and expertise to perform it.Forthe problem of malware classification, static analysis is more popular due to its morestraightforward

</div><span class="text_page_counter">Trang 14</span><div class="page_container" data-page="14">

implementation. This dissertation also uses static analysis as the main method for feature extraction[4,5,6,7,8,9,10,11,12,13,14].

Malware classification assigns malware samplesintospecific malware families, in-cluding benign ones. Signature-based and machine learning-based methodshaveusu-ally been used for this problem. Signature-based methodshavebeen traditional and widelyused[15,16,17].Theyrelyonmatchingthe"signature"ofknownmalwaresam- ples with unknown ones. As mentioned in the previous paragraph, static or dynamic analysis can extract the "signature" from samples. Several limitations of signature- based methods exist as follows: (i) they

toevadedetection; and (iii) they require constant updates of the signature database. Machine-learning-basedmethodsareemergingandpromisingtechniquesformalware

classification. They use various machine learning algorithms to learn from a large set of labeled malware samples and then classify new ones based on their features.Conversely,machine-learning-based methods can overcome some of the challenges of signature-based methods, such as detecting new or unknown malware, handling com- plex or dynamic code features, and reducing humaninterventionand manual analysis.However,themachinelearning-basedmethodhassomedrawbacks,including(i)requir- ing more time and resources than the signature-based method and (ii) the accuracy of classificationdependsonthequalityoflabelingthetrainingdataaswellasthelearning

model.Sincemachine-learning-basedmethodsaremoreadvancedthansignature-based ones, this work focuses on the machine-learning-based method for Android malware classification. reliable and explainable, as theymaygroup apps based on arbitrary orirrelevantfeatures or fail to capture the true characteristics of malware. Therefore,su-pervisedlearningisstillmorepopularinapplicationsforAndroidmalwareclassification

duetoitsmoreaccurateandinterpretableresults[6,18,19,20,21,22,23,24,25,26,27].This work focuses on a supervised learning model based on the above analysis.Super- vised learning requires a large and reliable dataset that labels Android apps benign or malware.Fortunately,suchdatasetscanbeeasilyfoundontheInternet.

There are many steps involved in a machine learning problem, but two of the most important ones are data preparation and model evaluation:

• Data preparation is collecting, cleaning, transforming, and selecting the data that willbeusedformachinelearning.Datapreparationiscrucialbecauseitaffectsthe

</div><span class="text_page_counter">Trang 15</span><div class="page_container" data-page="15">

qualityandperformanceofthemachinelearningmodel.Ifthedataisincomplete, inaccurate,irrelevant,orinconsistent,themodelcannotlearnthecorrectpatterns andmakeaccuratepredictions.

• Model evaluation measures and compares the machine learning model’s perfor-mance on unseen data. Model evaluation is important because it helps to de-terminehowwell the model generalizes to new situations andhowreliable its predictionsare.

Featureextraction is one step of data preparation.ForAndroid malware classifica-tion, the input of this step is a list of APK (Android ApplicationPackage)files, and the output israwfeatures extracted from APK files. By applying static analysis, some examples ofrawfeatures wouldbe(i) permissions (a list of permissions that the app requests from the system or user, such as permission to access the Internet, contact information, etc.), (ii) API calls (a list of methods the appinvokesfrom the Android framework or other libraries), and (iii) resources (the files that the app uses to store dataorprovideuserinterfaceelements,suchasimages,icons,sounds,strings,layouts, etc.). These features are usually presented in "string" format andthusneed tobecon-vertedintonumbers before being used as input for the machine learning model. Many related workshaveused or combined one of the aboverawfeatures without consider-ing the relationship among these features[4,5,7,26,28,29].In this dissertation,twomethods are proposed forrawfeature augmentation, i.e., those that

therelationshipbetweenfeaturesbasedontheobservationthatifamalwarerequests specific permission, itmaytend to call some particular APIs (therefore, observing the con-occurrence of permission and/or API callsmayhelp tomakethe relationship between individualfeatures).

Forthemodelevaluation,severaltypicalmachinelearningmodelswereinvestigated and adopted for this problem, including SVM, RF, selection tree, KNN, NaiveBayes,etc.[14,25,27,30] Although

canachievequiteahighaccuracyclassificationrate,theyusuallyfocusonmalwaredetectionprob-lems, i.e., binary classification. In recent years, deep learning models, such as the Convolutional NeuralNetwork(CNN),havedominatedmanyfields of machine learn- ing, e.g., fingerprint recognition, face recognition, voice detection, anticipation, etc.However,quite a few works tried CNN for the problem of Android malware detection and classification[31,32,33,34].Theadvantageof deep learning models is that they can "learn" features fromrawinput data. Manual feature extractionmay,therefore,beunnecessary in some cases. Some research has proposed the idea thatbydirectly converting APK filesinto"images," the malware classification problem would, there- fore, become an image classification one

</div><span class="text_page_counter">Trang 16</span><div class="page_container" data-page="16">

andthuscanbesolvedbythe CNN model. [35,36]ThisideashowsagoodperformanceformalwareontheWindowsplatform

</div><span class="text_page_counter">Trang 17</span><div class="page_container" data-page="17">

but notgoodperformance on the Android platform. Thepoorresult on Android is because, unlike the execution file in Windows, an APK file in Android is not a single file. Still, it contains all thecontentsneeded to run the application, including Android manifestandclasses.dex(compiledjavacode),resources.Therefore,simplyconverting an APK fileintoan "image"may makeno sense. Even in the case of converting only the classes.dex file (the runtime compiled code)into"image," the represented "image"may lacka lot of information stored in other files, and this consequently leads to apoorclassification result. Therefore, totakefulladvantageof deep learning, which is the ability to learn some hidden features of the sample files but keep improving the performance of Android malware classification, the Wide and Deep (WD) modelwasproposedforthisproblem.Experimentalresultsconductedondifferentdatasetsprovedthefe asibilityoftheproposedidea.

In summary, this dissertation offers the following main contributions: • Proposing feature enhancementmethods:

<b>– Featureaugmentation based on co-occurrence matrix in the work[Pub.2].– Featureaugmentation based on the Apriori algorithm in the work[Pub.6].</b> comprehensive review and synthesis of pertinent literature to establish a general problem and subsequently scrutinized it to identify unresolved issues. Subse-quently,the dissertation introduces several approaches to address these issues during the feature extraction and training stages. The proposed methodologies were tested using three datasets from trusted sources to assess and contrast their performance against alternative approaches. The present dissertation is organized in the following structure:

</div><span class="text_page_counter">Trang 18</span><div class="page_container" data-page="18">

<i><b>• Chapter2.ProposedMethodsforFeatureExtraction. This chapter</b></i> based on co-occurrence attributes to find new characteristics and helprenovatethe characteristic set. A selection method based on popularity and con-trastvaluedevelopedtwomeasures, and a characteristic evaluation method is based on thesetwomeasures; characteristic selection is based on thevalueof the evaluationmethod.

<i><b>• Chapter3.DeepLearningBasedforAndroidMalwareClassificatio</b></i>

<i><b>n. Presenting the implementation of some deep learning methods for Android</b></i>

mal- ware detection problems and proposing some models of augmentations for the problem. The first part of the chapter proposes and tests the application of deep belief and convolutional neural networks to detect Android malware. Based on the study results at the beginning of the chapter about the superiority of the CNN model in the Android malware detection problem, the dissertation will

basedonthesamplesetsizefortheAndroidmalwaredetectionproblem.

</div><span class="text_page_counter">Trang 19</span><div class="page_container" data-page="19">

<b>Chapter 1OVERVIEWOF</b>

<b>ANDROIDMALWARECLASSIFICATIONBASED ONMACHINEL E A R N I N G</b>

Chapter1will provide an overview of foundational knowledge, covering aspects such as Android operating system architecture, malware, Android-specific malware, methods for classifying malware on Android, metrics used in machine learning and deep learning, and related works.

</div><span class="text_page_counter">Trang 20</span><div class="page_container" data-page="20">

Figure 1.1: Architecture of Android OS system[37]

<b>• LinuxKernel</b>

The Android Operating System is built upon the Linux kernel version 2.6. Should they wish tobeexecuted, all operations are carried out at this level. These pro-cessesincludememorymanagement,hardwarecommunications(drivermodels), security tasks, and processm a n a g e m e n t .

Although Android was built upon the Linux kernel, the kernel has been heavily modified. These modifications are tailor-made to satisfy the characteristics of handheld devices, such as the limited nature of the CPU, memory and storage, screen size, and, most importantly, the continuous need for wireless connections.

</div><span class="text_page_counter">Trang 21</span><div class="page_container" data-page="21">

This level contains the following components:

<b>– Display Driver: controls the screen’s display and captures user interactions </b>

<b>– Binder IPC Driver: handles connections and communication with wireless</b>

networks such as CDMA, GSM, 3G, 4G, and E to ensure seamless

The hardware abstractionlayer(HAL) provides standard interfaces that expose device hardware capabilities to the higher-levelJavaAPI framework. The HAL consistsofmultiplelibrarymodules,eachimplementinganinterfaceforaspecific

hardware component, such as the camera or Bluetooth module. When a frame- work API calls to access device hardware, the Android system loads the library module for that hardwarec o m p o n e n t .

<b>• AndroidRuntime</b>

The Android Runtime provides the libraries thatanyprograms inJavaneed to functioncorrectly.It hastwomain components,muchliketheJavaequivalent on personal computers. The first component is the CoreLibrary,which contains classes such asJavaIO, Collections, and File Access. The second component is theDalvikVirtualMachine,anenvironmentforrunningAndroidapplication s.

<b>• Native C/C++L i b r a r i e s</b>

This section comprises numerous libraries written in C/C++ to be utilized bysoftware applications. These libraries are grouped into the following categories:

</div><span class="text_page_counter">Trang 22</span><div class="page_container" data-page="22">

<b>– System C Libraries: these libraries are based on the C standard and are used</b>

exclusivelybythe operatings y s t e m .

<b>– OpenGLES: Android supports high-performance 2D and 3D graphics</b>

withtheOpenGraphicsLibrary(OpenGL<small>®</small>),specifically,theOpenGLESAPI.Ope nGL is a cross-platform graphics API specifying a standard 3D graphics processing hardware softwareinterface.

<b>– Media Libraries: this collection contains various code segments to support</b>

<b>• JavaAPIFrameworkThe entire feature set of the Android OS</b>

isavailabletoyouthrough APIs written in theJavalanguage. These APIs form the building blocksyouneed to create Android appsbysimplifying the reuse of core, modular system components, and services, which include the <b>– Window Manager: manages the construction and display of user interfaces</b>

and the organization and management of interfaces

</div><span class="text_page_counter">Trang 23</span><div class="page_container" data-page="23">

System Apps are apps that communicate with the users. Some of these apps include:

<b>– The basic apps that come with the OS, such as Phone, Contacts, Browser, </b>

SMS,Ca le nd ar , Ema il ,Ma ps , Ca me r a, e t c.

<b>– The user-installed apps,likegames, dictionaries, etc. </b>

These applications share thesecharacteristics:

<b>– WritteninJavaorKotlin,withextensiontypeAPK(APKfile).</b>

<b>– When an app is run, a Virtual Machine is initialized for that runtime. The</b>

app canbeanActiveProgram with a user interface, a background app, or a service.

<b>– Android is a multitasking operating system, meaning users can run multiple</b>

programs and taskssimultaneously. However,for each app, there exists only one instance. Thispreventsthe abuse of resources and generally helps the system run moreefficiently.

<b>– ApplicationsinAndroidareassigneduser-specificIDnumberstodifferentiate</b> their privileges when accessing resources, hardware configurations, and the system.

<b>– Android is an open-source operating system, distinguishing it</b>

frommanyother mobile operating systems. It allows third-party applications to run in the background.However,these background appshavea minor restriction, as they are limited to using only 5-10% of the CPUcapacity.This limitation is in place topreventmonopolization of CPU resources.

<i><b>1.1.2 Overview of AndroidMalware</b></i>

<b>• MalwareDeftnition</b>

According to NIST[38], Malware is defined as:

“Malware, also known as malicious code, refers to a program that is covertly in-sertedintoanotherprogramintendingtodestroydata,rundestructiveorintrusiveprograms, or otherwise compromise theconfidentiality,integrity,oravailabilityof the victim’s data, applications, or operating system. Malware is the most com- mon external threat to most hosts, causing widespread damage and disruption andnecessitatingextensiverecoveryeffortswithinmostorganizations”.

Fromthe above definition, it canbeseen that malware is unsuitable for users and systems. Understanding malware andhowtopreventit helps protect users in today’s connected environment.

</div><span class="text_page_counter">Trang 24</span><div class="page_container" data-page="24">

<b>• Categories ofMalware</b>

The rise of malware comes with the development of the internet, especiallywhen all activities, including social and financial, cannowbeperformed online, and they are subject to anonymous attacks for unrighteous intentions. Malware willbeclassifiedintoseven types, as shown inTable1.1below[38, 39]:

Table 1.1: Types of malware

<b>Malware </b>

Viruses self-replicatebyinserting copies of themselvesintohost programs or data files. Viruses are often triggered through user interaction, such as opening a file or running a program. Viruses canbedividedintothe followingtwosubcategories:

<b>– Compiled Viruses: a compiled virus is executedbyan</b>

operating system. Types of compiled viruses include file infector viruses, which attach themselves to executable programs;bootsector viruses, which infect the masterbootrecordsofharddrivesorthebootsectorsofremov - able media; and multipartite viruses, which combine the characteristics of file infector andbootsectorv i r u s e s .

<b>– Interpreted Viruses: interpreted viruses are</b>

executedbyan application. Within this subcategory, macro virusestake advantageof the capabilities of applications’ macro programming language to infect application documents and document templates. In contrast, scripting viruses infect scripts that are understoodbyscripting languages processedbyservices

</div><span class="text_page_counter">Trang 25</span><div class="page_container" data-page="25">

Worms:a worm is a self-replicating, self-contained program that usually executes itself without userintervention.Wormsare dividedintotwocategories:

<b>– NetworkServiceWorms:anetworkservice</b> worm takesadvantageof a vulnerability in anetworkservice toprop- agateitselfandinfectothersystems.

<b>– Mass MailingWorms:a mass mailing worm is similar to</b>

an e-mail-borne virus but is self-contained rather than infecting an existingfile.

Example: Stuxnet, SQL Slammer

Trojan Horses

a Trojan Horse is a self-contained, nonreplicating program that, while appearing benign, actually has a hidden malicious

purpose.Trojanhorseseitherreplaceexistingfileswithmali-ciousversionsoraddnewonestosystems.Theyoftendeliver other attacker tools tos y s t e m s .

Example: Emotet, Triada.

Spywareismalwarethatcanrunsecretlyonthesystemwith- out notifying users.Todisrupt system processes,spywareaims to collectprivateinformation and grant remote access to bad actors.Spywareis often used to steal financial infor- mation orprivateuseri n f o r m a t i o n .

Example: DarkHotel, Olympic Vision, Keylogger

Adwareis the most commonly used malware to collect user data on the system and provide ads to users without permis-sion. Even though adware isn’t occasionally dangerous, in some situations, adware can cause system crashes.

</div><span class="text_page_counter">Trang 26</span><div class="page_container" data-page="26">

Ransomware is a kind of malware that has permission to ac-cess system private information; it encrypts data to prevent user access, and then the attackers can take advantage of the situation and blackmail users. Ransomware is usually part of phishing actions. The attacker can encrypt information that can only be opened with his key.

Example: RYUK, Robbinhood, Clop, DarkSide

Fileless malware

Filelessmalwareliveinsidethememory.Thissoftwarewillbeprocess ed from the victim system’s memory (NOT from files on the hard disk). Thus, it is harder to detect compared to other classic malware. It also makes the encryption process harderbecauseFilelessmalwarewilldisappearwhenrestart- ing thesystem.

Example:A s t a r o t h

<b>• Android MalwareOverview</b>

Android OSalwaysholds a high market share on the mobile operating system.Followingthe statistics of[1]in June 2023, Android dominated 70.79% of the mobile market. Thus, Android OS’s vulnerabilities are attractive to hackers, as all the social and financial activities cannowbeperformed on mobile devices. According toAV-Test[2],new types of malware are still being createdannually,along with the development of an open-source OSlikeAndroid. The malware increasefrom2013toMarch2022isshowninFig.1 . 2 .

<b>• Android MalwareCharacteristics</b>

Malware is a developing threat to every connected individual in the age ofmobile phones and the internet. Because of the financial incentives, the number and complexity of Android malware are growing, making it more difficult to detect. Android malware is almost identical to the varieties of malware that users mightbefamiliar with on their desktops, but it is only for Android phones and tablets. Android malware primarily stealsprivateinformation, which canbeas common as the phone number, emails, or contacts of the user or as critical as financial credentials. With that data, the scammershavemanyunlawful options that can earn them substantialmoney.There are some signs indicating that a mobile devicewasinfectedbymalware:(1)usersoftenseesuddenpop-upadvertisements on their devices; (2) mobile batteries drain faster than usual; (3) users notice applications that they did not intentionally install; and (4) some apps do not appear on the screen after installation. Android malware appears inmanyforms,

</div><span class="text_page_counter">Trang 27</span><div class="page_container" data-page="27">

Figure 1.2: The increase of malware on Android OS

such as trojans, adware, ransomware, spyware, viruses, phishing apps, orworms. Kasperskyhasinvestigatedwidespreadmalwarein2020and2021andcategorized them(Fig.1.3) [40].Malwareofteninfiltratesviavarioustraditionalsources,such

asharmfuldownloadsinemails,browsingdubiouswebsites,orfollowinglinksfrom unknownsenders.

Figure 1.3: Types of malware on Android OS

<b>Common sources of Android malware:</b>

</div><span class="text_page_counter">Trang 28</span><div class="page_container" data-page="28">

<b>– Applications thathavebeen infected:Attackerscan collect popular</b>

pro- grams,repackagethem with malware, and re-distribute them through down- load links. This method is so effective thatmanyfraudsters tend to design oradvertisenewapps;naiveusersmayfollowcustomizeddownloadlinksand accidentallyinstallordownloadmalwaretotheirdevices.

<b>– Malvertisements: malvertising is the kind of malware embedded in</b>

adver- tising distributed through advertisements. A virus willbedownloaded to the user’scomputeriftheuserclicksononeofthesepop-ups.Theusercanblock adsontheAndroiddevice,whichisaneffectivewaytopreventmalware.

<b>– Scams: phishing assaults and other standard email- or SMS-based frauds</b>

are examples of online scams. The email or message will contain a link to malware, which willbeinstalled on the phone when the user clicks the link. It’soneofthemostcommonwaystoinfectAndroidphones.

<b>– Direct download to the device: this is the most trivialwayto infect a</b>

device with malware. The attackersmustonly directly connect a gadget or USB to the phone and install the virus programs.However,it is difficult to do thisway,because it is difficult for the attacker to gain direct access to the victim’sdevice.

<b>1.2 AndroidMalware ClassiftcationMethods</b>

Twotechniquesareoftenusedformalwaredetection:signature-basedandanomaly- based.

Asignature-basedapproachisoftenemployedincommercialantivirusproducts,as the detection results attain high accuracy and precision. Malware behaviors or fea- tures willberetained in a database of samples or characteristics. A malware detection system (a detector) will analyze and recognize malware based on one or severalchar-acteristics that match pre-defined patterns. Malware signatures canbestatic, knownbytesequences, or behavior characteristics, such asnetworkbehavior.However,this methodisuselessindetectingunknownorzero-daymalware,astheiruniquetraitsdo notexistintheprogramdatabase.

On the other hand, the anomaly-based method can detect unknown suspicious be-havior. This method is usually based on machine learning techniques. The difference between normal and abnormal behavior can be modeled during training. Since 2017, machine learning and deep learning, in particular, have been extensively applied for malware detection on mobile devices.

</div><span class="text_page_counter">Trang 29</span><div class="page_container" data-page="29">

<i><b>1.2.1 Signature-basedMethod</b></i>

In this method, the signature of sample malware will be stored in a list of known threats and their indicators of compromise (IOCs). The signature can be extracted by static or dynamic analysis. The method compares the sample’s signature with all the signatures stored in the database to decide whether a sample is malware.

One of the attributes of the signature-based method is highaccuracy.Toachievethat,indicatorsstoredinthedatabasemustbeaccurate,havecomprehen sivecoverage, andbeupdatedregularly,as new malware is bornrapidly.On the other hand, using a signature-based method is time-consuming. The larger the number of files or apps thatneedtobechecked,thelongerthetestingtimerequiredbecausethesystemneeds to sequentially decompile each app, extract features, and then compare each feature with the patterns defined in the database. The program can often combine static and dynamicsignatures,e.g.,dataextractedfromthedecompiledcodeandbehavioraldata while the app runs. The combination will provide more comprehensive coverage, but theexaminationtimewillincreaseconsiderably.

Permissions, API calls, class names, intents, services, or opcode patterns are often used to spot the malware. In[16],Enck et al. proposed a security service for the Android operating system called Kirin. The Kirin authenticates an app at installation time using a set of protection rules designed to match the properties configured in the app. Kirin system also evaluates configurations extracted from the installer’s manifest files and compares them with the rules set up and saved in the system.

Batyuk et al.[17]applied static analysis on 1865 top free Android apps retrieved from the Android Market. The experiments showed that at least 167 access private information such as IMEI, IMSI, and phone numbers among the analyzed apps. One hundred fourteen apps read sensitive data and immediately write them to a stream, which indicates a significant privacy concern.

Dynamic analysis is highly efficient when dealing with obfuscation techniques such as polymorphism, binarypackagingsystems, and encryption.However,app operation (eveninavirtualenvironment)alsocostsdynamicanalysismoretimethanstaticanal- ysis. Chen et al.[15]proposed an approach to indicate dangerous samples in Android devices using static features and dynamic patterns. The static features were acquired via decompilation of APK files, and connections between the app’s classes, attributes, methods,andvariableswillbeextracted.Theprogramalsoanalyzesfunctioncallsand the relationships between data threads when the Android app runs. All that informa-tioncanbeusedtodeducethreatpatternsandcheckwhethertheappaccessesprivatedata or conductsanyillegal operation, e.g., sending messages without permission or stealing confidential information. The experiments in the report show that the rate of malware found in 252 samples using the dynamic signature-based method is91.6%.

</div><span class="text_page_counter">Trang 30</span><div class="page_container" data-page="30">

Figure 1.4: Anomaly-Based Detection Technique activities. The anomaly-based detection technique consists of the training and detection stages, as presented in Fig.1.4.This technique observes normal be- haviors of the appovera period and uses attributes of standard models as vectors to compare and detect abnormal behaviors ifanyoccur. A set of standard behavior attributes willbedeveloped in the training stage. In the detection stage, whenanyabnormal “vectors” arise between the model and the running app, that app willbedefined as an anomaly program. This technique allows for recognizing even unknown malware and zero-dayattacks.

In an anomaly-based approach, application-extracted behaviors can be achieved in three ways: static analyses, dynamic analyses, or hybrid analyses. Static analyses will be investigated before installation using the app’s source code. Dynamic analyses will perform the test and collect all the app data during execution, for example, API calls, events, etc., where hybrid methods use both.

However,the abnormal and expected behaviors of the samples are not easily sep-arated because of the large number of behaviors extracted. There is no basis to de-termine what behavior is normal and not normal. It is not feasible to divide these behaviors based solely on the analyst’s experience. Machine learning models are ap-plied during training to minimize time and increaseefficiency.When applying ma- chine learning, the number of behaviors that shouldbefedintothe training model canbeenormous, as all behaviorsmustbecollected as features.Nowadays,there aremanymachine learning modelshavebeen applied to malware detection, suchas

</div><span class="text_page_counter">Trang 31</span><div class="page_container" data-page="31">

SVM (SupportVectorMachine), KNN (K-Nearest Neighbors), RF (RandomForest),etc., and the modern deep-learning models DNN (Deep Neural Network), DBN (Deep Belief Network), CNN (Convolutional Neural Network), RNN (Recurrent Neural Net- work), LSTM (LongShort-TermMemory), GAN (Generative Adversarial Network), etc.T h o s e modelswillbediscussedinalatersectionofthedissertation.

Schmidt et al.[41]haveanalyzed Linux ELF (Executable and LinkingFormat)objectfiles in an Android environment using the command readelf. The function calls read from the executables are compared with the malware database for classificationbyusing the DecisionTreelearner (DT), Nearest Neighbor (NN) algorithm, and Rule Inducer(RI).Thistechniqueshows96%accuracyinthedetectionphasewith10%false positives. Schmidt. et al. extended their function calls-based technique to Symbian OS[42].They extracted function calls from binaries and applied their centroid machine, based on alightweightclustering algorithm, to identify benign and malware executa- bles. The technique provides 70-90% detection accuracy and 0-20% falsepositives.

Schmidtetal.[43]proposedaframeworktomonitorsmartphonesrunningSymbian OS and Windows Mobile OS to extract system features for detecting anomalous apps. Theproposedframeworkisbasedontrackingclientsrunsonmobiledevices,collecting data describing the system state, such as the amount of free RAM, the number of running processes, CPU usage, and the number of SMS messages in the sent direc-tory,and sending it to the Remote Anomaly Detection System (RADS). The remote server contains a database to store the received features; the detection units access the database and run machine learning algorithms, e.g., AIS or SOM, to distinguish between normal and abnormal behaviors. A

differentsizes,reducingthesetoffeaturesfrom70to14,thussaving80%ofdiskspace and significantly reducing computation and communication costs.Consequently,the approach positively influences battery life and has a small impact on actual positive detection.

Only the machine learning methods applied to malware detection on the Android system will be discussed in this dissertation. The next chapter will detail the analysis to get the behaviors or features by static, dynamic, and hybrid methods.

<i><b>1.2.3 AndroidMalware ClassificationEvaluationMetrics</b></i>

In the problem of recognizing and classifying, some commonly used measures are

<i>(Acc),Precision,Recall,F1-score,confusionmatrix,ROCcurve,AreaUnderthe Curve (AUC), etc.Forthe</i>

classification problem of having multiple outputs, there areslightdifferencesintheuseofmeasures.

</div><span class="text_page_counter">Trang 32</span><div class="page_container" data-page="32">

<i>1.2.3.1 Metricsfor the Binary ClassificationProblem</i>

In the detection problem, the output has only two labels, commonly called Positive and Negative, where Positive indicates an app is malware, and Negative alludes to the opposite. Hence, there are four definitions provided:

<i>• TP(TruePositive): apps correctly classified asm a l w a r e .• FP(FalsePositive): apps mistakenly classified asm a l w a r e .• TN(TrueNegative): apps correctly classified asb e n i g n .• FN(FalseNegative): apps mistakenly classified asb e n i g n .</i>

While evaluating, the ratio (rate – R) of these four measures is considered:

<i>Accis</i>oftenusedwithproblemswherethenumberofpositiveandnegativesamples are equal. As for problems with a large deviation between the number of positive and

<i>• Recallis defined as the ratio ofTPpoints to those that are actuallypositive(TP+FN). The formula for calculatingRecallis shown as Equation1.3.</i>

</div><span class="text_page_counter">Trang 33</span><div class="page_container" data-page="33">

<i>1.2.3.2 Metricsfor Multi-labelled ClassificationP r o b l e m</i>

When there are multiple labels as output in the classification problem, it canbereduced to a detection problem for each class, considering the data belonging to the classunderconsiderationtobepositiveandalltheremainingdatalabelstobenegative. Thus, there willbea pair of precision and recall for each class. The concepts of

<i>whereTP<small>c</small>,FP<small>c</small>, andFN<small>crespectively areTP,FP, andFNof the classc.</small></i>

<i>Macro-average precisionis the average of precisions by class, similar</i>

<i>toMicro−average recall(called Recall: average actual classification of each class of</i>

malware and benign), given in Equation1.6.

With the abovementioned measures, the Acc and Recall measures are used for this classification problem in experiments.

<i><b>1.2.4 AndroidMalwareD a t a s e t</b></i>

Many datasets have been published for the research community as follows:

• Contagio mobile: released in 2010 and last updated in 2010. It consists only of

</div><span class="text_page_counter">Trang 34</span><div class="page_container" data-page="34">

of files in each family is not balanced. Some familieshaveonly one or a few (less than10)malwarefiles,whileothersmayhavemorethan1000files.Furthermore,Drebin also provides 123,453 benign samples in the form of extractedfe a t ur e s .

• PRAGuard: released in 2015, PRAGuard consists of 10479 malware without mal-ware family labels. PRAGuardwascreatedbymixing MalGenome and Contagio Minidumpdatawithsevendifferentmixingtechniques.InApril2021,thisdatasetwasdeco mmissioned.

• Androzoo: Androzoowascreated in 2016 and is still being updated. It provides both malware and benign in large quantities.However,Androzoo only provides apps, and theyhaven’tbeen classified as families. So far, the number of files

• CICMalDroid 2020: samples collected in 2018 and published in 2020 with a size of 13,077 files in 5 categories (Adware, Banking Malware, SMS Malware, Mobile Riskware,Benign).

• InvesAndMal (CIC MalDroid2019): samples collected in 2017 and published in 2019 with 5491 files. This dataset is dividedintofour categories(Adware,Ran-somware, Scareware, SMS malware) and consists of 42 families within the above categories. Most of the benign accounts for 5000 samples. It is currently still public.

• MalNet2020:thedatasetwaspublishedinDecember2020with1,262,024samples. This dataset is essentially downloaded from Androzoo but has extracted features fromFCG (FunctionCall Graph) and Image. The dataset is dividedinto696 malware families and 47 malware

down-loadedfromMalNet’shomepage( providesSHA256todownloadfromAndrozoo.

In the experiments in the doctoral dissertation (including experiments in journal articles and conferences), the following datasets were used:

• Virusshare: in the conference paperFAIR[Pub.4], a small number of samples, including 500 (250 malware and 250 benign), were used. Since the number of malware and benign programs is balanced, the only measure to apply isaccuracy.

</div><span class="text_page_counter">Trang 35</span><div class="page_container" data-page="35">

• Drebin:Thisisawell-knowndatasetusedinmanypapersbylocalandforeignau-thors.Duringthisresearchwork,theDrebindatasetwasconstantlyimplemented, suchas:

<b>– [Pub.1]: this research experimented on the entire Drebin dataset (including</b>

both benign and malware provided). The articleshowedthat using a CNN model hadadvantagesoverthe original Drebin SVM model. Because the

<b>– [Pub.2, Pub.6]: those journals utilized the entire Drebin malware combined</b>

with 7,140 benign samples from a different source. Multiple measurements wereperformedtoevaluatethefeatureselectioninthepaper.

• AMD: similar to Drebin, this dataset is widely usedbyresearchers due to the largequantityandvarietyofsamples.

<b>– In [Pub.10], the 65 families with the most samples were appropriate for the</b>

research. The [Pub.11] study used the AMD dataset with families with at

</div><span class="text_page_counter">Trang 36</span><div class="page_container" data-page="36">

<b>Standard of dataset evaluation:</b>

Toevaluate the quality of a dataset, the dissertation uses several criteria: thenum-berofsamples,thenumberoflabels,thedistributionofsamplesamongclasses,andthe

levelofupdatingofthedatasets.Thesecriteriacanhelpensurethatthedatasetiscom- prehensive, well-labeled, balanced, and up-to-date, which can increase the reliability andgeneralizationoftheresearchresults.

<b>The quality of classiftcation depends on the dataset:</b>

Basedontheabovedatasets,somedatasetsaresuitableformalwaredetectiontasks (which only provide malware and are not dividedinto manymalware families) and classificationtasks(inwhichmanymalwarefamiliesarewithinmalware).Italsoneeds

tobecombinedwithaseparatebenignset(whichcanbedownloadedfromsourcessuch as Androzoo, GooglePlay,etc.). With the same machine learning and deep learning algorithm, the adaptation to each dataset gives different results. This is because the features extracted in each dataset (a set of samples) are different. Assuming that all datasetshavegoodquality,there is still a clear difference between each dataset due to the different years of publication. Each year, Google provides new versions withmanychanges,sothefeaturesextractedineachsetaredifferent".Somedatasetshavespecific

</div><span class="text_page_counter">Trang 37</span><div class="page_container" data-page="37">

features, such as datasets containingC++code instead of justJavacode, datasets containingscrambledandnotsimplyreadablelikeregularcode,encrypteddatasets,or datasets thathavecode rearranged in different positions, etc.Fromthe above, it canbeseenthatthequalityofeachdatasetsignificantlyaffectstheclassificationqualit y.

<b>Modiftcation and advancing the dataset:</b>

The investigation conducted in the dissertation indicates that the labeled datasets exhibit a discrepancy among distinct malware families. The Virusshare and Androzoo datasets, which furnish APK files, exhibit a partialitytowardsspecific labels when subjected to labeling software despite theirlackof inherent labeling.Consequently,thisresearchhasincorporatedmultiplesupplementaryevaluationmetr icstofurnisha moreall-encompassingappraisalofthecorrelationamongdiversefamilieswithvarying

<i>quantities,includingbutnotlimitedto precision,recall,andF1-score.</i>

<b>1.3 Machine Learning-based Method for Android Malware Clas- siftcation</b>

The problem of malware classification on the Android platform is described in Fig. 1.5. Ingeneral,therearefourstepsinvolvedinAndroidmalwareclassification.

APK file is a compressed file containing other

<i>fileslikeAndroidmanifest.xml(later</i> called the XML file),

<i>andclasses.dex(later called the DEX file), etc.Featuresextracted from APK files</i>

form a dataset and serve as input to training models.Featuresare critical to a model and thekeycomponents for the model tomaketrue or false decisions. Arrays of features canbecollected via static analysis, dynamic analysis, or hybrid

set.Forexample,indexclassescanbetransformedtoimagefeatures,collectgroupsof dynamic features such as permission, API call,intent,etc., or transform file code to

<i>a“smali” file. A set of extracted features couldbedefined as a“rawfeaturedataset.”</i>

Features of the original dataset (original feature dataset) were transformed into binary form, and then these binary values can be specified differently as:

<b>• Images: transformed from text to binary or Hex code for image point </b>

collec- tion.

<b>• Frequency: attributes occurrence (Permission, API call, etc.) frequency in </b>

APK’s file canbetransformedintofeatures.

<i><b>• Binary encoding: if the defined behavior takes place, pass “1”, else </b></i>

</div><span class="text_page_counter">Trang 38</span><div class="page_container" data-page="38">

Figure 1.5: Overview of the problem of detecting malware on the Android

<b>• Relationshipweight:apply a mathematical model to retrieve the</b>

relation- ship between features and assignedweightsto newly acquired relationships(Forexample, the relationship betweenAPIs).

Mathematical formulas such as the TF-IDF algorithm(TermFrequency– Inverse DocumentFrequency),IG (Information Gain), PSO (Particle Swarm Optimiza-tion), GA (Genetic Algorithm), etc. couldbeused to assignweightsto each feature. Besides, the above algorithms canbeused to evaluate the importance of eachfeatureinthedataset.Manystudiesdidn’tassignnewweighttoeachfeature

butusedtheoriginaldataset directly.

The dataset can contain hundreds or tens of thousands of features. Of course, the more featuresyouputintothe training model, the longer the training time. On the other hand,manyfeatures are not necessarily suitable for the classification problem. Therefore, feature selection is also a problem studied in classification in general and malware classification in particular. Which features to chooseand

</div><span class="text_page_counter">Trang 39</span><div class="page_container" data-page="39">

which features to remove depend on the criteria given by each researcher; for example, it is possible to create a threshold of accuracy, recall, etc., to stop the feature removal or rely on the weights of each feature in the original dataset to remove the feature with the threshold set by us.

<b>3.Featureaugmentation from theavailablefeatures e t</b>

ThefeaturesusedinclassifyingAndroidmalwareareprimarilydiscrete.Theywillberelated to each other in the application.Featureswill often come in groups (for example, calling ACCESS_FINE_LOCATION will include thegetFromLo-cation()systemcall).Findingtherelationshipbetweenfeaturesineachsuchap-plication is complex. Therefore, data engineershavedevelopedmanymethods to enhance the input dataset. The augmentation approach could generate more fea-turesbasedontheircorrelationorhybridizationorreducethenumberoffeaturesusedasinp utdata.Somefeaturegenerationmethods,suchasApriori,K-means, FP-growth, etc., canbementioned. On the other hand, dimensionality reduc-tion techniques often used are low-variance filtering or Generalized discriminantanalysis.

Applyingnewmodelsintrainingalwayspiquesinterestindataclassificationprob- lems. Researchershaveappliedmanymachine learning models toimproveclas-sificationquality,especially image classification. New methods and models are developed based on the existing studies, and deep learning emerges as an evo-lution of traditional machine learning. In deep learning, the typical model used is CNN. Along with CNN, there aremanyvariations of CNN, such as 16, VGG-19, ResNet, etc. It canbeseen that developing or applying a new model to a classifier is of great significance. Many research groupshaveapplied themodels tootherproblems,includingtheAndroidmalwaredetectionproblem.

<b>1.4 RelatedWorks</b>

<i><b>1.4.1 RelatedWorksonFeatureE x t r a c t i o n</b></i>

<i>1.4.1.1 FeaturesExtractionM e t h o d s</i>

The overview model of feature extraction is described in Fig.1.6.The current research follows the feature extraction methods:

<b>1. Static features extraction: analyzing source code (via reverse </b>

engineering) to getthefeaturesasstrings(stringtype)fromthefile.

<b>2. Dynamic features extraction:takeeach APK file and run it in an isolation</b>

environment (e.g., a sandbox independent of the operating systemenvironment),

</div><span class="text_page_counter">Trang 40</span><div class="page_container" data-page="40">

Figure 1.6: General model of feature extraction methods

just like installing an app and running each module in that app. The desired features can be extracted during execution.

<b>3. Hybrid features extraction: combining static method (1) and dynamic</b>

method (2).

<b>4. Image conversion:usually,an APK file, DEX file, and XML file </b>

canbetrans-formedintoasequenceofbytes,andthenanimagecanbecreatedfromthebinary sources. The image here canbea grayscale image or a color GRB image. This image conversion method can alsobeconsidered a static method;however,the featuresarenotnecessarilyanalyzedthoroughlytoproduceinternalstringslikein the static method, so in thisstudy,image conversion analysis willbecategorized separately.

<b>a) Static ExtractionMethod</b>

The static analysis method decompiles the APKpacket.It analyzes the internal

<i>characteristics,therebycollectingsuspiciouscharacteristicsinthedecompiledcodefiles, and those “suspicious</i>

<i>attributes” in the form of strings are called features. The static method hasmany</i>

advantages,suchas: • Fastanalysis.

• Candetectmalwarewhosebehaviorisnotdirectlyvisibletotheoutside.

</div>