String obfuscation detection using machine learning
Title: String obfuscation detection using machine learning
SNIC Project: LiU-2019-37
Project Type: LiU Compute
Principal Investigator: Alireza Mohammadinodooshan <>
Affiliation: Linköpings universitet
Duration: 2019-10-29 – 2020-11-01
Classification: 10201


String obfuscation is a technique used by malware authors as a countermeasure to malware detection methods. This project focuses on the detection of obfuscated strings in android applications(APKs) using machine learning techniques. In a string-encrypted app, string material is stored in a scrambled form within the application, and additional logic is inserted into the program code of the app to reconstruct strings from the encrypted content on-the-fly dynamically. The usage of obfuscated strings can make the analysis of the malware harder or impossible. Several techniques for the detection of string encryption in APKs have been proposed in the literature, such as the ones proposed in [1,2]. The common point between all of these works (or similar works for detection of obfuscated strings in other domains) is that they are all using the ngrams of the strings or the entropy of strings as the feature sets for the detection of string obfuscation. To evaluate these features sets, they use a diverse set of algorithms and tools. While some of these algorithms are computationally light, some need a higher computation resource. As an example, one of the tools used in [1] is the ATM package, which is an automated machine learning library that can do the model selection and tuning automatically. In other words, for each training set, it can search through different classification algorithms and different hyperparameters of those algorithms to find the optimal choice for that dataset. It is worth noting that this makes running the algorithm computationally heavy, especially on large datasets. In this project, we initially focused on showing that the feature sets used by the previous works do not generalize well for all types of strings (e.g., short strings). We also proposed our generalizable string obfuscation method. Moreover, We have identified a methodological problem in the empirical evaluation of the string encryption detection capabilities of [1]. Due to our limited computation resources, all of these contributions have been evaluated by some small scale experiments. However, we need to do the study on some more massive datasets and for more cross-validation folds to validate them and make them ready for publication. Therefore, we need access to the HPC cluster for doing these experiments. [1]O Mirzaei wt al., AndrODet: An adaptive Android obfuscation detector. Future Generation Computer Systems 90 (2019), 240–261. [2]S Dong et al., Understanding Android Obfuscation Techniques: A Large-Scale Investigation in the Wild. In Security and Privacy in Communication Networks. Springer International Publishing, 172–192. [3] T. Swearingen et al. , ATM: A distributed, collaborative, scalable system for automated machine learning, in 2017 IEEE International Conference on Big Data, BigData 2017, Boston, MA, USA, December 11-14, 2017