Best Practices for Integrating LLMs with Malware Analysis Tools

Authors:

(1) Constantinos Patsakis, Department of Informatics, University of Piraeus, 80 Karaoli & Dimitriou str., 18534 Piraeus, Greece and Information Management Systems Institute of Athena Research Centre, Greece;

(2) Fran Casino, Information Management Systems Institute of Athena Research Centre, Greece and Department of Computer Engineering and Mathematics, Universitat Rovira i Virgili;

(3) Nikolaos Lykousas, Data Centric, Romania.

Table of Links

Abstract and 1 Introduction

2.2 LLMs in cybersecurity

3 Problem setting

4 Setting up the experiment and the dataset

5 Experimental results and discussion

6 Integration with existing pipelines

7 Conclusions, Acknowledgements, and References

6 Integration with existing pipelines

As discussed, while LLMs are not mature enough to fully replace traditional deobfuscators, they can efficiently compliment them whenever they fail. This is very often during malware campaigns where threat actors may push changes on their droppers and payloads. As such, important information for, e.g., takedowns or isolation of the threat can be missed. In the case of Emotet that we use for reference, some of the compromised domains that were used to download Emotet’s executable were missed after changes in the dropper and the deobfuscator required manual patches.

To address such issues, one could consider a pipeline as the one illustrated in Figure 7. More precisely, we consider that the input is a malicious MS Office document which, depending on the analysis environment, is either sent to a sandbox or an emulator. Either of them opens the file and logs the obfuscated malicious payload through its usual mechanisms. The payload is then sent to the traditional deobfuscators and to an LLM. Each of them tries to extract the configuration, e.g., dropper URLs, C2 servers, but the LLM tries to provide a summary of the code and from that derive the MITRE ATT&CK methods[6]. This way, the LLM can fill in the gaps of the deobfuscator in case it fails, but also provide a brief analysis of what the malicious script does and an easy to digest output for further correlations (MITRE ATT&CK methods). In our experiments, the prompt in Figure 8 managed to do the job quite efficiently in OpenAI’s GPT-4. For instance, the script of Figure 2 returns the JSON of Figure 9 which is quite accurate.

7 Conclusions

According to our outcomes, LLMs can effectively automate a substantial portion of the deobfuscation process. The latter implies that, even though the advent of LLMs is still in its infancy, they exhibit a remarkable potential to improve malware analysis and to be integrated into real-world pipelines, as we discuss in the following paragraphs.

First, cutting-edge LLMs do not simply generate code or superficially understand its context. Our extensive results clearly show their ability to process it, identify the relevant parts, and operate on them. Notably, this capability is showcased in code that is deliberately written in a form to prevent this from happening. While this is very relevant for LLMs provided as cloud-based services, the same does not apply to local LLMs. Indeed, the disparity among the two flavours of LLMs is so grave that they could be considered inefficient for this task. Yet, despite having fewer parameters than proprietary models, local LLMs can be fine-tuned to optimise their performance in specific tasks as their weights are made publicly available, which we will explore in future work. The latter includes exploring smaller LLMs to provide resource-efficient solutions related to code deobfuscation and analysis, fostering the adoption of LLMs in constrained environments.

The above implies that we do not foresee that LLMs will replace traditional unpackers but operate in existing pipelines to enrich traditional malware analysis and threat intelligence platforms. To this end, in deobfuscating malicious code, the pressing needs can be summarised in three key areas: minimising the hallucinations, expanding the input for queries, and enhancing training methodologies. More concretely, our experiments have uncovered that even in the case of the best-performing LLM, there are numerous hallucinated domains. This raises a significant concern as such processes are used to automatically create rules, making the risk of raising false flags very high. Since the hallucinated domains are most likely to originate from the training dataset due to, e.g., high representation and reputation, they would be benign. Thus, the automatically generated rules for the hallucinated domains may not only permit malicious traffic but inadvertently block legitimate ones. Furthermore, we should consider that if the LLMs hallucinate domains, they may also hallucinate functionality once they do not understand some code snippet, leading to false claims and possibly false attribution. Moreover, the occasional misinterpretation of the prompts due to the alignment of some LLMs can also impede such tasks in automated pipelines. Finally, hallucinations are a way for LLMs to fill in the gaps in their responses. Therefore, in the context of this task, and possibly for other cybersecurity tasks, LLMs should provide different responses to hallucinations. For instance, responding that the task cannot be performed would be preferable to returning a wrong result.

While the scripts in our current dataset are relatively short, malicious code, in general, is significantly longer and would not fit within a single prompt. Hence, expanding the input size becomes an absolute necessity. However, the length of the code is not the sole challenge. Since this code is purposefully obfuscated to evade analysis, even by humans, training LLMs with adequately labelled and properly annotated malicious code is crucial [18]. By fine-tuning LLMs, we anticipate a substantial improvement in the models’ accuracy, as well as their ability to handle more complex tasks, especially in understanding malicious code and artefacts. To achieve this, we plan to use sets of obfuscated and gradually deobfuscated code to train LLMs on how to deobfuscate code. We prioritise doing this for VBA, PHP, Python, and Javascript since malware analysts often find such obfuscated codes in malicious documents and webshells. Other languages would most likely use executables, which cannot be directly and accurately reversed to code.

In this regard, and as stated in Section 2, there is a need for greater transparency and ethical considerations in AI development, which are, in turn, necessary to comply with regulations such as the EU Artificial Intelligence Act. One of the main concerns is the unclear origins of training data in most models and the opacity surrounding the refinement of these models, a step that usually requires human interaction. The latter is particularly relevant for evaluating the capabilities of LLMs since biased outcomes and overfitting can only be avoided if a sound methodology is used to define the training and testing procedures, highlighting the relevance of the corresponding datasets.

Acknowledgements

This work was supported by the European Commission under the Horizon Europe Programme as part of the projects LAZARUS (https://lazarus-he.eu/) (Grant Agreement no. 101070303), CyberSecPro (https://www.cybersecpro-project.eu/) (Grant Agreement no 101083594), and CYMEDSEC. This research is supported by supported by Ministerio de Ciencia, Innovaci´on y Universidades, Gobierno de Espa˜na (Agencia Estatal de Investigaci´on, Fondo Europeo de Desarrollo Regional -FEDER-, European Union) under the research grant PID2021-127409OB-C33 CONDOR. Fran Casino was supported by the Government of Catalonia with the Beatriu de Pin´os programme (Grant No. 2020 BP 00035), and by AGAUR with the project ASCLEPIUS (2021-SGR-00111).

The content of this article does not reflect the official opinion of the European Union. Responsibility for the information and views expressed therein lies entirely with the authors.

References

[1] Ehab Alkhateeb, Ali Ghorbani, and Arash Habibi Lashkari. A survey on run-time packers and mitigation techniques. International Journal of Information Security, 23(2):887–913, 2024.

[2] Rishi Bommasani, Kevin Klyman, Shayne Longpre, Sayash Kapoor, Nestor Maslej, Betty Xiong, Daniel Zhang, and Percy Liang. The foundation model transparency index. arXiv preprint arXiv:2310.12941, 2023.

[3] Joan Calvet, Fanny Lalonde L´evesque, Jose M Fernandez, J Marion, E Traourouder, and F Menet. Waveatlas: surfing through the landscape of current malware packers. In Virus Bulletin Conference, 2015.

[4] Fran Casino, Nikolaos Lykousas, Ivan Homoliak, Constantinos Patsakis, and Julio HernandezCastro. Intercepting hail hydra: real-time detection of algorithmically generated domains. Journal of Network and Computer Applications, 190:103135, 2021.

[5] Anargyros Chrysanthou, Yorgos Pantis, and Constantinos Patsakis. The anatomy of deception: Measuring technical and human factors of a large-scale phishing campaign. Computers & Security, 140:103780, 2024.

[6] Gelei Deng, Yi Liu, V´ıctor Mayoral-Vilches, et al. Pentestgpt: An llm-empowered automatic penetration testing tool. arXiv preprint arXiv:2308.06782, 2023.

[7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics, 2019.

[8] Emmanuel Dupoux. Cognitive science in the era of artificial intelligence: A roadmap for reverseengineering the infant language-learner. Cognition, 173:43–59, 2018.

[9] Europol. World’s most dangerous malware EMOTET disrupted through global action. https://www.europol.europa.eu/media-press/newsroom/news/worlds-most-dangerousmalware-emotet-disrupted-through-global-action, 2021. [Accessed 24-04-2024].

[10] Mohamed Amine Ferrag, Mthandazo Ndhlovu, Norbert Tihanyi, et al. Revolutionizing cyber threat detection with large language models. CoRR, abs/2306.14263, 2023.

[11] Jiaxuan Geng, Junfeng Wang, Zhiyang Fang, Yingjie Zhou, Di Wu, and Wenhan Ge. A survey of strategy-driven evasion methods for pe malware: Transformation, concealment, and attack. Computers & Security, 137:103595, 2024.

[12] Dimitris Gritzalis, Kim-Kwang Raymond Choo, and Constantinos Patsakis. Malware - Handbook of Prevention and Detection. Springer, 2024.

[13] Maanak Gupta, Charankumar Akiri, Kshitiz Aryal, Eli Parker, and Lopamudra Praharaj. From chatgpt to threatgpt: Impact of generative ai in cybersecurity and privacy. IEEE Access, 11:80218– 80245, 2023.

[14] Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.

[15] Takashi Koide, Naoki Fukushi, Hiroki Nakano, and Daiki Chiba. Detecting phishing sites using chatgpt. CoRR, abs/2306.05816, 2023.

[16] Jesse D. Kornblum. Identifying almost identical files using context triggered piecewise hashing. Digit. Investig., 3(Supplement):91–97, 2006.

[17] Vasilios Koutsokostas, Nikolaos Lykousas, Theodoros Apostolopoulos, Gabriele Orazi, Amrita Ghosal, Fran Casino, Mauro Conti, and Constantinos Patsakis. Invoice# 31415 attached: Automated analysis of malicious microsoft office documents. Computers & Security, 114:102582, 2022.

[18] Marie-Anne Lachaux, Baptiste Roziere, Marc Szafraniec, and Guillaume Lample. Dobf: A deobfuscation pre-training objective for programming languages. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 14967–14979. Curran Associates, Inc., 2021.

[19] Shijia Li, Jiang Ming, Pengda Qiu, Qiyuan Chen, Lanqing Liu, Huaifeng Bao, Qiang Wang, and Chunfu Jia. Packgenome: Automatically generating robust YARA rules for accurate malware packer detection. In Weizhi Meng, Christian Damsgaard Jensen, Cas Cremers, and Engin Kirda, editors, Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, CCS 2023, Copenhagen, Denmark, November 26-30, 2023, pages 3078–3092. ACM, 2023.

[20] Mandiant. Tracking Malware with Import Hashing. https://cloud.google.com/blog/topics/ threat-intelligence/tracking-malware-import-hashing/, 2014. [Accessed 24-04-2024].

[21] Timothy McIntosh, Tong Liu, Teo Susnjak, Hooman Alavizadeh, Alex Ng, Raza Nowrozy, and Paul Watters. Harnessing gpt-4 for generation of cybersecurity grc policies: A focus on ransomware attack mitigation. Computers & Security, 134:103424, 2023.

[22] Eduardo Mosqueira-Rey, Elena Hern´andez-Pereira, David Alonso-R´ıos, Jos´e Bobes-Bascar´an, and Angel Fern´andez-Leal. Human-in-the-loop machine learning: a state of the art. ´ Artificial Intelligence Review, 56(4):3005–3054, 2023.

[23] Trivikram Muralidharan, Aviad Cohen, Noa Gerson, and Nir Nissim. File packing from the malware perspective: Techniques, analysis approaches, and directions for enhancements. ACM Comput. Surv., 55(5), dec 2022.

[24] Jonathan Oliver, Chun Cheng, and Yanggui Chen. Tlsh–a locality sensitive hash. In 2013 Fourth Cybercrime and Trustworthy Computing Workshop, pages 7–13. IEEE, 2013.

[25] Yin Minn Pa Pa, Shunsuke Tanizaki, Tetsui Kou, et al. An attacker’s dream? exploring the capabilities of chatgpt for developing malware. In Proceedings of the 16th Cyber Security Experimentation and Test Workshop, CSET ’23, page 10–18, New York, NY, USA, 2023. Association for Computing Machinery.

[26] Constantinos Patsakis and Anargyros Chrysanthou. Analysing the fall 2020 emotet campaign. arXiv preprint arXiv:2011.06479, 2020.

[27] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training, 2018.

[28] Kevin A. Roundy and Barton P. Miller. Binary-code obfuscations in prevalent packer tools. ACM Comput. Surv., 46(1), jul 2013.

[29] Sayak Saha Roy, Poojitha Thota, Krishna Vamsi Naragam, and Shirin Nilizadeh. From chatbots to phishbots?–preventing phishing scams created using chatgpt, google bard and claude. arXiv preprint arXiv:2310.19181, 2023.

[30] Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, J´er´emy Rapin, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.

[31] Miuyin Yong Wong, Matthew Landen, Manos Antonakakis, Douglas M. Blough, Elissa M. Redmiles, and Mustaque Ahamad. An inside look into the practice of malware analysis. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, CCS ’21, page 3053–3069, New York, NY, USA, 2021. Association for Computing Machinery.

[32] Geunha You, Gyoosik Kim, Seong-je Cho, and Hyoil Han. A comparative study on optimization, obfuscation, and deobfuscation tools in android. Journal of Internet Services and Information Security, 11(1):2–15, 2021.

[33] Alexandros Zacharis and Constantinos Patsakis. Aicef: an ai-assisted cyber exercise content generation framework using named entity recognition. International Journal of Information Security, 22(5):1333–1354, Oct 2023.

This paper is available on arxiv under CC BY-NC-SA 4.0 by Deed (Attribution-Noncommercial-Sharealike 4.0 International) license.

[6] https://attack.mitre.org/

Best Practices for Integrating LLMs with Malware Analysis Tools

Too Long; Didn't Read

People Mentioned

Coin Mentioned

Table of Links

6 Integration with existing pipelines

7 Conclusions

Acknowledgements

References

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

Categories

Trending Topics

Best Practices for Integrating LLMs with Malware Analysis Tools

Too Long; Didn't Read

People Mentioned

Coin Mentioned

Table of Links

6 Integration with existing pipelines

7 Conclusions

Acknowledgements

References

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES

Categories

Trending Topics