Authors:
(1) Constantinos Patsakis, Department of Informatics, University of Piraeus, 80 Karaoli & Dimitriou str., 18534 Piraeus, Greece and Information Management Systems Institute of Athena Research Centre, Greece;
(2) Fran Casino, Information Management Systems Institute of Athena Research Centre, Greece and Department of Computer Engineering and Mathematics, Universitat Rovira i Virgili;
(3) Nikolaos Lykousas, Data Centric, Romania.
Table of Links
2 Related work
2.1 Malware analysis and countermeasures
4 Setting up the experiment and the dataset
5 Experimental results and discussion
6 Integration with existing pipelines
7 Conclusions, Acknowledgements, and References
3 Problem setting
In what follows, we assume that we have a piece of malware from which we want to extract actionable intelligence. Moreover, we assume the malware is packed to protect its malicious payload. Therefore, the goal of this research is not the detection, but the extraction of the payload. Since we consider the case of cyber threat intelligence, the assumption that we already know that a file is malicious is weak, since one may have many ways to know that a specific file is malicious. For instance, the file is very similar at the byte level to other known malware files (e.g., using ssdeep [16] and TLSH [24]), the imphash of the file is known to be malicious [20], or the YARA rules have identified a packer which is known to be malicious [19].
Knowing that the file is malicious, we want to extract the malicious payload to understand what the malware does. Although one may claim that this can be extracted via dynamic analysis, this is not always accurate. For instance, as in the case of Emotet, which we discuss afterwards, the malware may use multiple sites to drop the payload or have various command and control (C2) servers. To take down the malware’s infrastructure and prosecute the perpetrators, one needs to collect all these domains and IPs, yet all this information is stored in the payload of the malware. The problem is that collecting this information is not straightforward, as malware authors are well aware of this. Thus, during a malware campaign, the payload, obfuscation mechanisms, and packers can change, leading to the loss of crucial information or requiring a lot of manual effort to alleviate this. In this context, we explore how LLMs can facilitate this process to identify obfuscated information and extract it in an automated way.
This paper is available on arxiv under CC BY-NC-SA 4.0 by Deed (Attribution-Noncommercial-Sharealike 4.0 International) license.