Model Performance and Pitfalls in Automated Malware Deobfuscation

by DeobfuscateApril 22nd, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Testing four LLMs on Emotet scripts, GPT-4 led in deobfuscation, but all models struggled with hallucinations and prompt limitations.

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - Model Performance and Pitfalls in Automated Malware Deobfuscation
Deobfuscate HackerNoon profile picture
0-item

Authors:

(1) Constantinos Patsakis, Department of Informatics, University of Piraeus, 80 Karaoli & Dimitriou str., 18534 Piraeus, Greece and Information Management Systems Institute of Athena Research Centre, Greece;

(2) Fran Casino, Information Management Systems Institute of Athena Research Centre, Greece and Department of Computer Engineering and Mathematics, Universitat Rovira i Virgili;

(3) Nikolaos Lykousas, Data Centric, Romania.

Abstract and 1 Introduction

2 Related work

2.1 Malware analysis and countermeasures

2.2 LLMs in cybersecurity

3 Problem setting

4 Setting up the experiment and the dataset

5 Experimental results and discussion

6 Integration with existing pipelines

7 Conclusions, Acknowledgements, and References

5 Experimental results and discussion

For our experiments, we opted to use four state-of-the-art LLMs. More precisely, we used two cloudbased LLMs offered as services and two local ones. For the LLMs provided as cloud services, we opted for OpenAI’s GPT-4 (gpt-4-1106), considered the reference LLM, and the “Pro” variant of Google’s recently introduced LLM, Gemini. For both of them, we used the official APIs, as provided by OpenAI and Google, respectively. In terms of locally deployed LLMs, we considered the following models: Meta’s Code Llama Instruct (with 34B parameters) [30], which is based on Llama 2 with additional fine tuning on 500 billion tokens of source code data, as its name implies, is an LLM that is trained specifically in large chunks of code, and Mistral AI’s Mixtral 8x7B Instruct model [14]. For both models, the quantisation level was set to 8 bits (from 16 bits in the original model weights) for all tensors, and they were deployed on NVIDIA A100 GPUs. The above setting guarantees the proper diversity and representation of LLMs. It should be noted that, due to the criticality and sensitivity of the underlying data, malware analysts are expected to opt for local models to minimise the disclosure of such information to third parties.


Using the dataset described above, we decided to assess the capabilities of LLMs to deobfuscate the corresponding scripts. To determine whether the LLMs managed to achieve the task, we argue that if an LLM can collect the URLs of a given script correctly, then the LLM has performed the deobfuscation to an acceptable degree and understood the context of the code efficiently. While we understand that the


Figure 3: The environment used for the analysis of the malicious documents.


scope is relatively advanced for an LLM and that a code summary might be far more straightforward, we believe this would not be enough. First, the experiment’s goal is not to determine whether a script is malicious or benign. Conceptually, the chances of an obfuscated PowerShell script launched from MS Word being harmless are slim [17]. Thus, we assume the script is malicious, but we want to extract actionable intelligence from it. Moreover, tools like PSDecode that perform some deobfuscation in their summaries may provide some insight on what the code does, e.g., it downloads and executes some content from the Internet. Note that for such tools to deobfuscate the code, they resort to intercepting calls to functions and logging them. Therefore, the code has to be executed, requiring a sandboxed environment. From the logged information, such tools make keyword matches in the logs to identify specific actions. As such, while helpful, they do not understand the code and its content, which is what we would expect from an LLM.


Crafting prompts for novel tasks requires brute-force trial-and-error experimentation, and different prompt templates with different wording choices lead to significant accuracy differences. The final prompt; see Figure 4 for OpenAI’s GPT-4 and Code Llama tasks to deobfuscate the scripts of our dataset, were selected after several iterations, empirically yielded the best results, particularly reducing hallucinations in local models. LLMs allow the control of the randomness and creativity of the responses they generate through a parameter referred to as temperature. In our experiments, we set the temperature for each LLM to zero so that the results are focused, deterministic, with the least possible hallucinations, and allow for reproducibility. Then, for each script, we extracted the URLs and compared them to our ground truth (i.e., determined whether the URLs were correctly extracted) to assess the LLMs’ accuracy in deobfuscation.


Our assessment shows that OpenAI’s GPT-4 clearly outperforms all LLMs, correctly identifying 69.56% of the URLs, followed by Google’s Gemini Pro with almost half accuracy (36.84%). The two local LLMs scored very low, with Code Llama achieving only 22.13% and Mixtral 11.59%. Although these exhibit the prevalence of OpenAI’s GPT-4, there are further things to note. For instance, the deobfuscation might be partially successful, yet the URL domain can be extracted. From our experiments, this was the result of substitutions or splits that were not made. Therefore, we relaxed the task, requesting the extraction of the correct domain. In this simplified case, the results were significantly improved since each of the LLMs gained a boost of 13.33% (Code Llama) up to 19.16% (GPT-4) in accuracy.


Beyond the poor performance of the two local LLMs, we also witnessed many hallucinations in the models. That is, outputs conforming to the request’s prompt; they are URLs, but factually, they are incorrect and make little sense. The extent of these haluccinations is illustrated in Figure 5c. By far, most hallucinations are generated by the lowest performing LLM, Mixtral, which is almost 70% more than the second LLM in hallucinations, Code Llama. Interestingly, there is no common hallucinated domain in the top 20 of all LLMs. On the contrary, there is only one hallucinated domain among GPT-4 and Gemini Pro, which is blueyellows.com. It is worth noting that the domain resembles blueyellowshop.com, which would be the correctly deobfuscated URL, indicating the partial deobfuscation from the LLM and an attempt to fill in the gaps identified. The most hallucinated domains are admins.com and


Figure 4: Structure of the task prompts used in OpenAI’s GPT (top) and Code LLama (bottom).


blog.com from Mixtral. We attribute both these hallucinations to the existence of literals admin and blog in several URLs of our dataset and the corresponding scripts. Code llama often returned bogus domains of the form www.example?.com where ? is either NULL, 1, and 2. In this case, there were several WordPress URLs, e.g., http://www.example.com/wp-content/uploads/2019/07/image.png, probably stemming from the fact that Code Llama recognised something that looks like a WordPress URL from its training and substituted for that. Gemini Pro introduced several obvious hallucinations, e.g., youtube.com and facebook.com. Overall, we can claim that most hallucinated domains match the pattern of the blueyellows.com case discussed above, incorrectly processed domains by the LLM.


Finally, we should note that there were a couple of instances where GPT-4 and Gemini Pro models refused to perform the task requested in the prompt, producing replies such as “I’m sorry, I cannot extract URLs from this Powershell code as it appears to be obfuscated and possibly malicious. As a language model, I prioritise ethical and safe utilisation of technology.” (GPT-4), and “I’m designed solely to process and generate text, so I’m unable to assist you with that.” (Gemini Pro), which can be attributed to their stringent alignment, provided the narrow scope of the task. This was not observed in the locally deployed models.


Further to merely using the obfuscated Powershell scripts from Emotet, we wanted to assess whether this approach would work with other other obfuscators, note that Emotet seems to use Bohannon’s Invoke-Obfuscation[4] [26]. Therefore, we used Chimaira[5], which obfuscates Powershell scripts and managed to bypass the detections of many antivirus solutions. Having the deobfuscated scripts from the original dataset, we obfuscated them with Chimara. Using as reference the deobfuscated script of Figure 2b, we obfuscated it with Chimaira, and a snapshot of the obfuscated script is illustrated in Figure 6. For the sake of brevity, from this snapshot, we have removed the random comments that Chimaira adds since they can be easily removed with a script, a CyberChef recipe, or an LLM. As it can be observed, Chimaira uses very long variable names with random characters, nevertheless, even in the so-called paranoid mode, the strings are not obfuscated enough, see Figure 6 where the string with the domains is more than obvious and easy to be extracted. To obfuscate it, one has to manually specify the strings, which are simply split into smaller strings and assigned variables so that their concatenation returns the specific string.


Due to budget constraints, as we had to perform multiple tasks for each script, we could not make the experiments with all scripts. However, in all our experiments, all LLMs managed to deobfuscate



the strings generated by Chimaira. Yet, this could not be made in a single task. They required more prompts as the input size exceeded the APIs constraints. Hence, one prompt was needed to remove the comments, one to replace the variable names with shorter and more readable ones, and another for the replacement and concatenation of variables. Nonetheless, LLMs proved to be efficient in this task, even if the adversaries used different tooling.


This paper is available on arxiv under CC BY-NC-SA 4.0 by Deed (Attribution-Noncommercial-Sharealike 4.0 International) license.


[4] https://github.com/danielbohannon/Invoke-Obfuscation


[5] https://github.com/tokyoneon/Chimera

Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks