Hi! I’m Nikolai Khechumov from Avito’s Application Security team. This is the second part of our journey, where we are trying to improve secrets detection. In the previous part, we examined different types of secrets, understood the core problems, and hit a dead end. Now we are going to make a breakthrough and do something useful.
Okay, since we cannot easily use the existing tools, let’s try to understand how they work and review the first two steps any SAST tool makes to build an AST.
A very simplified scheme is shown below:
These two steps are “lexing” and “parsing.”
The Lexing (also known as ‘tokenization’) stage receives a code as just a stream of chars. Then it finds very basic language-specific syntax constructions and outputs them as a set of typed tokens — small strings with semantic interpretation. For example, a token ‘def’ is not just three chars but a Python keyword reserved for function declaration. We now have valuable insights about the purpose of every token, but the context itself is still missing.
The parsing stage adds valuable context. It combines tokens into higher-level language constructions and outputs them in a form called an “abstract syntax tree.” Now we have more structural details: groups of tokens became variables, functions, and so on.
For us, the question is still the same: how.
One day, while making a presentation for a talk, I had a code snippet. I wanted to format it nicely, started a browser, and typed “syntax highlighting online.” Then I opened the first link, pasted my code, pressed “highlight,” and.. a lighting strike hit somewhere near me. What I did is I found a tool that:
Forget about the presentation. Let’s find something similar running on Python. And yes, I found it.
I found Pygments, a fantastic library that solved the most significant problem: language-dependent tokenization. It still uses regexes under the hood, but those regexes are not written by me.
That’s a killer feature!
The library is about syntax highlighting but it features RawTokenLexer
, so we can output a raw stream of tokens with their semantic meaning.
Our first problem — decreasing the number of strings for analysis — is solved. Now we understand the type of every token and can just ignore useless ones: keywords, punctuation, numbers, operators, etc., leaving only literals and comments. But we are still unable to understand the names and values of variables.
The lexing problem seems to be solved. Now let's move on to parsing. Building a true AST looks a bit redundant for our problem, so let’s optimize.
Further research has shown that the type of token is more important than its value for variable detection. We need to look for patterns inside a stream of token types to detect variables. Once the pattern is found, we can make additional checks against the values of the tokens inside it.
To create a Token Type Stream (TTS), I took one character from the type name of each token. Thanks to Pygments, token types are common across languages and formats.
Eventually, a Variable Detection Rule (VDR) may contain the following:
pattern
to look inside a TTS (regex with match groups)MatchRules
for match groups that check for expected values inside the pattern (eq. useful for finding assignment punctuation)MatchSemantics
to clarify a group’s semantic purpose
So let’s take our simple code snippet:
def main(arg: str):
a = 3
b = 'hello'
return f'{a}{b}'
Represent as an array with TTS.
Basically, the variable of our interest (b
) is between the 20th and the 25th token.
The reference pattern here should be something like this:
(n)t*(o|p)t*(s)(s)(s)t
Let’s apply it to the TTS
Match groups help us to apply additional verification logic.
And finally, a rule may be supplemented with our vision of semantics.
Summarizing everything
The cool part here is that approach allows you to cover any language with variable detection using only 4-6 rules.
Semantic information about a given file enables us to analyze deeper in the right context. For example, we can now calculate entropy for a variable’s value, knowing for sure that this specific substring is a variable's value.
Names of variables are also important for analysis. There are two new rules I introduced:
Now we can also use hashed secrets as a source of new rules. A hashed secret rule can contain information about the original secret length to improve performance.
This approach became a giant step forward for Avito: we received a noticeable performance boost (up to 60%) and +200% new findings we could not see before. Of course, the false positive rate has also grown, but it’s mostly due to test secrets. The findings are semantically correct.
Our code scanning model requires every scanner to be a web service that supports our internal protocol, so we decided to extract the core functionality and open-source it as a CLI tool called DeepSecrets.
You can find it here: https://github.com/avito-tech/deepsecrets
Release notes are in a separate article here.
Thank you!
The featured image for this piece was generated with stable diffusion v2.1
Prompt: Illustrate a computer screen with the caution symbol.