The Great Fuzzy Hashing Debate


In the first post in this series, we introduced the use of hashing techniques to detect similar functions in reverse engineering scenarios. We described PIC hashing, the hashing technique we use in SEI Pharos, as well as some terminology and metrics to evaluate how well a hashing technique is working. We left off last time after showing that PIC hashing performs poorly in some cases, and wondered aloud if it is possible to do better.

In this post, we will try to answer that question by introducing and experimenting with a very different type of hashing called fuzzy hashing. Like regular hashing, there is a hash function that reads a sequence of bytes and produces a hash. Unlike regular hashing, though, you don’t compare fuzzy hashes with equality. Instead, there is a similarity function that takes two fuzzy hashes as input and returns a number between 0 and 1, where 0 means completely dissimilar and 1 means completely similar.

My colleague, Cory Cohen, and I debated whether there is utility in applying fuzzy hashes to instruction bytes, and our debate motivated this blog post. I thought there would be a benefit, but Cory felt there would not. Hence, these experiments. For this blog post, I’ll be using the Lempel-Ziv Jaccard Distance fuzzy hash (LZJD) because it’s fast, whereas most fuzzy hash algorithms are slow. A fast fuzzy hashing algorithm opens up the possibility of using fuzzy hashes to search for similar functions in a large database and other interesting possibilities.

As a baseline I’ll also be using Levenshtein distance, which is a measure of how many changes you need to make to one string to transform it to another. For example, the Levenshtein distance between “cat” and “bat” is 1, because you only need to change the first letter. Levenshtein distance allows us to define an optimal notion of similarity at the instruction byte level. The tradeoff is that it’s really slow, so it’s only really useful as a baseline in our experiments.

Experiments in Accuracy of PIC Hashing and Fuzzy Hashing

To test the accuracy of PIC hashing and fuzzy hashing under various scenarios, I defined a few experiments. Each experiment takes a similar (or identical) piece of source code and compiles it, sometimes with different compilers or flags.

Experiment 1: openssl version 1.1.1w

In this experiment, I compiled openssl version 1.1.1w in a few different ways. In each case, I examined the resulting openssl executable.

Experiment 1a: openssl1.1.1w Compiled With Different Compilers

In this first experiment, I compiled openssl 1.1.1w with gcc -O3 -g and clang -O3 -g and compared the results. We’ll start with the confusion matrix for PIC hashing:









Hashing says same


Hashing says different


Ground truth says same


23


301


Ground truth says different


31


117,635

As we saw earlier, this results in a recall of 0.07, a precision of 0.45, and a F1 score of 0.12. To summarize: pretty bad.

How do LZJD and Levenshtein distance do? Well, that’s a bit harder to quantify, because we have to pick a similarity threshold at which we consider the function to be “the same.” For example, at a threshold of 0.8, we’d consider a pair of functions to be the same if they had a similarity score of 0.8 or higher. To communicate this information, we could output a confusion matrix for each possible threshold. Instead of doing this, I’ll plot the results for a range of thresholds shown in Figure 1 below:

04222024_figure1

Figure 1: Precision Versus Recall Plot for “openssl GCC vs. Clang”

The red triangle represents the precision and recall of PIC hashing: 0.45 and 0.07 respectively, just like we calculated above. The solid line represents the performance of LZJD, and the dashed line represents the performance of Levenshtein distance (LEV). The color tells us what threshold is being used for LZJD and LEV. On this graph, the ideal result would be at the top right (100 percent recall and precision). So, for LZJD and LEV to have an advantage, it should be above or to the right of PIC hashing. But, we can see that both LZJD and LEV go sharply to the left before moving up, which indicates that a substantial decrease in precision is needed to improve recall.

Figure 2 illustrates what I call the violin plot. You may want to click on it to zoom in. There are three panels: The leftmost is for LEV, the middle is for PIC hashing, and the rightmost is for LZJD. On each panel, there is a True column, which shows the distribution of similarity scores for equivalent pairs of functions. There is also a False column, which shows the distribution scores for nonequivalent pairs of functions. Since PIC hashing does not provide a similarity score, we consider every pair to be either equivalent (1.0) or not (0.0). A horizontal dashed line is plotted to show the threshold that has the highest F1 score (i.e., a good combination of both precision and recall). Green points indicate function pairs that are correctly predicted as equivalent or not, whereas red points indicate mistakes.

figure2_04222024

Figure 2: Violin Plot for “openssl gcc vs clang”. Click to zoom in.

This visualization shows how well each similarity metric differentiates the similarity distributions of equivalent and nonequivalent function pairs. Obviously, the hallmark of a good similarity metric is that the distribution of equivalent functions should be higher than nonequivalent functions. Ideally, the similarity metric should produce distributions that do not overlap at all, so we could draw a line between them. In practice, the distributions usually intersect, and so instead we’re forced to make a tradeoff between precision and recall, as can be seen in Figure 1.

Overall, we can see from the violin plot that LEV and LZJD have a slightly higher F1 score (reported at the bottom of the violin plot), but none of these techniques are doing a great job. This implies that gcc and clang produce code that is quite different syntactically.

Experiment 1b: openssl 1.1.1w Compiled With Different Optimization Levels

The next comparison I did was to compile openssl 1.1.1w with gcc -g and optimization levels -O0, -O1, -O2, -O3.

Comparing Optimization Levels -O0 and -O3

Let’s start with one of the extremes, comparing -O0 and -O3:

figure3_04222024

Figure 3: Precision vs. Recall Plot for “openssl -O0 vs -O3”

The first thing you might be wondering about in this graph is, Where is PIC hashing? Well, if you look closely, it’s there at (0, 0). The violin plot gives us a little more information about what is going on.

figure4_04222024

Figure 4: Violin Plot for “openssl -O0 vs -O3”. Click to zoom in.

Here we can see that PIC hashing made no positive predictions. In other words, none of the PIC hashes from the -O0 binary matched any of the PIC hashes from the -O3 binary. I included this experiment because I thought it would be very challenging for PIC hashing, and I was right. But, after some discussion with Cory, we realized something fishy was going on. To achieve a precision of 0.0, PIC hashing can’t find any functions equivalent. That includes trivially simple functions. If your function is just a ret there’s not much optimization to do.

Eventually, I guessed that the -O0 binary did not use the -fomit-frame-pointer option, whereas all other optimization levels do. This matters because this option changes the prologue and epilogue of every function, which is why PIC hashing does so poorly here.

LEV and LZJD do slightly better again, achieving low (but nonzero) F1 scores. But to be fair, none of the techniques do very well here. It’s a difficult problem.

Comparing Optimization Levels -O2 and -O3

On the much easier extreme, let’s look at -O2 and -O3.

figure5_04222024

Figure 5: Precision vs. Recall Plot for “openssl -O2 vs -O3”

figure6_04222024

Figure 6: Violin Plot for “openssl -O1 vs -O2”. Click to zoom in.

PIC hashing does pretty well here, achieving a recall of 0.79 and a precision of 0.78. LEV and LZJD do about the same. However, the precision vs. recall graph (Figure 11) for LEV shows a much more appealing tradeoff line. LZJD’s tradeoff line is not nearly as appealing, as it’s more horizontal.

You can start to see more of a difference between the distributions in the violin plots here in the LEV and LZJD panels. I’ll call this one a three-way “tie.”

Comparing Optimization Levels -O1 and -O2

I would also expect -O1 and -O2 to be fairly similar, but not as similar as -O2 and -O3. Let’s see:

figure7_04222024

Figure 7: Precision vs. Recall Plot for “openssl -O1 vs -O2”

figure8_04222024

Figure 8: Violin Plot for “openssl -O1 vs -O2”. Click to zoom in.

The precision vs. recall graph (Figure 7) is quite interesting. PIC hashing starts at a precision of 0.54 and a recall of 0.043. LEV shoots straight up, indicating that by lowering the threshold it is possible to increase recall substantially without losing much precision. A particularly attractive tradeoff might be a precision of 0.43 and a recall of 0.51. This is the type of tradeoff I was hoping to see with fuzzy hashing.

Unfortunately, LZJD’s tradeoff line is again not nearly as appealing, as it curves in the wrong direction.

We’ll say this is a pretty clear win for LEV.

Comparing Optimization Levels -O1 and -O3

Finally, let’s compare -O1 and -O3, which are different, but both have the -fomit-frame-pointer option enabled by default.

figure9_04222024

Figure 9: Precision vs. Recall Plot for “openssl -O1 vs -O3”

figure10_04222024

Figure 10: Violin Plot for “openssl -O1 vs -O3”. Click to zoom in.

These graphs look almost identical to comparing -O1 and -O2. I would describe the difference between -O2 and -O3 as minor. So, it’s again a win for LEV.

Experiment 2: Different openssl Versions

The final experiment I did was to compare various versions of openssl. Cory suggested this experiment because he thought it was reflective of typical malware reverse engineering scenarios. The idea is that the malware author released Malware 1.0, which you reverse engineer. Later, the malware changes a few things and releases Malware 1.1, and you want to detect which functions did not change so that you can avoid reverse engineering them again.

I compared a few different versions of openssl:

table_04222024

I compiled each version using gcc -g -O2.

openssl 1.0 and 1.1 are different minor versions of openssl. As explained here:

Letter releases, such as 1.0.2a, exclusively contain bug and security fixes and no new features.

So, we would expect that openssl 1.0.2u is fairly different from any 1.1.1 version. And, we would expect that in the same minor version, 1.1.1 would be similar to 1.1.1q, but it would be more different than 1.1.1w.

Experiment 2a: openssl 1.0.2u vs 1.1.1w

As before, let’s start with the most extreme comparison: 1.0.2u vs 1.1.1w.

figure11a_04222024

Figure 11: Precision vs. Recall Plot for “openssl 1.0.2u vs 1.1.1w”

figure12_04222024

Figure 12: Violin Plot for “openssl 1.0.2u vs 1.1.1w”. Click to zoom in.

Perhaps not surprisingly, because the two binaries are quite different, all three techniques struggle. We’ll say this is a three way tie.

Experiment 2b: openssl 1.1.1 vs 1.1.1w

Now, let’s look at the original 1.1.1 release from September 2018 and compare it to the 1.1.1w bugfix release from September 2023. Although a lot of time has passed between the releases, the only differences should be bug and security fixes.

figure13_04222024

Figure 13: Precision vs. Recall Plot for “openssl 1.1.1 vs 1.1.1w”

figure14_04242024

Figure 14: Violin Plot for “openssl 1.1.1 vs 1.1.1w”. Click to zoom in.

All three techniques do much better on this experiment, presumably because there are far fewer changes. PIC hashing achieves a precision of 0.75 and a recall of 0.71. LEV and LZJD go almost straight up, indicating an improvement in recall with minimal tradeoff in precision. At roughly the same precision (0.75), LZJD achieves a recall of 0.82 and LEV improves it to 0.89. LEV is the clear winner, with LZJD also showing a clear advantage over PIC.

Experiment 2c: openssl 1.1.1q vs 1.1.1w

Let’s continue looking at more similar releases. Now we’ll compare 1.1.1q from July 2022 to 1.1.1w from September 2023.

figure15_04222024

Figure 15: Precision vs. Recall Plot for “openssl 1.1.1q vs 1.1.1w”

figure16_04222024

Figure 16: Violin Plot for “openssl 1.1.1q vs 1.1.1w”. Click to zoom in.

As can be seen in the precision vs. recall graph (Figure 15), PIC hashing starts at an impressive precision of 0.81 and a recall of 0.94. There simply isn’t a lot of room for LZJD or LEV to make an improvement. This results in a three-way tie.

Experiment 2d: openssl 1.1.1v vs 1.1.1w

Finally, we’ll look at 1.1.1v and 1.1.1w, which were released only a month apart.

figure17_04222024

Figure 17: Precision vs. Recall Plot for “openssl 1.1.1v vs 1.1.1w”

figure18_04222024

Figure 18: Violin Plot for “openssl 1.1.1v vs 1.1.1w”. Click to zoom in.

Unsurprisingly, PIC hashing does even better here, with a precision of 0.82 and a recall of 1.0 (after rounding). Again, there’s basically no room for LZJD or LEV to improve. This is another three way tie.

Conclusions: Thresholds in Practice

We saw some scenarios in which LEV and LZJD outperformed PIC hashing. However, it’s important to realize that we are conducting these experiments with ground truth, and we’re using the ground truth to select the optimal threshold. You can see these thresholds listed at the bottom of each violin plot. Unfortunately, if you look carefully, you’ll also notice that the optimal thresholds are not always the same. For example, the optimal threshold for LZJD in the “openssl 1.0.2u vs 1.1.1w” experiment was 0.95, but it was 0.75 in the “openssl 1.1.1q vs 1.1.1w” experiment.

In the real world, to use LZJD or LEV, you need to select a threshold. Unlike in these experiments, you could not select the optimal one, because you would have no way of knowing if your threshold was working well or not. If you choose a poor threshold, you might get substantially worse results than PIC hashing.

PIC Hashing is Pretty Good

I think we learned that PIC hashing is pretty good. It’s not perfect, but it generally provides excellent precision. In theory, LZJD and LEV can perform better in terms of recall, which is appealing. In practice, however, it would not be clear that they would because you would not know which threshold to use. Also, although we didn’t talk much about computational performance, PIC hashing is very fast. Although LZJD is much faster than LEV, it’s still not nearly as fast as PIC.

Imagine you have a database of a million malware function samples and you have a function that you want to look up in the database. For PIC hashing, this is just a standard database lookup, which can benefit from indexing and other precomputation techniques. For fuzzy hash approaches, we would need to invoke the similarity function a million times each time we wanted to do a database lookup.

There’s a Limit to Syntactic Similarity

Remember that we included LEV to represent the optimal similarity based on the edit distance of instruction bytes. LEV did not significantly outperform PIC , which is quite telling, and suggests that there is a fundamental limit to how well syntactic similarity based on instruction bytes can perform. Surprisingly, PIC hashing appears to be close to that limit. We saw a striking example of this limit when the frame pointer was accidentally omitted and, more generally, all syntactic techniques struggle when the differences become too great.

It’s unclear whether any variants, like computing similarities over assembly code instead of executable code bytes, would perform any better.

Where Do We Go From Here?

There are of course other strategies for comparing similarity, such as incorporating semantic information. Many researchers have studied this. The general downside to semantic techniques is that they are substantially more expensive than syntactic techniques. But, if you’re willing to pay the higher computational price, you can get better results.

Recently, a major new feature called BSim was added to Ghidra. BSim can find structurally similar functions in potentially large collections of binaries or object files. BSim is based on Ghidra’s decompiler and can find matches across compilers used, architectures, and/or small changes to source code.

Another interesting question is whether we can use neural learning to help compute similarity. For example, we might be able to train a model to understand that omitting the frame pointer does not change the meaning of a function, and so shouldn’t be counted as a difference.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *