On binary diffing

Introduction Link to heading

Finding a tiny change in a large amount of data is a painfull task when one have to do it manually.

Finding a tiny change in a Iarge amount of data is a painfull task when one have to do it manually.

Well, it looks like it can still be painfull with a rather small amount of data…

Comparison is hard to do.

Diffing is the automation of this comparison task. It has various uses such as building version control systems and providing source code patches.

Diffing techniques Link to heading

There exists various diffing techniques, each one addressing a comparison problem, each one with its strengths and drawbacks.

Using the same method to compare text and images internals won’t make sense: you won’t diff PNGs using Git nor diff text using Resemble.js or any other CSS visual regression testing tools.

Before comparing anything, You have to deeply understand the structure of your sources to be able to compare them properly.

When it comes to executables binary files diffing, the accuracy of your method depends on the granularity of the properties you can extract from your binaries. A binary can be expressed, with coarse granularity, as a simple checksum. To understand how binaries differ, you can express them with finer granularity. To do so, you will extract multiple properties depending on your targeted accuracy and goals:

  • file format (PE, ELF, Mach0)
  • targeted platform
  • instructions set (x86, x86-64)
  • sections
  • functions
  • calling conventions
  • basic blocks
  • instructions
  • metadata (from timestamp to compiler features)

The overall results and accuracy you can expect from your diffing method depend on the amount of knowledge you have in your sources.

When those properties are acquired, you can start reasoning about them and construct models you’ll be able to compare. For example:

  • compare functions symbols (as in BAM)
  • compare function graphs (as in diaphora - KoKa tarjan)
  • compare immediate values / code vectors (as in YaDiff)

In the context of reverse engineering, diffing and especially binary diffing can be very handy and often a huge time saver. You can use diffing techniques in various situations like patch analysis or symbols propagation.

We’ll explore some of them.

Patch analysis Link to heading

In order to perform patch analyses, you should perform diffing techniques on a binary and its patched version. If you’re dealing with a security patch, chances are you’re going to reveal the vulnerability fixed in that patch. You’ll be able to locate and (hopefully) understand the vulnerability there. This would help you writing your n-day exploit.

Patch analysis is a key process in patch diffing. It requires a good understanding of patch format(s). Also, patch application modalities are very important as a patch can come in various shapes. A patch can be a simple reinstallation of all the components of a software, or a very specific update of a single component.

More than the patch itself, the analyst should focus on the patch application process. This could mean having interest in the patch installer itself. The Windows Installer is a good candidate for exploring all these aspects.

Symbols propagation Link to heading

During reverse engineering, you might want to help yourself by documenting your binary using a dedicated software like IDA pro or Ghidra, hence, applying labels on functions, basic block, renaming variables and more.

Diffing can play a role when it comes to work on a new version of the binary you’re studying. Many of the code might not change, maybe some basic blocks, maybe some rebasing. Binary diffing can help you propagate your documentation from one version to another.

In order to propagate documentation of a binary to a new version of that binary. You have to recognize every object you want to track. This object can be:

  • a section
  • a function
  • a basic block
  • an instruction
  • some data
  • an entry in an import table

When done, you should be able to remap your database of objects on an older or newer version of the binary you’re studying.

Your database should be agnostic of the platform, enabling you to propagate symbols across various forms of binaries: let’s say you own a leaked collection of symbols for the desktop version of the app but you want to work on the mobile version of that same app.

Resist to diffing techniques Link to heading

Oh you don’t want your code to be easily understood and tracked by the binary diffing capabilities? Then you should attack what makes binary diffing possible: the code structure and properties such as functions calls.

CFG flattening is an example technique that helps messing with common diffing techniques: the code belonging to a particular function may not be arranged contiguously in the binary. Bits and pieces of the function might be scattered throughout the code section, and chunks of code may even be shared between functions. They are known as overlapping code blocks. This happens a lot in optimized code such as firware and embeeded systems codebases.

Attribution Link to heading

Comparisons can help build families and in the end, can help with attributions.

Binary diffing can play a major role while doing binary attribution. It can help detect code redundancy and variants using some correlation factors. Pattern matching can help to find some very unique attributes (e.g l33t signatures).

See also Link to heading

hex differ, bindiff, diaphora, bmat, eEye, DarunGrim, yadiff, patch tuesday