Glossary & Concepts

I talk about a lot of things (I talk a lot in general). This page describes terms and problems I discuss.

Forward

Some people have slightly different interpretations of terms. These are the terms as I understand them. If you disagree with my interpretation, or if I am just flat out wrong, which happens, please send me corrections.

Table of Contents

Abstract Interpretation

Borrowing from Wikipedia:

In computer science, abstract interpretation is a theory of sound approximation of the semantics of computer programs, based on monotonic functions over ordered sets, especially lattices.

Let’s begin with the lattice. For the purposes of program analysis, our lattices will be collections of some property with two special values. We call these values top and bottom. We will use to represent top and to represent bottom. We apply a special meaning to these values. ⊤ means, “I have too much information, and can no longer hold more information.” ⊥ means, “I don’t have enough information to make a decision.”

It is very important that we have a ⊤ and ⊥. ⊥ and ⊤ bound our lattices, and therefore our analysis. Without a ⊤, we cannot guarantee our analysis will reach a fixed-point, and we cannot guarantee our analysis will terminate.

The property we are analyzing is determined by the domain of our analysis. Some example domains may be a value range (low, high), or a set of values {0, 2, 4, 8}. Perhaps our domain will consider whether a pointer points towards allocated or freed memory.

How we interpret operators, such as +,-,*,/, etc., will also be determined by our domain, as these operators must now operate over our lattices. It is important that these operators always increase the amount of information each lattice holds, until either that lattice reaches ⊤ or the analysis terminates. A monotonic function is one that either never increases or never decreases, and our operations over lattices must be monotonic to ensure our analysis will terminate.

We also have two functions for these lattices, “Join,” and, “Meet.” Join is like a set union, and meet is like a set intersection, except they take into account our special values Top and Bottom.

Let’s consider the domain of values a variable may hold at a specific point in the program. Once a value holds three numbers, we will go to, ⊤.

Consider the following program:

0: start:
1: var i = 0;
2: while i < (1 << 64) loop
3:     i = i + 1
4: end

We can see this program will run for quite some time. However, we need our analysis of this program to terminate in a reasonable amount of time.

We will now perform a forward data-flow analysis with our abstract domain. Let’s look at what values our variables hold at each step in the program.

step: 1,   instruction: 0,   i = ⊥
step: 2,   instruction: 1,   i = 0
step: 3,   instruction: 2,   i = 0
step: 4,   instruction: 3,   i = {0, 1}
step: 5,   instruction: 2,   i = {0, 1}
step: 6,   instruction: 3,   i = ⊤
step: 7,   instruction: 2,   i = ⊤
step: 8,   instruction: 3,   i = ⊤
step: 9,   instruction: 4,   i = ⊤

At the conclusion of this analysis, we’re left with i = ⊤, a sound approximation of the possible values i can hold at instruction 4.

Angr

BINARY ANALYSIS FRAMEWORK | RELEASED | PYTHON

Angr is the binary analysis framework, and symbolic execution engine, developed by UCSB and Shellphish to compete in the Cyber Grand Challenge.

There are lots of people who will tell you Angr is poorly documented. These people are wrong.

Angr is a working symbolic execution engine for x86 binaries, and can do things out of the box.

Binary Analysis Platform (BAP)

BINARY ANALYSIS FRAMEWORK | RELEASED | OCAML

Binary Analysis Platform (BAP), is a framework for analyzing binary executables. BAP is written in OCAML, and is used by ForAllSecure and the research coming out of CMU. BAP is a framework, and doesn’t really do much out of the box.

Binary Ninja (Binja)

DISASSEMBLER | RELEASED | PYTHON/C++

Binary Ninja is a new disassembler that has generated a lot of attention due to an affordable price point, and a focus towards automating binary analysis.

Binary Ninja works by lifting binaries to its various intermediate languages, making it easy to write high-level analyses.

Conservative Analysis

A sound analysis is one that always triggers when the property it is looking for is present (but may also trigger when the property is not present), I.E. there are no false-negatives.

A precise analysis is one that only triggers when the property it is looking for is present (but may also not trigger when the property is present), I.E. there are no false-positives.

A conservative analysis guarantees soundness, but not precision. A conservative analysis may have false postives, but will not have false negatives.

Sound. Precise. Terminates. Pick two.

Dominator (Graph)

In a directed graph, a vertex A dominates vertex B if all paths to B must go through A. All vertices dominate themselves.

In a directed graph, a vertex A is the immediate dominator of vertex B if A is a dominator of B, A is not B, and there are not other dominators between A and B.

A vertex A post-dominates B if all paths through B must go through A. All vertices post-dominate themselves.

Control-Flow Graphs are directed graphs, and sometimes we are interested in certain properties of a program which can be derived purely through control flow. In other words, sometimes the structure of a program can answer a question without having to understand the semantics of the program.

For example, assume we identify a cycle in a graph. We know we are inside a loop. However, we may wish to identify the, “Head,” of the loop. To do so, we can find the vertex which comes before all other vertices. This vertex will, “Dominate,” all other vertices in the cycle.

Another example may be considering a trace through a program. If a block’s dominator is not visited by a program trace, then we inherintly know said block was not visited.

The classical use of post-dominators is in a popular algorithm for determining the locations of Phi instructions for Single Static Assignment.

Dynamic Analysis

Dynamic analysis is an analysis conducted by running the program to see what it does. Examples of dynamic analysis include:

Dynamic Binary Instrumentation (DBI)

While Dynamic Binary Instrumentation (DBI) may have multiple meanings, when we refer to Dynamic Binary Instrumentation we refer to the process of inserting instrumentation routines into a program at runtime as we natively execute the program. This is done with DBI framework/tool such as pin tools, DynamoRIO, or panda.

DBI frameworks typically work similar to a Just-In-Time compilation engine. As the DBI framework runs the program, each time it encounters instructions it has not yet executed, it give the user an opportunity to insert custom instrumentation routines at various locations. The user can then do things such as trace data through a program, or count each time an instruction is executed.

Hopper

DISASSEMBLER | RELEASED | OBJECTIVE-C

Hopper is a ridiculously affordable, and ridiculously usable disassembler for OSX. If you run OSX, and you want to reverse, but you don’t want to drop >$1k on Ida Pro, drop $99 on Hopper.

Instruction Decoding

Instruction decoding refers to the process of taking a sequence of bytes which represent a valid instruction for an architecture, and creating a usable representation of this instruction. Instruction decoding does not necessarily provide a textual representation of the instruction, but does provide things like the mnemonic of the instruction and its operands.

Intermediate Representation / Intermediate Language

An Intermediate Representation (IR) comes originally from the world of compilers. Compilers often take human-readable/writeable source code, and parse this code into an intermediate representation. Compilers then perform a series of optimizations and transformations over this IR, before eventually transforming the IR into instructions for the target machine. Compilers and framworks may have multiple IRs for multiple stages of their analysis.

An Intermediate Language is the language used to encode the semantics of the program in the IR. These terms are used and understood interchangeably.

Popular IRs for binary analysis, along with some frameworks which use them, include:

The use of an IR can become a religious war, in spirit of, but without the intensity of, the war between vim and emacs.

Lifting

Lifting refers to lifting machine code for a specific architecture an intermediate representation. Every IR used for the analysis of binaries has a mechanism to lift machine code to that IR.

LLVM & LLVM IR

PROJECT | RELEASED | C++

LLVM is a compiler framework, used to implement languages which compile down to machine code. Clang is the C/C++ compiler developed by the LLVM project, and OSX’s default C compiler.

The Intermediate Representation (IR) used by LLVM has gained popularity in the analysis of binary programs, and can be found in tools such as mcsema, panda, and s2e among others.

The main argument for the use of LLVM IR is its extensive ecosystem of tooling, as the IR is used by the compilers/PL community.

Manticore

SYMBOLIC EXECUTION ENGINE | RELEASED | PYTHON

Manticore is a new symbolic execution released by Trail of Bits in 2017. This engine is still very young, and I haven’t played around with it, but it’s generating some buzz and is under active development so it probably warrants mention.

MAYHEM

PROJECT | UNRELEASED

MAYHEM is a system implemented with BAP for finding exploitable bugs in binaries. MAYHEM conducts static symbolic execution (SSE) over a binary program. Once state explosion prevents further exploration, a taint trace is run over the binary to an interesting point found via SSE, the state at that point is snapshotted, and a new SSE is conducting from that point forward.

This technique allows MAYHEM to symbolically explore further into a program than a pure-static solution would allow, while offsetting the issue of Dynamic Symbolic Execution only being capable of exploring one path at a time.

Online / Offline Executor

An online executor is one that runs with the program, a dynamic executor if you will. Triton is an example of an online executor. An online symbolic execution engine is a dynamic symbolic execution engine.

An offline executor is one that does not run with the program in its native execution environment. Angr is an example of an offline executor. An offline sytmbolic execution is a static symbolic execution engine.

Pin Tools

DYNAMIC BINARY INSTRUMENTATION FRAMEWORK | RELEASED | C++

Pin Tools, or often just referred to as, “Pin,” is a dynamic binary instrumentation framework released by Intel. Pin is easy to use, and reliable. Pin uses XED to decode x86 instructions.

PwnTools

CTF FRAMEWORK | RELEASED | PYTHON

PwnTools is a framework for solving CTF challenges, mostly exploitation challenges, in python. If we are CTFing together, and collaborating on a challenge, and you are not using PwnTools, then we are not collaborating on a challenge.

REIL

Reverse Engineering Intermediate Language. This is the IL that was used by BinNavi. REIL does not support floating point.

The main argument for the use of REIL is its simplicity, which simplifies the implmentation of different analyses.

REIL: A platform-independent intermediate representation of disassembled code for static code analysis

RREIL

A revised version of RREIL, created by Alexander Sepp, Bogdan Mihaila, and Axel Simon for use in bindead. RREIL does not support floating point.

The main argument for the use of RREIL is similar to REIL. You will notice a strong leaning of the German RE community towards REIL/RREIL.

Precise Static Analysis of Binaries by Extracting Relational Information

SAGE

OFFLINE CONCOLIC EXECUTOR | UNRELEASED

SAGE was/is an offline concolic executor developed by Microsoft to improved coverage in fuzzing. SAGE was a pretty big deal when the paper was released, and was one of the first practical implementations of constraint-solving against non-academic, real-world applications.

SAGE is discussed in the paper, “Automated Whitebox Fuzz Testing.”

Shellphish

Shellphish is the CTF team from the University of California, Santa Barbra. They created angr to compete in the Cyber Grand Challenge.

Symbolic Execution

Symbolic Execution is a type of constraint-based analysis. Symbolic Execution can be performed statically and dynamically.

Symbolic Execution allows us to determine how inputs influence a computer program by. Instead of inputs holding concrete values, such as 0x12345678, inputs are treated as variables (or symbols).

As we step through a program, we will encounter conditional branches which cause control-flow to branch depending on the value of an input. When this happens, we, “Fork,” execution, creating two identical states, and then asserting the conditional branch true for once branch, and false for the other. Other types of input-influenced branches, such as indirect branches (think switch/case statements, or function pointers), can be handled similarily by forking state and enforcing appropriate constraints over each path taken. This is the constraint-generation phase of our constraint-based analysis.

We continue this process, forking state at conditional branches, until one of the following occurs:

  1. We exhaust resources available for our analysis.
  2. We reach some desired condition and terminate the symbolic execution.
  3. We exhaust all possible paths through the program.

At any point in the program, we can attempt to solve this system of constraints. Solving this system of constraints is equivalent to solving for an input which will cause execution of the program to follow a given path. If there exists a solution to this system of constraints, I.E. the system is satisfiable, then we have reached a valid state in the program and an input is returned. If the constraints or unsatisfiable, we are in an unreachable state in the program. This process is the constraint-solving phase of our constraint-based analysis.

The primary advantage of Symbolic Execution is it allows us to, “Collapse,” multiple concrete inputs into a single symbolic variable. Another way to think of this is we can traverse multiple unique paths, or multiple unique inputs, simultaneously.

Take for example the following function:

1: int test (int c) {
2:     if (c < 100)
3:         return 0;
4:     else if c > 100000)
5:        return 0;
6:     return c;
7: }

Assuming we wanted to test this function programmatically without any human-assisted understanding of its semantics, we would need to send 2^32 inputs to guarantee we exhaust all paths. With symbolic execution, we can send an unconstrained symbolic variable to this program, and observe the outputs.

If we symbolically executed this program with an unconstrained symbolic variable for c, we would receive 3 states in return. Those states may look similar to:

(true (< c 100)) (= result 0)
(true (and (false (< c 100)) (> c 100000))) (= result 0)
(false (and (false (< c 100)) (> c 100000))) (= result c)

With these constraints, we can solve for valid values of c along any path, or continue executing other portions of the program.

While symbolic execution is based explained in trivial examples, it struggles, and often fails, to handle large programs. The, “Path explosion,” problem is the primary problem for which symbolic execution chokes.

Taint Analysis

A traditional taint analysis is a type of data-flow analysis which attempts to determine whether a tainted value can reach a particular point in a program. Tainted normally refers to values which come from a source of user input, and optionally hold some property.

We consider three different functions with taint analysis, SOURCE(), SINK(), and SANITIZE(). SOURCE() provides tainted values with the property we care about, SANITIZE() removes taint, perhaps by removing the property we care about, and SINK() is the target for our analysis. A taint analysis will tell us whether a tainted value can reach a sink.

We also have taint propagation. When a tainted value is assigned to a variable, we propagate taint from the source value to the destination variable.

Take for example the following program:

1: $username = $_GET['username'];
2: if ($some_condition)
3:     $username = mysql_real_escape_string($username);
4: $query = mysql_query("SELECT * FROM users WHERE username='" . $username . "'");

We observe two paths through this program, 1 → 2 → 3 → 4 and 1 → 2 → 4.

Assume the following:

A taint analysis would trigger here along the path 1 → 2 → 4, providing us with insight that perhaps a SQL injection vulnerability exists in this code.

Taint Trace

A taint trace is a program trace where only actions/instructions which change the set of tainted variables in a program are logged.

By obtaining a taint trace from a program execution over a concrete input, we can consider only the instructions which are relevant to the inputs fed to the program. A taint tracer is a key component to scalable offline concolic executors, and has been used to scale other types of constraint-based analysis. For example, in MAYHEM, ForAllSecure pairs a taint tracer to help scale static symbolic execution.

VEX IR

BINARY INTERMEDIATE REPRESENTATION | RELEASED | C

VEX is the Intermediate Representation used by Valgrind. VEX IR is very verbose, and is geared primarily towards fast dynamic binary instrumentation.

The definitive documentation of VEX IR is, surprisingly enough, the file libvex_ir.h

XED

X86 INSTRUCTION DECODER | RELEASED | C

XED is the x86 decoder developed by Intel originally for Pin Tools, and then released as a separate project. Xed is relatively new, but is quickly becoming the defacto x86 decoder.