Type Inference for Decompiled Code: From Hidden Semantics to Structured Insights

by Lukas Seidel, Sam L. Thomas

When a program gets compiled to produce a binary, a lot of useful information is lost in the process, such as variable names, variable types, and comments. Types provide a lot of semantic context that can be of immense benefit when trying to understand what a program does. Unfortunately, as the information is lost, decompilers often struggle to recover meaningful types without user intervention. The basic type recovery that decompilers perform is very limited and commonly only distinguishes between primitive types, such as integers and void pointers. While some decompilers can take advantage of types associated with external library functions, and propagate them, in the majority of cases, more expressive and context-specific types, such as user-defined structs or library types, remain unidentified by basic inference mechanisms. This limits the readability and accuracy of the decompiled output as illustrated by examples in img1 and img2. A large body of (mostly academic) work has explored this problem in two directions. The first direction focuses on sophisticated type inference approaches that can recover a type’s layout with higher precision, e.g., by reconstructing a struct type with its field. While the second direction has focused on recovery of common type names by matching against a database of known types. In practice, to add as much context as possible, an ideal type inference method should achieve both: recovery of meaningful type and field names as well as the full structure of a variable’s type.

In the Binarly Transparency Platform (BTP), we use decompiled code to visualize our code-based findings so that our users have enough context to effectively triage them. Therefore, the closer to the original source our decompiler output is, the easier the user’s life is when performing triage. Clearly, having meaningful symbol and type information within our decompiled listings is crucial in this case. In this blog, the Binarly REsearch team details how we have extended our analysis framework to recover missing type information. We first provide an overview of our analysis framework and where we currently leverage decompilation, we then explore past work on type inference, how we have used past work to integrate type recovery into our framework, and then discuss the challenges we have faced. To demonstrate the effectiveness of our implementation we provide a case-study of applying our type recovery implementation to UEFI binaries, and finally discuss practical limitations.

[img1] IDA’s decompilation of a function in the UsbBusDxe module of the EDK2 framework.

In the absence of proper type information, extensive pointer arithmetic and casting between IDA’s core types—such as DWORD and _BYTE—becomes necessary, resulting in output that is often difficult to comprehend.

[img2] Applying type information from the corresponding Program Database (PDB) file, which contains explicit debug information, renders a much more helpful picture.

By knowing that ParentIf (formerly v5) is of type _USB_INTERFACE, IDA can now correctly resolve struct field accesses. Additionally, it automatically assigns more expressive variable names based on the type.

Binary Analysis in the BTP

To contextualize our want for better type recovery, we first provide an overview of how binaries are analyzed within the Binarly Transparency Platform, and where and how decompiler output is used.

When a user submits a new input to the BTP for analysis, it passes through a pipeline that serves to generate actionable insights about potential risks, such as vulnerabilities and malicious behaviors.

Our analysis pipeline consists of three phases--normalization, analysis, and reporting. The normalization phase takes heterogeneous inputs (e.g., various firmware formats and Docker containers), and produces a uniform representation consisting of their analyzable components and metadata. Typical components include executables, firmware blobs, and configuration files. We refer to this uniform format as the Binarly Analysis Archive (BA2).

Following normalization, a number of analyses are performed on each of the components in the BA2. These analyses are structured in two phases--core analysis passes and specialized analysis passes. The core analysis passes provide a foundation for performing the more advanced specialized analyses, and serve to produce a uniform representation of each binary. They consist of function recognition, control-flow and call-graph recovery, and variable and cross-reference identification. The specialized analysis passes serve to identify vulnerabilities, potentially malicious behaviors, embedded secrets, and so on, and are built using techniques such as symbolic execution, taint analysis, partial emulation, and abstract interpretation.

Once all analysis passes have completed, the reporting phase produces a structured representation of each of the potential risks identified. It augments each finding with severity information, various prioritization metrics, and for code-related issues such as unknown vulnerabilities, reachability properties and annotated decompiler code listings describing the potential vulnerability.

While our main framework is developed in-house in Rust, to avoid reinventing the wheel, we use a fork of Ghidra's C++-based decompiler to produce code listings. Our fork is exposed via a Rust-based API (via cxx) and has been extended to provide tight integration with our analysis framework. These extensions allow us to communicate type information, locations of global variables and function starts, symbol information, as well as the ability to inject additional information into the decompiler state while decompilation is being performed via decompiler actions. Most importantly, they also allow us to construct a mapping of addresses to listing ranges, which we use as part of our reporting logic to provide detailed annotations that highlight key parts of each listing pertinent to understanding our findings. We integrate our type prediction mechanism as part of this annotation step in order to produce closer to source decompiler output, to aid downstream tasks such as triage and remediation.

Inferring Types

Predicting the type of a variable based on its surrounding context and on how it is used within that context is not a new idea. Approaches in this area vary significantly in their methodologies.

Constraint-Based Methods

Early efforts employed dynamic analysis to infer more precise variable types. The Howard system executes a program in QEMU to observe memory accesses at runtime. Based on these memory usage patterns, it is then able to infer a variable’s type. For example, observing a static offset on a dereferenced variable could indicate a struct field access. Additionally, system calls are tracked as their arguments and their arguments’ types are known and can be used to further propagate these high-confidence types. While substantial effort has been put into automating and optimizing dynamic approaches, executing a program is often infeasible in a fully automated, headless analysis setting, and full coverage is hard to achieve, which is necessary, if we want to recover all types within a binary.

Dynamic approaches were mostly superseded by methods relying on deterministic static analysis that are more applicable as a default pass in a decompiler setting. These analyze the dataflow between different variable locations in order to generate constraints for a typing system. TIE (2011) fuses binary code analysis paradigms with type reconstruction theory for the first time. The technique constructs a set of formulas based on variable usage which is subsequently solved to assign types. For instance, if an arithmetic operation is followed by a conditional on the signed flag, it can be inferred that both operands are signed. The work also highlights a problem with this kind of constraint-based inference: even for statically typed languages, a variable’s type is erased during compilation and a variable location can be freely reused for values of different types. Retypd addresses this problem through subtyping and solves further challenges in static type inference. The approach uses lattice-based (or push-down) typing algorithms to infer complex types such as recursive structs, pointers, and polymorphic references. These constraint-based methods are still leveraged by decompilers for type reconstruction during the variable recovery pass. Although thorough and precise, static inference methods tend to be computationally expensive, limiting their scalability and practical speed. The BinSub author acknowledges that existing implementations of type inference systems that are based on subtyping and polymorphism are too inefficient to achieve adoption in production. Their work introduces algebraic subtyping to the domain, leading to a drastically faster method for binary type reconstruction. At the same time, even the most precise and deterministic methods are not immune to imprecisions in practice. Failures when disassembling, for example, can lead to unsound constraint generation. Furthermore, it is worth noting that all of the aforementioned methods infer sound C types only. While the advanced approaches can also construct sound complex types such as structs and unions, having a *(var1 + 0x16) resolved as a var1->field2 is only marginally helpful without type and field names. And these can only be derived by matching against a dataset of known types.

Advent of Probabilistic Type Inference

More recently, probabilistic approaches have emerged to complement traditional static analysis. These methods, too, analyze usage patterns, struct accesses, and local semantic contexts but in a less rigorous way, often providing a significant speed-up. While early probabilistic techniques fell short of identifying user-defined types (e.g., DEBIN), training a machine learning model on a corpus of binaries with debug and type information offers the prospects of recovering named types, e.g., from common libraries.

Graph Neural Networks

Larger datasets in combination with deep learning approaches like Graph Neural Networks (GNN) led to advancements, predicting library and user-defined types from a vast type corpus for the first time. TYGR constructs a graph-based representation of dataflow from an intermediate representation extracted with the angr binary analysis framework. The GNN-based approach is then trained on a dataset of binary programs to learn the data-access- and usage patterns of variables and their types. While TYGR is able to infer a struct’s layout and the types of its fields, it does not recover semantic information. As the model is not trained to map a classification against a known corpus of types beyond primitive C-types, it cannot predict the name of a struct type or its fields.

LLMs: Modeling the Human Approach

With the rise of the Transformer architecture and Large Language Models (LLMs), the field experienced an outright renaissance. Intuitively, the use case makes sense: when we as humans reverse engineer a binary, we see the decompiled code, we see pointer arithmetic corresponding to struct field accesses, and variables being passed to other (maybe already reversed) functions. When we make an assumption about a variable’s type or a struct’s layout, it is based on these observations conditioned by our existing knowledge and experiences gathered while reading and writing code and analyzing binaries. In this sense, LLMs appear like the perfect candidate to base a probabilistic type inference model on: untyped decompiled code + a knowledge base of (dataset) of decompiled code with type information => newly typed decompiled code. Indeed, first implementations, such as DIRTY [https://cmustrudel.github.io/papers/ChenDIRTY2022.pdf], showed impressive predictive capabilities. However, as models grow in parameter size, their practical applicability is hindered by considerable computational overhead. DIRTY requires over one second per processed function when running on an NVIDIA L4 GPU, and more than 8x that when running on a consumer-grade CPU. If we want to analyze dozens of binaries with hundreds or thousands of functions, these runtimes quickly become unacceptable. For instance, an average UEFI firmware image consists of around 300 to 500 modules and scanning it with the Binarly Transparency Platform will typically analyze more than 50,000 functions.

Towards a Production-Ready Model

Many available solutions have practical limitations that severely constrain their usefulness in a production environment. At online inference time, we are usually limited to representations we already have access to. Lifting assembly code to specialized intermediate representations or constructing graph representations for each function to use as input to a model is often too expensive when performing analysis at scale. LLMs require immense resources and adding proposed models to production pipelines would introduce a non-justifiable runtime overhead in many cases. Finally, falling back to only predicting primitive types without additional semantic information will not add a lot of useful context, given that the decompiler is already providing basic type suggestions.

Consequently, when a new method was published in July 2024, running on decompiled code in text form, claiming State-of-the-Art performance and being drastically faster than an LLM, it piqued our interest.

N-Grams: A Simpler Approach to Type Recognition

Before Transformer-based models dominated the field of Natural Language Processing and even before LSTMs (a form of recurrent neural networks), n-gram approaches were commonplace. An n-gram model captures local contextual patterns by considering sequences of n adjacent tokens and matching them against a database of previously recorded tokens. This approach is surprisingly effective at modeling language structure and computationally lightweight at the same time. The STRIDE paper introduces an n-gram based approach to type inference. For every variable in a decompiler’s output, we consider n tokens to its left and n tokens to its right as the variable’s context. This is illustrated in [img3]. The intuition behind this is similar to LLMs: By treating decompiled code as language, n-gram methods can efficiently model the immediate syntactic and semantic patterns around variable usage, struct field accesses, and function calls. In contrast to running expensive matrix multiplications on hundreds of millions of parameters, the type inference problem is reduced to a series of hash comparisons.

[img3] STRIDE finds the largest n-grams matching the areas next to instances of the variable we want to predict, v3.

It looks up the top 5 variable names to then sum the scores for each occurrence of the variable across the function. Taken from [https://arxiv.org/pdf/2407.02733]. (Note: Besides types, STRIDE also supports predicting names. The underlying process of prediction remains the same.)

Consensus

The final prediction mechanism implements three primary factors to score the matches from the n-gram database: First, STRIDE always queries n-gram databases for multiple sizes of n. In the current implementation these are {60, 30, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2}. Matching against a larger n-gram gets rated higher as it indicates an increased confidence that we have seen this particular type in exactly this context. Second, with more occurrences of a variable matching on n-grams of the same type, our confidence increases further. Third, the less diverse an n-gram is, the higher its impact. For instance, if an n-gram only ever occurs with a single type, it’s a distinctive match. Finally, the algorithm aggregates predictions for a variable’s occurrences over a function. The local scores are summed up and a single prediction is made to ensure that a variable does not change its type throughout a function.

Implementation Challenges

For a thorough evaluation of the method including integration into our internal tooling, we re-implemented the open-sourced Python tool in Rust. The resulting implementation fully reproduces the results of the original paper. At 0.11ms per function we achieve vastly higher prediction speeds than the original Python implementation (8ms) or DIRTY (200ms in the fastest (!) case, i.e., low accuracy settings on GPU). Integrating the whole approach top-to-bottom into our analysis pipeline presented us with a couple of practical challenges.

Over-Representation of Primitive Types

The DIRTY authors note that in their dataset of decompiled functions, only 1.8% of all variables have a struct type. We observe similar numbers, with predictions (and ground-truth types) overly dominated by primitives. The basic integer types int32, uint32, int64 and uint64, for example, make up more than 30% of all types in our testing dataset. This omnipresence of primitive types automatically leads to a problem: As we have vastly more recorded observations and diverse contexts for such primitives, we will commonly find an n-gram match even where it is not the correct prediction. Moreover, when two types are scored on-par, i.e., the n-grams in our database for the current variable match contexts for two different types equally well, we will tie-break by type frequency. Stochastically, this is the right choice but it leads to a bias towards non-struct types in general. While testing, we would see cases where the correct struct type was ranked 2nd or 3rd highest among predictions, but would be overshadowed by an integer type. At the same time, these struct types are primarily what we are looking for.

A Heuristic for Better Struct Recovery

Commonly, recovering struct types will lead to vastly more readable decompiler pseudocode, as struct field accesses get resolved correctly and the types of further variables can be inferred based on the struct fields’ subtypes. As established, even minor differences between a context we are running inference on and the contexts we have in our n-gram databases can lead to a semantic struct type being overshadowed by the prediction of a primitive type.

To address this problem, we introduce a new heuristic to the STRIDE inference pipeline: We perform manual checks on the variable usage in the current context, identifying pointer arithmetic to access static offsets. For any given function context we want to run recover types for, we scan the current function’s tokens for pointer offset accesses in the form of *(param3 + 0x16). In cases where we observe multiple accesses, the likelihood of the variable having a struct type is high. Consequently, if our top prediction for the variable’s context is still a primitive, it signals a false positive. We then proceed to scan the top-5 (which is configurable) predictions for a more fitting struct type. For every struct candidate, we look up the full type representation from our internal TypeDB and match the struct’s fields with the observed offsets being accessed inside our context. This procedure led to increased probabilities of inferring the correct struct type instead of a meaningless, and incorrect, primitive type during our testing.

From inferred Value to Internal Representation

The majority of academic work in this area tends to overlook one important aspect of type inference: what do we do once we get an answer from our prediction oracle? Depending on the model and how the training dataset is constructed, we now have a more or less expressive string at hands. In the worst case, the model will output a type’s name, e.g., timeval. While this might be enough to match against a ground-truth and calculate the model’s accuracy, applying this type in a decompiler is infeasible without knowing the type’s inner layout. Although in this case we have a relatively common type to deal with and could look it up system headers, for more custom types this approach will not work. DIRTY and STRIDE (when trained on DIRTY’s dataset) will produce predictions of the form: "{"T": 6, "n": "timeval", "l": [{"T": 4, "n": "tv_sec", "t": "__time_t", "s": 8}, {"T": 4, "n": "tv_usec", "t": "__suseconds_t", "s": 8}]}" for structs. This is better and tells us a lot about the type’s structure. However, we cannot directly apply it as a type to a variable in our decompiler if we do not know about __time_t. For our implementation, we rely on our internal type representations. Inside our analysis framework, we store available type definitions with a Type Database (TypeDB). As we use our internal tooling during dataset creation, we can keep track of every type in our vocabulary, i.e., the space we can later predict from, store it in a TypeDB and persist it to disk to later be loaded again. The internal representation of a type will completely specify it, including all its field’s (for structs) or variant’s (for unions) subtypes, their names, offsets, sizes, their subtypes and so on. This allows us to construct a Ghidra-internal Datatype from these representations from our Rust API.

Finally, we need to apply these types to a decompiler. Changing a variable’s type in Ghidra’s, IDA’s or Binary Ninja’s UI is easy, once the type’s definition is available, either manually created or imported from a header file. However, for Ghidra’s C++-based decompiler, there are multiple levels of abstractions (VarNodes, Symbols, HighVariable) for dealing with variables throughout the Ghidra decompiler API and updating the types associated with them is subject to various constraints, and the decompilation phase introduced those changes. Ergo, not every mechanism to change a variable’s type will yield a change in the decompiler’s output when the same function is decompiled again. Ultimately, we found a reliable way to update a variable’s type by modifying type locks manually and using the retypeSymbol interface in the local scope. Importantly, updated types persist between multiple invocations of decompiling a function, due to their persistence in the local symbol table. With that, we were able to move forward.

Case Study: EDK2

N-gram based approaches are expected to not generalize all too well beyond what they were trained on. After all, they are matching tokens verbatim. Still, the initial publication sounded promising: STRIDE achieved 65% accuracy on the type recovery task even for functions that were not in the training set, which was 10 percentage points better than the former State-of-the-Art. Given these limitations, we looked for a suitable first domain to apply our STRIDE implementation to– ideally one where we could train on a substantial corpus of decompiled functions, capturing code patterns that were likely to occur in other binaries, too. Since we at Binarly are dealing with UEFI firmware on a regular basis, we chose the EFI Development Kit II (EDK2) for a first case study. It is widely used throughout UEFI implementations and even vendor-specific firmware commonly leverages EDK2 library APIs and types, making it an excellent dataset to explore generalization capabilities.

The Dataset

For construction of the n-gram database, we used our EDK2 dataset. It includes builds from various release versions, different compilers, for different architecture and with different optimizations (DEBUG, RELEASE and NOOPT). For each version, we load, analyze and decompile the various EDK2 modules, such as MdeModulePkg, NetworkPkg or UsbBusDxe, and their functions twice. Once, with a subset of EDK-specific types and function signatures we already make use of in our analyses, and once with additional type and debug information loaded and applied from binary-specific PDBs. Subsequently, we can use the information of the second decompilation to annotate variables in the original output with better type information. We do this for every function and build our training dataset to compute and store n-gram contexts from that. As mentioned before, all encountered types are tracked in a TypeDB to be reproducible and assignable in the decompiler at inference time.

The final dataset consists of 453,155 token-unique functions with type-labeled variables. Token-unique means that the same function is allowed to appear multiple times as long as it is syntactically different, e.g., due to compiler optimizations.

Applying Types to Findings

The most common place where our customers encounter decompiled code in the Binarly Transparency Platform is in the context of a finding. For instance, when our analyses identified an arbitrary write vulnerability via NVRAM variable in a UEFI firmware module, the vulnerable function will get decompiled and presented as part of the finding. The decompilation is meant to aid the security team and the developers in triaging and fixing the bug. Thus, we wanted to see if STRIDE trained on the EDK2 corpus could add helpful context to these findings.

For reasons discussed in Over-Representation of Primitive Types, the cited 65% accuracy on type recovery are inflated by primitive types. In practice, we observed that for struct types, even minor deviations in variable usage from the training set could skew the results and lead to incorrect predictions. This was greatly improved when enabling the previously introduced heuristic for struct types. Below, we show output from our analysis framework for EFI vulnerabilities for an artificial sample. The two functions/findings are presented once as they appeared originally, and once after running them through our STRIDE pipeline. As can be seen, we correctly, again thanks to our improved heuristics, predict the types of multiple variables which leads to way cleaner output. We now know more about the context of the finding, namely that the function is making use of the UsbBusDxe module to deal with USB protocols. Moreover, we can cleanly distinguish struct field accesses and even a struct-dependent function call can be resolved. The added semantics facilitate understanding and triaging of the finding.

Limitations

While testing STRIDE with our internal tooling, we found multiple restrictions that limit the approach’s usefulness in a production setting.
‍

Type Confusions

An aspect that never can be ignored in probabilistic prediction systems is the probability of a misclassification. Multiple scenarios or data characteristics can lead to such cases. The high specificity of n-gram matching is generally well-suited to minimize such problems. During our testing, the most common non-primitive type confusion was assigning the VA_LIST type. The type is used throughout EDK2 for variable-length arguments (see img5) and in many implementations is just a `typedef char *VA_LIST;`. Due to its generic layout and ubiquitous use (VA_LIST accounts for one percent of all variables’ types in our EDK2 corpus), the stored contexts in the n-gram databases make for a quite generic fit.

[img5] The VA_LIST typedef is heavily used throughout EDK2. As it is not very task- or context-specific, inferring it with precision is difficult.

Most commonly, rather than confusing two syntactically similar struct types for each other, STRIDE will rather output a generic integer type. To address this, we employ a heuristic to not overwrite an existing primitive type with an inferred one such as int32_t or Ghidra’s generic undefined4 as decompilers are usually effective at getting these correct themselves. Thereby, we avoid regression. Theoretically, it is of course still possible to have very small, highly similar contexts that, e.g., just wrap a handle of a different type, and could lead to a type confusion. The possibility of a misclassification can never be fully eradicated when using probabilistic models. At the same time, automating detection of such false positives is challenging and under-discussed in academic work. When two types in our vocabulary of predictable outcomes have struct members at the same offsets, this could further foster a type confusion. Restricting the current vocabulary by narrowing a prediction’s scope is an effective counter measure to reduce such cases. For example, we train and maintain separate n-gram and type databases for 32- and 64 bit programs.

Imprecise Training Data

One interesting failure case we encountered were types incorrectly inferred by the decompiler during training dataset construction. Adding function signatures and type information to a binary and its decompilation via a PDB does not only introduce types from that debug information with 100% accuracy. Based on the augmented context, the decompiler continues to infer the types of further dependent variables, which can lead to false positives. For example, we observed decompiler failures resulting in “negative array accesses”; certain global variables implicitly feature special values at a negative offset right in front of themselves in memory. These structs, typically holding metadata or handles to global interfaces, naturally are of a different type, but as the decompiler reconstructed that memory access as global_var[-1], the wrong type gets propagated, see img4.

Additionally inferred types, while often being correct, add a hard-to-control-for factor during dataset creation as they end up in the n-gram lookup database. This, in turn, could lead to false positives during online prediction. Not because STRIDE misclassifies a variable, but because it classifies it correctly but has the wrong type filed, thanks to the decompiler. Strict filtering of failure-induced misclassifications is required to not poison the training data.

[img4] The Decompiler incorrectly resolves a negative offset access on a variable.

As the variable assignment and the decompiler-inferred type of UsbIoProtocol is incorrect, subsequent struct field accesses cannot get resolved correctly.

Lack of Useful Predictions

The factor that turned out to limit practical applicability the most was the simple absence of meaningful predictions. As explained earlier, generalization beyond what is in the training corpus is marginal for n-gram-based techniques. While testing STRIDE with our EFI analysis framework on firmware of various manufacturers, we observed that the scopes of our EFI findings are more narrow than we initially deemed. Findings are surrounded by OEM-specific code more often than by code that is heavily making use of EDK2 APIs. Consequently, we only rarely observe complex semantic types that we have represented in our databases. The good news here is that erroneous and misleading predictions seem to appear rarely. The most common answer by far in case of uncertainty is the type with the highest likelihood of occurring in general: A failed prediction will often just recommend an int32_t or similar which is, while not helpful, also not misleading in a semantic sense. Still, this constrains the usefulness of the approach which we also need to weigh up against its runtime overhead and added code complexity.

Conclusion

As the type inference pipeline in its current state is not adding as much semantic context as we had hoped for, we aim to extend our datasets and broaden our n-gram databases. This will enrich them with further types and corresponding contexts, potentially leading to useful predictions in more diverse environments. The Binarly REsearch Team will also explore applications of the approach in other products, as the premise of probabilistic type inference as well as early results still are promising. One main task in this will be to develop further heuristics to make predictions more stable in a production context, adding new insights to facilitate code understanding.