Cryptographic Algorithms Identification in Java Bytecode

By Claudiu-Vlad Ursache, Yegor Vasilenko, Sam Thomas

Cryptographic algorithms protect critical properties of modern software. With the potential danger posed by the advent of quantum computers, it has become more important for companies to identify which algorithms are present in the systems they use or ship to customers.

While mature algorithm usage detection solutions for languages which target the JVM are readily found when working at the source code level (e.g. CodeQL, Semgrep), solutions working at the bytecode level are less prevalent, outdated, or do not provide comprehensive coverage.

Today we’re excited to announce that our cryptographic algorithm identification capabilities have been extended to the Java ecosystem. Our new analyzer can scan Java archives (JARs) both as standalone files and embedded within other formats such as Docker containers and firmware images. Support for Android packages is in development and is planned for release later this year. We have found our approach to be faster on average than similar projects in the industry, while maintaining a very high level of accuracy. In addition to providing a complete overview of algorithms used, we use dataflow analysis to ensure a near-zero false positive rate and report call graph reachability for the detected algorithm usage to help prioritize the most relevant issues. And just like our solution for identifying cryptographic algorithms in native binaries, we report NIST IR 8457 compliance information for all encountered algorithms to determine the Post-Quantum Cryptography readiness of software artefacts.

How it works

Let’s consider a simple example from a real-world project:

In this particular piece of code, we see that the method `cipher` of the class `AESCBC` uses the `javax.crypto.Cipher` class from the JDK to perform a cryptographic operation using the AES algorithm. In order to identify this particular instance of a cryptographic operation in a JAR containing this class, we do several things behind the scenes:

1. We generate a Code Property Graph from the JAR

2. We tag nodes in the graph relevant to cryptographic operations (sources and sinks) using a set of tagging rules written in a simple YAML format

3. We run a final dataflow analysis between sources and sinks using a set of conclusion rules

The reason for doing all of this is precision. We have observed multiple instances of cryptographic object instantiations in real-world projects which require interprocedural dataflow tracking (i.e. spanning multiple functions/methods) in order for the instantiations to be identified, and Code Property Graphs provide a solid foundation to deliver that functionality.

Code Property Graphs & Joern

The core detection logic of our analyzer is powered by the open-source project Joern, which works with Code Property Graphs (CPGs). A CPG is "a language-agnostic intermediate graph representation of code designed for code querying". The introductory paper of this data structure won the IEEE Test-of-Time award in 2024. Joern provides CPG frontends for multiple input languages, together with a solid API for querying them. We were able to quickly build on top of it, integrate it into our Rust-based analysis pipeline via the Rust-JVM FFI provided by j4rs, and release a solution that is strong in both performance and accuracy.

To give an idea of a CPG, here are the nodes and edges found in the CPG of a simple code snippet written in C:

Code Property Graph of a sample C code snippetthe actual code snippet, not necessary to include in actual caption:```void foo() { int x = source(); if (x < MAX) { int y = 2 * x; sink(y); }}```

For the specific case of analyzing Java bytecode using CPGs, the underlying representation is Soot IR, which, for our example, looks like this:

Rules & Tagging

In order to detect cryptographic algorithm usage accurately, we tag nodes in the CPGs generated from Java bytecode and run a number of dataflow analyses on the graph, all based on rules we defined for various popular cryptographic libraries. The general pattern is: tag the source (the call which instantiates the object performing the cryptographic operation), tag the sink (the instance of the object performing the cryptographic operation at the call site which triggers the operation), and finally find a dataflow between the source and the sink (i.e. the conclusion).

The rule which tags the instantiation of the object which will perform the cryptographic operation (the source) in the example is the following:

Which leads to a TAG node in the graph:

While the rule that tags the instance of the object performing the cryptographic operation (the sink), looks like this:

Which leads to an additional TAG node in the graph:

Finally, the "conclusion" rule specifies the final dataflow analysis to be run which validates that the object which has been instantiated to use a specific cryptographic algorithm actually performs the cryptographic operation:

Which, in our example, will lead to the “encryption/aes” finding:

We have identified 6 code patterns used in cryptographic operations implemented in the most popular cryptographic libraries for the JVM, and our rule format allows us to detect every instance and permutation of these cryptographic APIs. Some examples of the patterns used in the instantiation of the objects performing cryptographic operations are:

// instantiation with string argument
String algorithm = "AES/CBC/PKCS7Padding";
Cipher cipherAes = Cipher.getInstance(algorithm);

// instantiation with object of a specific type
MD4Digest md4Digest = new MD4Digest();

// instantiation with object of a specific type with an argument of a specific type
DESEngine engine = new DESEngine();
EAXBlockCipher eaxDesCipher = new EAXBlockCipher(engine);


// instantiation with object of a specific type with an argument that is a field access on an object of another type with a field of a specific name
KeysetHandle ed25519KeysetHandle = KeysetHandle.generateNew(PredefinedSignatureParameters.ED25519);
        PublicKeySign ed25519Signer = ed25519KeysetHandle.getPrimitive(RegistryConfiguration.get(), PublicKeySign.class);

Support, Performance, Accuracy & Limitations

Currently we support the most popular and widely adopted cryptographic libraries like Bouncy Castle, Apache Commons, Google Tink and Google Guava; with production-level support for JDK8+ bytecode generated from Java code, followed by beta-level support for bytecode generated from Kotlin code and experimental-level support for all other JVM languages.

In terms of performance, we have compared our analyzer with existing tools on the market and have found it to perform better on average in terms of running time while providing reports of all the algorithms used rather than issues with some of them.

In terms of accuracy, our analyzer is capable of detecting 100% of the algorithms found in the popular cryptographic API benchmark CamBench with near-zero false positive and false negative rates with similar results found in the real-world projects we tested.

Our comparisons were performed against the Cryptoscope paper, where we found our tool to be 5x faster on average while providing slightly better results for the real-world targets presented in the paper, against cryptoguard, where we are slightly slower but it does not support targets which require more modern versions of the JDK (8+), and against CryptoAnalysis, where we had issues scanning most of the real-world targets we chose (various exceptions were thrown).

When it comes to limitations, the use of code optimizers and obfuscators do not eliminate findings from our analyzer in the most commonly encountered type of JARs (i.e. non-fat-JARs), but do reduce the detail of the findings - specifically, in common optimization/obfuscation configurations, the line number information for the calls that trigger the cryptographic operations. Additionally, we have found Joern to have some rough edges when it comes to dataflow tracking through specific language constructs like lambdas in Java-bytecode-backed CPGs, but based on our experience and extensive testing of real-world applications, we do not expect this to be problematic because we have found that algorithm usage code tends to follow fairly simple code patterns.

Conclusion & Future Work

Precise cryptographic algorithm identification in Java bytecode can be achieved when performing analyses backed by dataflow tracking. We have built such a precise analyzer and we can’t wait for you to try it out. In the near future, we plan to refine our detection rules, support Android applications and improve support for alternative JVM languages.

The feature is currently part of the Binarly Transparency Platform and will be included in the upcoming platform release (v3.5).

Cryptographic Algorithms Identification in Java Bytecode

How it works

Code Property Graphs & Joern

Rules & Tagging

Support, Performance, Accuracy & Limitations

Conclusion & Future Work

What's lurking in your firmware?

Platform

REsearch

Learn

Company

Platform

REsearch

Learn

Company