MD5 / SHA hashing of strings in Scala

If you’re using Scala (or another JVM language) and want to compute MD5 or SHA string hashes as hexadecimal strings (e.g. md5("foo") = "acbd18db4cc2f85cedef654fccc4a4d8"), consider using Guava’s Hashing or Apache Commons Codec’s DigestUtils instead of writing your own helper function: these libraries can be significantly faster than many of the DIY implementations floating around the web.

Comparing performance against a common DIY solution

There’s many implementations online which look something like this:

import java.security.MessageDigest

def diyMD5(str: String): String = {
    MessageDigest
        .getInstance("MD5")
        .digest(str.getBytes("UTF-8"))
        .map("%02x".format(_))
        .mkString("")
}

To compare performance to the Guava and Apache Commons implementations, let’s run a simple benchmark. Here I’m using a simple benchmarking helper function instead of JMH and am ignoring several factors (such as choice of input string). That said, this quick-and-dirty benchmark is sufficient to uncover a large performance gap:

def timeAndRecordAllocations(
    numWarmups: Int,
    numTrials: Int,
  )(functionToBenchmark: => Unit): Unit = {

  import java.lang.management.ManagementFactory
  import com.sun.management.ThreadMXBean
  val threadMxBean = ManagementFactory.getThreadMXBean.asInstanceOf[ThreadMXBean]
  val threadId = Thread.currentThread.getId

  // Warmup:
  {
    var i = 0
    while (i < numWarmups) {
      functionToBenchmark
      i += 1
    }
  }

  // Benchmark:
  System.gc()
  val bytesBefore = threadMxBean.getThreadAllocatedBytes(threadId)
  val cpuTimeBefore = threadMxBean.getThreadCpuTime(threadId)

  {
    var i = 0
    while (i < numTrials) {
      functionToBenchmark
      i += 1
    }
  }

  val cpuTimeAfter = threadMxBean.getThreadCpuTime(threadId)
  System.gc()
  val bytesAfter = threadMxBean.getThreadAllocatedBytes(threadId)
  println(s"Estimated bytes/op: ${Math.round(1.0 * (bytesAfter - bytesBefore) / numTrials)}")
  println(s"CPU time nanos/op: ${Math.round(1.0 * (cpuTimeAfter - cpuTimeBefore) / numTrials)}")
}

Running this on my laptop in an Ammonite shell with Scala 2.12.8 and OpenJDK 1.8.0_191:

// Setup dependencies:
@ import $ivy.`com.google.guava:guava:23.0`
@ import $ivy.`commons-codec:commons-codec:1.14`

// [...] define benchmarking helper functions as above [...]

// Run benchmarks:
@ timeAndRecordAllocations() { diyMD5("foo") }
Estimated bytes/op: 19912
CPU time nanos/op: 17393

@ timeAndRecordAllocations() { Hashing.md5().hashString("foo", StandardCharsets.UTF_8).toString }
Estimated bytes/op: 776
CPU time nanos/op: 520

@ timeAndRecordAllocations() { DigestUtils.md5Hex("foo".getBytes(StandardCharsets.UTF_8)) }
Estimated bytes/op: 785
CPU time nanos/op: 690

Eyeballing this, it looks like the Guava and Apache Commons solutions are ~25-33x faster and allocate ~25x less memory than the DIY implementation (at least for this small input string). There’s other DIY implementations which are faster (e.g. by using a BigInteger for formatting) but they’re generally still slower than these library functions.

A few notes:

If you do decide to write your own implementation (e.g. to cut out certain library overheads) then you should be more scientific and careful in your benchmarking than I’ve been here. My goal here is to dissuade folks from copying the slow implementations that you can find on Google, not to micro-optimize.
With small input strings the runtime is dominated by the cost of converting the MessageDigest’s output into a string of hexadecimal characters. If you’re hashing large files or byte streams then you may see less speedup from switching to the Guava or Apache implementations.
It may be possible to speed up the MessageDigest operations by using an alternative Java crypto provider. For example, see benchmark results from Amazon Corretto crypto provider (disclaimer: I haven’t had a chance to try this myself yet).
If you’re trying to replace slow DIY implementations in your own codebase, make sure to double-check the case of the character strings being generated: %02x and %02X will generate lowercase and uppercase strings, respectively, and accidentally mixing them up could have bad consequences if you’re, say, using the checksums as database keys.

Performance optimizations in the Guava and Apache Commons libraries

Here are some performance tricks used in third-party libraries’ implementations:

Avoiding a synchronization bottleneck: MessageDigest instances are not thread-safe, so it’s fairly common to create fresh instances whenever they’re needed. Under the hood, MessageDigest.getInstance() ends up calling java.security.Provider.getService() and that method can be prone to synchronization bottlenecks in older JDK versions (including many versions of Java 8).

Guava works around this problem by making a static prototype which holds a MessageDigest instance and clones it instead of calling MessageDigest.getInstance().

Several other projects use this trick, including AtlasDB and the DataStax Java driver for Apache Cassandra.

An alternative approach is to store MessageDigest instances in thread-locals; this approach is used by the AWS Java SDK (see also: an interesting discussion on GitHub about the trade-offs between thread locals and cloning).

Apache Commons Codec implements neither of these optimizations.
Bytes to hex conversion: both Guava and Apache Commons Codec convert checksums’ bytes to hexadecimal strings by using the bytes to index into a char[] lookup array.

For Apache Spark

If you’re using Apache Spark’s DataFrame, Dataset, or SQL APIs then you should consider using its built-in md5 and sha2 functions: Spark’s built-in functions will generally be faster than UDFs because they can avoid conversions between Spark’s internal and external data representations. Using the built-ins will also allow your program to benefit from future improvements to Spark’s function implementations.

Under the hood, Spark 2.4.6’s implementation of these functions uses Apache Commons’ DigestUtils.

Recommended solutions

Comparing performance against a common DIY solution

Performance optimizations in the Guava and Apache Commons libraries

For Apache Spark