MD5 / SHA hashing of strings in Scala
If you’re using Scala (or another JVM language) and want to compute MD5 or SHA string hashes as hexadecimal strings (e.g. md5("foo") = "acbd18db4cc2f85cedef654fccc4a4d8"
), consider using Guava’s Hashing
or Apache Commons Codec’s DigestUtils
instead of writing your own helper function: these libraries can be significantly faster than many of the DIY implementations floating around the web.
Recommended solutions
First I’ll list my recommended solutions. The examples here all return lowercase strings.
With Guava:
import com.google.common.hash.Hashing
import com.google.common.io.BaseEncoding
import java.nio.charset.StandardCharsets
Hashing.sha256().hashString("foo", StandardCharsets.UTF_8).toString
// returns "2c26b46b68ffc68ff99b453c1d30413413422d706483bfa0f98a5e886266e7ae"
Hashing.md5().hashString("foo", StandardCharsets.UTF_8).toString
// returns "acbd18db4cc2f85cedef654fccc4a4d8"
BaseEncoding.base16().lowerCase().encode(Array[Byte](-34, -83, -34, -83))
// returns "deaddead"
With Apache Commons:
import org.apache.commons.codec.binary.Hex
import org.apache.commons.codec.digest.DigestUtils
import java.nio.charset.StandardCharsets
DigestUtils.sha256Hex("foo".getBytes(StandardCharsets.UTF_8))
// returns "2c26b46b68ffc68ff99b453c1d30413413422d706483bfa0f98a5e886266e7ae"
DigestUtils.md5Hex("foo".getBytes(StandardCharsets.UTF_8))
// returns "acbd18db4cc2f85cedef654fccc4a4d8"
Hex.encodeHexString(Array[Byte](-34, -83, -34, -83))
// returns "deaddead"
These libraries also have functions for hashing streams, files, and other input types; see their API docs for details.
Comparing performance against a common DIY solution
There’s many implementations online which look something like this:
import java.security.MessageDigest
def diyMD5(str: String): String = {
MessageDigest
.getInstance("MD5")
.digest(str.getBytes("UTF-8"))
.map("%02x".format(_))
.mkString("")
}
To compare performance to the Guava and Apache Commons implementations, let’s run a simple benchmark. Here I’m using a simple benchmarking helper function instead of JMH and am ignoring several factors (such as choice of input string). That said, this quick-and-dirty benchmark is sufficient to uncover a large performance gap:
def timeAndRecordAllocations(
numWarmups: Int,
numTrials: Int,
)(functionToBenchmark: => Unit): Unit = {
import java.lang.management.ManagementFactory
import com.sun.management.ThreadMXBean
val threadMxBean = ManagementFactory.getThreadMXBean.asInstanceOf[ThreadMXBean]
val threadId = Thread.currentThread.getId
// Warmup:
{
var i = 0
while (i < numWarmups) {
functionToBenchmark
i += 1
}
}
// Benchmark:
System.gc()
val bytesBefore = threadMxBean.getThreadAllocatedBytes(threadId)
val cpuTimeBefore = threadMxBean.getThreadCpuTime(threadId)
{
var i = 0
while (i < numTrials) {
functionToBenchmark
i += 1
}
}
val cpuTimeAfter = threadMxBean.getThreadCpuTime(threadId)
System.gc()
val bytesAfter = threadMxBean.getThreadAllocatedBytes(threadId)
println(s"Estimated bytes/op: ${Math.round(1.0 * (bytesAfter - bytesBefore) / numTrials)}")
println(s"CPU time nanos/op: ${Math.round(1.0 * (cpuTimeAfter - cpuTimeBefore) / numTrials)}")
}
Running this on my laptop in an Ammonite shell with Scala 2.12.8 and OpenJDK 1.8.0_191:
// Setup dependencies:
@ import $ivy.`com.google.guava:guava:23.0`
@ import $ivy.`commons-codec:commons-codec:1.14`
// [...] define benchmarking helper functions as above [...]
// Run benchmarks:
@ timeAndRecordAllocations() { diyMD5("foo") }
Estimated bytes/op: 19912
CPU time nanos/op: 17393
@ timeAndRecordAllocations() { Hashing.md5().hashString("foo", StandardCharsets.UTF_8).toString }
Estimated bytes/op: 776
CPU time nanos/op: 520
@ timeAndRecordAllocations() { DigestUtils.md5Hex("foo".getBytes(StandardCharsets.UTF_8)) }
Estimated bytes/op: 785
CPU time nanos/op: 690
Eyeballing this, it looks like the Guava and Apache Commons solutions are ~25-33x faster and allocate ~25x less memory than the DIY implementation (at least for this small input string). There’s other DIY implementations which are faster (e.g. by using a BigInteger
for formatting) but they’re generally still slower than these library functions.
A few notes:
- If you do decide to write your own implementation (e.g. to cut out certain library overheads) then you should be more scientific and careful in your benchmarking than I’ve been here. My goal here is to dissuade folks from copying the slow implementations that you can find on Google, not to micro-optimize.
- With small input strings the runtime is dominated by the cost of converting the
MessageDigest
’s output into a string of hexadecimal characters. If you’re hashing large files or byte streams then you may see less speedup from switching to the Guava or Apache implementations. - It may be possible to speed up the
MessageDigest
operations by using an alternative Java crypto provider. For example, see benchmark results from Amazon Corretto crypto provider (disclaimer: I haven’t had a chance to try this myself yet). - If you’re trying to replace slow DIY implementations in your own codebase, make sure to double-check the case of the character strings being generated:
%02x
and%02X
will generate lowercase and uppercase strings, respectively, and accidentally mixing them up could have bad consequences if you’re, say, using the checksums as database keys.
Performance optimizations in the Guava and Apache Commons libraries
Here are some performance tricks used in third-party libraries’ implementations:
-
Avoiding a synchronization bottleneck:
MessageDigest
instances are not thread-safe, so it’s fairly common to create fresh instances whenever they’re needed. Under the hood,MessageDigest.getInstance()
ends up callingjava.security.Provider.getService()
and that method can be prone to synchronization bottlenecks in older JDK versions (including many versions of Java 8).Guava works around this problem by making a static prototype which holds a
MessageDigest
instance and clones it instead of callingMessageDigest.getInstance()
.Several other projects use this trick, including AtlasDB and the DataStax Java driver for Apache Cassandra.
An alternative approach is to store
MessageDigest
instances in thread-locals; this approach is used by the AWS Java SDK (see also: an interesting discussion on GitHub about the trade-offs between thread locals and cloning).Apache Commons Codec implements neither of these optimizations.
-
Bytes to hex conversion: both Guava and Apache Commons Codec convert checksums’ bytes to hexadecimal strings by using the bytes to index into a
char[]
lookup array.
For Apache Spark
If you’re using Apache Spark’s DataFrame, Dataset, or SQL APIs then you should consider using its built-in md5
and sha2
functions: Spark’s built-in functions will generally be faster than UDFs because they can avoid conversions between Spark’s internal and external data representations. Using the built-ins will also allow your program to benefit from future improvements to Spark’s function implementations.
Under the hood, Spark 2.4.6’s implementation of these functions uses Apache Commons’ DigestUtils
.