+ "markdown": "---\ntitle: Hashing Performance\ndate: 2025-01-26\nauthor: Carlos\ncategories:\n - performance \n - TypeScript\nfilters:\n - ../drop-knitr-stderr.lua\n---\n\n::: {.cell}\n\n:::\n\n\n\nIn the course of 1.7's perf work, we are going to introduce a number of persistent caches\nfor Quarto projects. This will require knowing which hashing functions perform well\nunder what settings. I'm using [this file](./deno-hash-bench.ts) to measure the results.\n\n::: {.cell}\n::: {.cell-output .cell-output-stderr}\n\n```\nRows: 17 Columns: 5\n── Column specification ────────────────────────────────────────────────────────\nDelimiter: \",\"\ndbl (5): size_log2, sha256, md5, djb2, blueimp-md5\n\nℹ Use `spec()` to retrieve the full column specification for this data.\nℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n```\n\n\n:::\n\n::: {.cell-output-display}\n{#fig-runtimes width=672}\n:::\n:::\n\nImportant features:\n\n| algorithm | sync | quality |\n|---------------|------|------------|\n| `djb2` | yes | non-crypto |\n| `blueimp-md5` | yes | meh |\n| `md5` | no | meh |\n| `sha256` | no | good |\n\n### djb2\n\nI've spent some time trying to write a faster version of djb2 and couldn't really make meaningful progress.\nI tried:\n\n- unrolling the loop directly (not enough of a win)\n- operating at 32 bits at a time by converting the string to a buffer first\n\n## Takeaways\n\n- `blueimp-md5` only makes sense if `sync` MD5 calls are necessary: it's slower than `md5` at every range.\n- `md5` only makes sense if the quality improvement over `djb2` is needed, but `sha256` not being required:\n - the DJB2 algorithm gives ~32 bits of hashing space, birthday paradoxes start appearing at 2^16 items, while MD5 gives 128 bits.\n - `md5` is, adversarially, trivially breakable\n\n- `sha256` is async and has a large startup cost, but is the fastest at strings starting at size ~2^14 = 16k, faster even than `djb2`.\n\n## A design for a general-purpose cache?\n\nIf we need cryptographically-safe hashes, then we need to use SHA-256 everywhere. Unfortunately, that incurs ~15ms of overhead per call independently of the size of the string. That's a lot.\n\nIf `djb2` is good enough in terms of quality, then we still need to worry about hash space size. `djb2` has 32 bits of address space. By the birthday paradox, if we want a 1 in a million chance of a hash collision, then the cache size needs to be at most [~100](https://en.wikipedia.org/wiki/Birthday_problem#Probability_table).\n\nHonestly, this number is small enough that I'm wary about using `djb2` at all in Quarto as a substitute for string equality.\n\nIf we could create a 64-bit version of `djb2`, that would likely suffice for Quarto documents: the critical size for such caches to achieve a 1-in-a-million catastrophic failure is ~6 million.\n\n`md5` has 128 bits, and in non-adversarial settings that's plenty.\n\nThe penalty of using `md5` is about 50%, and the requirement for using async:\n\n::: {.cell}\n::: {.cell-output-display}\n{width=672}\n:::\n:::\n\nThat's a completely acceptable tradeoff.\n\nSo, I think our general-purpose cache is:\n\n- use `md5` or `sha256`, whichever is faster. The breakpoint where `sha256` is clearly it is at string sizes of around 16k or larger.\n\n- this cache will be necessarily async.\n",
0 commit comments