aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
-rw-r--r--README.md57
1 files changed, 27 insertions, 30 deletions
diff --git a/README.md b/README.md
index 545c73b..5c3e012 100644
--- a/README.md
+++ b/README.md
@@ -94,36 +94,33 @@ int main(void) {
## Benchmarks
-1 000 000 u64→u64 entries: insert, lookup (hit), lookup (miss), remove.
-All implementations use the same murmur-inspired mixing hash where the language allows it.
-Best time across multiple runs is reported — see notes on V-Cache below.
-
-**System:** AMD Ryzen 9 7950X3D, Clang 22.1.1 / GCC 15.2.1 / Rust 1.93.0, Linux, performance governor.
-
-```
-Implementation Best (ms)
----------------------------------------------
-abseil flat_hash_map (C++) 34 [1]
-cheesemap SSE2 (Clang) 45
-Java HashMap ~32 [2]
-Rust HashMap (hashbrown) 48
-cheesemap scalar (Clang) 56
-Go map 1.26 (Swiss table) 68
-.NET Dictionary 70
-Chromium JS Map 102
-std::unordered_map + mimalloc ~110 [3]
-cheesemap SSE2 (GCC) 58 [4]
-Node.js Map 153
-std::unordered_map (Clang) 176
-Python dict 193
-```
-
-[1] abseil is the reference Swiss table implementation — its numbers are expected to be strong.
-[2] Java's best run benefits from C2 JIT reaching peak tier during the measured phase; average is ~67 ms.
-[3] std::unordered_map is a chained hash map; mimalloc recovers ~38% over the system allocator by reducing per-node allocation overhead. Without mimalloc the best observed time was ~176 ms.
-[4] GCC's codegen for this workload is ~25% slower than Clang due to a failure to inline the hot lookup path.
-
-**Note on AMD 3D V-Cache:** The Ryzen 9 7950X3D has two CCDs — one with 96 MB L3 (3D V-Cache) and one with 32 MB L3. The benchmark's working set (~32 MB) sits right on the boundary of the smaller CCD. The OS scheduler assigns the process to a CCD at random, producing a bimodal result distribution: runs landing on the V-Cache CCD are significantly faster. Best times above reflect V-Cache-lucky runs. Average times are roughly 1.5–2× higher on this machine. To pin to the V-Cache CCD for reproducible results: `numactl --cpunodebind=0 ./bench`.
+1,000,000 `u64 -> u64` entries: insert, lookup hit, lookup miss, remove. All implementations use the same murmur-inspired mixing hash where the language allows it. Best time across multiple runs is shown.
+
+**Platform:** x86-64, Linux, performance governor.
+
+| Implementation | Best (ms) | Notes |
+|---|---|---|
+| Java HashMap | ~32 | [1] |
+| abseil flat_hash_map | ~34 | [2] |
+| cheesemap SSE2 (Clang) | ~45 | |
+| Rust HashMap (hashbrown) | ~48 | |
+| cheesemap scalar (GCC) | ~53 | |
+| cheesemap scalar (Clang) | ~56 | |
+| cheesemap SSE2 (GCC) | ~58 | [3] |
+| Go map (Swiss table) | ~68 | |
+| .NET Dictionary | ~70 | |
+| Chromium JS Map | ~102 | |
+| std::unordered_map + mimalloc | ~110 | [4] |
+| Node.js Map | ~153 | |
+| std::unordered_map | ~176 | |
+| Python dict | ~193 | |
+
+> **Note on extended L3 cache:** The test machine has a split cache topology where the benchmark's ~32 MB working set fits comfortably in one cache domain but thrashes in the other. Best times above reflect lucky scheduling onto the larger cache. Average times are roughly 1.5-2x higher. Pin with `numactl --cpunodebind=0` for reproducible results.
+
+**[1]** JVM C2 JIT reaching peak tier mid-run inflates the best; average is ~67 ms.
+**[2]** abseil is the reference Swiss table implementation.
+**[3]** GCC is ~25% slower than Clang here due to a failure to inline the hot lookup path.
+**[4]** std::unordered_map is chained; mimalloc recovers ~38% by reducing per-node allocation cost. Without it: ~176 ms.
---