Lock-Free in Java: Scenario 07 - Per-Core Sharded Processing
Part 1: The 2AM Crisis That Changed Everything
Thursday, 2:24 AM. I was deep into one of those late-night deployments that you promise yourself will never happen again, yet somehow always do. The monitoring dashboard lit up like a Christmas tree, but not the kind you want to see. Our transaction processing system - the backbone of our fintech platform - was choking under load.
The numbers told a story that made my stomach drop. We had promised our clients 60,000 transactions per second. The system was processing approximately 3,000. A full 20x shortfall. The business team's emails were already flooding in, each one more urgent than the last. Our biggest client, a major e-commerce platform running their Black Friday equivalent, was losing money with every passing second.
I pulled up the flame graph and there it was - a massive red block sitting at the top of the call stack: java.util.concurrent.locks.ReentrantLock.lock(). Our threads were spending 73% of their time doing absolutely nothing except waiting. Not processing transactions, not computing results, not serving customers - just waiting for their turn to access a shared buffer.
The architecture seemed reasonable at design time. We had a single RingBuffer that all processing threads would write to, and a downstream consumer that would batch-process the results. Simple, clean, easy to reason about. Except now, with 64 cores all hammering the same lock, simple had become our enemy.
I checked the CPU utilization: 8%. Eight percent on a 64-core machine that should have been maxed out processing transactions. The cores weren't running our code - they were running the kernel's scheduler, bouncing threads between parked and runnable states in an endless dance of futility.
The profiler told me something else too. Our L3 cache miss rate was through the roof. Every time a thread acquired the lock, it had to fetch the lock state from another core's cache - a process that took 40-100 nanoseconds depending on which core held it. Then, when it released the lock, the next thread had to do the same. Cache lines were bouncing across the CPU like ping-pong balls.
At 3:47 AM, I pushed a desperate hack to production - reducing our thread pool from 64 threads to 8. Throughput actually improved. We went from 3,000 to 9,000 transactions per second. Still nowhere near our target, but it stopped the immediate bleeding. The reduced contention meant threads spent more time working and less time waiting.
The next morning, bleary-eyed and running on caffeine, I gathered the team for an emergency architecture review. "We have a fundamental design problem," I said, pulling up the flame graphs. "Locks don't scale. On high-core-count machines, they become the bottleneck, not the solution. We need to think differently."
That began a two-week intensive deep-dive that would fundamentally change how I approach concurrent system design. What we discovered was an elegant pattern that took our system from 3,000 TPS to over 60,000 TPS - a 20x improvement achieved not through more hardware or clever algorithms, but through understanding how modern CPUs actually work.
This is the story of per-core sharding: how we eliminated contention by giving each core its own private playground to work in, and the journey that taught me that sometimes the best lock is no lock at all.
Part 2: Why Contention Kills Performance
Before diving into the solution, we need to understand the problem at a fundamental level. Contention isn't just "threads waiting" - it's a cascade of performance-destroying effects that compound under load.
The Anatomy of Lock Contention
When multiple threads compete for a shared resource protected by a lock, several things happen:
1. Thread Parking and Context Switches
When a thread tries to acquire a lock that another thread holds, it eventually "parks" - tells the operating system to stop scheduling it until the lock is available. This parking operation isn't free:
The actual lock operation (compare-and-swap) takes about 50 nanoseconds. But the context switch overhead is 2,000-6,000 nanoseconds - 40-120x more expensive than the operation we're trying to protect.
2. Lock Convoy Effect
Here's where it gets really insidious. Consider four threads all trying to access the same lock:
Each thread pays the full context switch penalty. They form a "convoy" - processing sequentially with the worst possible overhead. Four threads that should provide 4x parallelism instead provide less throughput than a single thread would, because they spend all their time in scheduling overhead.
3. Cache Line Bouncing
Modern CPUs have per-core caches organized in a hierarchy (L1, L2, L3). When data is modified, the cache line containing that data must be invalidated in all other cores. This is called the MESI protocol (Modified, Exclusive, Shared, Invalid).
A lock's internal state (whether it's held, by whom, who's waiting) lives in memory. When Thread-1 acquires the lock, it modifies this state. That modification invalidates the cache line in all other cores. When Thread-2 tries to acquire the lock, it must fetch the cache line from Thread-1's cache (or main memory) - an operation costing 40-100+ nanoseconds depending on CPU topology.
Each bounce costs 40-100ns. With 64 cores competing, the lock state might bounce dozens of times per lock acquisition. The CPU's interconnect becomes saturated with coherency traffic rather than useful work.
The CAS Retry Storm
Lock-free algorithms use Compare-And-Swap (CAS) operations instead of locks. But they're not immune to contention:
Under low contention, this works beautifully - one CAS, done. Under high contention, something nasty happens:
This is a CAS retry storm. Instead of O(N) operations for N increments, we get O(N^2). Worse, each failed CAS still bounces cache lines, generating massive memory traffic with zero progress.
Quantifying the Damage
In our transaction processing system, I measured the following before we fixed the problem:
The system was spending 73% of its cycles on locking overhead and only 27% on actual work. The 8% CPU utilization showed that most cores were parked, waiting. The L3 cache miss rate of 38% indicated massive cache-line bouncing.
This is what contention looks like at scale. And the only way to fix it is to eliminate the contention itself.
Part 3: The Insight - Eliminating Contention Through Sharding
The night after our emergency meeting, I couldn't sleep. I kept turning the problem over in my mind. We had tried:
- Reducing lock hold time (already minimized)
- Using StampedLock for optimistic reads (didn't help - we were write-heavy)
- Using lock striping (helped somewhat, but not enough)
- Going lock-free with CAS (CAS retry storms under high contention)
None of these approaches addressed the fundamental issue: all threads were fighting over the same resource. Whether that resource was protected by a lock, a CAS variable, or anything else, the contention remained.
Then it hit me. What if we didn't have one shared resource? What if we had 64 shared resources - one per core?
The idea was simple: instead of a single buffer that all threads write to, create multiple buffers. Assign each thread to a specific buffer based on some deterministic mapping. Now threads on different cores never compete with each other.
Before: Single Shared Buffer (contention!)
After: Per-Core Sharded Buffers (zero contention!)
This is sharding - partitioning a shared resource into independent pieces that can be accessed without coordination. It's the same principle that makes distributed databases scale: instead of one big lock, many small locks (or better, no locks at all).
The Key Insight: Thread Affinity
For sharding to work well, we need stable thread-to-shard assignments. If threads randomly pick shards, we're back to contention. The insight is to use the thread's identity to deterministically select a shard.
In Java, every thread has a unique ID accessible via Thread.currentThread().getId(). This ID is stable for the thread's lifetime. We can use it to map threads to shards:
The bitwise AND with (shardCount - 1) works because shardCount is a power of 2 (we'll ensure this). It's equivalent to modulo but much faster.
This mapping ensures:
- The same thread always writes to the same shard
- Different threads write to potentially different shards
- The mapping is O(1) with no memory access required
Choosing the Shard Count
How many shards should we have? There's a trade-off:
Too few shards: Some shards will be shared by multiple threads, reintroducing contention.
Too many shards: Memory overhead increases, and the consumer has more work round-robining between shards.
The sweet spot is typically:
- Minimum: Number of cores (so each core can have its own shard)
- Maximum: Number of producer threads (so each thread has its own shard)
For a 64-core machine with 64 producer threads, 64 shards is ideal. Each thread gets exclusive access to its shard - zero contention.
But what if we have 64 threads and only 8 shards? Then on average, 8 threads share each shard. Contention is reduced 8x compared to a single buffer, but not eliminated. We call this the "contention factor."
Round-Robin Consumption
With multiple shards, the consumer must check all of them. A simple approach is round-robin:
This has two nice properties:
- Fairness: No shard gets starved
- Batching opportunity: The consumer can drain multiple items from a shard before moving on
The trade-off is that consumption isn't strictly FIFO across the entire system - items might be consumed out of arrival order if they landed in different shards. For most systems, this is acceptable.
Part 4: The Single Shared Buffer - Our Baseline
Before building the sharded solution, let's examine exactly what we're replacing. Understanding the baseline in detail reveals the specific costs we're trying to eliminate.
Implementation
Here's the single shared buffer implementation we started with:
Analysis
This implementation is correct. It provides thread safety through a ReentrantLock, prevents buffer overflows, maintains FIFO ordering, and helps the garbage collector by nulling consumed slots.
But let's trace through what happens when 4 threads try to offer() simultaneously:
We achieved 3.4% efficiency. The other 96.6% was overhead.
Memory Layout Analysis
Let's examine the object layout using JOL (Java Object Layout):
Notice that head, tail, count, and lock are all within 32 bytes of each other - they fit on a single 64-byte cache line. This means:
- When a producer updates
head, it invalidates the consumer's cachedtail - When the consumer updates
tail, it invalidates the producer's cachedhead - Every lock acquisition invalidates the cache lines for all competing threads
This is false sharing at its worst. Fields that should be independent are sharing cache lines, causing unnecessary invalidation traffic.
Benchmark Results
Using JMH with 64 producer threads on a 64-core machine:
The median latency (298ns) is acceptable. But look at the tail:
- p99: 2.8 microseconds (10x median)
- p99.9: 18 microseconds (60x median)
- p99.99: 67 microseconds (225x median)
This tail latency is the lock convoy effect manifesting. Some unlucky threads wait through multiple context switch cycles before they can proceed.
Part 5: The Per-Core Sharded Buffer - Our Solution
Now let's build the sharded solution. The design goals are:
- Eliminate contention between producer threads
- Maintain lock-free (or very-low-contention) access paths
- Keep the implementation simple and debuggable
- Provide O(1) shard selection
Architecture
Implementation
Key Design Decisions
1. Power-of-2 Shard Count
This enables fast shard selection using bitwise AND instead of modulo:
On x86-64, modulo requires a IDIV instruction (~40 cycles), while bitwise AND is a single AND instruction (~1 cycle).
2. Thread ID Hashing
Thread IDs in Java are sequential starting from 1. For most applications, consecutive threads will map to consecutive shards, providing good distribution. If you need better distribution (e.g., thread pool recycling), consider using a hash:
3. Cache Line Padding
Each shard has padding fields (p01-p07, etc.) to ensure that hot fields of adjacent shards don't share cache lines:
Without padding, updates to Shard[0]'s head would invalidate Shard[1]'s cached data - defeating the purpose of sharding.
4. Per-Slot Sequence Numbers
The shard implementation uses the same per-slot sequence pattern we developed in earlier articles. This enables:
- Lock-free producer path (CAS only for slot claiming)
- Safe publication (sequence update after data write)
- Consumer doesn't need synchronization
5. Round-Robin Consumption
The consumer maintains state (consumerShardIndex) to remember where it left off. This ensures fairness - no shard gets starved even if another shard is very active.
Part 6: Technical Deep Dive - Why Sharding Works
Let's analyze exactly why sharding provides such dramatic performance improvements by examining the CPU-level behavior.
Memory Access Patterns
Single Shared Buffer:
With 64 cores, each write causes 63 cache line invalidations. The CPU's coherency protocol becomes the bottleneck.
Sharded Buffer:
With proper shard count (one per thread), cache lines never bounce. Each core operates on its own data in its own cache.
Cache Line Utilization Visualization
CAS Retry Analysis
Single Buffer with 64 Threads:
When all 64 threads CAS the same head variable:
- Expected successful CAS operations: 64 (one per item)
- Actual CAS operations (with retries): 64 + 63 + 62 + ... + 1 = 2,080
That's 32x more atomic operations than necessary!
Sharded Buffer with 64 Shards:
Each thread CAS-es its own shard's head:
- Expected successful CAS operations: 64
- Actual CAS operations: 64 (no retries when one thread per shard)
Throughput Scaling Model
For a single shared resource with N threads:
Where ContentionFactor represents the overhead from contention (cache misses, retries, context switches).
For sharded resources:
This is the fundamental difference. Shared resources follow Amdahl's Law - performance is limited by the serial portion (contention). Sharded resources follow Gustafson's Law - performance scales with parallelism.
Experimental Validation
I ran experiments comparing throughput vs. thread count:
Notice:
- Single buffer performance actually decreases beyond 4 threads due to contention
- Sharded buffer scales nearly linearly up to 64 threads
- At 64 threads, sharding provides 51.57x better throughput
This is the power of eliminating contention. The sharded buffer lets all 64 cores work at full speed, while the single buffer serializes them through a chokepoint.
Part 7: Benchmarks and Results
Benchmark Setup
Hardware:
- CPU: AMD EPYC 7742 (64 cores, 128 threads, 2.25 GHz base)
- RAM: 512 GB DDR4-3200
- OS: Ubuntu 22.04, Linux 5.15
- JVM: OpenJDK 21, ZGC
Benchmark configuration:
- Buffer capacity: 1024 per shard
- Shard count: 64 (one per core)
- Producer threads: Variable (1-64)
- Consumer: Single thread, continuous drain
- Duration: 60 seconds per configuration
- Warmup: 30 seconds
Latency Results
Offer Latency (64 Producer Threads):
| Metric | Single Buffer | Sharded Buffer | Improvement |
|---|---|---|---|
| Mean | 512ns | 27ns | 19.0x |
| p50 | 298ns | 22ns | 13.5x |
| p90 | 756ns | 38ns | 19.9x |
| p99 | 2,890ns | 67ns | 43.1x |
| p99.9 | 18,234ns | 134ns | 136.1x |
| p99.99 | 67,234ns | 287ns | 234.2x |
The tail latency improvement is dramatic. At p99.99, sharding is 234x better - turning 67 microsecond worst-case latency into sub-microsecond latency.
Throughput Results
| Producers | Single Buffer | Sharded Buffer | Improvement |
|---|---|---|---|
| 1 | 2.1M/s | 2.0M/s | 0.95x |
| 4 | 3.1M/s | 7.8M/s | 2.5x |
| 8 | 3.0M/s | 15.2M/s | 5.1x |
| 16 | 2.8M/s | 29.8M/s | 10.6x |
| 32 | 2.5M/s | 57.1M/s | 22.8x |
| 64 | 2.1M/s | 108.3M/s | 51.6x |
At 64 threads, the sharded buffer achieves over 100 million operations per second. The single buffer is throttled to 2.1 million - less than it achieved with a single thread!
Latency Distribution
The sharded buffer's distribution is tightly clustered in the sub-30ns range, while the single buffer has a long tail extending into tens of microseconds.
CPU Utilization
Single Buffer (64 threads):
- User CPU: 8%
- System CPU: 12%
- Idle: 80%
Sharded Buffer (64 threads):
- User CPU: 89%
- System CPU: 3%
- Idle: 8%
The single buffer wastes 80% of CPU capacity on contention overhead. The sharded buffer actually utilizes the available compute resources.
Cache Analysis (via perf)
The sharded buffer has:
- 89% fewer L1 cache misses
- 93% fewer LLC misses
- 13x fewer cycles per operation
This directly maps to the performance improvement - fewer cache misses means faster execution.
GC Behavior
Both implementations have similar allocation patterns (same data, same buffer sizes), so GC behavior is comparable. The key difference is that the sharded buffer doesn't create lock wait queue nodes, eliminating ~3MB/sec of allocation pressure that the locked implementation generates.
Part 8: Trade-offs and When to Use
When Per-Core Sharding Excels
1. High-Core-Count Servers
Modern servers have 32, 64, or even 128+ cores. Traditional synchronization patterns that worked fine on 4-8 core machines fall apart at this scale. Sharding is designed for this environment.
2. Many-Producer, Single-Consumer Patterns
Examples:
- Log aggregation (many app threads write logs, one thread flushes to disk)
- Metrics collection (many threads emit metrics, one thread aggregates)
- Event sourcing (many threads emit events, one thread persists)
The MPSC (Multi-Producer Single-Consumer) pattern is a natural fit for sharding.
3. Bursty Workloads
When work arrives in bursts (market open, flash sales, etc.), contention spikes dramatically. Sharding maintains consistent performance regardless of load pattern.
4. Latency-Sensitive Systems
For trading, gaming, or real-time systems where tail latency matters, the 234x improvement at p99.99 is transformative.
When to Avoid Sharding
1. Low Thread Counts
With 1-4 threads, the overhead of managing multiple shards may exceed the contention cost. The single buffer is simpler and nearly as fast.
2. Strict Ordering Requirements
Sharding relaxes FIFO ordering - items in different shards may be consumed out of arrival order. If strict ordering is required, sharding won't work without additional coordination (which reintroduces contention).
3. Memory-Constrained Environments
64 shards each with 1024 slots means 65,536 buffer slots instead of 1,024. For embedded systems or containers with tight memory limits, this overhead may be unacceptable.
4. Simple Applications
If you're not hitting performance limits, the added complexity of sharding isn't justified. Premature optimization is the root of all evil.
Choosing Shard Count
The optimal shard count depends on your workload:
Shard Count = Number of Producer Threads:
- Zero contention (ideal)
- Maximum memory usage
- Best for performance-critical paths
Shard Count = Number of Cores:
- Near-zero contention for most workloads
- Reasonable memory usage
- Good default choice
Shard Count = Number of NUMA Nodes:
- Minimizes cross-NUMA traffic
- Lower memory usage
- Good for memory-constrained systems
Monitoring Recommendations
Track these metrics in production:
If cas.retries is consistently high, you may need more shards. If shard utilization is very uneven, your thread-to-shard mapping may need adjustment.
Part 9: Advanced Optimizations
Optimization 1: NUMA-Aware Shard Placement
On multi-socket systems, accessing memory from a remote NUMA node costs 2-3x more than local access. We can optimize by aligning shards with NUMA topology:
This keeps threads accessing local memory, reducing cross-node traffic.
Optimization 2: Adaptive Shard Selection
If thread IDs cluster badly (e.g., all map to the same few shards), use adaptive selection:
This adds overhead (reading load counters) but ensures better distribution under pathological thread ID patterns.
Optimization 3: Batch Operations
For very high throughput, batch multiple items per shard access:
Batching amortizes the overhead of shard selection and cache-line access across multiple items.
Optimization 4: Work Stealing for Uneven Loads
If some shards fill faster than others, the consumer can use work stealing:
This ensures the consumer always works on available data, reducing idle time.
Part 10: Real-World Application
Let me share how we applied these concepts to solve our original problem - the transaction processing system that was failing at 3,000 TPS when we needed 60,000.
The Architecture Before
All 64 handler threads fought for the single buffer. Lock convoys, cache-line bouncing, and CAS retry storms killed our performance.
The Architecture After
Each handler gets its own shard. Zero contention, maximum parallelism.
The Results
We exceeded our 60,000 TPS target with room to spare. More importantly, our tail latencies dropped from milliseconds to microseconds, meeting our SLA requirements with margin.
Lessons Learned
1. Profile Before Optimizing
We could have guessed that "locks are slow" and tried many optimizations. Instead, we profiled and discovered exactly where time was going (73% in lock wait). This directed us to the right solution.
2. Understand Hardware
The fix wasn't algorithmic cleverness - it was understanding CPU caches, coherency protocols, and NUMA topology. Software engineering is hardware engineering at this level.
3. Measure After Optimizing
We validated every change with benchmarks. Some "optimizations" (like adding more spin iterations) actually made things worse. Data beats intuition.
4. Simple Solutions Often Best
Sharding isn't complex. It's essentially "have more buffers instead of one." The insight was recognizing that our problem was contention, not algorithm efficiency.
Part 11: Conclusion
That Thursday night crisis taught me something fundamental: at scale, coordination is the enemy of performance. We had 64 cores capable of processing 100 million operations per second, reduced to 3 million by a single lock.
The journey from 3,000 TPS to 62,000 TPS wasn't about clever algorithms or exotic data structures. It was about one key insight: eliminate contention by eliminating sharing.
Per-core sharding embodies this principle:
- Instead of one buffer, many buffers
- Instead of threads competing, threads cooperating (by staying out of each other's way)
- Instead of cache lines bouncing, cache lines staying put
The results speak for themselves:
- 20x throughput improvement
- 136x tail latency improvement
- 70x reduction in context switches
- Linear scalability with core count
But sharding isn't magic. It requires:
- Power-of-2 shard counts for efficient selection
- Careful cache-line padding to prevent false sharing
- Thread-to-shard affinity for stable assignment
- Single-consumer design for simple consumption
When you find your high-core-count system underperforming despite apparent CPU headroom, look for contention. Profile for lock wait time, cache misses, and context switches. If you find a hot lock or CAS variable being hammered by many threads, consider sharding.
The pattern applies beyond ring buffers:
- Connection pools can be sharded
- Statistics counters can be sharded (see LongAdder)
- Thread-local storage is extreme sharding
- Database sharding follows the same principle
Remember: the fastest synchronization is no synchronization. The best lock is no lock. When you can partition your problem so threads never need to coordinate, you unlock the full parallelism potential of modern hardware.
And remember - measure, understand, optimize. In that order.
Appendix A: Quick Reference
Algorithm Summary
Producer Protocol:
- Get thread ID
- Hash to shard index:
threadId & (shardCount - 1) - Write to that shard (lock-free CAS)
Consumer Protocol:
- Check shard[roundRobinIndex]
- If empty, try next shard
- Advance roundRobinIndex
Key Invariants:
- shardCount is always power of 2
- Same thread always maps to same shard
- Shards are cache-line padded to prevent false sharing
Performance Comparison
| Metric | Single Buffer | Sharded (64) | Improvement |
|---|---|---|---|
| Mean Latency | ~500ns | ~27ns | 18.5x |
| p99.9 Latency | ~18us | ~134ns | 134x |
| Throughput (64T) | ~2M ops/s | ~108M ops/s | 54x |
| CPU Utilization | 8% | 89% | 11x |
| Cache Miss Rate | 38% | 4% | 9.5x reduction |
When to Use
Do Use:
- High-core-count systems (16+ cores)
- Many-producer, single-consumer patterns
- Latency-sensitive applications
- Bursty workloads
Don't Use:
- Low thread counts (1-4)
- Strict ordering requirements
- Memory-constrained environments
- Simple applications (KISS)
Tuning Guidelines
Shard Count:
- Start with: Number of producer threads
- Minimum: Number of cores
- Maximum: 2x number of producers
Shard Capacity:
- Start with: 1024 per shard
- Increase if: shards frequently fill
- Decrease if: memory constrained
Consumer Strategy:
- Round-robin for fairness
- Work-stealing for uneven loads
- Batched drain for throughput
Appendix B: Common Pitfalls
Pitfall 1: Non-Power-of-2 Shard Count
Always round up shard count to the nearest power of 2.
Pitfall 2: Missing Cache Line Padding
Pitfall 3: Thread ID Collisions
Thread IDs in Java are sequential, but thread pools recycle threads with the same ID. Consider using thread-local state or a registration pattern for long-running systems:
Pitfall 4: Consumer Thread Affinity
If the consumer runs on a different NUMA node than the shards it reads, performance degrades. Pin the consumer to the same node as the majority of shards:
Pitfall 5: Over-Sharding
More shards isn't always better. Each shard has memory overhead and the consumer must check all of them. For 64 threads, 64-128 shards is plenty. 1024 shards would waste memory and slow consumption.
Appendix C: Alternative Implementations
JCTools SpscArrayQueue
For single-producer single-consumer scenarios, JCTools provides optimized queues:
Disruptor with Multiple Ring Buffers
The LMAX Disruptor supports multiple producers with sophisticated sequencing:
Agrona's ManyToOneRingBuffer
Agrona (used by Aeron) provides IPC-capable ring buffers:
Chronicle Queue
For persistent sharded queues with disk-backed storage:
Appendix D: Debugging and Troubleshooting
Diagnosing Contention Issues
Before implementing sharding, you need to confirm that contention is actually your problem. Here's a systematic approach:
Step 1: Profile for Lock Wait Time
Use Java Flight Recorder (JFR) to capture lock contention events:
Then analyze with JMC (Java Mission Control):
- Look for "Java Lock Wait" events
- Check "Thread Contention" view
- Identify hot locks by contention count
Step 2: Measure Cache Behavior
On Linux, use perf to capture cache statistics:
Key metrics to watch:
- L1 miss rate > 5%: Possible cache-line bouncing
- LLC miss rate > 10%: Severe cache pressure
- Cache-misses growing linearly with threads: Contention signature
Step 3: Track Context Switches
High context switch rate (> 10,000/sec) combined with low CPU utilization suggests lock convoy effects.
Common Debugging Scenarios
Scenario: Throughput Drops When Adding Threads
This is the classic contention signature. If going from 4 to 8 threads decreases throughput, you have a serialization bottleneck.
Diagnosis:
Scenario: Uneven Shard Utilization
If some shards are always full while others are empty, your thread-to-shard mapping is broken.
Diagnosis:
Solutions:
- Check thread ID distribution
- Use a hash function instead of direct modulo
- Increase shard count
Scenario: Consumer Can't Keep Up
If shards fill up and producers start failing, the consumer is the bottleneck.
Diagnosis:
Solutions:
- Use batched drain operations
- Add more consumer threads (requires multi-consumer design)
- Increase shard capacity
- Reduce producer rate
Performance Tuning Checklist
Before deploying sharded buffers to production, verify:
- Shard count is power of 2
- Shard count >= number of producer threads
- Cache line padding is present between hot fields
- Consumer is pinned to appropriate CPU core
- JVM is using appropriate GC (ZGC or Shenandoah for low latency)
- Monitoring is in place for shard utilization
- Backpressure strategy is defined for full shards
- Thread affinity is configured if NUMA is present
Appendix E: Memory Ordering Deep Dive
Understanding memory ordering is crucial for implementing correct lock-free algorithms. Let me explain the specific ordering requirements in our sharded buffer.
The Publication Problem
The core challenge in any producer-consumer buffer is ensuring that when the consumer sees "data is available," the data is actually there. Consider this sequence:
Without proper memory ordering, step 4 might read stale data because the CPU or compiler could reorder step 1 and 2.
Release-Acquire Semantics
We solve this with release-acquire ordering:
The guarantees:
- Release store: All prior writes become visible before this store is visible
- Acquire load: All subsequent loads see writes that happened before the corresponding release
This creates a "synchronizes-with" relationship: if the consumer's acquire load sees the value written by the producer's release store, all writes before the producer's release are visible to all reads after the consumer's acquire.
VarHandle Access Modes
Java's VarHandle provides several access modes with different ordering guarantees:
For our buffer:
- Producer writes use
setRelease(sufficient for publication) - Consumer reads use
getAcquire(sufficient for consumption) - CAS operations use volatile semantics by default
x86 vs ARM Ordering
Memory ordering requirements vary by CPU architecture:
x86-64 (Intel/AMD):
- Strong memory model
- Stores are never reordered with other stores
- Loads are never reordered with other loads
- Release/acquire are essentially free (no extra barriers)
ARM/Apple Silicon:
- Weak memory model
- Loads and stores can be freely reordered
- Release requires a DMB ST barrier
- Acquire requires a DMB LD barrier
On x86, our code performs well because the memory model provides most guarantees for free. On ARM, the explicit release-acquire ordering ensures correctness but may have slight overhead from memory barriers.
Testing Memory Ordering
Memory ordering bugs are notoriously hard to find. They may only manifest under specific timing conditions or on specific hardware. Here's how to stress-test:
Run this test thousands of times, ideally on both x86 and ARM machines. Any memory ordering bug will eventually cause the assertion to fail.
Appendix F: Production Deployment Guide
Capacity Planning
To size your sharded buffer appropriately:
1. Estimate Peak Throughput
Measure or estimate your peak message rate:
Typical burst factors:
- Web applications: 3-5x
- Trading systems: 10-20x
- IoT systems: 5-10x
2. Calculate Buffer Size
The buffer must hold messages long enough for the consumer to process them:
3. Determine Shard Count
Match producer thread count:
4. Calculate Memory Usage
JVM Configuration
Recommended JVM flags for low-latency operation:
Monitoring in Production
Essential metrics to track:
Alert Thresholds:
- Offer success rate < 99%: Consumer falling behind
- Shard balance ratio < 0.5: Poor thread-to-shard distribution
- Any shard at 90% capacity: Risk of data loss
Load Testing Strategy
Before production deployment, conduct thorough load testing:
Phase 1: Baseline Establishment
Phase 2: Scalability Validation
Expected results: Each doubling of threads should roughly double throughput, up to your shard count.
Phase 3: Sustained Load Test Run at 80% of maximum throughput for extended periods (hours to days):
- Monitor for memory leaks
- Watch for GC behavior degradation
- Check for shard imbalance accumulation
Phase 4: Burst Testing Simulate realistic burst patterns:
Graceful Shutdown
Ensure clean shutdown to avoid data loss:
Appendix G: Further Reading and Resources
Essential Books
For those who want to go deeper into the concepts covered in this article:
-
"The Art of Multiprocessor Programming" by Maurice Herlihy and Nir Shavit - The definitive reference for lock-free algorithms. Covers the theoretical foundations including linearizability, wait-freedom, and lock-freedom. Essential for understanding correctness proofs.
-
"Java Concurrency in Practice" by Brian Goetz et al. - While somewhat dated, this remains the best introduction to Java concurrency. Chapters on visibility and atomicity are particularly relevant.
-
"C++ Concurrency in Action" by Anthony Williams - Even if you're a Java developer, this book provides excellent explanations of memory ordering that translate directly to VarHandle operations.
-
"Is Parallel Programming Hard?" by Paul McKenney - Free online book that dives deep into memory barriers, cache coherency, and the hardware foundations of concurrent programming.
Online Resources
-
Mechanical Sympathy Blog (https://mechanical-sympathy.blogspot.com/) - Martin Thompson's blog on low-latency programming. Essential reading for understanding hardware-software interactions.
-
JCTools Wiki (https://github.com/JCTools/JCTools/wiki) - Documentation for the JCTools library, with excellent explanations of different queue types and their use cases.
-
Disruptor Technical Paper (https://lmax-exchange.github.io/disruptor/disruptor.html) - The original paper explaining the LMAX Disruptor. Many concepts in this article derive from this work.
-
Lock-Free Programming Considered Harmful - A cautionary perspective on when lock-free is (and isn't) appropriate. Good for developing a balanced view.
Related Patterns
Patterns that complement or build upon per-core sharding:
Striped Counters (LongAdder)
Java's LongAdder uses a similar sharding concept for high-contention counters:
Thread-Local Storage The extreme form of sharding - each thread has its own copy:
Work Stealing Complements sharding by allowing idle consumers to steal from busy shards:
Hierarchical Aggregation For very large systems, multi-level sharding:
Acknowledgments
The techniques in this article build on decades of research and engineering. Special thanks to:
- The LMAX team for the Disruptor pattern
- Martin Thompson for Mechanical Sympathy insights
- The JCTools maintainers for battle-tested implementations
- Doug Lea for java.util.concurrent
This article is part of the Lock-Free in Java series. The complete source code and benchmarks are available at https://github.com/techishthoughts-org/off_heap_algorithms
