Lock-Free in Java: Scenario 07 - Per-Core Sharded Processing

Part 1: The 2AM Crisis That Changed Everything

Thursday, 2:24 AM. I was deep into one of those late-night deployments that you promise yourself will never happen again, yet somehow always do. The monitoring dashboard lit up like a Christmas tree, but not the kind you want to see. Our transaction processing system - the backbone of our fintech platform - was choking under load.

The numbers told a story that made my stomach drop. We had promised our clients 60,000 transactions per second. The system was processing approximately 3,000. A full 20x shortfall. The business team's emails were already flooding in, each one more urgent than the last. Our biggest client, a major e-commerce platform running their Black Friday equivalent, was losing money with every passing second.

I pulled up the flame graph and there it was - a massive red block sitting at the top of the call stack: java.util.concurrent.locks.ReentrantLock.lock(). Our threads were spending 73% of their time doing absolutely nothing except waiting. Not processing transactions, not computing results, not serving customers - just waiting for their turn to access a shared buffer.

The architecture seemed reasonable at design time. We had a single RingBuffer that all processing threads would write to, and a downstream consumer that would batch-process the results. Simple, clean, easy to reason about. Except now, with 64 cores all hammering the same lock, simple had become our enemy.

I checked the CPU utilization: 8%. Eight percent on a 64-core machine that should have been maxed out processing transactions. The cores weren't running our code - they were running the kernel's scheduler, bouncing threads between parked and runnable states in an endless dance of futility.

The profiler told me something else too. Our L3 cache miss rate was through the roof. Every time a thread acquired the lock, it had to fetch the lock state from another core's cache - a process that took 40-100 nanoseconds depending on which core held it. Then, when it released the lock, the next thread had to do the same. Cache lines were bouncing across the CPU like ping-pong balls.

At 3:47 AM, I pushed a desperate hack to production - reducing our thread pool from 64 threads to 8. Throughput actually improved. We went from 3,000 to 9,000 transactions per second. Still nowhere near our target, but it stopped the immediate bleeding. The reduced contention meant threads spent more time working and less time waiting.

The next morning, bleary-eyed and running on caffeine, I gathered the team for an emergency architecture review. "We have a fundamental design problem," I said, pulling up the flame graphs. "Locks don't scale. On high-core-count machines, they become the bottleneck, not the solution. We need to think differently."

That began a two-week intensive deep-dive that would fundamentally change how I approach concurrent system design. What we discovered was an elegant pattern that took our system from 3,000 TPS to over 60,000 TPS - a 20x improvement achieved not through more hardware or clever algorithms, but through understanding how modern CPUs actually work.

This is the story of per-core sharding: how we eliminated contention by giving each core its own private playground to work in, and the journey that taught me that sometimes the best lock is no lock at all.

Part 2: Why Contention Kills Performance

Before diving into the solution, we need to understand the problem at a fundamental level. Contention isn't just "threads waiting" - it's a cascade of performance-destroying effects that compound under load.

The Anatomy of Lock Contention

When multiple threads compete for a shared resource protected by a lock, several things happen:

1. Thread Parking and Context Switches

When a thread tries to acquire a lock that another thread holds, it eventually "parks" - tells the operating system to stop scheduling it until the lock is available. This parking operation isn't free:

Thread lifecycle under contention:
1. Thread calls lock.lock()           [2ns]
2. Finds lock held by another thread  [5ns]
3. Spins briefly hoping lock releases [50-200ns]
4. Gives up, calls LockSupport.park() [20ns to initiate]
5. Context switch to kernel           [1,000-3,000ns]
6. Thread added to wait queue         [100ns]
7. ... time passes ...
8. Lock released, thread signaled     [100ns]
9. Context switch back to user space  [1,000-3,000ns]
10. Thread resumes, acquires lock     [50ns]

The actual lock operation (compare-and-swap) takes about 50 nanoseconds. But the context switch overhead is 2,000-6,000 nanoseconds - 40-120x more expensive than the operation we're trying to protect.

2. Lock Convoy Effect

Here's where it gets really insidious. Consider four threads all trying to access the same lock:

Time    Thread-1    Thread-2    Thread-3    Thread-4
----    --------    --------    --------    --------
0ns     ACQUIRE     blocked     blocked     blocked
50ns    WORKING     parking..   parking..   parking..
3000ns  WORKING     parked      parked      parked
5000ns  RELEASE     waking..    parked      parked
8000ns  (done)      ACQUIRE     parked      parked
8050ns  ---         WORKING     waking..    parked
11000ns ---         WORKING     parking..   parked
13000ns ---         RELEASE     ACQUIRE     waking..

Each thread pays the full context switch penalty. They form a "convoy" - processing sequentially with the worst possible overhead. Four threads that should provide 4x parallelism instead provide less throughput than a single thread would, because they spend all their time in scheduling overhead.

3. Cache Line Bouncing

Modern CPUs have per-core caches organized in a hierarchy (L1, L2, L3). When data is modified, the cache line containing that data must be invalidated in all other cores. This is called the MESI protocol (Modified, Exclusive, Shared, Invalid).

A lock's internal state (whether it's held, by whom, who's waiting) lives in memory. When Thread-1 acquires the lock, it modifies this state. That modification invalidates the cache line in all other cores. When Thread-2 tries to acquire the lock, it must fetch the cache line from Thread-1's cache (or main memory) - an operation costing 40-100+ nanoseconds depending on CPU topology.

Cache line bouncing example:

Core 0: lock state in L1 cache [Modified]
Core 1: wants lock, issues cache-line read
Core 0: snoops request, sends data, transitions to [Invalid]
Core 1: receives data, takes ownership [Modified]
Core 0: wants lock back, issues cache-line read
Core 1: snoops request, sends data, transitions to [Invalid]
... repeat forever ...

Each bounce costs 40-100ns. With 64 cores competing, the lock state might bounce dozens of times per lock acquisition. The CPU's interconnect becomes saturated with coherency traffic rather than useful work.

The CAS Retry Storm

Lock-free algorithms use Compare-And-Swap (CAS) operations instead of locks. But they're not immune to contention:

// Simple CAS-based counter increment
while (true) {
    long current = counter.get();
    if (counter.compareAndSet(current, current + 1)) {
        break;  // Success!
    }
    // CAS failed - retry
}

Under low contention, this works beautifully - one CAS, done. Under high contention, something nasty happens:

64 threads all try to CAS the same counter simultaneously:
- 1 thread succeeds
- 63 threads fail, retry
- Next round: 1 succeeds, 62 fail
- Next round: 1 succeeds, 61 fail
- ... this continues for 63 more rounds ...

Total CAS operations: 1 + 63 + 62 + 61 + ... + 1 = 2016 CAS operations
to increment 64 times.

Expected CAS operations without contention: 64

This is a CAS retry storm. Instead of O(N) operations for N increments, we get O(N^2). Worse, each failed CAS still bounces cache lines, generating massive memory traffic with zero progress.

Quantifying the Damage

In our transaction processing system, I measured the following before we fixed the problem:

Metric                           Value
-------------------------------  ---------------
Target throughput                60,000 TPS
Actual throughput                3,000 TPS
CPU utilization                  8%
Lock acquisition latency (p50)   187ns
Lock acquisition latency (p99)   1,256ns
Lock acquisition latency (p999)  8,934ns
Context switches/second          147,000
L3 cache miss rate               38%
Cycles spent in lock code        73%

The system was spending 73% of its cycles on locking overhead and only 27% on actual work. The 8% CPU utilization showed that most cores were parked, waiting. The L3 cache miss rate of 38% indicated massive cache-line bouncing.

This is what contention looks like at scale. And the only way to fix it is to eliminate the contention itself.

Part 3: The Insight - Eliminating Contention Through Sharding

The night after our emergency meeting, I couldn't sleep. I kept turning the problem over in my mind. We had tried:

Reducing lock hold time (already minimized)
Using StampedLock for optimistic reads (didn't help - we were write-heavy)
Using lock striping (helped somewhat, but not enough)
Going lock-free with CAS (CAS retry storms under high contention)

None of these approaches addressed the fundamental issue: all threads were fighting over the same resource. Whether that resource was protected by a lock, a CAS variable, or anything else, the contention remained.

Then it hit me. What if we didn't have one shared resource? What if we had 64 shared resources - one per core?

The idea was simple: instead of a single buffer that all threads write to, create multiple buffers. Assign each thread to a specific buffer based on some deterministic mapping. Now threads on different cores never compete with each other.

Before: Single Shared Buffer (contention!)

Loading diagram...

After: Per-Core Sharded Buffers (zero contention!)

Loading diagram...

This is sharding - partitioning a shared resource into independent pieces that can be accessed without coordination. It's the same principle that makes distributed databases scale: instead of one big lock, many small locks (or better, no locks at all).

The Key Insight: Thread Affinity

For sharding to work well, we need stable thread-to-shard assignments. If threads randomly pick shards, we're back to contention. The insight is to use the thread's identity to deterministically select a shard.

In Java, every thread has a unique ID accessible via Thread.currentThread().getId(). This ID is stable for the thread's lifetime. We can use it to map threads to shards:

int shardIndex = (int) (Thread.currentThread().getId() & (shardCount - 1));
Shard shard = shards[shardIndex];
shard.write(data);

The bitwise AND with (shardCount - 1) works because shardCount is a power of 2 (we'll ensure this). It's equivalent to modulo but much faster.

This mapping ensures:

The same thread always writes to the same shard
Different threads write to potentially different shards
The mapping is O(1) with no memory access required

Choosing the Shard Count

How many shards should we have? There's a trade-off:

Too few shards: Some shards will be shared by multiple threads, reintroducing contention.

Too many shards: Memory overhead increases, and the consumer has more work round-robining between shards.

The sweet spot is typically:

Minimum: Number of cores (so each core can have its own shard)
Maximum: Number of producer threads (so each thread has its own shard)

For a 64-core machine with 64 producer threads, 64 shards is ideal. Each thread gets exclusive access to its shard - zero contention.

But what if we have 64 threads and only 8 shards? Then on average, 8 threads share each shard. Contention is reduced 8x compared to a single buffer, but not eliminated. We call this the "contention factor."

Contention Factor = Threads / Shards

64 threads, 1 shard:  CF = 64 (baseline - worst case)
64 threads, 8 shards: CF = 8  (8x improvement)
64 threads, 64 shards: CF = 1 (ideal - zero contention)

Round-Robin Consumption

With multiple shards, the consumer must check all of them. A simple approach is round-robin:

public T consume() {
    for (int i = 0; i < shardCount; i++) {
        T item = shards[nextShard].poll();
        nextShard = (nextShard + 1) & (shardCount - 1);
        if (item != null) {
            return item;
        }
    }
    return null;  // All shards empty
}

This has two nice properties:

Fairness: No shard gets starved
Batching opportunity: The consumer can drain multiple items from a shard before moving on

The trade-off is that consumption isn't strictly FIFO across the entire system - items might be consumed out of arrival order if they landed in different shards. For most systems, this is acceptable.

Part 4: The Single Shared Buffer - Our Baseline

Before building the sharded solution, let's examine exactly what we're replacing. Understanding the baseline in detail reveals the specific costs we're trying to eliminate.

Implementation

Here's the single shared buffer implementation we started with:

View source

public class SingleSharedBuffer<T> {
 
    private final Object[] buffer;
    private final int capacity;
    private final int mask;
 
    private final ReentrantLock lock = new ReentrantLock();
 
    private int head = 0;  // Next position to write
    private int tail = 0;  // Next position to read
    private int count = 0; // Current element count
 
    public SingleSharedBuffer(int capacity) {
        // Round up to power of 2 for fast modulo
        this.capacity = Integer.highestOneBit(capacity - 1) << 1;
        this.mask = this.capacity - 1;
        this.buffer = new Object[this.capacity];
    }
 
    /**
     * Adds an element to the buffer.
     * Thread-safe but contention-prone under high load.
     */
    public boolean offer(T element) {
        lock.lock();
        try {
            if (count == capacity) {
                return false;  // Buffer full
            }
 
            buffer[head] = element;
            head = (head + 1) & mask;
            count++;
            return true;
        } finally {
            lock.unlock();
        }
    }
 
    /**
     * Removes and returns an element from the buffer.
     * Returns null if buffer is empty.
     */
    @SuppressWarnings("unchecked")
    public T poll() {
        lock.lock();
        try {
            if (count == 0) {
                return null;  // Buffer empty
            }
 
            T element = (T) buffer[tail];
            buffer[tail] = null;  // Help GC
            tail = (tail + 1) & mask;
            count--;
            return element;
        } finally {
            lock.unlock();
        }
    }
 
    public int size() {
        lock.lock();
        try {
            return count;
        } finally {
            lock.unlock();
        }
    }
}

Analysis

This implementation is correct. It provides thread safety through a ReentrantLock, prevents buffer overflows, maintains FIFO ordering, and helps the garbage collector by nulling consumed slots.

But let's trace through what happens when 4 threads try to offer() simultaneously:

Timeline (nanoseconds):

0ns:     Thread-1 calls offer()
2ns:     Thread-2 calls offer()
5ns:     Thread-3 calls offer()
7ns:     Thread-4 calls offer()

8ns:     Thread-1 acquires lock (CAS succeeds)
10ns:    Thread-2 tries lock, finds held, prepares to spin
12ns:    Thread-3 tries lock, finds held, prepares to spin
14ns:    Thread-4 tries lock, finds held, prepares to spin

15ns:    Thread-2 spins (checks lock ~20 times)
...
215ns:   Thread-2 gives up spinning, calls park()
220ns:   Thread-3 gives up spinning, calls park()
225ns:   Thread-4 gives up spinning, calls park()

8ns:     Thread-1 (inside critical section)
         - Reads count (cache miss: ~40ns)
         - Compares count == capacity
         - Writes to buffer[head]
         - Updates head
         - Increments count (cache line now Modified)
         - Total critical section: ~80ns

88ns:    Thread-1 releases lock
         - Updates lock state
         - Signals waiting threads

3088ns:  Thread-2 wakes up (context switch cost)
         - Tries to acquire lock (succeeds)
         - Executes critical section (~80ns)
         - Releases lock

6168ns:  Thread-3 wakes up (context switch cost)
         - Acquires lock, critical section, releases

9248ns:  Thread-4 wakes up (context switch cost)
         - Acquires lock, critical section, releases

Total time for 4 offers: ~9,328ns
Time actually doing work: 4 × 80ns = 320ns
Efficiency: 320 / 9328 = 3.4%

We achieved 3.4% efficiency. The other 96.6% was overhead.

Memory Layout Analysis

Let's examine the object layout using JOL (Java Object Layout):

com.example.SingleSharedBuffer object internals:
OFF  SZ               TYPE DESCRIPTION
  0   8                    (object header)
  8   8                    (object header)
 16   4                int SingleSharedBuffer.capacity
 20   4                int SingleSharedBuffer.mask
 24   4                int SingleSharedBuffer.head      ← HOT (producers)
 28   4                int SingleSharedBuffer.tail      ← HOT (consumer)
 32   4                int SingleSharedBuffer.count     ← HOT (both)
 36   4                    (alignment/padding)
 40   8   Object[] SingleSharedBuffer.buffer
 48   8   ReentrantLock SingleSharedBuffer.lock  ← HOT (both)
Instance size: 56 bytes

Notice that head, tail, count, and lock are all within 32 bytes of each other - they fit on a single 64-byte cache line. This means:

When a producer updates head, it invalidates the consumer's cached tail
When the consumer updates tail, it invalidates the producer's cached head
Every lock acquisition invalidates the cache lines for all competing threads

This is false sharing at its worst. Fields that should be independent are sharing cache lines, causing unnecessary invalidation traffic.

Benchmark Results

Using JMH with 64 producer threads on a 64-core machine:

Benchmark                          Mode  Cnt     Score     Error  Units
SingleSharedBuffer.offer          sample 10000   512.34 ±  24.67  ns/op
SingleSharedBuffer.offer:p50      sample          298.00          ns/op
SingleSharedBuffer.offer:p90      sample          756.00          ns/op
SingleSharedBuffer.offer:p99      sample         2890.00          ns/op
SingleSharedBuffer.offer:p99.9    sample        18234.00          ns/op
SingleSharedBuffer.offer:p99.99   sample        67234.00          ns/op

Throughput (aggregate): ~3,200,000 operations/second

The median latency (298ns) is acceptable. But look at the tail:

p99: 2.8 microseconds (10x median)
p99.9: 18 microseconds (60x median)
p99.99: 67 microseconds (225x median)

This tail latency is the lock convoy effect manifesting. Some unlucky threads wait through multiple context switch cycles before they can proceed.

Part 5: The Per-Core Sharded Buffer - Our Solution

Now let's build the sharded solution. The design goals are:

Eliminate contention between producer threads
Maintain lock-free (or very-low-contention) access paths
Keep the implementation simple and debuggable
Provide O(1) shard selection

Architecture

Loading diagram...

Implementation

View source

package com.techishthoughts.sharding;
 
import java.lang.invoke.MethodHandles;
import java.lang.invoke.VarHandle;
import java.util.concurrent.atomic.AtomicInteger;
 
/**
 * A sharded ring buffer that eliminates contention by partitioning
 * writes across multiple independent buffers based on thread identity.
 *
 * Key design principles:
 * 1. Each shard is independent - no coordination between shards
 * 2. Thread-to-shard mapping is deterministic and O(1)
 * 3. Shards use lock-free operations internally
 * 4. Consumer round-robins across shards
 *
 * Performance characteristics:
 * - Producer: 20-30ns per offer (zero contention with proper shard count)
 * - Consumer: 30-40ns per poll (single-threaded, no contention)
 * - Scalability: Linear with core count
 *
 * @param <T> Element type stored in the buffer
 */
public class PerCoreShardedBuffer<T> {
 
    // ========== Shard Configuration ==========
 
    /** Number of shards (always a power of 2) */
    private final int shardCount;
 
    /** Mask for fast shard selection (shardCount - 1) */
    private final int shardMask;
 
    /** Array of independent shard buffers */
    private final Shard<T>[] shards;
 
    /** Consumer's current shard index for round-robin */
    private int consumerShardIndex = 0;
 
    // ========== Constructor ==========
 
    /**
     * Creates a sharded buffer with the specified number of shards.
     *
     * @param shardCount Number of shards (will be rounded up to power of 2)
     * @param shardCapacity Capacity of each shard's ring buffer
     */
    @SuppressWarnings("unchecked")
    public PerCoreShardedBuffer(int shardCount, int shardCapacity) {
        // Ensure shard count is a power of 2
        this.shardCount = roundUpToPowerOf2(shardCount);
        this.shardMask = this.shardCount - 1;
 
        // Allocate shards
        this.shards = (Shard<T>[]) new Shard[this.shardCount];
        for (int i = 0; i < this.shardCount; i++) {
            this.shards[i] = new Shard<>(shardCapacity);
        }
    }
 
    private static int roundUpToPowerOf2(int value) {
        int highBit = Integer.highestOneBit(value);
        return (highBit == value) ? value : highBit << 1;
    }
 
    // ========== Producer Operations ==========
 
    /**
     * Adds an element to the buffer.
     *
     * The calling thread is mapped to a specific shard based on its thread ID.
     * This mapping is stable - the same thread always writes to the same shard.
     *
     * Thread-safe and contention-free when shardCount >= producer thread count.
     *
     * @param element The element to add (must not be null)
     * @return true if element was added, false if shard was full
     */
    public boolean offer(T element) {
        // Map this thread to a shard
        int shardIndex = selectShard();
        Shard<T> shard = shards[shardIndex];
 
        // Write to the shard (potentially lock-free within shard)
        return shard.offer(element);
    }
 
    /**
     * Selects the shard for the current thread.
     *
     * Uses thread ID hashing for stable, deterministic mapping.
     * The bitwise AND with shardMask is equivalent to modulo but faster.
     */
    private int selectShard() {
        long threadId = Thread.currentThread().getId();
        return (int) (threadId & shardMask);
    }
 
    // ========== Consumer Operations ==========
 
    /**
     * Removes and returns an element from the buffer.
     *
     * Uses round-robin across shards to ensure fairness.
     * Should only be called from a single consumer thread.
     *
     * @return The next element, or null if all shards are empty
     */
    public T poll() {
        // Check each shard in round-robin order
        for (int i = 0; i < shardCount; i++) {
            int index = (consumerShardIndex + i) & shardMask;
            T element = shards[index].poll();
 
            if (element != null) {
                // Advance round-robin position for next call
                consumerShardIndex = (index + 1) & shardMask;
                return element;
            }
        }
 
        return null;  // All shards empty
    }
 
    /**
     * Drains available elements from all shards into the consumer.
     * More efficient than repeated poll() calls.
     *
     * @param consumer Function to process each element
     * @return Total number of elements drained
     */
    public int drain(java.util.function.Consumer<T> consumer) {
        int totalDrained = 0;
 
        for (int i = 0; i < shardCount; i++) {
            int index = (consumerShardIndex + i) & shardMask;
            totalDrained += shards[index].drain(consumer);
        }
 
        // Advance round-robin position
        consumerShardIndex = (consumerShardIndex + shardCount) & shardMask;
 
        return totalDrained;
    }
 
    // ========== Query Operations ==========
 
    /**
     * Returns the approximate total size across all shards.
     * May be stale due to concurrent modifications.
     */
    public int size() {
        int total = 0;
        for (Shard<T> shard : shards) {
            total += shard.size();
        }
        return total;
    }
 
    /** Returns the number of shards. */
    public int getShardCount() {
        return shardCount;
    }
 
    // ========== Shard Implementation ==========
 
    /**
     * An individual shard buffer.
     *
     * Uses lock-free CAS for the write path (head advancement).
     * The read path is single-threaded (one consumer) so no sync needed.
     */
    private static class Shard<T> {
 
        private static final VarHandle HEAD;
        private static final VarHandle SEQUENCE;
 
        static {
            try {
                MethodHandles.Lookup lookup = MethodHandles.lookup();
                HEAD = lookup.findVarHandle(Shard.class, "head", long.class);
                SEQUENCE = MethodHandles.arrayElementVarHandle(long[].class);
            } catch (ReflectiveOperationException e) {
                throw new ExceptionInInitializerError(e);
            }
        }
 
        // Padding to avoid false sharing with adjacent shards
        long p01, p02, p03, p04, p05, p06, p07;
 
        private volatile long head = 0;
 
        long p11, p12, p13, p14, p15, p16, p17;
 
        private long tail = 0;  // Only accessed by consumer
 
        long p21, p22, p23, p24, p25, p26, p27;
 
        private final Object[] buffer;
        private final long[] sequences;
        private final int capacity;
        private final int mask;
 
        Shard(int capacity) {
            this.capacity = roundUpToPowerOf2(capacity);
            this.mask = this.capacity - 1;
            this.buffer = new Object[this.capacity];
            this.sequences = new long[this.capacity];
 
            // Initialize sequences
            for (int i = 0; i < this.capacity; i++) {
                sequences[i] = i;
            }
        }
 
        private static int roundUpToPowerOf2(int value) {
            int highBit = Integer.highestOneBit(value);
            return (highBit == value) ? value : highBit << 1;
        }
 
        /**
         * Adds an element to this shard.
         * Lock-free using CAS for head advancement.
         */
        boolean offer(T element) {
            while (true) {
                long currentHead = head;
                int index = (int) (currentHead & mask);
                long sequence = (long) SEQUENCE.getAcquire(sequences, index);
 
                if (sequence == currentHead) {
                    // Slot is available, try to claim it
                    if (HEAD.compareAndSet(this, currentHead, currentHead + 1)) {
                        // Write data and publish
                        buffer[index] = element;
                        SEQUENCE.setRelease(sequences, index, currentHead + 1);
                        return true;
                    }
                    // CAS failed - another thread got it, retry
                    Thread.onSpinWait();
 
                } else if (sequence < currentHead) {
                    // Buffer is full
                    return false;
 
                } else {
                    // Slot being written by another thread, spin briefly
                    Thread.onSpinWait();
                }
            }
        }
 
        /**
         * Removes an element from this shard.
         * Single-threaded, no synchronization needed.
         */
        @SuppressWarnings("unchecked")
        T poll() {
            int index = (int) (tail & mask);
            long sequence = (long) SEQUENCE.getAcquire(sequences, index);
 
            if (sequence != tail + 1) {
                return null;  // No data available
            }
 
            T element = (T) buffer[index];
            buffer[index] = null;
 
            SEQUENCE.setRelease(sequences, index, tail + capacity);
            tail++;
 
            return element;
        }
 
        /**
         * Drains available elements from this shard.
         */
        @SuppressWarnings("unchecked")
        int drain(java.util.function.Consumer<T> consumer) {
            int count = 0;
 
            while (true) {
                int index = (int) (tail & mask);
                long sequence = (long) SEQUENCE.getAcquire(sequences, index);
 
                if (sequence != tail + 1) {
                    break;
                }
 
                T element = (T) buffer[index];
                buffer[index] = null;
                consumer.accept(element);
 
                SEQUENCE.setRelease(sequences, index, tail + capacity);
                tail++;
                count++;
            }
 
            return count;
        }
 
        int size() {
            long currentHead = head;
            long currentTail = tail;
            return (int) Math.max(0, currentHead - currentTail);
        }
    }
}

Key Design Decisions

1. Power-of-2 Shard Count

this.shardCount = roundUpToPowerOf2(shardCount);
this.shardMask = this.shardCount - 1;

This enables fast shard selection using bitwise AND instead of modulo:

int shardIndex = (int) (threadId & shardMask);  // Fast
// vs
int shardIndex = (int) (threadId % shardCount);  // Slow (division)

On x86-64, modulo requires a IDIV instruction (~40 cycles), while bitwise AND is a single AND instruction (~1 cycle).

2. Thread ID Hashing

long threadId = Thread.currentThread().getId();
return (int) (threadId & shardMask);

Thread IDs in Java are sequential starting from 1. For most applications, consecutive threads will map to consecutive shards, providing good distribution. If you need better distribution (e.g., thread pool recycling), consider using a hash:

int hash = Long.hashCode(threadId);
return hash & shardMask;

3. Cache Line Padding

Each shard has padding fields (p01-p07, etc.) to ensure that hot fields of adjacent shards don't share cache lines:

long p01, p02, p03, p04, p05, p06, p07;  // 56 bytes
private volatile long head = 0;           // 8 bytes
// Total: 64 bytes = 1 cache line

Without padding, updates to Shard[0]'s head would invalidate Shard[1]'s cached data - defeating the purpose of sharding.

4. Per-Slot Sequence Numbers

The shard implementation uses the same per-slot sequence pattern we developed in earlier articles. This enables:

Lock-free producer path (CAS only for slot claiming)
Safe publication (sequence update after data write)
Consumer doesn't need synchronization

5. Round-Robin Consumption

for (int i = 0; i < shardCount; i++) {
    int index = (consumerShardIndex + i) & shardMask;
    T element = shards[index].poll();
    if (element != null) {
        consumerShardIndex = (index + 1) & shardMask;
        return element;
    }
}

The consumer maintains state (consumerShardIndex) to remember where it left off. This ensures fairness - no shard gets starved even if another shard is very active.

Part 6: Technical Deep Dive - Why Sharding Works

Let's analyze exactly why sharding provides such dramatic performance improvements by examining the CPU-level behavior.

Memory Access Patterns

Single Shared Buffer:

Core 0: WRITE to buffer[0], head++
        - Cache line for head moves to Core 0 L1 (Modified)
        - All other cores: head cache line invalidated

Core 1: wants to WRITE
        - Reads head: cache miss, fetch from Core 0 (~80ns)
        - CAS head: cache line now Modified on Core 1
        - All other cores: cache line invalidated again

... repeated for each producer thread ...

With 64 cores, each write causes 63 cache line invalidations. The CPU's coherency protocol becomes the bottleneck.

Sharded Buffer:

Core 0: WRITE to shard[0]
        - Cache line for shard[0].head moves to Core 0 L1
        - Other cores don't have this cache line - no invalidation!

Core 1: WRITE to shard[1]
        - Cache line for shard[1].head moves to Core 1 L1
        - Core 0 doesn't care - different cache line

... all cores work independently in parallel ...

With proper shard count (one per thread), cache lines never bounce. Each core operates on its own data in its own cache.

Cache Line Utilization Visualization

Loading diagram...

CAS Retry Analysis

Single Buffer with 64 Threads:

When all 64 threads CAS the same head variable:

Expected successful CAS operations: 64 (one per item)
Actual CAS operations (with retries): 64 + 63 + 62 + ... + 1 = 2,080

That's 32x more atomic operations than necessary!

Sharded Buffer with 64 Shards:

Each thread CAS-es its own shard's head:

Expected successful CAS operations: 64
Actual CAS operations: 64 (no retries when one thread per shard)

Throughput Scaling Model

For a single shared resource with N threads:

Throughput = BaseOps / (1 + (N-1) * ContentionFactor)

Where ContentionFactor represents the overhead from contention (cache misses, retries, context switches).

For sharded resources:

Throughput = N * ShardThroughput  (linear scaling!)

This is the fundamental difference. Shared resources follow Amdahl's Law - performance is limited by the serial portion (contention). Sharded resources follow Gustafson's Law - performance scales with parallelism.

Experimental Validation

I ran experiments comparing throughput vs. thread count:

Threads  Single Buffer    Sharded (64 shards)   Speedup
------   -------------    ------------------    -------
1        2.1M ops/sec     2.0M ops/sec          0.95x
2        2.8M ops/sec     4.0M ops/sec          1.43x
4        3.1M ops/sec     7.8M ops/sec          2.52x
8        3.0M ops/sec     15.2M ops/sec         5.07x
16       2.8M ops/sec     29.8M ops/sec         10.64x
32       2.5M ops/sec     57.1M ops/sec         22.84x
64       2.1M ops/sec     108.3M ops/sec        51.57x

Notice:

Single buffer performance actually decreases beyond 4 threads due to contention
Sharded buffer scales nearly linearly up to 64 threads
At 64 threads, sharding provides 51.57x better throughput

This is the power of eliminating contention. The sharded buffer lets all 64 cores work at full speed, while the single buffer serializes them through a chokepoint.

Part 7: Benchmarks and Results

Benchmark Setup

Hardware:

CPU: AMD EPYC 7742 (64 cores, 128 threads, 2.25 GHz base)
RAM: 512 GB DDR4-3200
OS: Ubuntu 22.04, Linux 5.15
JVM: OpenJDK 21, ZGC

Benchmark configuration:

Buffer capacity: 1024 per shard
Shard count: 64 (one per core)
Producer threads: Variable (1-64)
Consumer: Single thread, continuous drain
Duration: 60 seconds per configuration
Warmup: 30 seconds

Latency Results

Offer Latency (64 Producer Threads):

Metric	Single Buffer	Sharded Buffer	Improvement
Mean	512ns	27ns	19.0x
p50	298ns	22ns	13.5x
p90	756ns	38ns	19.9x
p99	2,890ns	67ns	43.1x
p99.9	18,234ns	134ns	136.1x
p99.99	67,234ns	287ns	234.2x

The tail latency improvement is dramatic. At p99.99, sharding is 234x better - turning 67 microsecond worst-case latency into sub-microsecond latency.

Throughput Results

Producers	Single Buffer	Sharded Buffer	Improvement
1	2.1M/s	2.0M/s	0.95x
4	3.1M/s	7.8M/s	2.5x
8	3.0M/s	15.2M/s	5.1x
16	2.8M/s	29.8M/s	10.6x
32	2.5M/s	57.1M/s	22.8x
64	2.1M/s	108.3M/s	51.6x

At 64 threads, the sharded buffer achieves over 100 million operations per second. The single buffer is throttled to 2.1 million - less than it achieved with a single thread!

Latency Distribution

64 Producer Threads - Latency Distribution

Single Shared Buffer:
████████ | 0-100ns:    15%
██████████████████████████████ | 100-500ns:  52%
████████████████ | 500ns-1us:  23%
████ | 1-5us:      7%
██ | 5-20us:     2%
█ | 20us+:      1%

Per-Core Sharded Buffer:
████████████████████████████████████████████████████ | 0-30ns:    76%
██████████████████ | 30-50ns:   18%
███ | 50-100ns:   4%
█ | 100ns-1us:  1.8%
  | 1us+:       0.2%

The sharded buffer's distribution is tightly clustered in the sub-30ns range, while the single buffer has a long tail extending into tens of microseconds.

CPU Utilization

Single Buffer (64 threads):

User CPU: 8%
System CPU: 12%
Idle: 80%

Sharded Buffer (64 threads):

User CPU: 89%
System CPU: 3%
Idle: 8%

The single buffer wastes 80% of CPU capacity on contention overhead. The sharded buffer actually utilizes the available compute resources.

Cache Analysis (via perf)

perf stat -e L1-dcache-load-misses,LLC-load-misses,cache-misses
 
# Single Buffer
L1-dcache-load-misses: 2,847,234,891
LLC-load-misses:       187,234,567
Cycles/operation:      ~760
 
# Sharded Buffer
L1-dcache-load-misses: 312,456,789
LLC-load-misses:       12,345,678
Cycles/operation:      ~58

The sharded buffer has:

89% fewer L1 cache misses
93% fewer LLC misses
13x fewer cycles per operation

This directly maps to the performance improvement - fewer cache misses means faster execution.

GC Behavior

Both implementations have similar allocation patterns (same data, same buffer sizes), so GC behavior is comparable. The key difference is that the sharded buffer doesn't create lock wait queue nodes, eliminating ~3MB/sec of allocation pressure that the locked implementation generates.

Part 8: Trade-offs and When to Use

When Per-Core Sharding Excels

1. High-Core-Count Servers

Modern servers have 32, 64, or even 128+ cores. Traditional synchronization patterns that worked fine on 4-8 core machines fall apart at this scale. Sharding is designed for this environment.

2. Many-Producer, Single-Consumer Patterns

Examples:

Log aggregation (many app threads write logs, one thread flushes to disk)
Metrics collection (many threads emit metrics, one thread aggregates)
Event sourcing (many threads emit events, one thread persists)

The MPSC (Multi-Producer Single-Consumer) pattern is a natural fit for sharding.

3. Bursty Workloads

When work arrives in bursts (market open, flash sales, etc.), contention spikes dramatically. Sharding maintains consistent performance regardless of load pattern.

4. Latency-Sensitive Systems

For trading, gaming, or real-time systems where tail latency matters, the 234x improvement at p99.99 is transformative.

When to Avoid Sharding

1. Low Thread Counts

With 1-4 threads, the overhead of managing multiple shards may exceed the contention cost. The single buffer is simpler and nearly as fast.

2. Strict Ordering Requirements

Sharding relaxes FIFO ordering - items in different shards may be consumed out of arrival order. If strict ordering is required, sharding won't work without additional coordination (which reintroduces contention).

3. Memory-Constrained Environments

64 shards each with 1024 slots means 65,536 buffer slots instead of 1,024. For embedded systems or containers with tight memory limits, this overhead may be unacceptable.

4. Simple Applications

If you're not hitting performance limits, the added complexity of sharding isn't justified. Premature optimization is the root of all evil.

Choosing Shard Count

The optimal shard count depends on your workload:

Shard Count = Number of Producer Threads:

Zero contention (ideal)
Maximum memory usage
Best for performance-critical paths

Shard Count = Number of Cores:

Near-zero contention for most workloads
Reasonable memory usage
Good default choice

Shard Count = Number of NUMA Nodes:

Minimizes cross-NUMA traffic
Lower memory usage
Good for memory-constrained systems

Monitoring Recommendations

Track these metrics in production:

// Per-shard utilization
for (int i = 0; i < shardCount; i++) {
    metrics.gauge("shard." + i + ".utilization",
        () -> (double) shards[i].size() / shardCapacity);
}
 
// Overall throughput
metrics.counter("buffer.offers.total");
metrics.counter("buffer.offers.failed");
 
// Latency percentiles
metrics.timer("buffer.offer.latency");
 
// Contention indicator (if using CAS)
metrics.counter("buffer.cas.retries");

If cas.retries is consistently high, you may need more shards. If shard utilization is very uneven, your thread-to-shard mapping may need adjustment.

Part 9: Advanced Optimizations

Optimization 1: NUMA-Aware Shard Placement

On multi-socket systems, accessing memory from a remote NUMA node costs 2-3x more than local access. We can optimize by aligning shards with NUMA topology:

public class NumaAwareShardedBuffer<T> {
 
    private final Shard<T>[][] numaShards;  // [numaNode][shardWithinNode]
 
    public NumaAwareShardedBuffer(int shardsPerNode, int shardCapacity) {
        int numaNodes = getNumaNodeCount();
        this.numaShards = new Shard[numaNodes][];
 
        for (int node = 0; node < numaNodes; node++) {
            numaShards[node] = new Shard[shardsPerNode];
 
            // Allocate shards on their NUMA node
            allocateOnNode(node, () -> {
                for (int i = 0; i < shardsPerNode; i++) {
                    numaShards[node][i] = new Shard<>(shardCapacity);
                }
            });
        }
    }
 
    private int selectShard() {
        int numaNode = getCurrentNumaNode();  // Get thread's NUMA affinity
        long threadId = Thread.currentThread().getId();
        int shardIndex = (int) (threadId & (shardsPerNode - 1));
        return numaShards[numaNode][shardIndex];
    }
}

This keeps threads accessing local memory, reducing cross-node traffic.

Optimization 2: Adaptive Shard Selection

If thread IDs cluster badly (e.g., all map to the same few shards), use adaptive selection:

public class AdaptiveShardedBuffer<T> {
 
    private final AtomicIntegerArray shardLoad;  // Tracks items per shard
 
    private int selectShard() {
        // Start with thread ID based selection
        long threadId = Thread.currentThread().getId();
        int baseIndex = (int) (threadId & shardMask);
 
        // Check if base shard is overloaded
        if (shardLoad.get(baseIndex) > averageLoad * 1.5) {
            // Find a less loaded shard nearby
            for (int i = 1; i < shardCount; i++) {
                int altIndex = (baseIndex + i) & shardMask;
                if (shardLoad.get(altIndex) < averageLoad) {
                    return altIndex;
                }
            }
        }
 
        return baseIndex;
    }
}

This adds overhead (reading load counters) but ensures better distribution under pathological thread ID patterns.

Optimization 3: Batch Operations

For very high throughput, batch multiple items per shard access:

public int offerBatch(T[] elements, int count) {
    // Group elements by target shard
    @SuppressWarnings("unchecked")
    List<T>[] batches = new List[shardCount];
 
    for (int i = 0; i < count; i++) {
        int shardIndex = selectShardFor(elements[i]);
        if (batches[shardIndex] == null) {
            batches[shardIndex] = new ArrayList<>();
        }
        batches[shardIndex].add(elements[i]);
    }
 
    // Write batches to shards
    int written = 0;
    for (int i = 0; i < shardCount; i++) {
        if (batches[i] != null) {
            written += shards[i].offerBatch(batches[i]);
        }
    }
 
    return written;
}

Batching amortizes the overhead of shard selection and cache-line access across multiple items.

Optimization 4: Work Stealing for Uneven Loads

If some shards fill faster than others, the consumer can use work stealing:

public T pollWithStealing() {
    // First, try our preferred shard
    T element = shards[consumerShardIndex].poll();
    if (element != null) {
        return element;
    }
 
    // Our shard is empty - steal from the fullest shard
    int fullestShard = -1;
    int maxSize = 0;
 
    for (int i = 0; i < shardCount; i++) {
        int size = shards[i].size();
        if (size > maxSize) {
            maxSize = size;
            fullestShard = i;
        }
    }
 
    if (fullestShard >= 0) {
        return shards[fullestShard].poll();
    }
 
    return null;  // All shards empty
}

This ensures the consumer always works on available data, reducing idle time.

Part 10: Real-World Application

Let me share how we applied these concepts to solve our original problem - the transaction processing system that was failing at 3,000 TPS when we needed 60,000.

The Architecture Before

Loading diagram...

All 64 handler threads fought for the single buffer. Lock convoys, cache-line bouncing, and CAS retry storms killed our performance.

The Architecture After

Loading diagram...

Each handler gets its own shard. Zero contention, maximum parallelism.

The Results

Metric                        Before      After       Improvement
---------------------------   ---------   ---------   -----------
Throughput (TPS)              3,000       62,000      20.7x
Mean latency                  512ns       27ns        19.0x
p99.9 latency                 18,234ns    134ns       136x
CPU utilization               8%          89%         11.1x
Cache miss rate               38%         4%          9.5x reduction
Context switches/sec          147,000     2,100       70x reduction

We exceeded our 60,000 TPS target with room to spare. More importantly, our tail latencies dropped from milliseconds to microseconds, meeting our SLA requirements with margin.

Lessons Learned

1. Profile Before Optimizing

We could have guessed that "locks are slow" and tried many optimizations. Instead, we profiled and discovered exactly where time was going (73% in lock wait). This directed us to the right solution.

2. Understand Hardware

The fix wasn't algorithmic cleverness - it was understanding CPU caches, coherency protocols, and NUMA topology. Software engineering is hardware engineering at this level.

3. Measure After Optimizing

We validated every change with benchmarks. Some "optimizations" (like adding more spin iterations) actually made things worse. Data beats intuition.

4. Simple Solutions Often Best

Sharding isn't complex. It's essentially "have more buffers instead of one." The insight was recognizing that our problem was contention, not algorithm efficiency.

Part 11: Conclusion

That Thursday night crisis taught me something fundamental: at scale, coordination is the enemy of performance. We had 64 cores capable of processing 100 million operations per second, reduced to 3 million by a single lock.

The journey from 3,000 TPS to 62,000 TPS wasn't about clever algorithms or exotic data structures. It was about one key insight: eliminate contention by eliminating sharing.

Per-core sharding embodies this principle:

Instead of one buffer, many buffers
Instead of threads competing, threads cooperating (by staying out of each other's way)
Instead of cache lines bouncing, cache lines staying put

The results speak for themselves:

20x throughput improvement
136x tail latency improvement
70x reduction in context switches
Linear scalability with core count

But sharding isn't magic. It requires:

Power-of-2 shard counts for efficient selection
Careful cache-line padding to prevent false sharing
Thread-to-shard affinity for stable assignment
Single-consumer design for simple consumption

When you find your high-core-count system underperforming despite apparent CPU headroom, look for contention. Profile for lock wait time, cache misses, and context switches. If you find a hot lock or CAS variable being hammered by many threads, consider sharding.

The pattern applies beyond ring buffers:

Connection pools can be sharded
Statistics counters can be sharded (see LongAdder)
Thread-local storage is extreme sharding
Database sharding follows the same principle

Remember: the fastest synchronization is no synchronization. The best lock is no lock. When you can partition your problem so threads never need to coordinate, you unlock the full parallelism potential of modern hardware.

And remember - measure, understand, optimize. In that order.

Appendix A: Quick Reference

Algorithm Summary

Producer Protocol:

Get thread ID
Hash to shard index: threadId & (shardCount - 1)
Write to that shard (lock-free CAS)

Consumer Protocol:

Check shard[roundRobinIndex]
If empty, try next shard
Advance roundRobinIndex

Key Invariants:

shardCount is always power of 2
Same thread always maps to same shard
Shards are cache-line padded to prevent false sharing

Performance Comparison

Metric	Single Buffer	Sharded (64)	Improvement
Mean Latency	~500ns	~27ns	18.5x
p99.9 Latency	~18us	~134ns	134x
Throughput (64T)	~2M ops/s	~108M ops/s	54x
CPU Utilization	8%	89%	11x
Cache Miss Rate	38%	4%	9.5x reduction

When to Use

Do Use:

High-core-count systems (16+ cores)
Many-producer, single-consumer patterns
Latency-sensitive applications
Bursty workloads

Don't Use:

Low thread counts (1-4)
Strict ordering requirements
Memory-constrained environments
Simple applications (KISS)

Tuning Guidelines

Shard Count:

Start with: Number of producer threads
Minimum: Number of cores
Maximum: 2x number of producers

Shard Capacity:

Start with: 1024 per shard
Increase if: shards frequently fill
Decrease if: memory constrained

Consumer Strategy:

Round-robin for fairness
Work-stealing for uneven loads
Batched drain for throughput

Appendix B: Common Pitfalls

Pitfall 1: Non-Power-of-2 Shard Count

// WRONG: Requires expensive modulo
int shardIndex = (int) (threadId % shardCount);
 
// RIGHT: Fast bitwise AND (when shardCount is power of 2)
int shardIndex = (int) (threadId & (shardCount - 1));

Always round up shard count to the nearest power of 2.

Pitfall 2: Missing Cache Line Padding

// WRONG: Adjacent shards share cache lines
private static class Shard {
    volatile long head;
    long tail;
}
 
// RIGHT: Padded to separate cache lines
private static class Shard {
    long p0, p1, p2, p3, p4, p5, p6;  // 56 bytes
    volatile long head;                 // 8 bytes (64 total)
    long p10, p11, p12, p13, p14, p15, p16;
    long tail;
}

Pitfall 3: Thread ID Collisions

Thread IDs in Java are sequential, but thread pools recycle threads with the same ID. Consider using thread-local state or a registration pattern for long-running systems:

private static final AtomicInteger nextProducerId = new AtomicInteger(0);
private static final ThreadLocal<Integer> producerId =
    ThreadLocal.withInitial(nextProducerId::getAndIncrement);
 
private int selectShard() {
    return producerId.get() & shardMask;
}

Pitfall 4: Consumer Thread Affinity

If the consumer runs on a different NUMA node than the shards it reads, performance degrades. Pin the consumer to the same node as the majority of shards:

taskset -c 0-63 java -jar producer-app.jar   # Producers on node 0
taskset -c 64 java -jar consumer-app.jar      # Consumer on node 0 (or nearby)

Pitfall 5: Over-Sharding

More shards isn't always better. Each shard has memory overhead and the consumer must check all of them. For 64 threads, 64-128 shards is plenty. 1024 shards would waste memory and slow consumption.

Appendix C: Alternative Implementations

JCTools SpscArrayQueue

For single-producer single-consumer scenarios, JCTools provides optimized queues:

// One queue per producer-consumer pair
SpscArrayQueue<Order> queue = new SpscArrayQueue<>(1024);

Disruptor with Multiple Ring Buffers

The LMAX Disruptor supports multiple producers with sophisticated sequencing:

Disruptor<Event>[] disruptors = new Disruptor[shardCount];
for (int i = 0; i < shardCount; i++) {
    disruptors[i] = new Disruptor<>(
        Event::new, 1024, threadFactory,
        ProducerType.SINGLE, new BusySpinWaitStrategy()
    );
}

Agrona's ManyToOneRingBuffer

Agrona (used by Aeron) provides IPC-capable ring buffers:

ManyToOneRingBuffer ringBuffer = new ManyToOneRingBuffer(
    new UnsafeBuffer(ByteBuffer.allocateDirect(1024 * 1024))
);

Chronicle Queue

For persistent sharded queues with disk-backed storage:

ChronicleQueue[] shards = new ChronicleQueue[shardCount];
for (int i = 0; i < shardCount; i++) {
    shards[i] = ChronicleQueue.singleBuilder("shard-" + i).build();
}

Appendix D: Debugging and Troubleshooting

Diagnosing Contention Issues

Before implementing sharding, you need to confirm that contention is actually your problem. Here's a systematic approach:

Step 1: Profile for Lock Wait Time

Use Java Flight Recorder (JFR) to capture lock contention events:

java -XX:StartFlightRecording=duration=60s,filename=recording.jfr \
     -XX:+UnlockDiagnosticVMOptions \
     -XX:+DebugNonSafepoints \
     -jar your-app.jar

Then analyze with JMC (Java Mission Control):

Look for "Java Lock Wait" events
Check "Thread Contention" view
Identify hot locks by contention count

Step 2: Measure Cache Behavior

On Linux, use perf to capture cache statistics:

perf stat -e L1-dcache-load-misses,L1-dcache-loads,\
LLC-load-misses,LLC-loads,cache-misses \
-p $(pgrep -f your-app) sleep 30

Key metrics to watch:

L1 miss rate > 5%: Possible cache-line bouncing
LLC miss rate > 10%: Severe cache pressure
Cache-misses growing linearly with threads: Contention signature

Step 3: Track Context Switches

pidstat -w -p $(pgrep -f your-app) 1

High context switch rate (> 10,000/sec) combined with low CPU utilization suggests lock convoy effects.

Common Debugging Scenarios

Scenario: Throughput Drops When Adding Threads

This is the classic contention signature. If going from 4 to 8 threads decreases throughput, you have a serialization bottleneck.

Diagnosis:

// Add instrumentation
private final LongAdder casRetries = new LongAdder();
private final LongAdder casSuccesses = new LongAdder();
 
public boolean offer(T element) {
    while (true) {
        if (HEAD.compareAndSet(...)) {
            casSuccesses.increment();
            return true;
        }
        casRetries.increment();
    }
}
 
// Monitor retry ratio
double retryRatio = casRetries.sum() / (double) casSuccesses.sum();
// If retryRatio > 1.0, you have serious contention

Scenario: Uneven Shard Utilization

If some shards are always full while others are empty, your thread-to-shard mapping is broken.

Diagnosis:

// Add per-shard monitoring
for (int i = 0; i < shardCount; i++) {
    System.out.printf("Shard %d: size=%d, offers=%d%n",
        i, shards[i].size(), shards[i].getOfferCount());
}

Solutions:

Check thread ID distribution
Use a hash function instead of direct modulo
Increase shard count

Scenario: Consumer Can't Keep Up

If shards fill up and producers start failing, the consumer is the bottleneck.

Diagnosis:

// Track failed offers
private final LongAdder offerFailures = new LongAdder();
 
// In monitoring
long failures = offerFailures.sum();
if (failures > 0) {
    System.out.println("WARNING: " + failures + " offers failed");
}

Solutions:

Use batched drain operations
Add more consumer threads (requires multi-consumer design)
Increase shard capacity
Reduce producer rate

Performance Tuning Checklist

Before deploying sharded buffers to production, verify:

Shard count is power of 2
Shard count >= number of producer threads
Cache line padding is present between hot fields
Consumer is pinned to appropriate CPU core
JVM is using appropriate GC (ZGC or Shenandoah for low latency)
Monitoring is in place for shard utilization
Backpressure strategy is defined for full shards
Thread affinity is configured if NUMA is present

Appendix E: Memory Ordering Deep Dive

Understanding memory ordering is crucial for implementing correct lock-free algorithms. Let me explain the specific ordering requirements in our sharded buffer.

The Publication Problem

The core challenge in any producer-consumer buffer is ensuring that when the consumer sees "data is available," the data is actually there. Consider this sequence:

Producer:                          Consumer:
1. Write data to buffer[0]
2. Set sequence[0] = 1             3. Read sequence[0] == 1
                                   4. Read buffer[0]  <- might see stale data!

Without proper memory ordering, step 4 might read stale data because the CPU or compiler could reorder step 1 and 2.

Release-Acquire Semantics

We solve this with release-acquire ordering:

// Producer: release semantics
buffer[index] = element;                    // Store data
SEQUENCE.setRelease(sequences, index, newSeq);  // Release store
 
// Consumer: acquire semantics
long seq = (long) SEQUENCE.getAcquire(sequences, index);  // Acquire load
T element = (T) buffer[index];                             // Load data

The guarantees:

Release store: All prior writes become visible before this store is visible
Acquire load: All subsequent loads see writes that happened before the corresponding release

This creates a "synchronizes-with" relationship: if the consumer's acquire load sees the value written by the producer's release store, all writes before the producer's release are visible to all reads after the consumer's acquire.

VarHandle Access Modes

Java's VarHandle provides several access modes with different ordering guarantees:

// Plain - no ordering guarantees (fastest, but dangerous)
long val = (long) SEQUENCE.get(sequences, index);
SEQUENCE.set(sequences, index, newVal);
 
// Opaque - no reordering with other opaque ops on same variable
long val = (long) SEQUENCE.getOpaque(sequences, index);
SEQUENCE.setOpaque(sequences, index, newVal);
 
// Acquire/Release - partial ordering (what we use)
long val = (long) SEQUENCE.getAcquire(sequences, index);
SEQUENCE.setRelease(sequences, index, newVal);
 
// Volatile - full sequential consistency (strongest, slowest)
long val = (long) SEQUENCE.getVolatile(sequences, index);
SEQUENCE.setVolatile(sequences, index, newVal);

For our buffer:

Producer writes use setRelease (sufficient for publication)
Consumer reads use getAcquire (sufficient for consumption)
CAS operations use volatile semantics by default

x86 vs ARM Ordering

Memory ordering requirements vary by CPU architecture:

x86-64 (Intel/AMD):

Strong memory model
Stores are never reordered with other stores
Loads are never reordered with other loads
Release/acquire are essentially free (no extra barriers)

ARM/Apple Silicon:

Weak memory model
Loads and stores can be freely reordered
Release requires a DMB ST barrier
Acquire requires a DMB LD barrier

On x86, our code performs well because the memory model provides most guarantees for free. On ARM, the explicit release-acquire ordering ensures correctness but may have slight overhead from memory barriers.

Testing Memory Ordering

Memory ordering bugs are notoriously hard to find. They may only manifest under specific timing conditions or on specific hardware. Here's how to stress-test:

@Test
void testMemoryOrderingStress() throws InterruptedException {
    PerCoreShardedBuffer<long[]> buffer =
        new PerCoreShardedBuffer<>(64, 1024);
 
    AtomicBoolean failure = new AtomicBoolean(false);
    CountDownLatch latch = new CountDownLatch(1);
 
    // Producers write arrays where all elements should equal the array index
    Thread[] producers = new Thread[64];
    for (int i = 0; i < 64; i++) {
        final int id = i;
        producers[i] = new Thread(() -> {
            try { latch.await(); } catch (InterruptedException e) { return; }
            for (int j = 0; j < 100_000; j++) {
                long[] data = new long[8];
                Arrays.fill(data, id * 100_000L + j);
                buffer.offer(data);
            }
        });
        producers[i].start();
    }
 
    // Consumer verifies array consistency
    Thread consumer = new Thread(() -> {
        try { latch.await(); } catch (InterruptedException e) { return; }
        for (int i = 0; i < 64 * 100_000; i++) {
            long[] data = buffer.poll();
            if (data == null) {
                i--;
                continue;
            }
            // All elements should be equal
            for (int j = 1; j < data.length; j++) {
                if (data[j] != data[0]) {
                    failure.set(true);
                    System.err.printf(
                        "Memory ordering failure: data[0]=%d, data[%d]=%d%n",
                        data[0], j, data[j]);
                }
            }
        }
    });
    consumer.start();
 
    latch.countDown();  // Start all threads simultaneously
 
    for (Thread p : producers) p.join();
    consumer.join();
 
    assertFalse(failure.get(), "Memory ordering violation detected");
}

Run this test thousands of times, ideally on both x86 and ARM machines. Any memory ordering bug will eventually cause the assertion to fail.

Appendix F: Production Deployment Guide

Capacity Planning

To size your sharded buffer appropriately:

1. Estimate Peak Throughput

Measure or estimate your peak message rate:

Peak TPS = Average TPS * Burst Factor

Typical burst factors:

Web applications: 3-5x
Trading systems: 10-20x
IoT systems: 5-10x

2. Calculate Buffer Size

The buffer must hold messages long enough for the consumer to process them:

Buffer Size = Peak TPS * Consumer Latency * Safety Factor

Example:
Peak TPS = 100,000
Consumer latency = 100us = 0.0001s
Safety factor = 10

Buffer Size = 100,000 * 0.0001 * 10 = 1,000 per shard

3. Determine Shard Count

Match producer thread count:

Shard Count = min(Producer Threads, roundUpPowerOf2(Producer Threads))

4. Calculate Memory Usage

Memory = Shard Count * (Buffer Size * Reference Size + Padding)
       = 64 * (1024 * 8 + 256)
       = 64 * 8,448
       = 540 KB per sharded buffer

JVM Configuration

Recommended JVM flags for low-latency operation:

java \
  -Xms8g -Xmx8g \                          # Fixed heap size
  -XX:+UseZGC \                            # Low-latency GC
  -XX:+ZGenerational \                     # Generational ZGC (Java 21+)
  -XX:+AlwaysPreTouch \                    # Pre-touch heap pages
  -XX:-UseBiasedLocking \                  # Disable biased locking
  -XX:+UseTransparentHugePages \           # Use huge pages
  -XX:+PerfDisableSharedMem \              # Disable perf shared memory
  -Djdk.virtualThreadScheduler.parallelism=0 \  # Disable virtual threads interference
  -jar your-app.jar

Monitoring in Production

Essential metrics to track:

// Expose via JMX, Prometheus, or your metrics system
public class ShardedBufferMetrics {
 
    private final PerCoreShardedBuffer<?> buffer;
    private final LongAdder totalOffers = new LongAdder();
    private final LongAdder failedOffers = new LongAdder();
    private final LongAdder totalPolls = new LongAdder();
 
    // Call from offer()
    public void recordOffer(boolean success) {
        totalOffers.increment();
        if (!success) failedOffers.increment();
    }
 
    // Call from poll()
    public void recordPoll(boolean hasData) {
        if (hasData) totalPolls.increment();
    }
 
    // Metrics getters
    public double getOfferSuccessRate() {
        long total = totalOffers.sum();
        return total == 0 ? 1.0 :
            (total - failedOffers.sum()) / (double) total;
    }
 
    public long[] getShardSizes() {
        long[] sizes = new long[buffer.getShardCount()];
        for (int i = 0; i < sizes.length; i++) {
            sizes[i] = buffer.shards[i].size();
        }
        return sizes;
    }
 
    public double getShardBalanceRatio() {
        long[] sizes = getShardSizes();
        long max = Arrays.stream(sizes).max().orElse(0);
        long min = Arrays.stream(sizes).min().orElse(0);
        return max == 0 ? 1.0 : min / (double) max;
    }
}

Alert Thresholds:

Offer success rate < 99%: Consumer falling behind
Shard balance ratio < 0.5: Poor thread-to-shard distribution
Any shard at 90% capacity: Risk of data loss

Load Testing Strategy

Before production deployment, conduct thorough load testing:

Phase 1: Baseline Establishment

// Single-threaded throughput (ceiling for per-shard performance)
@Benchmark
public void singleThreadedThroughput() {
    buffer.offer(testData);
    buffer.poll();
}

Phase 2: Scalability Validation

// Verify linear scaling with thread count
@Benchmark
@Threads(1)
public void scale1Thread() { buffer.offer(testData); }
 
@Benchmark
@Threads(4)
public void scale4Threads() { buffer.offer(testData); }
 
@Benchmark
@Threads(16)
public void scale16Threads() { buffer.offer(testData); }
 
@Benchmark
@Threads(64)
public void scale64Threads() { buffer.offer(testData); }

Expected results: Each doubling of threads should roughly double throughput, up to your shard count.

Phase 3: Sustained Load Test Run at 80% of maximum throughput for extended periods (hours to days):

Monitor for memory leaks
Watch for GC behavior degradation
Check for shard imbalance accumulation

Phase 4: Burst Testing Simulate realistic burst patterns:

void burstTest() {
    // Normal load for 10 seconds
    runAtRate(10_000);
 
    // Burst to 10x for 1 second
    runAtRate(100_000);
 
    // Back to normal
    runAtRate(10_000);
 
    // Verify no data loss during transitions
    verifyDataIntegrity();
}

Graceful Shutdown

Ensure clean shutdown to avoid data loss:

public class ShardedBufferManager {
 
    private final PerCoreShardedBuffer<T> buffer;
    private volatile boolean shuttingDown = false;
 
    public void shutdown(Duration timeout) throws InterruptedException {
        shuttingDown = true;
 
        // Wait for buffer to drain
        long deadline = System.nanoTime() + timeout.toNanos();
        while (buffer.size() > 0 && System.nanoTime() < deadline) {
            Thread.sleep(10);
        }
 
        if (buffer.size() > 0) {
            log.warn("Shutdown with {} items remaining in buffer", buffer.size());
        }
    }
 
    public boolean offer(T element) {
        if (shuttingDown) {
            throw new IllegalStateException("Buffer is shutting down");
        }
        return buffer.offer(element);
    }
}

Appendix G: Further Reading and Resources

Essential Books

For those who want to go deeper into the concepts covered in this article:

"The Art of Multiprocessor Programming" by Maurice Herlihy and Nir Shavit - The definitive reference for lock-free algorithms. Covers the theoretical foundations including linearizability, wait-freedom, and lock-freedom. Essential for understanding correctness proofs.
"Java Concurrency in Practice" by Brian Goetz et al. - While somewhat dated, this remains the best introduction to Java concurrency. Chapters on visibility and atomicity are particularly relevant.
"C++ Concurrency in Action" by Anthony Williams - Even if you're a Java developer, this book provides excellent explanations of memory ordering that translate directly to VarHandle operations.
"Is Parallel Programming Hard?" by Paul McKenney - Free online book that dives deep into memory barriers, cache coherency, and the hardware foundations of concurrent programming.

Online Resources

Mechanical Sympathy Blog (https://mechanical-sympathy.blogspot.com/) - Martin Thompson's blog on low-latency programming. Essential reading for understanding hardware-software interactions.
JCTools Wiki (https://github.com/JCTools/JCTools/wiki) - Documentation for the JCTools library, with excellent explanations of different queue types and their use cases.
Disruptor Technical Paper (https://lmax-exchange.github.io/disruptor/disruptor.html) - The original paper explaining the LMAX Disruptor. Many concepts in this article derive from this work.
Lock-Free Programming Considered Harmful - A cautionary perspective on when lock-free is (and isn't) appropriate. Good for developing a balanced view.

Patterns that complement or build upon per-core sharding:

Striped Counters (LongAdder) Java's LongAdder uses a similar sharding concept for high-contention counters:

LongAdder counter = new LongAdder();
counter.increment();  // Shards internally
long sum = counter.sum();  // Aggregates shards

Thread-Local Storage The extreme form of sharding - each thread has its own copy:

ThreadLocal<Buffer> localBuffer = ThreadLocal.withInitial(Buffer::new);
localBuffer.get().write(data);

Work Stealing Complements sharding by allowing idle consumers to steal from busy shards:

ForkJoinPool pool = new ForkJoinPool();
pool.invoke(new RecursiveTask<>() { ... });

Hierarchical Aggregation For very large systems, multi-level sharding:

Level 0: Per-thread buffers
Level 1: Per-NUMA-node aggregators
Level 2: Global aggregator

Acknowledgments

The techniques in this article build on decades of research and engineering. Special thanks to:

The LMAX team for the Disruptor pattern
Martin Thompson for Mechanical Sympathy insights
The JCTools maintainers for battle-tested implementations
Doug Lea for java.util.concurrent

This article is part of the Lock-Free in Java series. The complete source code and benchmarks are available at https://github.com/techishthoughts-org/off_heap_algorithms

Sharded Processing: Per-Core Isolation for Zero Contention

Arthur Costa