Skip to content

HDDS-15359. Read chunks into pooled direct buffers#10353

Draft
smengcl wants to merge 1 commit into
apache:masterfrom
smengcl:bufferutils-direct
Draft

HDDS-15359. Read chunks into pooled direct buffers#10353
smengcl wants to merge 1 commit into
apache:masterfrom
smengcl:bufferutils-direct

Conversation

@smengcl
Copy link
Copy Markdown
Contributor

@smengcl smengcl commented May 24, 2026

Generated-by: Claude Code (Opus 4.7)

What changes were proposed in this pull request?

BufferUtils.assignByteBuffers previously allocated heap ByteBuffers for chunk reads. Reading from a FileChannel into a heap buffer forces the JDK to first read into a thread-local DirectByteBuffer (in sun.nio.ch.IOUtil) and then memcpy the chunk-sized payload into the user's heap buffer. So every default-config chunk read paid one extra user-space chunk-sized memcpy plus one chunk-sized heap allocation.

Two prior PRs added opt-in direct-buffer paths to skip exactly this copy:

  • HDDS-7188 (Tsz-Wo Nicholas Sze, Jan 2025): ChunkUtils.readDataNettyChunkedNioFile via Netty PooledByteBufAllocator.DEFAULT
  • HDDS-10488 (Sammi Chen, Sep 2024): MappedBufferManager mmap reads with semaphore-tracked quota

Both deliberately avoided raw ByteBuffer.allocateDirect because it is a known leak pattern under sustained load. Tiny wrapper objects mean GC fires rarely, the Cleaner that frees direct memory never runs promptly, and the native budget exhausts with OutOfMemoryError: Direct buffer memory. Neither PR updated the default assignByteBuffers path, so the heap allocation stayed.

This change uses Hadoop's org.apache.hadoop.util.DirectBufferPool, the same pool already used by BlockOutputStream and KeyValueContainerCheck. Release plumbing lives entirely in ChunkUtils.readData: a release lambda is registered on DispatcherContext.setReleaseMethod, fires once in GrpcXceiverService's finally block after responseObserver.onNext returns (the standard server marshaller is synchronous, so the response bytes are already serialized into the Netty outgoing buffer by then), and returns each buffer to the pool. Paths without a release-supporting context fall back to GC reclaim via the pool's WeakReference layer, so the pool degrades to plain allocateDirect rather than leaking. ChunkBuffer and ChunkBufferImplWithByteBufferList are unchanged from master.

A consumer audit confirmed no .array() usage breaks with direct buffers. ChecksumByteBufferImpl guards .array() with hasArray(), and UnsafeByteOperations.unsafeWrap of a direct ByteBuffer produces a zero-copy NioByteString without calling .array(). One test (TestChunkUtils.concurrentReadWriteOfSameFile) called .array() directly and was updated to use get(byte[]).

Expected impact

Per chunk read:

  • One chunk-sized user-space memcpy eliminated (~270 µs for a 4 MB chunk on a typical x86 server with ~15 GB/s memory bandwidth).
  • One chunk-sized heap allocation eliminated (heap churn moves to bounded direct memory in the pool).

Wall-clock effect depends on storage:

Storage Latency per chunk Effect
Page-cache hit sub-ms total meaningful (memcpy was a large fraction)
NVMe sequential ~1 ms total ~20 percent reduction
SATA SSD ~10 ms total a few percent
HDD tens of ms total invisible at wall-clock

Heap-churn reduction is consistent across storage tiers: at 100 MB/s read throughput a DN sheds roughly 100 MB/s of heap allocation.

Pool sizing and operational note

DirectBufferPool is unbounded by design (per ConcurrentMap<Integer, Queue<WeakReference<ByteBuffer>>>). Steady-state pool memory converges to peak concurrent in-flight buffers per capacity class. For default-config chunk reads (4 MB) on a 64-thread DN, that is roughly 256 MB of direct memory. Operators should ensure -XX:MaxDirectMemorySize accommodates this; the existing BlockOutputStream and KeyValueContainerCheck consumers of DirectBufferPool are already sized this way.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-15359

How was this patch tested?

  • Existing tests

Generated-by: Claude Code (Opus 4.7)
@ChenSammi
Copy link
Copy Markdown
Contributor

@smengcl , thanks for working on this. The idea overall looks good to me. Stream read uses KeyValueHandler#readBlockImpl, it's not covered in the patch.

Further more, we should consider add a limit to the direct buffer pool in case it causes OOM failure. There is an existing example, BoundedElasticByteBufferPool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants