Skip to content

DPL: add benchmark for memfd based message passing#15421

Open
ktf wants to merge 1 commit into
AliceO2Group:devfrom
ktf:pr15421
Open

DPL: add benchmark for memfd based message passing#15421
ktf wants to merge 1 commit into
AliceO2Group:devfrom
ktf:pr15421

Conversation

@ktf
Copy link
Copy Markdown
Member

@ktf ktf commented May 21, 2026

No description provided.

@ktf ktf requested a review from a team as a code owner May 21, 2026 07:33
@ktf
Copy link
Copy Markdown
Member Author

ktf commented May 21, 2026

@davidrohr @shahor02 @rbx what do you think of the following approach for shared memory based message passing:

  • Sender creates a memfd (or similar on macOS) per timeframe and fills it using a bump allocator to pack all the messages of a timeframe together.
  • On send, the file descriptor is sealed (no one can write on it anymore) and passes via UDS the file descriptor to the receiver.
  • On receive the receiver maps the fd with PROT_READ

This solves a few issues of the current shared memory backend:

  • No need to preallocate the segment.
  • The receivers have no chance to overwrite senders memory (the fd is mapped readonly by the receiver).
  • Reference counting can be done via a simple dup
  • No more memory fragmentation (at the cost of one page overhead for any TF)

do you see any particularly obvious drawback?

@shahor02
Copy link
Copy Markdown
Collaborator

This would indeed minimize the fragmentation, but how the forwarding will be done?

@rbx
Copy link
Copy Markdown
Contributor

rbx commented May 21, 2026

The idea sounds simple and elegant, I like it - it fits very well for your use case.

A few points from a quick glace over the benchmark:

  • for the fmq part you are using zeromq transport, even though shmem is implied. Not sure if that's intentional and you just want to test metadata transfer? But otherwise it should be switched to actually managed or unmanaged shmem - depending on which one you want to compare.
  • you mention sealing, but I don't find it in the code. Would be interesting to see how this works and also if it adds overhead (probably minimal).
  • worth keeping in mind that this is one to one, while in practice you will have to create more sockets for each receiver, again minimal.

But these are just the minor points towards the benchmark, the idea itself sound good!

@ktf
Copy link
Copy Markdown
Member Author

ktf commented May 21, 2026

Indeed the benchmark was erroneously using zeromq as a backend. With the current naive implementation, FairMQ is ~10x better.

I guess the overhead is the file creation. One could try to reuse / use for more than one tf the memfd to reduce that. I will experiment a bit more.

@alibuild
Copy link
Copy Markdown
Collaborator

Error while checking build/O2/fullCI_slc9 for ec9e706 at 2026-05-21 10:53:

++ [[ /sw/slc9_x86-64/GEANT4/v11.2.0-alice1-18 != '' ]]
+++ /sw/slc9_x86-64/GEANT4/v11.2.0-alice1-18/bin/geant4-config --datasets
+++ sed 's/[^ ]* //'
+++ sed 's/G4/export G4/'
+++ sed 's/DATA /DATA=/'
++ export G4NEUTRONHPDATA=/sw/slc9_x86-64/GEANT4/v11.2.0-alice1-18/share/Geant4/data/G4NDL4.7 export G4LEDATA=/sw/slc9_x86-64/GEANT4/v11.2.0-alice1-18/share/Geant4/data/G4EMLOW8.5 export G4LEVELGAMMADATA=/sw/slc9_x86-64/GEANT4/v11.2.0-alice1-18/share/Geant4/data/PhotonEvaporation5.7 export G4RADIOACTIVEDATA=/sw/slc9_x86-64/GEANT4/v11.2.0-alice1-18/share/Geant4/data/RadioactiveDecay5.6 export G4PARTICLEXSDATA=/sw/slc9_x86-64/GEANT4/v11.2.0-alice1-18/share/Geant4/data/G4PARTICLEXS4.0 export G4PIIDATA=/sw/slc9_x86-64/GEANT4/v11.2.0-alice1-18/share/Geant4/data/G4PII1.3 export G4REALSURFACEDATA=/sw/slc9_x86-64/GEANT4/v11.2.0-alice1-18/share/Geant4/data/RealSurface2.2 export G4SAIDXSDATA=/sw/slc9_x86-64/GEANT4/v11.2.0-alice1-18/share/Geant4/data/G4SAIDDATA2.0 export G4ABLADATA=/sw/slc9_x86-64/GEANT4/v11.2.0-alice1-18/share/Geant4/data/G4ABLA3.3 export G4INCLDATA=/sw/slc9_x86-64/GEANT4/v11.2.0-alice1-18/share/Geant4/data/G4INCL1.2 export G4ENSDFSTATEDATA=/sw/slc9_x86-64/GEANT4/v11.2.0-alice1-18/share/Geant4/data/G4ENSDFSTATE2.3
++ G4NEUTRONHPDATA=/sw/slc9_x86-64/GEANT4/v11.2.0-alice1-18/share/Geant4/data/G4NDL4.7
++ G4LEDATA=/sw/slc9_x86-64/GEANT4/v11.2.0-alice1-18/share/Geant4/data/G4EMLOW8.5
++ G4LEVELGAMMADATA=/sw/slc9_x86-64/GEANT4/v11.2.0-alice1-18/share/Geant4/data/PhotonEvaporation5.7
++ G4RADIOACTIVEDATA=/sw/slc9_x86-64/GEANT4/v11.2.0-alice1-18/share/Geant4/data/RadioactiveDecay5.6
++ G4PARTICLEXSDATA=/sw/slc9_x86-64/GEANT4/v11.2.0-alice1-18/share/Geant4/data/G4PARTICLEXS4.0
++ G4PIIDATA=/sw/slc9_x86-64/GEANT4/v11.2.0-alice1-18/share/Geant4/data/G4PII1.3
++ G4REALSURFACEDATA=/sw/slc9_x86-64/GEANT4/v11.2.0-alice1-18/share/Geant4/data/RealSurface2.2
++ G4SAIDXSDATA=/sw/slc9_x86-64/GEANT4/v11.2.0-alice1-18/share/Geant4/data/G4SAIDDATA2.0
++ G4ABLADATA=/sw/slc9_x86-64/GEANT4/v11.2.0-alice1-18/share/Geant4/data/G4ABLA3.3
++ G4INCLDATA=/sw/slc9_x86-64/GEANT4/v11.2.0-alice1-18/share/Geant4/data/G4INCL1.2
++ G4ENSDFSTATEDATA=/sw/slc9_x86-64/GEANT4/v11.2.0-alice1-18/share/Geant4/data/G4ENSDFSTATE2.3
++ rm -Rf /sw/BUILD/496fdad70b8e767abc55be1a932baa99209888a2/O2-RTC-test/rtc-test
++ mkdir /sw/BUILD/496fdad70b8e767abc55be1a932baa99209888a2/O2-RTC-test/rtc-test
++ pushd /sw/BUILD/496fdad70b8e767abc55be1a932baa99209888a2/O2-RTC-test/rtc-test
/sw/BUILD/496fdad70b8e767abc55be1a932baa99209888a2/O2-RTC-test/rtc-test /sw/BUILD/496fdad70b8e767abc55be1a932baa99209888a2/O2-RTC-test
++ type /sw/slc9_x86-64/O2/slc9_x86-64-slc9_x86-64-local14/lib/libO2GPUTrackingCUDA.so
/sw/slc9_x86-64/O2/slc9_x86-64-slc9_x86-64-local14/lib/libO2GPUTrackingCUDA.so is /sw/slc9_x86-64/O2/slc9_x86-64-slc9_x86-64-local14/lib/libO2GPUTrackingCUDA.so
++ type /sw/slc9_x86-64/O2/slc9_x86-64-slc9_x86-64-local14/lib/libO2GPUTrackingHIP.so
/sw/slc9_x86-64/O2/slc9_x86-64-slc9_x86-64-local14/lib/libO2GPUTrackingHIP.so is /sw/slc9_x86-64/O2/slc9_x86-64-slc9_x86-64-local14/lib/libO2GPUTrackingHIP.so
++ type /sw/slc9_x86-64/O2/slc9_x86-64-slc9_x86-64-local14/lib/libO2GPUTrackingOCL.so
/sw/slc9_x86-64/O2/slc9_x86-64-slc9_x86-64-local14/lib/libO2GPUTrackingOCL.so is /sw/slc9_x86-64/O2/slc9_x86-64-slc9_x86-64-local14/lib/libO2GPUTrackingOCL.so
+++ find /usr/local/cuda /usr/local/cuda-13 /usr/local/cuda-13.0 /usr/local/cuda-13.1 -type d -name stubs -prune -false -o '(' -type f -o -type l ')' -name libcuda.so -printf :%h -quit
++ LD_LIBRARY_PATH=/sw/slc9_x86-64/O2/slc9_x86-64-slc9_x86-64-local14/lib:/sw/slc9_x86-64/O2-customization/v1.0.0-7/lib:/sw/slc9_x86-64/googlebenchmark/1.9.5-2/lib:/sw/slc9_x86-64/GBL/V03-01-04-16/lib:/sw/slc9_x86-64/bookkeeping-api/v1.9.2-19/lib:/sw/slc9_x86-64/grpc/v1.71.0-20/lib:/sw/slc9_x86-64/c-ares/1.18.1-31/lib:/sw/slc9_x86-64/MLModels/20220530-13/lib:/sw/slc9_x86-64/ONNXRuntime/v1.22.0-81/lib:/sw/slc9_x86-64/pytorch_cpuinfo/alice1-17/lib:/sw/slc9_x86-64/safe_int/v3.0.28a-1/lib:/sw/slc9_x86-64/date/v3.0.3-16/lib:/sw/slc9_x86-64/gpu-system/cuda_13.1.115_arch@80_real#86_real#89_real#120_real#75_virtual@_home_F52XG4RPNRXWGYLMF5RXKZDBBI000000-rocm_6.3.42134_arch@gfx906#gfx908@_home_F5XXA5BPOJXWG3IK-opencl-miopen-migraphx-cudnn-tensorrt-2/lib:/sw/slc9_x86-64/onnx/v1.17.0-alice2-29/lib:/sw/slc9_x86-64/Eigen3/3.4.0-onnx1-16/lib:/sw/slc9_x86-64/VecGeom/v1.2.6-37/lib:/sw/slc9_x86-64/libjalienO2/0.2.3-9/lib:/sw/slc9_x86-64/fastjet/v3.4.1_1.052-alice3-18/lib:/sw/slc9_x86-64/cgal/6.1.1-6/lib:/sw/slc9_x86-64/MPFR/v3.1.3-25/lib:/sw/slc9_x86-64/GMP/v6.2.1-17/lib:/sw/slc9_x86-64/JAliEn-ROOT/0.7.17-6/lib:/sw/slc9_x86-64/Alice-GRID-Utils/0.0.7-4/lib:/sw/slc9_x86-64/json-c/v0.18.0-5/lib:/sw/slc9_x86-64/libwebsockets/v4.3.4-1/lib:/sw/slc9_x86-64/xjalienfs/1.7.0-14/lib:/sw/slc9_x86-64/DebugGUI/v0.8.0-41/lib:/sw/slc9_x86-64/libuv/v1.52.0-5/lib:/sw/slc9_x86-64/GLFW/3.4-5/lib:/sw/slc9_x86-64/MCStepLogger/v0.6.2-5/lib:/sw/slc9_x86-64/FairMQ/v1.10.1-13/lib:/sw/slc9_x86-64/FairCMakeModules/v1.0.0-36/lib:/sw/slc9_x86-64/ZeroMQ/v4.3.5-36/lib:/sw/slc9_x86-64/ms_gsl/4.2.1-12/lib:/sw/slc9_x86-64/Monitoring/v3.19.12-1/lib:/sw/slc9_x86-64/Configuration/v2.8.0-66/lib:/sw/slc9_x86-64/Ppconsul/v0.2.3-alice3-15/lib:/sw/slc9_x86-64/Common-O2/v1.6.4-13/lib:/sw/slc9_x86-64/libInfoLogger/v2.10.1-5/lib:/sw/slc9_x86-64/HepMC3/3.3.1-29/lib:/sw/slc9_x86-64/FairRoot/v18.4.9-alice3-154/lib:/sw/slc9_x86-64/FairLogger/v2.3.1-12/lib:/sw/slc9_x86-64/fmt/11.1.2-21/lib:/sw/slc9_x86-64/simulation/v1.0-137/lib:/sw/slc9_x86-64/GEANT3/v4-5-27/lib:/sw/slc9_x86-64/GEANT3/v4-5-27/lib64:/sw/slc9_x86-64/GEANT4_VMC/v6-6-update1-p3-46/lib:/sw/slc9_x86-64/vgm/v5-3-110/lib:/sw/slc9_x86-64/GEANT4/v11.2.0-alice1-18/lib:/sw/slc9_x86-64/xercesc/Xerces-C_3_2_5-27/lib:/sw/slc9_x86-64/VMC/v2-1-29/lib:/sw/slc9_x86-64/ROOT/v6-36-10-alice1-1/lib:/sw/slc9_x86-64/nlohmann_json/v3.11.3-19/lib:/sw/slc9_x86-64/Vc/1.4.5-19/lib:/sw/slc9_x86-64/FFTW3/v3.3.9-38/lib:/sw/slc9_x86-64/TBB/v2022.3.0-11/lib:/sw/slc9_x86-64/XRootD/v5.8.4-17/lib:/sw/slc9_x86-64/GSL/v2.8-8/lib:/sw/slc9_x86-64/generators/v1.0-71/lib:/sw/slc9_x86-64/pythia6/428-alice4-11/lib:/sw/slc9_x86-64/ninja-fortran/fortran-v1.11.1.g9-10/lib:/sw/slc9_x86-64/pythia/v8315-alice1-22/lib:/sw/slc9_x86-64/HepMC/HEPMC_02_06_10-29/lib:/sw/slc9_x86-64/lhapdf/v6.5.2-46/lib:/sw/slc9_x86-64/arrow/v20.0.0-alice1-33/lib:/sw/slc9_x86-64/re2/2024-07-02-17/lib:/sw/slc9_x86-64/double-conversion/v3.4.0-5/lib:/sw/slc9_x86-64/RapidJSON/v1.1.0-alice2-37/lib:/sw/slc9_x86-64/flatbuffers/v24.3.25-24/lib:/sw/slc9_x86-64/xsimd/14.0.0-14/lib:/sw/slc9_x86-64/utf8proc/v2.11.2-5/lib:/sw/slc9_x86-64/protobuf/v29.3-20/lib:/sw/slc9_x86-64/Clang/v20.1.7-20/lib:/sw/slc9_x86-64/lz4/v1.10.0-5/lib:/sw/slc9_x86-64/boost/v1.90.0-alice1-2/lib:/sw/slc9_x86-64/bz2/1.0.8-23/lib:/sw/slc9_x86-64/lzma/v5.2.3-14/lib:/sw/slc9_x86-64/Python-modules/1.0-76/lib:/sw/slc9_x86-64/Python-modules-list/1.0-40/lib:/sw/slc9_x86-64/hdf5/1.14.6-15/lib:/sw/slc9_x86-64/Python/v3.10.19-13/lib:/sw/slc9_x86-64/libffi/v3.2.1-alice1-10/lib:/sw/slc9_x86-64/libffi/v3.2.1-alice1-10/lib64:/sw/slc9_x86-64/sqlite/v3.47.2-10/lib:/sw/slc9_x86-64/libpng/v1.6.47-16/lib:/sw/slc9_x86-64/FreeType/v2.10.1-25/lib:/sw/slc9_x86-64/AliEn-Runtime/v2-19-le-25/lib:/sw/slc9_x86-64/UUID/v2.27.1-15/lib:/sw/slc9_x86-64/AliEn-CAs/v1-14/lib:/sw/slc9_x86-64/libxml2/v2.9.3-23/lib:/sw/slc9_x86-64/abseil/20240722.0-17/lib:/sw/slc9_x86-64/ninja/fortran-v1.11.1.g9-25/lib:/sw/slc9_x86-64/CMake/v4.1.4-2/lib:/sw/slc9_x86-64/curl/7.70.0-24/lib:/sw/slc9_x86-64/OpenSSL/v1.1.1m-15/lib:/sw/slc9_x86-64/zlib/v1.3.1-6/lib:/sw/slc9_x86-64/alibuild-recipe-tools/v0.3.0-1/lib:/sw/slc9_x86-64/GCC-Toolchain/v14.2.0-alice2-1/lib:/sw/slc9_x86-64/GCC-Toolchain/v14.2.0-alice2-1/lib64:/sw/slc9_x86-64/defaults-release/v1-8/lib:/usr/local/cuda-13.0/compat
++ o2-gpu-standalone-benchmark --noEvents -g --gpuType CUDA --RTCenable 1 --RTCcacheOutput 0 --RTCoptConstexpr 1 --RTCcompilePerKernel 1 --RTCTECHrunTest 2
GPU processing enabled
[INFO] GPU Tracker library loaded and GPU tracker object created sucessfully
[INFO] Created GPUReconstruction instance for device type CUDA (2)
Using default event settings, no event dir loaded (solenoidBz: -5.006680, constBz 0, maxTimeBin -2)
Standalone Test Framework for CA Tracker - Using GPU
[INFO] Starting CUDA RTC Compilation
[INFO] RTC Compilation finished (335.669815 seconds)
++ o2-gpu-standalone-benchmark --noEvents -g --gpuType HIP --RTCenable 1 --RTCcacheOutput 0 --RTCoptConstexpr 1 --RTCcompilePerKernel 1 --RTCTECHrunTest 2
GPU processing enabled
[INFO] GPU Tracker library loaded and GPU tracker object created sucessfully
[INFO] Created GPUReconstruction instance for device type HIP (3)
Using default event settings, no event dir loaded (solenoidBz: -5.006680, constBz 0, maxTimeBin -2)
Standalone Test Framework for CA Tracker - Using GPU
[INFO] Starting HIP RTC Compilation
[INFO] RTC Compilation finished (83.933545 seconds)
++ popd
/sw/BUILD/496fdad70b8e767abc55be1a932baa99209888a2/O2-RTC-test
++ rm -Rf /sw/BUILD/496fdad70b8e767abc55be1a932baa99209888a2/O2-RTC-test/rtc-test
++ mkdir -p /sw/INSTALLROOT/496fdad70b8e767abc55be1a932baa99209888a2/slc9_x86-64/O2-RTC-test/1.0-local790/etc/modulefiles
++ cat

Full log here.

@ktf
Copy link
Copy Markdown
Member Author

ktf commented May 21, 2026

on linux the overhead is actually just 2x, so it might be worth investigating further... I am also trying with MFD_HUGETLB which should improve things further.

@alibuild
Copy link
Copy Markdown
Collaborator

alibuild commented May 21, 2026

Error while checking build/O2/fullCI_slc9 for c4477d5 at 2026-05-22 03:13:

## sw/BUILD/o2checkcode-latest/log
--
========== List of errors found ==========
++ GRERR=0
++ grep -v clang-diagnostic-error error-log.txt
++ grep ' error:'
/sw/SOURCES/O2/slc9_x86-64-slc9_x86-64/0/Framework/Core/test/benchmark_ShmemVsMemfd.cxx:29:10: error: inclusion of deprecated C++ header 'signal.h'; consider using 'csignal' instead [modernize-deprecated-headers]
++ [[ 0 == 0 ]]
++ exit 1
--

Full log here.

@alibuild
Copy link
Copy Markdown
Collaborator

Error while checking build/O2/fullCI_slc9 for 8a07a00 at 2026-05-22 11:35:

## sw/BUILD/o2checkcode-latest/log
--
========== List of errors found ==========
++ GRERR=0
++ grep -v clang-diagnostic-error error-log.txt
++ grep ' error:'
/sw/SOURCES/O2/slc9_x86-64-slc9_x86-64/0/Framework/Core/test/benchmark_ShmemVsMemfd.cxx:29:10: error: inclusion of deprecated C++ header 'signal.h'; consider using 'csignal' instead [modernize-deprecated-headers]
++ [[ 0 == 0 ]]
++ exit 1
--

Full log here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants