Cuda
Cuda
Nvidia Tools Summary
Global Memory
Efficient Global Memory Access
L1/Tex Cache + Shared Memory
64 KB/SM in most architectures
192 KB/SM in A100 versus 128 KB/SM in V100
Vectors
Coorperative groups
Reductions
Faster parallel reductions in kepler
Warp Instrinsics
Warp Shuffle and Warp Vote Instrinsics
CUB
Cuda Streams
By default all cuda kernels use stream 0.
In an MPI like setup something like Nvidia MPS would be needed to balance load
Nvidia MPS Cuda streams pitfalls
Cuda Floating Point
https://docs.nvidia.com/cuda/floating-point/index.html