cuda #
CUDA Compute Backend for VSL π₯οΈ
vsl.cuda is a high-performance GPU compute backend for VSL backed by NVIDIA CUDA (cuBLAS + cuDNN).
π Status
β οΈ Phase B β Infrastructure ready; cuBLAS/cuDNN bindings are pending.
All operations currently fall back to CPU. Once CUDA Toolkit is available on
the build machine, theTODO(#238)markers can be replaced with actual GPU
kernels.
| Operation | Status | cuBLAS/cuDNN |
|---|---|---|
gemm |
β Stub (CPU fallback) | cublasDgemm |
gemv |
β Stub (CPU fallback) | cublasDgvm |
relu |
β Stub (CPU fallback) | cuDNNReLU |
sigmoid |
β Stub (CPU fallback) | cuDNNSigmoid |
tanh |
β Stub (CPU fallback) | cuDNNTanh |
add_vec |
β Stub (CPU fallback) | custom kernel |
mul_vec |
β Stub (CPU fallback) | custom kernel |
add_scalar |
β Stub (CPU fallback) | custom kernel |
mul_scalar |
β Stub (CPU fallback) | cublasDscal |
softmax |
β Stub (CPU fallback) | cuDNNSoftmaxForward |
layernorm |
β Stub (CPU fallback) | cuDNNLayerNorm |
conv2d |
β Stub (CPU fallback) | cuDNNConvolutionForward |
See issue #238 for the cuBLAS/cuDNN implementation tracker.
π Architecture
vsl.cuda
βββ backend.v # CUDABackend (ComputeBackend interface impl)
βββ compute/
β βββ elementwise.v # Activation functions (relu, sigmoid, tanh, ...)
β βββ gemm.v # Public GEMM wrapper (rowβcol conversion)
β βββ gemm_impl.v # Internal GEMM/GEMV stubs + CPU fallbacks
βββ v.mod
Memory layout: cuBLAS is column-major (same as VCL/Vulkan). The CUDABackend.to_internal() / from_internal() methods handle rowβcolumn-major conversion at the dispatch boundary.
π― Quick Start
import vsl.compute
// Use CUDA backend automatically when available
ctx := compute.new_context(.cuda)
// All compute operations dispatch to CUDA when the backend is set
a := []f64{len: 6}
b := []f64{len: 6}
// ... fill a, b ...
result := compute.add_vec(ctx, a, b)!
Or directly via the backend:
import vsl.cuda
mut dev := cuda.get_default_device()!
mut backend := cuda.new_cuda_backend()
result := backend.relu(my_data)!
π§ Requirements
Runtime (at runtime)
- NVIDIA GPU with compute capability β₯ 5.0 (Maxwell or newer)
- NVIDIA Driver β₯ 525.60 (for CUDA 12.x)
- CUDA Toolkit β₯ 11.8
- cuDNN β₯ 8.0
Build time (for CUDA GPU builds)
nvcc(NVIDIA C compiler) in$PATH- CUDA Toolkit headers (
cuda.h,cublas.h,cudnn.h) - cuDNN headers
π¦ Installation
Arch Linux
##sudo pacman -S nvidia nvidia-utils
##sudo pacman -S cuda
##sudo pacman -S cudnn
##nvcc --version ##nvidia-smi ##
Ubuntu / Debian
##sudo apt-get install nvidia-driver-535
##wget https://developer.downloads.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-11all.deb
sudo apt-get update
sudo apt-get install cuda-toolkit-12-2
##sudo apt-get install libcudnn8 libcudnn8-dev
macOS
CUDA Toolkit is no longer supported on macOS arm64 (Apple Silicon). For GPU acceleration on Apple Silicon, use the VCL/OpenCL backend instead.
ποΈ Compiling with CUDA
VSL uses conditional compilation ($if cuda ?) to include CUDA code:
##v -d cuda run your_app.v
##v -d cuda test .
##v -d cuda vet .
Set CUDNN_PATH if cuDNN is not in a standard location:
CUDNN_PATH=/opt/cuda v -d cuda run your_app.v
β‘ Performance Notes
- GEMM (matrix-matrix) β the highest-impact primitive. cuBLAS
dgemmis heavily optimized for NVIDIA Tensor Cores on Ampere+ GPUs. - cuDNN activations β fused kernels for ReLU, Sigmoid, Tanh provide ~2-5x speedup over CPU for large tensors.
- Memory layout β cuBLAS uses column-major; VSL uses row-major. The
to_internal()/from_internal()conversion adds overhead for small tensors but is negligible for large matrices (β₯ 256Γ256).
π§ͺ Examples
See the cuda/examples/ directory:
##v -d cuda run cuda/examples/relu_example.v
πΊοΈ Roadmap
| Phase | Description |
|---|---|
| β Phase B | Infrastructure ready; stub implementations with CPU fallbacks |
| π― Phase C | Replace stubs with actual cuBLAS/cuDNN kernels (issue #238) |
| ποΈ Phase D | Device discovery, multi-GPU support |
| π§ Phase E | GPU memory management (avoid CPUβGPU copies) |
| π§ͺ Phase F | Test suite, numerical validation vs reference |
π Resources
- VSL Documentation
- cuBLAS Documentation
- cuDNN Documentation
- CUDA Toolkit Download
- VSL ADR-001: Multi-backend GPU compute
Accelerate your scientific computing with NVIDIA GPUs! π
Constants #
const cublas_status_success = 0
const cublas_op_n = 0 // Non-transpose
CublasOperation values.
const cublas_op_t = 1 // Transpose
const cublas_op_c = 2 // Conjugate transpose
const cudnn_status_success = 0
const cudnn_activation_relu = 1
CudnnActivationMode values (cuDNN 9.x deprecated legacy API). Matches cudnnActivationMode_t enum in cudnn_graph_v9.h: CUDNN_ACTIVATION_SIGMOID=0, RELU=1, TANH=2, CLIPPED_RELU=3, ELU=4, ...
const cudnn_activation_sigmoid = 0
const cudnn_activation_tanh = 2
const cudnn_softmax_fast = 0
CudnnSoftmaxAlgorithm values.
const cudnn_softmax_mode_instance = 0 // apply softmax per image (NCHW)
CudnnSoftmaxMode values.
const cudnn_softmax_mode_channel = 1 // apply softmax per spatial location (CHANNELS)
const cudnn_tensor_nhwc = 0 // NCHW format (row-major in VSL)
CudnnTensorFormat values.
const cudnn_tensor_nchw = 1 // NCHW format
const cudnn_data_type_double = 1 // f64
CudnnDataType values.
const cudnn_data_type_float = 0 // f32
fn col_to_row_major #
fn col_to_row_major(data []f64, rows int, cols int) []f64
col_to_row_major converts column-major flat array back to row-major. For matrix [rows x cols], element (r, c) in column-major flat at index r + crows becomes element (r, c) in row-major flat at index rcols+c.
fn cublas_error #
fn cublas_error(status CublasStatus) string
cublas_error returns a string description of a cuBLAS status code.
fn cudnn_error #
fn cudnn_error(status CudnnStatus) string
cudnn_error returns a string description of a cuDNN status code.
fn get_default_device #
fn get_default_device() !&CudaDevice
get_default_device returns the first available CUDA device. Convenience wrapper; equivalent to get_device(0).
fn get_device #
fn get_device(index int) !&CudaDevice
get_device returns the CUDA device with the given index (0-based).
fn get_device_count #
fn get_device_count() !int
get_device_count returns the number of available CUDA devices. Calls cuInit internally if not yet initialized.
fn row_to_col_major #
fn row_to_col_major(data []f64, rows int, cols int) []f64
row_to_col_major converts row-major flat array to column-major. For matrix [rows x cols], element (r, c) in row-major flat at index rcols+c becomes element (r, c) in column-major flat at index r + crows.
type CudnnActivationDescriptor #
type CudnnActivationDescriptor = voidptr
cudnnActivationDescriptor_t is a handle to an activation descriptor.
type CudnnActivationMode #
type CudnnActivationMode = int
cudnnActivationMode_t specifies the activation mode.
type CudnnConvolutionDescriptor #
type CudnnConvolutionDescriptor = voidptr
cudnnConvolutionDescriptor_t is a handle to a convolution descriptor.
type CudnnConvolutionFwdAlgo #
type CudnnConvolutionFwdAlgo = int
cudnnConvolutionFwdAlgo_t specifies the convolution forward algorithm.
type CudnnDataType #
type CudnnDataType = int
cudnnDataType_t specifies the data type of a tensor.
type CudnnFilterDescriptor #
type CudnnFilterDescriptor = voidptr
cudnnFilterDescriptor_t is a handle to a filter descriptor.
type CudnnSoftmaxAlgorithm #
type CudnnSoftmaxAlgorithm = int
cudnnSoftmaxAlgorithm_t specifies the softmax algorithm.
type CudnnSoftmaxMode #
type CudnnSoftmaxMode = int
cudnnSoftmaxMode_t specifies the softmax mode.
type CudnnTensorDescriptor #
type CudnnTensorDescriptor = voidptr
cudnnTensorDescriptor_t is a handle to a tensor descriptor.
type CudnnTensorFormat #
type CudnnTensorFormat = int
cudnnTensorFormat_t specifies the memory layout of a tensor.
type Device #
type Device = CudaDevice
Device is an alias for CudaDevice exposed at the top-level module boundary. Users that only need the device type can import vsl.cuda and use cuda.Device.
struct CudaDevice #
struct CudaDevice {
pub mut:
// device_id is the CUDA device ordinal (0, 1, ...).
device_id int
// name is the human-readable device name (e.g. "NVIDIA GeForce RTX 4060").
name string
// handle is the CUDA device handle (CUdevice for driver API).
handle voidptr
// ctx is the CUDA context for this device (driver API), if created.
ctx CudaContext
// cublas is the cuBLAS context handle.
cublas CublasHandle
// cudnn is the cuDNN context handle.
cudnn CudnnHandle
// stream is the CUDA stream used for operations.
stream CudaStream
}
CudaDevice represents a CUDA device with its cuBLAS and cuDNN handles. It wraps all GPU resource management for the CUDA backend.
fn (CudaDevice) release #
fn (mut d CudaDevice) release() !
release releases all CUDA resources (context, cuBLAS, cuDNN, stream).
fn (CudaDevice) init #
fn (mut d CudaDevice) init() !
init initializes cuBLAS and cuDNN handles for this device.
- README
- Constants
- fn col_to_row_major
- fn cublas_error
- fn cudnn_error
- fn get_default_device
- fn get_device
- fn get_device_count
- fn row_to_col_major
- type CudnnActivationDescriptor
- type CudnnActivationMode
- type CudnnConvolutionDescriptor
- type CudnnConvolutionFwdAlgo
- type CudnnDataType
- type CudnnFilterDescriptor
- type CudnnSoftmaxAlgorithm
- type CudnnSoftmaxMode
- type CudnnTensorDescriptor
- type CudnnTensorFormat
- type Device
- struct CudaDevice