cuda #

CUDA Compute Backend for VSL 🖥️

vsl.cuda is a high-performance GPU compute backend for VSL backed by NVIDIA CUDA (cuBLAS + cuDNN).

🚀 Status

CUDA backend — cuBLAS/cuDNN bindings active when CUDA Toolkit + cuDNN are
available at build time (-d cuda). Operations use GPU kernels where
implemented; CPU fallback when CUDA/cuDNN is unavailable.

VTL integration (opt-in CUDA Linear/Conv2D): merged in
vlang/vtl#93 (issues #89–#91 closed).

Operation	GPU (CUDA+cuDNN)	Fallback	Tracker
`gemm`	✅ `cublasDgemm`	CPU col-major	—
`gemv`	✅ `cublasDgemv`	CPU	—
`relu` / `sigmoid` / `tanh`	✅ cuDNN activation	CPU	—
`add_vec`	✅ `cublasDaxpy`	CPU	—
`mul_vec`	✅ `cublasDdgmm`	CPU	—
`add_scalar` / `mul_scalar`	✅ cuBLAS/cuDNN path	CPU	—
`softmax`	✅ `cudnnSoftmaxForward`	CPU	—
`layernorm`	optional `-d cudnn_layernorm`	CPU	—
`conv2d`	✅ `cudnnConvolutionForward`	CPU	—
`conv2d_backward`	✅ `cudnnConvolutionBackward*`	CPU	—

mul_vec uses legacy cublasDdgmm (SIDE_RIGHT, 1×n row layout; cublasDdgmm_v2 absent on some distros). Layer norm GPU: build with -d cudnn_layernorm when libcudnn exports cudnnLayerNormForward (9.1+). Numerical parity tests: cuda/compute/numerical_validation_test.v (#281).

📁 Architecture

vsl.cuda
├── backend.v           # CUDABackend (ComputeBackend interface impl)
├── compute/
│   ├── elementwise.v   # Activation functions (relu, sigmoid, tanh, ...)
│   ├── gemm.v          # Public GEMM wrapper (row↔col conversion)
│   └── gemm_impl.v     # Internal GEMM/GEMV implementation + CPU fallbacks
└── v.mod

Memory layout: cuBLAS is column-major (same as VCL/Vulkan). The CUDABackend.to_internal() / from_internal() methods handle row↔column-major conversion at the dispatch boundary.

🎯 Quick Start

import vsl.compute

// Use CUDA backend automatically when available
ctx := compute.new_context(.cuda)

// All compute operations dispatch to CUDA when the backend is set
a := []f64{len: 6}
b := []f64{len: 6}
// ... fill a, b ...

result := compute.add_vec(ctx, a, b)!

Or directly via the backend:

import vsl.cuda

mut dev := cuda.get_default_device()!
mut backend := cuda.new_cuda_backend()

result := backend.relu(my_data)!

🔧 Requirements

Runtime (at runtime)

NVIDIA GPU with compute capability ≥ 5.0 (Maxwell or newer)
NVIDIA Driver ≥ 525.60 (for CUDA 12.x)
CUDA Toolkit ≥ 11.8
cuDNN ≥ 8.0

Build time (for CUDA GPU builds)

nvcc (NVIDIA C compiler) in $PATH
CUDA Toolkit headers (cuda.h, cublas.h, cudnn.h)
cuDNN headers

📦 Installation

Arch Linux

##sudo pacman -S nvidia nvidia-utils

##sudo pacman -S cuda

##sudo pacman -S cudnn

##nvcc --version        ##nvidia-smi            ##

Ubuntu / Debian

##sudo apt-get install nvidia-driver-535

##wget https://developer.downloads.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-11all.deb
sudo apt-get update
sudo apt-get install cuda-toolkit-12-2

##sudo apt-get install libcudnn8 libcudnn8-dev

macOS

CUDA Toolkit is no longer supported on macOS arm64 (Apple Silicon). For GPU acceleration on Apple Silicon, use the VCL/OpenCL backend instead.

🏗️ Compiling with CUDA

VSL uses conditional compilation ($if cuda ?) to include CUDA code:

##v -d cuda run your_app.v

##v -d cuda test .

##v -d cuda vet .

Set CUDNN_PATH if cuDNN is not in a standard location:

CUDNN_PATH=/opt/cuda v -d cuda run your_app.v

⚡ Performance Notes

GEMM (matrix-matrix) — the highest-impact primitive. cuBLAS dgemm is heavily optimized for NVIDIA Tensor Cores on Ampere+ GPUs.
cuDNN activations — fused kernels for ReLU, Sigmoid, Tanh provide ~2-5x speedup over CPU for large tensors.
Memory layout — cuBLAS uses column-major; VSL uses row-major. The to_internal() / from_internal() conversion adds overhead for small tensors but is negligible for large matrices (≥ 256×256).

🧪 Examples

See the cuda/examples/ directory:

##v -d cuda run cuda/examples/relu_example.v

🗺️ Roadmap

Phase	Description
✅ Phase B	Infrastructure ready; CPU fallback path
✅ Phase C	cuBLAS/cuDNN kernels for GEMM/GEMV, activations, softmax, Conv2D, LayerNorm
🏗️ Phase D	Device discovery, multi-GPU support
🧠 Phase E	GPU memory management (avoid CPU↔GPU copies)
✅ Phase F	Numerical validation vs reference for key kernels

🔗 Resources

Accelerate your scientific computing with NVIDIA GPUs! 🚀

Constants #

const cuda_memcpy_host_to_device = 2

const cuda_memcpy_device_to_host = 3

const cuda_memcpy_device_to_device = 4

const cublas_status_success = 0

const cublas_op_n = 0 // Non-transpose

CublasOperation values.

const cublas_op_t = 1 // Transpose

const cublas_op_c = 2 // Conjugate transpose

const cudnn_status_success = 0

const cudnn_activation_relu = 1

CudnnActivationMode values (cuDNN 9.x deprecated legacy API). Matches cudnnActivationMode_t enum in cudnn_graph_v9.h: CUDNN_ACTIVATION_SIGMOID=0, RELU=1, TANH=2, CLIPPED_RELU=3, ELU=4, ...

const cudnn_activation_sigmoid = 0

const cudnn_activation_tanh = 2

const cudnn_softmax_fast = 0

CudnnSoftmaxAlgorithm values.

const cudnn_softmax_mode_instance = 0 // apply softmax per image (NCHW)

CudnnSoftmaxMode values.

const cudnn_softmax_mode_channel = 1 // apply softmax per spatial location (CHANNELS)

const cudnn_tensor_nhwc = 0 // NCHW format (row-major in VSL)

CudnnTensorFormat values.

const cudnn_tensor_nchw = 1 // NCHW format

const cudnn_data_type_double = 1 // f64

CudnnDataType values.

const cudnn_data_type_float = 0 // f32

fn col_to_row_major #

fn col_to_row_major(data []f64, rows int, cols int) []f64

col_to_row_major converts column-major flat array back to row-major. For matrix [rows x cols], element (r, c) in column-major flat at index r + crows becomes element (r, c) in row-major flat at index rcols+c.

fn cublas_error #

fn cublas_error(status CublasStatus) string

cublas_error returns a string description of a cuBLAS status code.

fn cudnn_error #

fn cudnn_error(status CudnnStatus) string

cudnn_error returns a string description of a cuDNN status code.

fn get_default_device #

fn get_default_device() !&CudaDevice

get_default_device returns the first available CUDA device. Convenience wrapper; equivalent to get_device(0).

fn get_device #

fn get_device(index int) !&CudaDevice

get_device returns the CUDA device with the given index (0-based).

fn get_device_count #

fn get_device_count() !int

get_device_count returns the number of available CUDA devices. Calls cuInit internally if not yet initialized.

fn row_to_col_major #

fn row_to_col_major(data []f64, rows int, cols int) []f64

row_to_col_major converts row-major flat array to column-major. For matrix [rows x cols], element (r, c) in row-major flat at index rcols+c becomes element (r, c) in column-major flat at index r + crows.

type CudnnActivationDescriptor #

type CudnnActivationDescriptor = voidptr

cudnnActivationDescriptor_t is a handle to an activation descriptor.

type CudnnActivationMode #

type CudnnActivationMode = int

cudnnActivationMode_t specifies the activation mode.

type CudnnConvolutionBwdDataAlgo #

type CudnnConvolutionBwdDataAlgo = int

cudnnConvolutionBwdDataAlgo_t / BwdFilterAlgo_t — backward convolution algorithms.

type CudnnConvolutionBwdFilterAlgo #

type CudnnConvolutionBwdFilterAlgo = int

type CudnnConvolutionDescriptor #

type CudnnConvolutionDescriptor = voidptr

cudnnConvolutionDescriptor_t is a handle to a convolution descriptor.

type CudnnConvolutionFwdAlgo #

type CudnnConvolutionFwdAlgo = int

cudnnConvolutionFwdAlgo_t specifies the convolution forward algorithm.

type CudnnDataType #

type CudnnDataType = int

cudnnDataType_t specifies the data type of a tensor.

type CudnnFilterDescriptor #

type CudnnFilterDescriptor = voidptr

cudnnFilterDescriptor_t is a handle to a filter descriptor.

type CudnnSoftmaxAlgorithm #

type CudnnSoftmaxAlgorithm = int

cudnnSoftmaxAlgorithm_t specifies the softmax algorithm.

type CudnnSoftmaxMode #

type CudnnSoftmaxMode = int

cudnnSoftmaxMode_t specifies the softmax mode.

type CudnnTensorDescriptor #

type CudnnTensorDescriptor = voidptr

cudnnTensorDescriptor_t is a handle to a tensor descriptor.

type CudnnTensorFormat #

type CudnnTensorFormat = int

cudnnTensorFormat_t specifies the memory layout of a tensor.

type Device #

type Device = CudaDevice

Device is an alias for CudaDevice exposed at the top-level module boundary. Users that only need the device type can import vsl.cuda and use cuda.Device.

struct CudaDevice #

@[heap]

struct CudaDevice {
pub mut:
	// device_id is the CUDA device ordinal (0, 1, ...).
	device_id int
	// name is the human-readable device name (e.g. "NVIDIA GeForce RTX 4060").
	name string
	// handle is the CUDA device handle (CUdevice for driver API).
	handle voidptr
	// ctx is the CUDA context for this device (driver API), if created.
	ctx CudaContext
	// cublas is the cuBLAS context handle.
	cublas CublasHandle
	// cudnn is the cuDNN context handle.
	cudnn CudnnHandle
	// stream is the CUDA stream used for operations.
	stream CudaStream
}

CudaDevice represents a CUDA device with its cuBLAS and cuDNN handles. It wraps all GPU resource management for the CUDA backend.

fn (CudaDevice) release #

fn (mut d CudaDevice) release() !

release releases all CUDA resources (context, cuBLAS, cuDNN, stream).

fn (CudaDevice) init #

fn (mut d CudaDevice) init() !

init initializes cuBLAS and cuDNN handles for this device.