Skip to content

cuda #

CUDA Compute Backend for VSL πŸ–₯️

vsl.cuda is a high-performance GPU compute backend for VSL backed by NVIDIA CUDA (cuBLAS + cuDNN).

πŸš€ Status

CUDA backend β€” cuBLAS/cuDNN bindings active when CUDA Toolkit + cuDNN are
available at build time (-d cuda). Operations use GPU kernels where
implemented; CPU fallback when CUDA/cuDNN is unavailable.

VTL integration (opt-in CUDA Linear/Conv2D): merged in
vlang/vtl#93 (issues #89–#91 closed).

Operation GPU (CUDA+cuDNN) Fallback Tracker
gemm βœ… cublasDgemm CPU col-major β€”
gemv βœ… cublasDgemv CPU β€”
relu / sigmoid / tanh βœ… cuDNN activation CPU β€”
add_vec βœ… cublasDaxpy CPU β€”
mul_vec βœ… cublasDdgmm CPU β€”
add_scalar / mul_scalar βœ… cuBLAS/cuDNN path CPU β€”
softmax βœ… cudnnSoftmaxForward CPU β€”
layernorm optional -d cudnn_layernorm CPU β€”
conv2d βœ… cudnnConvolutionForward CPU β€”
conv2d_backward βœ… cudnnConvolutionBackward* CPU β€”

mul_vec uses legacy cublasDdgmm (SIDE_RIGHT, 1Γ—n row layout; cublasDdgmm_v2 absent on some distros). Layer norm GPU: build with -d cudnn_layernorm when libcudnn exports cudnnLayerNormForward (9.1+). Numerical parity tests: cuda/compute/numerical_validation_test.v (#281).

πŸ“ Architecture

vsl.cuda
β”œβ”€β”€ backend.v           # CUDABackend (ComputeBackend interface impl)
β”œβ”€β”€ compute/
β”‚   β”œβ”€β”€ elementwise.v   # Activation functions (relu, sigmoid, tanh, ...)
β”‚   β”œβ”€β”€ gemm.v          # Public GEMM wrapper (row↔col conversion)
β”‚   └── gemm_impl.v     # Internal GEMM/GEMV implementation + CPU fallbacks
└── v.mod

Memory layout: cuBLAS is column-major (same as VCL/Vulkan). The CUDABackend.to_internal() / from_internal() methods handle row↔column-major conversion at the dispatch boundary.

🎯 Quick Start

import vsl.compute

// Use CUDA backend automatically when available
ctx := compute.new_context(.cuda)

// All compute operations dispatch to CUDA when the backend is set
a := []f64{len: 6}
b := []f64{len: 6}
// ... fill a, b ...

result := compute.add_vec(ctx, a, b)!

Or directly via the backend:

import vsl.cuda

mut dev := cuda.get_default_device()!
mut backend := cuda.new_cuda_backend()

result := backend.relu(my_data)!

πŸ”§ Requirements

Runtime (at runtime)

  • NVIDIA GPU with compute capability β‰₯ 5.0 (Maxwell or newer)
  • NVIDIA Driver β‰₯ 525.60 (for CUDA 12.x)
  • CUDA Toolkit β‰₯ 11.8
  • cuDNN β‰₯ 8.0

Build time (for CUDA GPU builds)

  • nvcc (NVIDIA C compiler) in $PATH
  • CUDA Toolkit headers (cuda.h, cublas.h, cudnn.h)
  • cuDNN headers

πŸ“¦ Installation

Arch Linux

##sudo pacman -S nvidia nvidia-utils

##sudo pacman -S cuda

##sudo pacman -S cudnn

##nvcc --version        ##nvidia-smi            ##

Ubuntu / Debian

##sudo apt-get install nvidia-driver-535

##wget https://developer.downloads.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-11all.deb
sudo apt-get update
sudo apt-get install cuda-toolkit-12-2

##sudo apt-get install libcudnn8 libcudnn8-dev

macOS

CUDA Toolkit is no longer supported on macOS arm64 (Apple Silicon). For GPU acceleration on Apple Silicon, use the VCL/OpenCL backend instead.

πŸ—οΈ Compiling with CUDA

VSL uses conditional compilation ($if cuda ?) to include CUDA code:

##v -d cuda run your_app.v

##v -d cuda test .

##v -d cuda vet .

Set CUDNN_PATH if cuDNN is not in a standard location:

CUDNN_PATH=/opt/cuda v -d cuda run your_app.v

⚑ Performance Notes

  • GEMM (matrix-matrix) β€” the highest-impact primitive. cuBLAS dgemm is heavily optimized for NVIDIA Tensor Cores on Ampere+ GPUs.
  • cuDNN activations β€” fused kernels for ReLU, Sigmoid, Tanh provide ~2-5x speedup over CPU for large tensors.
  • Memory layout β€” cuBLAS uses column-major; VSL uses row-major. The to_internal() / from_internal() conversion adds overhead for small tensors but is negligible for large matrices (β‰₯ 256Γ—256).

πŸ§ͺ Examples

See the cuda/examples/ directory:

##v -d cuda run cuda/examples/relu_example.v

πŸ—ΊοΈ Roadmap

Phase Description
βœ… Phase B Infrastructure ready; CPU fallback path
βœ… Phase C cuBLAS/cuDNN kernels for GEMM/GEMV, activations, softmax, Conv2D, LayerNorm
πŸ—οΈ Phase D Device discovery, multi-GPU support
🧠 Phase E GPU memory management (avoid CPU↔GPU copies)
βœ… Phase F Numerical validation vs reference for key kernels

πŸ”— Resources


Accelerate your scientific computing with NVIDIA GPUs! πŸš€

Constants #

const cuda_memcpy_host_to_device = 2
const cuda_memcpy_device_to_host = 3
const cuda_memcpy_device_to_device = 4
const cublas_status_success = 0
const cublas_op_n = 0 // Non-transpose

CublasOperation values.

const cublas_op_t = 1 // Transpose
const cublas_op_c = 2 // Conjugate transpose
const cudnn_status_success = 0
const cudnn_activation_relu = 1

CudnnActivationMode values (cuDNN 9.x deprecated legacy API). Matches cudnnActivationMode_t enum in cudnn_graph_v9.h: CUDNN_ACTIVATION_SIGMOID=0, RELU=1, TANH=2, CLIPPED_RELU=3, ELU=4, ...

const cudnn_activation_sigmoid = 0
const cudnn_activation_tanh = 2
const cudnn_softmax_fast = 0

CudnnSoftmaxAlgorithm values.

const cudnn_softmax_mode_instance = 0 // apply softmax per image (NCHW)

CudnnSoftmaxMode values.

const cudnn_softmax_mode_channel = 1 // apply softmax per spatial location (CHANNELS)
const cudnn_tensor_nhwc = 0 // NCHW format (row-major in VSL)

CudnnTensorFormat values.

const cudnn_tensor_nchw = 1 // NCHW format
const cudnn_data_type_double = 1 // f64

CudnnDataType values.

const cudnn_data_type_float = 0 // f32

fn col_to_row_major #

fn col_to_row_major(data []f64, rows int, cols int) []f64

col_to_row_major converts column-major flat array back to row-major. For matrix [rows x cols], element (r, c) in column-major flat at index r + crows becomes element (r, c) in row-major flat at index rcols+c.

fn cublas_error #

fn cublas_error(status CublasStatus) string

cublas_error returns a string description of a cuBLAS status code.

fn cudnn_error #

fn cudnn_error(status CudnnStatus) string

cudnn_error returns a string description of a cuDNN status code.

fn get_default_device #

fn get_default_device() !&CudaDevice

get_default_device returns the first available CUDA device. Convenience wrapper; equivalent to get_device(0).

fn get_device #

fn get_device(index int) !&CudaDevice

get_device returns the CUDA device with the given index (0-based).

fn get_device_count #

fn get_device_count() !int

get_device_count returns the number of available CUDA devices. Calls cuInit internally if not yet initialized.

fn row_to_col_major #

fn row_to_col_major(data []f64, rows int, cols int) []f64

row_to_col_major converts row-major flat array to column-major. For matrix [rows x cols], element (r, c) in row-major flat at index rcols+c becomes element (r, c) in column-major flat at index r + crows.

type CudnnActivationDescriptor #

type CudnnActivationDescriptor = voidptr

cudnnActivationDescriptor_t is a handle to an activation descriptor.

type CudnnActivationMode #

type CudnnActivationMode = int

cudnnActivationMode_t specifies the activation mode.

type CudnnConvolutionBwdDataAlgo #

type CudnnConvolutionBwdDataAlgo = int

cudnnConvolutionBwdDataAlgo_t / BwdFilterAlgo_t β€” backward convolution algorithms.

type CudnnConvolutionBwdFilterAlgo #

type CudnnConvolutionBwdFilterAlgo = int

type CudnnConvolutionDescriptor #

type CudnnConvolutionDescriptor = voidptr

cudnnConvolutionDescriptor_t is a handle to a convolution descriptor.

type CudnnConvolutionFwdAlgo #

type CudnnConvolutionFwdAlgo = int

cudnnConvolutionFwdAlgo_t specifies the convolution forward algorithm.

type CudnnDataType #

type CudnnDataType = int

cudnnDataType_t specifies the data type of a tensor.

type CudnnFilterDescriptor #

type CudnnFilterDescriptor = voidptr

cudnnFilterDescriptor_t is a handle to a filter descriptor.

type CudnnSoftmaxAlgorithm #

type CudnnSoftmaxAlgorithm = int

cudnnSoftmaxAlgorithm_t specifies the softmax algorithm.

type CudnnSoftmaxMode #

type CudnnSoftmaxMode = int

cudnnSoftmaxMode_t specifies the softmax mode.

type CudnnTensorDescriptor #

type CudnnTensorDescriptor = voidptr

cudnnTensorDescriptor_t is a handle to a tensor descriptor.

type CudnnTensorFormat #

type CudnnTensorFormat = int

cudnnTensorFormat_t specifies the memory layout of a tensor.

type Device #

type Device = CudaDevice

Device is an alias for CudaDevice exposed at the top-level module boundary. Users that only need the device type can import vsl.cuda and use cuda.Device.

struct CudaDevice #

@[heap]
struct CudaDevice {
pub mut:
	// device_id is the CUDA device ordinal (0, 1, ...).
	device_id int
	// name is the human-readable device name (e.g. "NVIDIA GeForce RTX 4060").
	name string
	// handle is the CUDA device handle (CUdevice for driver API).
	handle voidptr
	// ctx is the CUDA context for this device (driver API), if created.
	ctx CudaContext
	// cublas is the cuBLAS context handle.
	cublas CublasHandle
	// cudnn is the cuDNN context handle.
	cudnn CudnnHandle
	// stream is the CUDA stream used for operations.
	stream CudaStream
}

CudaDevice represents a CUDA device with its cuBLAS and cuDNN handles. It wraps all GPU resource management for the CUDA backend.

fn (CudaDevice) release #

fn (mut d CudaDevice) release() !

release releases all CUDA resources (context, cuBLAS, cuDNN, stream).

fn (CudaDevice) init #

fn (mut d CudaDevice) init() !

init initializes cuBLAS and cuDNN handles for this device.