Skip to content

cuda #

CUDA Compute Backend for VSL πŸ–₯️

vsl.cuda is a high-performance GPU compute backend for VSL backed by NVIDIA CUDA (cuBLAS + cuDNN).

πŸš€ Status

⚠️ Phase B β€” Infrastructure ready; cuBLAS/cuDNN bindings are pending.
All operations currently fall back to CPU. Once CUDA Toolkit is available on
the build machine, the TODO(#238) markers can be replaced with actual GPU
kernels.

Operation Status cuBLAS/cuDNN
gemm βœ… Stub (CPU fallback) cublasDgemm
gemv βœ… Stub (CPU fallback) cublasDgvm
relu βœ… Stub (CPU fallback) cuDNNReLU
sigmoid βœ… Stub (CPU fallback) cuDNNSigmoid
tanh βœ… Stub (CPU fallback) cuDNNTanh
add_vec βœ… Stub (CPU fallback) custom kernel
mul_vec βœ… Stub (CPU fallback) custom kernel
add_scalar βœ… Stub (CPU fallback) custom kernel
mul_scalar βœ… Stub (CPU fallback) cublasDscal
softmax βœ… Stub (CPU fallback) cuDNNSoftmaxForward
layernorm βœ… Stub (CPU fallback) cuDNNLayerNorm
conv2d βœ… Stub (CPU fallback) cuDNNConvolutionForward

See issue #238 for the cuBLAS/cuDNN implementation tracker.

πŸ“ Architecture

vsl.cuda
β”œβ”€β”€ backend.v           # CUDABackend (ComputeBackend interface impl)
β”œβ”€β”€ compute/
β”‚   β”œβ”€β”€ elementwise.v   # Activation functions (relu, sigmoid, tanh, ...)
β”‚   β”œβ”€β”€ gemm.v          # Public GEMM wrapper (row↔col conversion)
β”‚   └── gemm_impl.v     # Internal GEMM/GEMV stubs + CPU fallbacks
└── v.mod

Memory layout: cuBLAS is column-major (same as VCL/Vulkan). The CUDABackend.to_internal() / from_internal() methods handle row↔column-major conversion at the dispatch boundary.

🎯 Quick Start

import vsl.compute

// Use CUDA backend automatically when available
ctx := compute.new_context(.cuda)

// All compute operations dispatch to CUDA when the backend is set
a := []f64{len: 6}
b := []f64{len: 6}
// ... fill a, b ...

result := compute.add_vec(ctx, a, b)!

Or directly via the backend:

import vsl.cuda

mut dev := cuda.get_default_device()!
mut backend := cuda.new_cuda_backend()

result := backend.relu(my_data)!

πŸ”§ Requirements

Runtime (at runtime)

  • NVIDIA GPU with compute capability β‰₯ 5.0 (Maxwell or newer)
  • NVIDIA Driver β‰₯ 525.60 (for CUDA 12.x)
  • CUDA Toolkit β‰₯ 11.8
  • cuDNN β‰₯ 8.0

Build time (for CUDA GPU builds)

  • nvcc (NVIDIA C compiler) in $PATH
  • CUDA Toolkit headers (cuda.h, cublas.h, cudnn.h)
  • cuDNN headers

πŸ“¦ Installation

Arch Linux

##sudo pacman -S nvidia nvidia-utils

##sudo pacman -S cuda

##sudo pacman -S cudnn

##nvcc --version        ##nvidia-smi            ##

Ubuntu / Debian

##sudo apt-get install nvidia-driver-535

##wget https://developer.downloads.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-11all.deb
sudo apt-get update
sudo apt-get install cuda-toolkit-12-2

##sudo apt-get install libcudnn8 libcudnn8-dev

macOS

CUDA Toolkit is no longer supported on macOS arm64 (Apple Silicon). For GPU acceleration on Apple Silicon, use the VCL/OpenCL backend instead.

πŸ—οΈ Compiling with CUDA

VSL uses conditional compilation ($if cuda ?) to include CUDA code:

##v -d cuda run your_app.v

##v -d cuda test .

##v -d cuda vet .

Set CUDNN_PATH if cuDNN is not in a standard location:

CUDNN_PATH=/opt/cuda v -d cuda run your_app.v

⚑ Performance Notes

  • GEMM (matrix-matrix) β€” the highest-impact primitive. cuBLAS dgemm is heavily optimized for NVIDIA Tensor Cores on Ampere+ GPUs.
  • cuDNN activations β€” fused kernels for ReLU, Sigmoid, Tanh provide ~2-5x speedup over CPU for large tensors.
  • Memory layout β€” cuBLAS uses column-major; VSL uses row-major. The to_internal() / from_internal() conversion adds overhead for small tensors but is negligible for large matrices (β‰₯ 256Γ—256).

πŸ§ͺ Examples

See the cuda/examples/ directory:

##v -d cuda run cuda/examples/relu_example.v

πŸ—ΊοΈ Roadmap

Phase Description
βœ… Phase B Infrastructure ready; stub implementations with CPU fallbacks
🎯 Phase C Replace stubs with actual cuBLAS/cuDNN kernels (issue #238)
πŸ—οΈ Phase D Device discovery, multi-GPU support
🧠 Phase E GPU memory management (avoid CPU↔GPU copies)
πŸ§ͺ Phase F Test suite, numerical validation vs reference

πŸ”— Resources


Accelerate your scientific computing with NVIDIA GPUs! πŸš€

Constants #

const cublas_status_success = 0
const cublas_op_n = 0 // Non-transpose

CublasOperation values.

const cublas_op_t = 1 // Transpose
const cublas_op_c = 2 // Conjugate transpose
const cudnn_status_success = 0
const cudnn_activation_relu = 1

CudnnActivationMode values (cuDNN 9.x deprecated legacy API). Matches cudnnActivationMode_t enum in cudnn_graph_v9.h: CUDNN_ACTIVATION_SIGMOID=0, RELU=1, TANH=2, CLIPPED_RELU=3, ELU=4, ...

const cudnn_activation_sigmoid = 0
const cudnn_activation_tanh = 2
const cudnn_softmax_fast = 0

CudnnSoftmaxAlgorithm values.

const cudnn_softmax_mode_instance = 0 // apply softmax per image (NCHW)

CudnnSoftmaxMode values.

const cudnn_softmax_mode_channel = 1 // apply softmax per spatial location (CHANNELS)
const cudnn_tensor_nhwc = 0 // NCHW format (row-major in VSL)

CudnnTensorFormat values.

const cudnn_tensor_nchw = 1 // NCHW format
const cudnn_data_type_double = 1 // f64

CudnnDataType values.

const cudnn_data_type_float = 0 // f32

fn col_to_row_major #

fn col_to_row_major(data []f64, rows int, cols int) []f64

col_to_row_major converts column-major flat array back to row-major. For matrix [rows x cols], element (r, c) in column-major flat at index r + crows becomes element (r, c) in row-major flat at index rcols+c.

fn cublas_error #

fn cublas_error(status CublasStatus) string

cublas_error returns a string description of a cuBLAS status code.

fn cudnn_error #

fn cudnn_error(status CudnnStatus) string

cudnn_error returns a string description of a cuDNN status code.

fn get_default_device #

fn get_default_device() !&CudaDevice

get_default_device returns the first available CUDA device. Convenience wrapper; equivalent to get_device(0).

fn get_device #

fn get_device(index int) !&CudaDevice

get_device returns the CUDA device with the given index (0-based).

fn get_device_count #

fn get_device_count() !int

get_device_count returns the number of available CUDA devices. Calls cuInit internally if not yet initialized.

fn row_to_col_major #

fn row_to_col_major(data []f64, rows int, cols int) []f64

row_to_col_major converts row-major flat array to column-major. For matrix [rows x cols], element (r, c) in row-major flat at index rcols+c becomes element (r, c) in column-major flat at index r + crows.

type CudnnActivationDescriptor #

type CudnnActivationDescriptor = voidptr

cudnnActivationDescriptor_t is a handle to an activation descriptor.

type CudnnActivationMode #

type CudnnActivationMode = int

cudnnActivationMode_t specifies the activation mode.

type CudnnConvolutionDescriptor #

type CudnnConvolutionDescriptor = voidptr

cudnnConvolutionDescriptor_t is a handle to a convolution descriptor.

type CudnnConvolutionFwdAlgo #

type CudnnConvolutionFwdAlgo = int

cudnnConvolutionFwdAlgo_t specifies the convolution forward algorithm.

type CudnnDataType #

type CudnnDataType = int

cudnnDataType_t specifies the data type of a tensor.

type CudnnFilterDescriptor #

type CudnnFilterDescriptor = voidptr

cudnnFilterDescriptor_t is a handle to a filter descriptor.

type CudnnSoftmaxAlgorithm #

type CudnnSoftmaxAlgorithm = int

cudnnSoftmaxAlgorithm_t specifies the softmax algorithm.

type CudnnSoftmaxMode #

type CudnnSoftmaxMode = int

cudnnSoftmaxMode_t specifies the softmax mode.

type CudnnTensorDescriptor #

type CudnnTensorDescriptor = voidptr

cudnnTensorDescriptor_t is a handle to a tensor descriptor.

type CudnnTensorFormat #

type CudnnTensorFormat = int

cudnnTensorFormat_t specifies the memory layout of a tensor.

type Device #

type Device = CudaDevice

Device is an alias for CudaDevice exposed at the top-level module boundary. Users that only need the device type can import vsl.cuda and use cuda.Device.

struct CudaDevice #

@[heap]
struct CudaDevice {
pub mut:
	// device_id is the CUDA device ordinal (0, 1, ...).
	device_id int
	// name is the human-readable device name (e.g. "NVIDIA GeForce RTX 4060").
	name string
	// handle is the CUDA device handle (CUdevice for driver API).
	handle voidptr
	// ctx is the CUDA context for this device (driver API), if created.
	ctx CudaContext
	// cublas is the cuBLAS context handle.
	cublas CublasHandle
	// cudnn is the cuDNN context handle.
	cudnn CudnnHandle
	// stream is the CUDA stream used for operations.
	stream CudaStream
}

CudaDevice represents a CUDA device with its cuBLAS and cuDNN handles. It wraps all GPU resource management for the CUDA backend.

fn (CudaDevice) release #

fn (mut d CudaDevice) release() !

release releases all CUDA resources (context, cuBLAS, cuDNN, stream).

fn (CudaDevice) init #

fn (mut d CudaDevice) init() !

init initializes cuBLAS and cuDNN handles for this device.