Skip to content

model_selection #

VSL Model Selection Module

The vsl.model_selection module provides utilities for splitting data and performing cross-validation for machine learning model evaluation.

Features

Train-Test Splitting

  • train_test_split: Split arrays into random train and test subsets
  • train_val_test_split: Three-way split into train, validation, and test
  • Stratified splitting: Preserve class proportions in splits

Cross-Validation

  • KFold: K-Fold cross-validation iterator
  • StratifiedKFold: Stratified K-Fold (preserves class balance)
  • LeaveOneOut: Leave-one-out cross-validation
  • ShuffleSplit: Random permutation cross-validator

Quick Start

Basic Train-Test Split

import vsl.model_selection

x := [[1.0, 2.0], [3.0, 4.0], [5.0, 6.0], [7.0, 8.0], [9.0, 10.0]]
y := [0.0, 1.0, 0.0, 1.0, 0.0]

result := model_selection.train_test_split(x, y, model_selection.TrainTestSplitConfig{
    test_size:   0.2
    shuffle:     true
    random_seed: 42
})!

println('Train size: ${result.x_train.len}')
println('Test size: ${result.x_test.len}')

Stratified Split (Preserve Class Proportions)

import vsl.model_selection

x := [[1.0], [2.0], [3.0], [4.0], [5.0], [6.0], [7.0], [8.0],
    [9.0], [10.0]]
y := [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0]

result := model_selection.train_test_split(x, y, model_selection.TrainTestSplitConfig{
    test_size: 0.3
    stratify:  true // Maintains 50/50 class balance in both sets
})!

Train-Validation-Test Split

import vsl.model_selection

result := model_selection.train_val_test_split(x, y, model_selection.TrainValTestConfig{
    val_size:  0.1 // 10% validation
    test_size: 0.2 // 20% test (70% train)
    stratify:  true
})!

println('Train: ${result.x_train.len}')
println('Val: ${result.x_val.len}')
println('Test: ${result.x_test.len}')

K-Fold Cross-Validation

import vsl.model_selection

kf := model_selection.KFold.new(5, true, 42)
folds := kf.split(100)! // 100 samples

for i, fold in folds {
    println('Fold ${i + 1}:')
    println('  Train indices: ${fold.train_indices.len}')
    println('  Test indices: ${fold.test_indices.len}')

    // Get actual data for this fold
    x_train, y_train, x_test, y_test := model_selection.get_fold_data(x, y, fold)

    // Train and evaluate your model here
}

Stratified K-Fold

import vsl.model_selection

y := [0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

skf := model_selection.StratifiedKFold.new(5, true, 42)
folds := skf.split(y)!

// Each fold maintains class proportions

Leave-One-Out Cross-Validation

import vsl.model_selection

loo := model_selection.LeaveOneOut.new()
folds := loo.split(10) // Returns 10 folds, each with 1 test sample

for fold in folds {
    // fold.test_indices has exactly 1 element
    // fold.train_indices has n-1 elements
}

Shuffle Split (Multiple Random Splits)

import vsl.model_selection

ss := model_selection.ShuffleSplit.new(10, 0.2, 42) // 10 splits, 20% test
folds := ss.split(100)!

// Creates 10 different random splits

API Reference

TrainTestSplitConfig

Field Type Default Description
test_size f64 0.25 Proportion for test set (0.0-1.0)
shuffle bool true Whether to shuffle before splitting
stratify bool false Preserve class proportions
random_seed u32 42 Random seed for reproducibility

TrainValTestConfig

Field Type Default Description
val_size f64 0.1 Proportion for validation set
test_size f64 0.2 Proportion for test set
shuffle bool true Whether to shuffle
stratify bool false Preserve class proportions
random_seed u32 42 Random seed

Cross-Validators

Class Description
KFold Standard K-Fold CV
StratifiedKFold Stratified K-Fold (for classification)
LeaveOneOut Leave-one-out CV (n folds for n samples)
ShuffleSplit Random permutation splits

Fold Structure

pub struct Fold {
pub:
    train_indices []int // Indices for training
    test_indices  []int // Indices for testing
}

Helper Functions

Function Description
get_fold_data(x, y, fold) Extract actual data arrays from fold indices

Integration with ML Pipeline

import vsl.model_selection
import vsl.metrics
import vsl.ml

// Prepare data
x := your_features
y := your_labels

// Cross-validation
kf := model_selection.KFold.new(5, true, 42)
folds := kf.split(x.len)!

mut scores := []f64{}

for fold in folds {
    x_train, y_train, x_test, y_test := model_selection.get_fold_data(x, y, fold)

    // Create ML Data
    mut train_data := ml.Data.from_raw_xy_sep(x_train, y_train)!

    // Train model
    mut model := ml.LogReg.new(mut train_data, 'cv_model')
    model.train(ml.LogRegTrainConfig{})

    // Predict
    mut predictions := []f64{}
    for row in x_test {
        predictions << model.predict(row)
    }

    // Evaluate
    acc := metrics.accuracy_score(y_test, predictions)!
    scores << acc
}

// Report average score
mut avg := 0.0
for s in scores {
    avg += s
}
avg /= f64(scores.len)
println('Average CV Accuracy: ${avg}')

Best Practices

  1. Use stratified splitting for classification - Ensures representative class distribution
  2. Set random_seed for reproducibility - Same seed = same splits
  3. K=5 or K=10 for cross-validation - Common choices balancing bias-variance
  4. Use LeaveOneOut for small datasets - Maximizes training data
  5. Use ShuffleSplit for large datasets - Faster than full K-Fold

Notes

  • Stratified methods require labels to be integers (class labels)
  • Cross-validation returns indices, use get_fold_data for actual arrays
  • Train and test indices are always disjoint within each fold
  • Random seed ensures reproducibility across runs

See Also

fn get_fold_data #

fn get_fold_data(x [][]f64, y []f64, fold Fold) ([][]f64, []f64, [][]f64, []f64)

get_fold_data extracts the actual data for a fold

fn train_test_split #

fn train_test_split(x [][]f64, y []f64, config TrainTestSplitConfig) !TrainTestResult

train_test_split splits arrays into random train and test subsets

fn train_val_test_split #

fn train_val_test_split(x [][]f64, y []f64, config TrainValTestConfig) !TrainValTestResult

train_val_test_split splits data into train, validation, and test sets

fn KFold.new #

fn KFold.new(n_splits int, shuffle bool, random_seed u32) &KFold

KFold.new creates a new KFold cross-validator

fn LeaveOneOut.new #

fn LeaveOneOut.new() &LeaveOneOut

LeaveOneOut.new creates a new LeaveOneOut cross-validator

fn ShuffleSplit.new #

fn ShuffleSplit.new(n_splits int, test_size f64, random_seed u32) &ShuffleSplit

ShuffleSplit.new creates a new ShuffleSplit cross-validator

fn StratifiedKFold.new #

fn StratifiedKFold.new(n_splits int, shuffle bool, random_seed u32) &StratifiedKFold

StratifiedKFold.new creates a new StratifiedKFold cross-validator

struct Fold #

struct Fold {
pub mut:
	train_indices []int
	test_indices  []int
}

Fold represents a single fold in cross-validation

struct KFold #

@[heap]
struct KFold {
pub:
	n_splits    int  = 5
	shuffle     bool = true
	random_seed u32  = 42
}

KFold generates train/test indices for k-fold cross-validation

fn (KFold) split #

fn (kf &KFold) split(n_samples int) ![]Fold

split generates indices to split data into training and test sets

struct LeaveOneOut #

@[heap]
struct LeaveOneOut {
}

LeaveOneOut implements leave-one-out cross-validation

fn (LeaveOneOut) split #

fn (loo &LeaveOneOut) split(n_samples int) []Fold

split generates indices for leave-one-out CV

struct ShuffleSplit #

@[heap]
struct ShuffleSplit {
pub:
	n_splits    int = 5
	test_size   f64 = 0.2
	random_seed u32 = 42
}

ShuffleSplit generates random permutation cross-validator

fn (ShuffleSplit) split #

fn (ss &ShuffleSplit) split(n_samples int) ![]Fold

split generates random train/test indices

struct StratifiedKFold #

@[heap]
struct StratifiedKFold {
pub:
	n_splits    int  = 5
	shuffle     bool = true
	random_seed u32  = 42
}

StratifiedKFold generates stratified train/test indices

fn (StratifiedKFold) split #

fn (skf &StratifiedKFold) split(y []f64) ![]Fold

split generates stratified indices to split data

struct TrainTestResult #

struct TrainTestResult {
pub:
	x_train [][]f64
	x_test  [][]f64
	y_train []f64
	y_test  []f64
}

TrainTestResult holds the result of a train-test split

struct TrainTestSplitConfig #

struct TrainTestSplitConfig {
pub:
	test_size   f64  = 0.25 // proportion of data for test set
	shuffle     bool = true // whether to shuffle before splitting
	stratify    bool // whether to preserve class proportions
	random_seed u32 = 42 // random seed for reproducibility
}

TrainTestSplitConfig holds configuration for train-test split

struct TrainValTestConfig #

struct TrainValTestConfig {
pub:
	val_size    f64  = 0.1 // proportion for validation
	test_size   f64  = 0.2 // proportion for test
	shuffle     bool = true
	stratify    bool
	random_seed u32 = 42
}

TrainValTestConfig holds configuration for train-val-test split

struct TrainValTestResult #

struct TrainValTestResult {
pub:
	x_train [][]f64
	x_val   [][]f64
	x_test  [][]f64
	y_train []f64
	y_val   []f64
	y_test  []f64
}

TrainValTestResult holds the result of a train-val-test split