model_selection #
VSL Model Selection Module
The vsl.model_selection module provides utilities for splitting data and performing cross-validation for machine learning model evaluation.
Features
Train-Test Splitting
- train_test_split: Split arrays into random train and test subsets
- train_val_test_split: Three-way split into train, validation, and test
- Stratified splitting: Preserve class proportions in splits
Cross-Validation
- KFold: K-Fold cross-validation iterator
- StratifiedKFold: Stratified K-Fold (preserves class balance)
- LeaveOneOut: Leave-one-out cross-validation
- ShuffleSplit: Random permutation cross-validator
Quick Start
Basic Train-Test Split
import vsl.model_selection
x := [[1.0, 2.0], [3.0, 4.0], [5.0, 6.0], [7.0, 8.0], [9.0, 10.0]]
y := [0.0, 1.0, 0.0, 1.0, 0.0]
result := model_selection.train_test_split(x, y, model_selection.TrainTestSplitConfig{
test_size: 0.2
shuffle: true
random_seed: 42
})!
println('Train size: ${result.x_train.len}')
println('Test size: ${result.x_test.len}')
Stratified Split (Preserve Class Proportions)
import vsl.model_selection
x := [[1.0], [2.0], [3.0], [4.0], [5.0], [6.0], [7.0], [8.0],
[9.0], [10.0]]
y := [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0]
result := model_selection.train_test_split(x, y, model_selection.TrainTestSplitConfig{
test_size: 0.3
stratify: true // Maintains 50/50 class balance in both sets
})!
Train-Validation-Test Split
import vsl.model_selection
result := model_selection.train_val_test_split(x, y, model_selection.TrainValTestConfig{
val_size: 0.1 // 10% validation
test_size: 0.2 // 20% test (70% train)
stratify: true
})!
println('Train: ${result.x_train.len}')
println('Val: ${result.x_val.len}')
println('Test: ${result.x_test.len}')
K-Fold Cross-Validation
import vsl.model_selection
kf := model_selection.KFold.new(5, true, 42)
folds := kf.split(100)! // 100 samples
for i, fold in folds {
println('Fold ${i + 1}:')
println(' Train indices: ${fold.train_indices.len}')
println(' Test indices: ${fold.test_indices.len}')
// Get actual data for this fold
x_train, y_train, x_test, y_test := model_selection.get_fold_data(x, y, fold)
// Train and evaluate your model here
}
Stratified K-Fold
import vsl.model_selection
y := [0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
skf := model_selection.StratifiedKFold.new(5, true, 42)
folds := skf.split(y)!
// Each fold maintains class proportions
Leave-One-Out Cross-Validation
import vsl.model_selection
loo := model_selection.LeaveOneOut.new()
folds := loo.split(10) // Returns 10 folds, each with 1 test sample
for fold in folds {
// fold.test_indices has exactly 1 element
// fold.train_indices has n-1 elements
}
Shuffle Split (Multiple Random Splits)
import vsl.model_selection
ss := model_selection.ShuffleSplit.new(10, 0.2, 42) // 10 splits, 20% test
folds := ss.split(100)!
// Creates 10 different random splits
API Reference
TrainTestSplitConfig
| Field | Type | Default | Description |
|---|---|---|---|
test_size |
f64 |
0.25 |
Proportion for test set (0.0-1.0) |
shuffle |
bool |
true |
Whether to shuffle before splitting |
stratify |
bool |
false |
Preserve class proportions |
random_seed |
u32 |
42 |
Random seed for reproducibility |
TrainValTestConfig
| Field | Type | Default | Description |
|---|---|---|---|
val_size |
f64 |
0.1 |
Proportion for validation set |
test_size |
f64 |
0.2 |
Proportion for test set |
shuffle |
bool |
true |
Whether to shuffle |
stratify |
bool |
false |
Preserve class proportions |
random_seed |
u32 |
42 |
Random seed |
Cross-Validators
| Class | Description |
|---|---|
KFold |
Standard K-Fold CV |
StratifiedKFold |
Stratified K-Fold (for classification) |
LeaveOneOut |
Leave-one-out CV (n folds for n samples) |
ShuffleSplit |
Random permutation splits |
Fold Structure
pub struct Fold {
pub:
train_indices []int // Indices for training
test_indices []int // Indices for testing
}
Helper Functions
| Function | Description |
|---|---|
get_fold_data(x, y, fold) |
Extract actual data arrays from fold indices |
Integration with ML Pipeline
import vsl.model_selection
import vsl.metrics
import vsl.ml
// Prepare data
x := your_features
y := your_labels
// Cross-validation
kf := model_selection.KFold.new(5, true, 42)
folds := kf.split(x.len)!
mut scores := []f64{}
for fold in folds {
x_train, y_train, x_test, y_test := model_selection.get_fold_data(x, y, fold)
// Create ML Data
mut train_data := ml.Data.from_raw_xy_sep(x_train, y_train)!
// Train model
mut model := ml.LogReg.new(mut train_data, 'cv_model')
model.train(ml.LogRegTrainConfig{})
// Predict
mut predictions := []f64{}
for row in x_test {
predictions << model.predict(row)
}
// Evaluate
acc := metrics.accuracy_score(y_test, predictions)!
scores << acc
}
// Report average score
mut avg := 0.0
for s in scores {
avg += s
}
avg /= f64(scores.len)
println('Average CV Accuracy: ${avg}')
Best Practices
- Use stratified splitting for classification - Ensures representative class distribution
- Set random_seed for reproducibility - Same seed = same splits
- K=5 or K=10 for cross-validation - Common choices balancing bias-variance
- Use LeaveOneOut for small datasets - Maximizes training data
- Use ShuffleSplit for large datasets - Faster than full K-Fold
Notes
- Stratified methods require labels to be integers (class labels)
- Cross-validation returns indices, use
get_fold_datafor actual arrays - Train and test indices are always disjoint within each fold
- Random seed ensures reproducibility across runs
See Also
fn get_fold_data #
fn get_fold_data(x [][]f64, y []f64, fold Fold) ([][]f64, []f64, [][]f64, []f64)
get_fold_data extracts the actual data for a fold
fn train_test_split #
fn train_test_split(x [][]f64, y []f64, config TrainTestSplitConfig) !TrainTestResult
train_test_split splits arrays into random train and test subsets
fn train_val_test_split #
fn train_val_test_split(x [][]f64, y []f64, config TrainValTestConfig) !TrainValTestResult
train_val_test_split splits data into train, validation, and test sets
fn KFold.new #
fn KFold.new(n_splits int, shuffle bool, random_seed u32) &KFold
KFold.new creates a new KFold cross-validator
fn LeaveOneOut.new #
fn LeaveOneOut.new() &LeaveOneOut
LeaveOneOut.new creates a new LeaveOneOut cross-validator
fn ShuffleSplit.new #
fn ShuffleSplit.new(n_splits int, test_size f64, random_seed u32) &ShuffleSplit
ShuffleSplit.new creates a new ShuffleSplit cross-validator
fn StratifiedKFold.new #
fn StratifiedKFold.new(n_splits int, shuffle bool, random_seed u32) &StratifiedKFold
StratifiedKFold.new creates a new StratifiedKFold cross-validator
struct Fold #
struct Fold {
pub mut:
train_indices []int
test_indices []int
}
Fold represents a single fold in cross-validation
struct KFold #
struct KFold {
pub:
n_splits int = 5
shuffle bool = true
random_seed u32 = 42
}
KFold generates train/test indices for k-fold cross-validation
fn (KFold) split #
fn (kf &KFold) split(n_samples int) ![]Fold
split generates indices to split data into training and test sets
struct LeaveOneOut #
struct LeaveOneOut {
}
LeaveOneOut implements leave-one-out cross-validation
fn (LeaveOneOut) split #
fn (loo &LeaveOneOut) split(n_samples int) []Fold
split generates indices for leave-one-out CV
struct ShuffleSplit #
struct ShuffleSplit {
pub:
n_splits int = 5
test_size f64 = 0.2
random_seed u32 = 42
}
ShuffleSplit generates random permutation cross-validator
fn (ShuffleSplit) split #
fn (ss &ShuffleSplit) split(n_samples int) ![]Fold
split generates random train/test indices
struct StratifiedKFold #
struct StratifiedKFold {
pub:
n_splits int = 5
shuffle bool = true
random_seed u32 = 42
}
StratifiedKFold generates stratified train/test indices
fn (StratifiedKFold) split #
fn (skf &StratifiedKFold) split(y []f64) ![]Fold
split generates stratified indices to split data
struct TrainTestResult #
struct TrainTestResult {
pub:
x_train [][]f64
x_test [][]f64
y_train []f64
y_test []f64
}
TrainTestResult holds the result of a train-test split
struct TrainTestSplitConfig #
struct TrainTestSplitConfig {
pub:
test_size f64 = 0.25 // proportion of data for test set
shuffle bool = true // whether to shuffle before splitting
stratify bool // whether to preserve class proportions
random_seed u32 = 42 // random seed for reproducibility
}
TrainTestSplitConfig holds configuration for train-test split
struct TrainValTestConfig #
struct TrainValTestConfig {
pub:
val_size f64 = 0.1 // proportion for validation
test_size f64 = 0.2 // proportion for test
shuffle bool = true
stratify bool
random_seed u32 = 42
}
TrainValTestConfig holds configuration for train-val-test split
struct TrainValTestResult #
struct TrainValTestResult {
pub:
x_train [][]f64
x_val [][]f64
x_test [][]f64
y_train []f64
y_val []f64
y_test []f64
}
TrainValTestResult holds the result of a train-val-test split
- README
- fn get_fold_data
- fn train_test_split
- fn train_val_test_split
- fn KFold.new
- fn LeaveOneOut.new
- fn ShuffleSplit.new
- fn StratifiedKFold.new
- struct Fold
- struct KFold
- struct LeaveOneOut
- struct ShuffleSplit
- struct StratifiedKFold
- struct TrainTestResult
- struct TrainTestSplitConfig
- struct TrainValTestConfig
- struct TrainValTestResult