Skip to content

preprocessing #

VSL Preprocessing Module

The vsl.preprocessing module provides utilities for data preprocessing commonly used in machine learning pipelines.

Features

Scalers

  • StandardScaler: Standardizes features by removing the mean and scaling to unit variance (Z-score normalization)
  • MinMaxScaler: Transforms features by scaling to a given range (default [0, 1])
  • RobustScaler: Scales features using statistics robust to outliers (median and IQR)

Encoders

  • LabelEncoder: Encodes categorical labels as integers
  • OneHotEncoder: Encodes categorical features as one-hot numeric arrays
  • OrdinalEncoder: Encodes categorical features as ordinal integers

Binning

  • cut: Bins values into discrete intervals using equal-width binning
  • qcut: Bins values using quantile-based (equal-frequency) binning
  • Binner: Fitted binning transformer for consistent bin assignment

Quick Start

Standard Scaling (Z-Score Normalization)

import vsl.preprocessing

data := [[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]]

mut scaler := preprocessing.StandardScaler.new()
scaled := scaler.fit_transform(data)!

// Inverse transform to recover original values
recovered := scaler.inverse_transform(scaled)!

Min-Max Scaling

import vsl.preprocessing

data := [[0.0], [50.0], [100.0]]

// Scale to [0, 1] range
mut scaler := preprocessing.MinMaxScaler.new(0.0, 1.0)
scaled := scaler.fit_transform(data)!
// Result: [[0.0], [0.5], [1.0]]

// Or scale to custom range [-1, 1]
mut scaler2 := preprocessing.MinMaxScaler.new(-1.0, 1.0)
scaled2 := scaler2.fit_transform(data)!
// Result: [[-1.0], [0.0], [1.0]]

Robust Scaling

import vsl.preprocessing

// Data with outliers
data := [[1.0], [2.0], [3.0], [4.0], [5.0], [100.0]]

mut scaler := preprocessing.RobustScaler.new()
scaled := scaler.fit_transform(data)!
// Uses median and IQR, so outlier has less effect

Label Encoding

import vsl.preprocessing

labels := ['cat', 'dog', 'bird', 'cat', 'dog']

mut encoder := preprocessing.LabelEncoder.new()
encoded := encoder.fit_transform(labels)!
// Result: [0, 1, 2, 0, 1] (order depends on first occurrence)

// Inverse transform
decoded := encoder.inverse_transform(encoded)!
// Result: ['cat', 'dog', 'bird', 'cat', 'dog']

One-Hot Encoding

import vsl.preprocessing

data := [['red'], ['blue'], ['green'], ['red']]

mut encoder := preprocessing.OneHotEncoder.new(false)
encoded := encoder.fit_transform(data)!
// Result: [[1,0,0], [0,1,0], [0,0,1], [1,0,0]]

// Get feature names
names := encoder.get_feature_names(['color'])!
// Result: ['color_red', 'color_blue', 'color_green']

// Dummy encoding (drop first category)
mut dummy_encoder := preprocessing.OneHotEncoder.new(true)
dummy := dummy_encoder.fit_transform(data)!
// Result: [[0,0], [1,0], [0,1], [0,0]]

Binning

import vsl.preprocessing

values := [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]

// Equal-width binning
result := preprocessing.cut(values, 2, ['low', 'high'])!

// Quantile binning (equal-frequency)
result2 := preprocessing.qcut(values, 4, ['Q1', 'Q2', 'Q3', 'Q4'])!

// Using Binner for consistent transformation
mut binner := preprocessing.Binner.new(3, .uniform, ['small', 'medium', 'large'])
binner.fit(values)!
binned := binner.transform(values)!

API Reference

StandardScaler

Method Description
new() Creates a new StandardScaler
fit(x) Computes mean and std from data
transform(x) Applies standardization
fit_transform(x) Fits and transforms in one step
inverse_transform(x) Reverses standardization

Attributes after fitting:

  • mean_: Mean of each feature
  • std_: Standard deviation of each feature

MinMaxScaler

Method Description
new(feature_min, feature_max) Creates scaler with desired range
fit(x) Computes min and max from data
transform(x) Applies min-max scaling
fit_transform(x) Fits and transforms in one step
inverse_transform(x) Reverses scaling

RobustScaler

Method Description
new() Creates a new RobustScaler
fit(x) Computes median and IQR from data
transform(x) Applies robust scaling
fit_transform(x) Fits and transforms in one step

LabelEncoder

Method Description
new() Creates a new LabelEncoder
fit(y) Learns unique classes
transform(y) Encodes labels to integers
fit_transform(y) Fits and transforms in one step
inverse_transform(y) Converts integers back to labels

OneHotEncoder

Method Description
new(drop_first) Creates encoder; drop_first for dummy encoding
fit(x) Learns categories for each feature
transform(x) Applies one-hot encoding
fit_transform(x) Fits and transforms in one step
get_feature_names(input_names) Returns output feature names

Binner

Method Description
new(n_bins, strategy, labels) Creates binner with specified strategy
fit(values) Computes bin edges
transform(values) Assigns values to bins
fit_transform(values) Fits and transforms in one step
transform_to_indices(values) Returns bin indices instead of labels

BinningStrategy:

  • .uniform: Equal-width bins
  • .quantile: Equal-frequency bins

Integration with ML Pipeline

import vsl.preprocessing
import vsl.inout.csv
import vsl.ml

// Load data
csv_data := csv.read_csv('data.csv', csv.CsvReadConfig{ skip_header: true })!

// Scale features
mut scaler := preprocessing.StandardScaler.new()
scaled_data := scaler.fit_transform(csv_data.data)!

// Create ML Data object
mut data := ml.Data.from_raw_xy(scaled_data)!

// Train model
mut model := ml.LinReg.new(mut data, 'my_model')
model.train()

// For new predictions, use the same scaler
new_data := [[1.0, 2.0, 3.0]]
scaled_new := scaler.transform(new_data)!
prediction := model.predict(scaled_new[0])

See Also

fn cut #

fn cut(values []f64, n_bins int, labels []string) ![]string

cut bins values into discrete intervals using equal-width binning. Returns the bin labels for each value.

fn digitize #

fn digitize(values []f64, bins []f64) []int

digitize returns indices of the bins to which each value belongs. Similar to numpy.digitize

fn qcut #

fn qcut(values []f64, n_bins int, labels []string) ![]string

qcut bins values into discrete intervals using quantile-based binning. Returns the bin labels for each value.

fn Binner.new #

fn Binner.new(n_bins int, strategy BinningStrategy, labels []string) &Binner

Binner.new creates a new Binner

fn BinningStrategy.from #

fn BinningStrategy.from[W](input W) !BinningStrategy

fn LabelEncoder.new #

fn LabelEncoder.new() &LabelEncoder

LabelEncoder.new creates a new LabelEncoder

fn MinMaxScaler.new #

fn MinMaxScaler.new(feature_min f64, feature_max f64) &MinMaxScaler

MinMaxScaler.new creates a new MinMaxScaler with optional range

fn OneHotEncoder.new #

fn OneHotEncoder.new(drop_first bool) &OneHotEncoder

OneHotEncoder.new creates a new OneHotEncoder

fn OrdinalEncoder.new #

fn OrdinalEncoder.new() &OrdinalEncoder

OrdinalEncoder.new creates a new OrdinalEncoder

fn RobustScaler.new #

fn RobustScaler.new() &RobustScaler

RobustScaler.new creates a new RobustScaler

fn StandardScaler.new #

fn StandardScaler.new() &StandardScaler

StandardScaler.new creates a new StandardScaler

enum BinningStrategy #

enum BinningStrategy {
	uniform  // equal-width bins
	quantile // equal-frequency bins
}

BinningStrategy specifies how to create bin edges

struct Binner #

@[heap]
struct Binner {
mut:
	fitted bool
pub mut:
	edges_   []f64    // bin edges
	labels_  []string // bin labels
	n_bins   int
	strategy BinningStrategy
}

Binner provides a fitted binning transformer

fn (Binner) fit #

fn (mut b Binner) fit(values []f64) !

fit computes the bin edges based on the data

fn (Binner) transform #

fn (b &Binner) transform(values []f64) ![]string

transform applies the binning to the data

fn (Binner) fit_transform #

fn (mut b Binner) fit_transform(values []f64) ![]string

fit_transform fits and transforms in one step

fn (Binner) transform_to_indices #

fn (b &Binner) transform_to_indices(values []f64) ![]int

transform_to_indices returns bin indices instead of labels

struct LabelEncoder #

@[heap]
struct LabelEncoder {
mut:
	fitted bool
pub mut:
	classes_     []string       // unique classes in order
	class_to_idx map[string]int // mapping from class to index
}

LabelEncoder encodes categorical labels with values between 0 and n_classes-1.

fn (LabelEncoder) fit #

fn (mut e LabelEncoder) fit(y []string) !

fit learns the unique classes from the data

fn (LabelEncoder) transform #

fn (e &LabelEncoder) transform(y []string) ![]int

transform encodes labels to integers

fn (LabelEncoder) fit_transform #

fn (mut e LabelEncoder) fit_transform(y []string) ![]int

fit_transform fits and transforms in one step

fn (LabelEncoder) inverse_transform #

fn (e &LabelEncoder) inverse_transform(y []int) ![]string

inverse_transform converts integer labels back to original labels

struct MinMaxScaler #

@[heap]
struct MinMaxScaler {
mut:
	fitted bool
pub mut:
	min_        []f64 // minimum of each feature
	max_        []f64 // maximum of each feature
	data_range_ []f64 // max - min for each feature
	n_features  int
	feature_min f64 = 0.0 // desired minimum of transformed feature
	feature_max f64 = 1.0 // desired maximum of transformed feature
}

MinMaxScaler transforms features by scaling each feature to a given range. Formula: x_scaled = (x - min) / (max - min) * (feature_max - feature_min) + feature_min

fn (MinMaxScaler) fit #

fn (mut s MinMaxScaler) fit(x [][]f64) !

fit computes the min and max to be used for later scaling

fn (MinMaxScaler) transform #

fn (s &MinMaxScaler) transform(x [][]f64) ![][]f64

transform applies the min-max scaling to the data

fn (MinMaxScaler) fit_transform #

fn (mut s MinMaxScaler) fit_transform(x [][]f64) ![][]f64

fit_transform fits and transforms in one step

fn (MinMaxScaler) inverse_transform #

fn (s &MinMaxScaler) inverse_transform(x [][]f64) ![][]f64

inverse_transform reverses the min-max scaling

struct OneHotEncoder #

@[heap]
struct OneHotEncoder {
mut:
	fitted bool
pub mut:
	categories_ [][]string // categories for each feature
	n_features  int
	drop_first  bool // whether to drop the first category (for dummy encoding)
}

OneHotEncoder encodes categorical features as a one-hot (or dummy) numeric array.

fn (OneHotEncoder) fit #

fn (mut e OneHotEncoder) fit(x [][]string) !

fit learns the categories for each feature

fn (OneHotEncoder) transform #

fn (e &OneHotEncoder) transform(x [][]string) ![][]f64

transform applies one-hot encoding to the data

fn (OneHotEncoder) fit_transform #

fn (mut e OneHotEncoder) fit_transform(x [][]string) ![][]f64

fit_transform fits and transforms in one step

fn (OneHotEncoder) get_feature_names #

fn (e &OneHotEncoder) get_feature_names(input_names []string) ![]string

get_feature_names returns the names of the output features

struct OrdinalEncoder #

@[heap]
struct OrdinalEncoder {
mut:
	fitted bool
pub mut:
	categories_ [][]string // categories for each feature
	n_features  int
}

OrdinalEncoder encodes categorical features as ordinal integers.

fn (OrdinalEncoder) fit #

fn (mut e OrdinalEncoder) fit(x [][]string) !

fit learns the categories for each feature

fn (OrdinalEncoder) transform #

fn (e &OrdinalEncoder) transform(x [][]string) ![][]f64

transform applies ordinal encoding to the data

fn (OrdinalEncoder) fit_transform #

fn (mut e OrdinalEncoder) fit_transform(x [][]string) ![][]f64

fit_transform fits and transforms in one step

struct RobustScaler #

@[heap]
struct RobustScaler {
mut:
	fitted bool
pub mut:
	median_    []f64 // median of each feature
	iqr_       []f64 // interquartile range of each feature
	n_features int
}

RobustScaler scales features using statistics that are robust to outliers. Uses median and interquartile range (IQR) instead of mean and std.

fn (RobustScaler) fit #

fn (mut s RobustScaler) fit(x [][]f64) !

fit computes the median and IQR to be used for later scaling

fn (RobustScaler) transform #

fn (s &RobustScaler) transform(x [][]f64) ![][]f64

transform applies the robust scaling to the data

fn (RobustScaler) fit_transform #

fn (mut s RobustScaler) fit_transform(x [][]f64) ![][]f64

fit_transform fits and transforms in one step

struct StandardScaler #

@[heap]
struct StandardScaler {
mut:
	fitted bool
pub mut:
	mean_      []f64 // mean of each feature
	std_       []f64 // standard deviation of each feature
	n_features int
}

StandardScaler standardizes features by removing the mean and scaling to unit variance. Formula: z = (x - mean) / std

fn (StandardScaler) fit #

fn (mut s StandardScaler) fit(x [][]f64) !

fit computes the mean and std to be used for later scaling

fn (StandardScaler) transform #

fn (s &StandardScaler) transform(x [][]f64) ![][]f64

transform applies the standardization to the data

fn (StandardScaler) fit_transform #

fn (mut s StandardScaler) fit_transform(x [][]f64) ![][]f64

fit_transform fits and transforms in one step

fn (StandardScaler) inverse_transform #

fn (s &StandardScaler) inverse_transform(x [][]f64) ![][]f64

inverse_transform reverses the standardization