preprocessing #

VSL Preprocessing Module

The vsl.preprocessing module provides utilities for data preprocessing commonly used in machine learning pipelines.

Features

Scalers

StandardScaler: Standardizes features by removing the mean and scaling to unit variance (Z-score normalization)
MinMaxScaler: Transforms features by scaling to a given range (default [0, 1])
RobustScaler: Scales features using statistics robust to outliers (median and IQR)

Encoders

LabelEncoder: Encodes categorical labels as integers
OneHotEncoder: Encodes categorical features as one-hot numeric arrays
OrdinalEncoder: Encodes categorical features as ordinal integers

Binning

cut: Bins values into discrete intervals using equal-width binning
qcut: Bins values using quantile-based (equal-frequency) binning
Binner: Fitted binning transformer for consistent bin assignment

Quick Start

Standard Scaling (Z-Score Normalization)

import vsl.preprocessing

data := [[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]]

mut scaler := preprocessing.StandardScaler.new()
scaled := scaler.fit_transform(data)!

// Inverse transform to recover original values
recovered := scaler.inverse_transform(scaled)!

Min-Max Scaling

import vsl.preprocessing

data := [[0.0], [50.0], [100.0]]

// Scale to [0, 1] range
mut scaler := preprocessing.MinMaxScaler.new(0.0, 1.0)
scaled := scaler.fit_transform(data)!
// Result: [[0.0], [0.5], [1.0]]

// Or scale to custom range [-1, 1]
mut scaler2 := preprocessing.MinMaxScaler.new(-1.0, 1.0)
scaled2 := scaler2.fit_transform(data)!
// Result: [[-1.0], [0.0], [1.0]]

Robust Scaling

import vsl.preprocessing

// Data with outliers
data := [[1.0], [2.0], [3.0], [4.0], [5.0], [100.0]]

mut scaler := preprocessing.RobustScaler.new()
scaled := scaler.fit_transform(data)!
// Uses median and IQR, so outlier has less effect

Label Encoding

import vsl.preprocessing

labels := ['cat', 'dog', 'bird', 'cat', 'dog']

mut encoder := preprocessing.LabelEncoder.new()
encoded := encoder.fit_transform(labels)!
// Result: [0, 1, 2, 0, 1] (order depends on first occurrence)

// Inverse transform
decoded := encoder.inverse_transform(encoded)!
// Result: ['cat', 'dog', 'bird', 'cat', 'dog']

One-Hot Encoding

import vsl.preprocessing

data := [['red'], ['blue'], ['green'], ['red']]

mut encoder := preprocessing.OneHotEncoder.new(false)
encoded := encoder.fit_transform(data)!
// Result: [[1,0,0], [0,1,0], [0,0,1], [1,0,0]]

// Get feature names
names := encoder.get_feature_names(['color'])!
// Result: ['color_red', 'color_blue', 'color_green']

// Dummy encoding (drop first category)
mut dummy_encoder := preprocessing.OneHotEncoder.new(true)
dummy := dummy_encoder.fit_transform(data)!
// Result: [[0,0], [1,0], [0,1], [0,0]]

Binning

import vsl.preprocessing

values := [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]

// Equal-width binning
result := preprocessing.cut(values, 2, ['low', 'high'])!

// Quantile binning (equal-frequency)
result2 := preprocessing.qcut(values, 4, ['Q1', 'Q2', 'Q3', 'Q4'])!

// Using Binner for consistent transformation
mut binner := preprocessing.Binner.new(3, .uniform, ['small', 'medium', 'large'])
binner.fit(values)!
binned := binner.transform(values)!

API Reference

StandardScaler

Method	Description
`new()`	Creates a new StandardScaler
`fit(x)`	Computes mean and std from data
`transform(x)`	Applies standardization
`fit_transform(x)`	Fits and transforms in one step
`inverse_transform(x)`	Reverses standardization

Attributes after fitting:

mean_: Mean of each feature
std_: Standard deviation of each feature

MinMaxScaler

Method	Description
`new(feature_min, feature_max)`	Creates scaler with desired range
`fit(x)`	Computes min and max from data
`transform(x)`	Applies min-max scaling
`fit_transform(x)`	Fits and transforms in one step
`inverse_transform(x)`	Reverses scaling

RobustScaler

Method	Description
`new()`	Creates a new RobustScaler
`fit(x)`	Computes median and IQR from data
`transform(x)`	Applies robust scaling
`fit_transform(x)`	Fits and transforms in one step

LabelEncoder

Method	Description
`new()`	Creates a new LabelEncoder
`fit(y)`	Learns unique classes
`transform(y)`	Encodes labels to integers
`fit_transform(y)`	Fits and transforms in one step
`inverse_transform(y)`	Converts integers back to labels

OneHotEncoder

Method	Description
`new(drop_first)`	Creates encoder; drop_first for dummy encoding
`fit(x)`	Learns categories for each feature
`transform(x)`	Applies one-hot encoding
`fit_transform(x)`	Fits and transforms in one step
`get_feature_names(input_names)`	Returns output feature names

Binner

Method	Description
`new(n_bins, strategy, labels)`	Creates binner with specified strategy
`fit(values)`	Computes bin edges
`transform(values)`	Assigns values to bins
`fit_transform(values)`	Fits and transforms in one step
`transform_to_indices(values)`	Returns bin indices instead of labels

BinningStrategy:

.uniform: Equal-width bins
.quantile: Equal-frequency bins

Integration with ML Pipeline

import vsl.preprocessing
import vsl.inout.csv
import vsl.ml

// Load data
csv_data := csv.read_csv('data.csv', csv.CsvReadConfig{ skip_header: true })!

// Scale features
mut scaler := preprocessing.StandardScaler.new()
scaled_data := scaler.fit_transform(csv_data.data)!

// Create ML Data object
mut data := ml.Data.from_raw_xy(scaled_data)!

// Train model
mut model := ml.LinReg.new(mut data, 'my_model')
model.train()

// For new predictions, use the same scaler
new_data := [[1.0, 2.0, 3.0]]
scaled_new := scaler.transform(new_data)!
prediction := model.predict(scaled_new[0])

fn cut #

fn cut(values []f64, n_bins int, labels []string) ![]string

cut bins values into discrete intervals using equal-width binning. Returns the bin labels for each value.

fn digitize #

fn digitize(values []f64, bins []f64) []int

digitize returns indices of the bins to which each value belongs. Similar to numpy.digitize

fn qcut #

fn qcut(values []f64, n_bins int, labels []string) ![]string

qcut bins values into discrete intervals using quantile-based binning. Returns the bin labels for each value.

fn Binner.new #

fn Binner.new(n_bins int, strategy BinningStrategy, labels []string) &Binner

Binner.new creates a new Binner

fn BinningStrategy.from #

fn BinningStrategy.from[W](input W) !BinningStrategy

fn LabelEncoder.new #

fn LabelEncoder.new() &LabelEncoder

LabelEncoder.new creates a new LabelEncoder

fn MinMaxScaler.new #

fn MinMaxScaler.new(feature_min f64, feature_max f64) &MinMaxScaler

MinMaxScaler.new creates a new MinMaxScaler with optional range

fn OneHotEncoder.new #

fn OneHotEncoder.new(drop_first bool) &OneHotEncoder

OneHotEncoder.new creates a new OneHotEncoder

fn OrdinalEncoder.new #

fn OrdinalEncoder.new() &OrdinalEncoder

OrdinalEncoder.new creates a new OrdinalEncoder

fn RobustScaler.new #

fn RobustScaler.new() &RobustScaler

RobustScaler.new creates a new RobustScaler

fn StandardScaler.new #

fn StandardScaler.new() &StandardScaler

StandardScaler.new creates a new StandardScaler

enum BinningStrategy #

enum BinningStrategy {
	uniform  // equal-width bins
	quantile // equal-frequency bins
}

BinningStrategy specifies how to create bin edges

struct Binner #

@[heap]

struct Binner {
mut:
	fitted bool
pub mut:
	edges_   []f64    // bin edges
	labels_  []string // bin labels
	n_bins   int
	strategy BinningStrategy
}

Binner provides a fitted binning transformer

fn (Binner) fit #

fn (mut b Binner) fit(values []f64) !

fit computes the bin edges based on the data

fn (Binner) transform #

fn (b &Binner) transform(values []f64) ![]string

transform applies the binning to the data

fn (Binner) fit_transform #

fn (mut b Binner) fit_transform(values []f64) ![]string

fit_transform fits and transforms in one step

fn (Binner) transform_to_indices #

fn (b &Binner) transform_to_indices(values []f64) ![]int

transform_to_indices returns bin indices instead of labels

struct LabelEncoder #

@[heap]

struct LabelEncoder {
mut:
	fitted bool
pub mut:
	classes_     []string       // unique classes in order
	class_to_idx map[string]int // mapping from class to index
}

LabelEncoder encodes categorical labels with values between 0 and n_classes-1.

fn (LabelEncoder) fit #

fn (mut e LabelEncoder) fit(y []string) !

fit learns the unique classes from the data

fn (LabelEncoder) transform #

fn (e &LabelEncoder) transform(y []string) ![]int

transform encodes labels to integers

fn (LabelEncoder) fit_transform #

fn (mut e LabelEncoder) fit_transform(y []string) ![]int

fit_transform fits and transforms in one step

fn (LabelEncoder) inverse_transform #

fn (e &LabelEncoder) inverse_transform(y []int) ![]string

inverse_transform converts integer labels back to original labels

struct MinMaxScaler #

@[heap]

struct MinMaxScaler {
mut:
	fitted bool
pub mut:
	min_        []f64 // minimum of each feature
	max_        []f64 // maximum of each feature
	data_range_ []f64 // max - min for each feature
	n_features  int
	feature_min f64 = 0.0 // desired minimum of transformed feature
	feature_max f64 = 1.0 // desired maximum of transformed feature
}

MinMaxScaler transforms features by scaling each feature to a given range. Formula: x_scaled = (x - min) / (max - min) * (feature_max - feature_min) + feature_min

fn (MinMaxScaler) fit #

fn (mut s MinMaxScaler) fit(x [][]f64) !

fit computes the min and max to be used for later scaling

fn (MinMaxScaler) transform #

fn (s &MinMaxScaler) transform(x [][]f64) ![][]f64

transform applies the min-max scaling to the data

fn (MinMaxScaler) fit_transform #

fn (mut s MinMaxScaler) fit_transform(x [][]f64) ![][]f64

fit_transform fits and transforms in one step

fn (MinMaxScaler) inverse_transform #

fn (s &MinMaxScaler) inverse_transform(x [][]f64) ![][]f64

inverse_transform reverses the min-max scaling

struct OneHotEncoder #

@[heap]

struct OneHotEncoder {
mut:
	fitted bool
pub mut:
	categories_ [][]string // categories for each feature
	n_features  int
	drop_first  bool // whether to drop the first category (for dummy encoding)
}

OneHotEncoder encodes categorical features as a one-hot (or dummy) numeric array.

fn (OneHotEncoder) fit #

fn (mut e OneHotEncoder) fit(x [][]string) !

fit learns the categories for each feature

fn (OneHotEncoder) transform #

fn (e &OneHotEncoder) transform(x [][]string) ![][]f64

transform applies one-hot encoding to the data

fn (OneHotEncoder) fit_transform #

fn (mut e OneHotEncoder) fit_transform(x [][]string) ![][]f64

fit_transform fits and transforms in one step

fn (OneHotEncoder) get_feature_names #

fn (e &OneHotEncoder) get_feature_names(input_names []string) ![]string

get_feature_names returns the names of the output features

struct OrdinalEncoder #

@[heap]

struct OrdinalEncoder {
mut:
	fitted bool
pub mut:
	categories_ [][]string // categories for each feature
	n_features  int
}

OrdinalEncoder encodes categorical features as ordinal integers.

fn (OrdinalEncoder) fit #

fn (mut e OrdinalEncoder) fit(x [][]string) !

fit learns the categories for each feature

fn (OrdinalEncoder) transform #

fn (e &OrdinalEncoder) transform(x [][]string) ![][]f64

transform applies ordinal encoding to the data

fn (OrdinalEncoder) fit_transform #

fn (mut e OrdinalEncoder) fit_transform(x [][]string) ![][]f64

fit_transform fits and transforms in one step

struct RobustScaler #

@[heap]

struct RobustScaler {
mut:
	fitted bool
pub mut:
	median_    []f64 // median of each feature
	iqr_       []f64 // interquartile range of each feature
	n_features int
}

RobustScaler scales features using statistics that are robust to outliers. Uses median and interquartile range (IQR) instead of mean and std.

fn (RobustScaler) fit #

fn (mut s RobustScaler) fit(x [][]f64) !

fit computes the median and IQR to be used for later scaling

fn (RobustScaler) transform #

fn (s &RobustScaler) transform(x [][]f64) ![][]f64

transform applies the robust scaling to the data

fn (RobustScaler) fit_transform #

fn (mut s RobustScaler) fit_transform(x [][]f64) ![][]f64

fit_transform fits and transforms in one step

struct StandardScaler #

@[heap]

struct StandardScaler {
mut:
	fitted bool
pub mut:
	mean_      []f64 // mean of each feature
	std_       []f64 // standard deviation of each feature
	n_features int
}

StandardScaler standardizes features by removing the mean and scaling to unit variance. Formula: z = (x - mean) / std

fn (StandardScaler) fit #

fn (mut s StandardScaler) fit(x [][]f64) !

fit computes the mean and std to be used for later scaling

fn (StandardScaler) transform #

fn (s &StandardScaler) transform(x [][]f64) ![][]f64

transform applies the standardization to the data

fn (StandardScaler) fit_transform #

fn (mut s StandardScaler) fit_transform(x [][]f64) ![][]f64

fit_transform fits and transforms in one step

fn (StandardScaler) inverse_transform #

fn (s &StandardScaler) inverse_transform(x [][]f64) ![][]f64

inverse_transform reverses the standardization

README
fn cut
fn digitize
fn qcut
fn Binner.new
fn BinningStrategy.from
fn LabelEncoder.new
fn MinMaxScaler.new
fn OneHotEncoder.new
fn OrdinalEncoder.new
fn RobustScaler.new
fn StandardScaler.new
enum BinningStrategy
struct Binner
- fn fit
- fn transform
- fn fit_transform
- fn transform_to_indices
struct LabelEncoder
- fn fit
- fn transform
- fn fit_transform
- fn inverse_transform
struct MinMaxScaler
- fn fit
- fn transform
- fn fit_transform
- fn inverse_transform
struct OneHotEncoder
- fn fit
- fn transform
- fn fit_transform
- fn get_feature_names
struct OrdinalEncoder
- fn fit
- fn transform
- fn fit_transform
struct RobustScaler
- fn fit
- fn transform
- fn fit_transform
struct StandardScaler
- fn fit
- fn transform
- fn fit_transform
- fn inverse_transform

preprocessing #

VSL Preprocessing Module

Features

Scalers

Encoders

Binning

Quick Start

Standard Scaling (Z-Score Normalization)

Min-Max Scaling

Robust Scaling

Label Encoding

One-Hot Encoding

Binning

API Reference

StandardScaler

MinMaxScaler

RobustScaler

LabelEncoder

OneHotEncoder

Binner

Integration with ML Pipeline

See Also

fn cut #

fn digitize #

fn qcut #

fn Binner.new #

fn BinningStrategy.from #

fn LabelEncoder.new #

fn MinMaxScaler.new #

fn OneHotEncoder.new #

fn OrdinalEncoder.new #

fn RobustScaler.new #

fn StandardScaler.new #

enum BinningStrategy #

struct Binner #

fn (Binner) fit #

fn (Binner) transform #

fn (Binner) fit_transform #

fn (Binner) transform_to_indices #

struct LabelEncoder #

fn (LabelEncoder) fit #

fn (LabelEncoder) transform #

fn (LabelEncoder) fit_transform #

fn (LabelEncoder) inverse_transform #

struct MinMaxScaler #

fn (MinMaxScaler) fit #

fn (MinMaxScaler) transform #

fn (MinMaxScaler) fit_transform #

fn (MinMaxScaler) inverse_transform #

struct OneHotEncoder #

fn (OneHotEncoder) fit #

fn (OneHotEncoder) transform #

fn (OneHotEncoder) fit_transform #

fn (OneHotEncoder) get_feature_names #

struct OrdinalEncoder #

fn (OrdinalEncoder) fit #

fn (OrdinalEncoder) transform #

fn (OrdinalEncoder) fit_transform #

struct RobustScaler #

fn (RobustScaler) fit #

fn (RobustScaler) transform #

fn (RobustScaler) fit_transform #

struct StandardScaler #

fn (StandardScaler) fit #

fn (StandardScaler) transform #

fn (StandardScaler) fit_transform #

fn (StandardScaler) inverse_transform #