preprocessing #
VSL Preprocessing Module
The vsl.preprocessing module provides utilities for data preprocessing commonly used in machine learning pipelines.
Features
Scalers
- StandardScaler: Standardizes features by removing the mean and scaling to unit variance (Z-score normalization)
- MinMaxScaler: Transforms features by scaling to a given range (default [0, 1])
- RobustScaler: Scales features using statistics robust to outliers (median and IQR)
Encoders
- LabelEncoder: Encodes categorical labels as integers
- OneHotEncoder: Encodes categorical features as one-hot numeric arrays
- OrdinalEncoder: Encodes categorical features as ordinal integers
Binning
- cut: Bins values into discrete intervals using equal-width binning
- qcut: Bins values using quantile-based (equal-frequency) binning
- Binner: Fitted binning transformer for consistent bin assignment
Quick Start
Standard Scaling (Z-Score Normalization)
import vsl.preprocessing
data := [[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]]
mut scaler := preprocessing.StandardScaler.new()
scaled := scaler.fit_transform(data)!
// Inverse transform to recover original values
recovered := scaler.inverse_transform(scaled)!
Min-Max Scaling
import vsl.preprocessing
data := [[0.0], [50.0], [100.0]]
// Scale to [0, 1] range
mut scaler := preprocessing.MinMaxScaler.new(0.0, 1.0)
scaled := scaler.fit_transform(data)!
// Result: [[0.0], [0.5], [1.0]]
// Or scale to custom range [-1, 1]
mut scaler2 := preprocessing.MinMaxScaler.new(-1.0, 1.0)
scaled2 := scaler2.fit_transform(data)!
// Result: [[-1.0], [0.0], [1.0]]
Robust Scaling
import vsl.preprocessing
// Data with outliers
data := [[1.0], [2.0], [3.0], [4.0], [5.0], [100.0]]
mut scaler := preprocessing.RobustScaler.new()
scaled := scaler.fit_transform(data)!
// Uses median and IQR, so outlier has less effect
Label Encoding
import vsl.preprocessing
labels := ['cat', 'dog', 'bird', 'cat', 'dog']
mut encoder := preprocessing.LabelEncoder.new()
encoded := encoder.fit_transform(labels)!
// Result: [0, 1, 2, 0, 1] (order depends on first occurrence)
// Inverse transform
decoded := encoder.inverse_transform(encoded)!
// Result: ['cat', 'dog', 'bird', 'cat', 'dog']
One-Hot Encoding
import vsl.preprocessing
data := [['red'], ['blue'], ['green'], ['red']]
mut encoder := preprocessing.OneHotEncoder.new(false)
encoded := encoder.fit_transform(data)!
// Result: [[1,0,0], [0,1,0], [0,0,1], [1,0,0]]
// Get feature names
names := encoder.get_feature_names(['color'])!
// Result: ['color_red', 'color_blue', 'color_green']
// Dummy encoding (drop first category)
mut dummy_encoder := preprocessing.OneHotEncoder.new(true)
dummy := dummy_encoder.fit_transform(data)!
// Result: [[0,0], [1,0], [0,1], [0,0]]
Binning
import vsl.preprocessing
values := [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]
// Equal-width binning
result := preprocessing.cut(values, 2, ['low', 'high'])!
// Quantile binning (equal-frequency)
result2 := preprocessing.qcut(values, 4, ['Q1', 'Q2', 'Q3', 'Q4'])!
// Using Binner for consistent transformation
mut binner := preprocessing.Binner.new(3, .uniform, ['small', 'medium', 'large'])
binner.fit(values)!
binned := binner.transform(values)!
API Reference
StandardScaler
| Method | Description |
|---|---|
new() |
Creates a new StandardScaler |
fit(x) |
Computes mean and std from data |
transform(x) |
Applies standardization |
fit_transform(x) |
Fits and transforms in one step |
inverse_transform(x) |
Reverses standardization |
Attributes after fitting:
mean_: Mean of each featurestd_: Standard deviation of each feature
MinMaxScaler
| Method | Description |
|---|---|
new(feature_min, feature_max) |
Creates scaler with desired range |
fit(x) |
Computes min and max from data |
transform(x) |
Applies min-max scaling |
fit_transform(x) |
Fits and transforms in one step |
inverse_transform(x) |
Reverses scaling |
RobustScaler
| Method | Description |
|---|---|
new() |
Creates a new RobustScaler |
fit(x) |
Computes median and IQR from data |
transform(x) |
Applies robust scaling |
fit_transform(x) |
Fits and transforms in one step |
LabelEncoder
| Method | Description |
|---|---|
new() |
Creates a new LabelEncoder |
fit(y) |
Learns unique classes |
transform(y) |
Encodes labels to integers |
fit_transform(y) |
Fits and transforms in one step |
inverse_transform(y) |
Converts integers back to labels |
OneHotEncoder
| Method | Description |
|---|---|
new(drop_first) |
Creates encoder; drop_first for dummy encoding |
fit(x) |
Learns categories for each feature |
transform(x) |
Applies one-hot encoding |
fit_transform(x) |
Fits and transforms in one step |
get_feature_names(input_names) |
Returns output feature names |
Binner
| Method | Description |
|---|---|
new(n_bins, strategy, labels) |
Creates binner with specified strategy |
fit(values) |
Computes bin edges |
transform(values) |
Assigns values to bins |
fit_transform(values) |
Fits and transforms in one step |
transform_to_indices(values) |
Returns bin indices instead of labels |
BinningStrategy:
.uniform: Equal-width bins.quantile: Equal-frequency bins
Integration with ML Pipeline
import vsl.preprocessing
import vsl.inout.csv
import vsl.ml
// Load data
csv_data := csv.read_csv('data.csv', csv.CsvReadConfig{ skip_header: true })!
// Scale features
mut scaler := preprocessing.StandardScaler.new()
scaled_data := scaler.fit_transform(csv_data.data)!
// Create ML Data object
mut data := ml.Data.from_raw_xy(scaled_data)!
// Train model
mut model := ml.LinReg.new(mut data, 'my_model')
model.train()
// For new predictions, use the same scaler
new_data := [[1.0, 2.0, 3.0]]
scaled_new := scaler.transform(new_data)!
prediction := model.predict(scaled_new[0])
See Also
fn cut #
fn cut(values []f64, n_bins int, labels []string) ![]string
cut bins values into discrete intervals using equal-width binning. Returns the bin labels for each value.
fn digitize #
fn digitize(values []f64, bins []f64) []int
digitize returns indices of the bins to which each value belongs. Similar to numpy.digitize
fn qcut #
fn qcut(values []f64, n_bins int, labels []string) ![]string
qcut bins values into discrete intervals using quantile-based binning. Returns the bin labels for each value.
fn Binner.new #
fn Binner.new(n_bins int, strategy BinningStrategy, labels []string) &Binner
Binner.new creates a new Binner
fn BinningStrategy.from #
fn BinningStrategy.from[W](input W) !BinningStrategy
fn LabelEncoder.new #
fn LabelEncoder.new() &LabelEncoder
LabelEncoder.new creates a new LabelEncoder
fn MinMaxScaler.new #
fn MinMaxScaler.new(feature_min f64, feature_max f64) &MinMaxScaler
MinMaxScaler.new creates a new MinMaxScaler with optional range
fn OneHotEncoder.new #
fn OneHotEncoder.new(drop_first bool) &OneHotEncoder
OneHotEncoder.new creates a new OneHotEncoder
fn OrdinalEncoder.new #
fn OrdinalEncoder.new() &OrdinalEncoder
OrdinalEncoder.new creates a new OrdinalEncoder
fn RobustScaler.new #
fn RobustScaler.new() &RobustScaler
RobustScaler.new creates a new RobustScaler
fn StandardScaler.new #
fn StandardScaler.new() &StandardScaler
StandardScaler.new creates a new StandardScaler
enum BinningStrategy #
enum BinningStrategy {
uniform // equal-width bins
quantile // equal-frequency bins
}
BinningStrategy specifies how to create bin edges
struct Binner #
struct Binner {
mut:
fitted bool
pub mut:
edges_ []f64 // bin edges
labels_ []string // bin labels
n_bins int
strategy BinningStrategy
}
Binner provides a fitted binning transformer
fn (Binner) fit #
fn (mut b Binner) fit(values []f64) !
fit computes the bin edges based on the data
fn (Binner) transform #
fn (b &Binner) transform(values []f64) ![]string
transform applies the binning to the data
fn (Binner) fit_transform #
fn (mut b Binner) fit_transform(values []f64) ![]string
fit_transform fits and transforms in one step
fn (Binner) transform_to_indices #
fn (b &Binner) transform_to_indices(values []f64) ![]int
transform_to_indices returns bin indices instead of labels
struct LabelEncoder #
struct LabelEncoder {
mut:
fitted bool
pub mut:
classes_ []string // unique classes in order
class_to_idx map[string]int // mapping from class to index
}
LabelEncoder encodes categorical labels with values between 0 and n_classes-1.
fn (LabelEncoder) fit #
fn (mut e LabelEncoder) fit(y []string) !
fit learns the unique classes from the data
fn (LabelEncoder) transform #
fn (e &LabelEncoder) transform(y []string) ![]int
transform encodes labels to integers
fn (LabelEncoder) fit_transform #
fn (mut e LabelEncoder) fit_transform(y []string) ![]int
fit_transform fits and transforms in one step
fn (LabelEncoder) inverse_transform #
fn (e &LabelEncoder) inverse_transform(y []int) ![]string
inverse_transform converts integer labels back to original labels
struct MinMaxScaler #
struct MinMaxScaler {
mut:
fitted bool
pub mut:
min_ []f64 // minimum of each feature
max_ []f64 // maximum of each feature
data_range_ []f64 // max - min for each feature
n_features int
feature_min f64 = 0.0 // desired minimum of transformed feature
feature_max f64 = 1.0 // desired maximum of transformed feature
}
MinMaxScaler transforms features by scaling each feature to a given range. Formula: x_scaled = (x - min) / (max - min) * (feature_max - feature_min) + feature_min
fn (MinMaxScaler) fit #
fn (mut s MinMaxScaler) fit(x [][]f64) !
fit computes the min and max to be used for later scaling
fn (MinMaxScaler) transform #
fn (s &MinMaxScaler) transform(x [][]f64) ![][]f64
transform applies the min-max scaling to the data
fn (MinMaxScaler) fit_transform #
fn (mut s MinMaxScaler) fit_transform(x [][]f64) ![][]f64
fit_transform fits and transforms in one step
fn (MinMaxScaler) inverse_transform #
fn (s &MinMaxScaler) inverse_transform(x [][]f64) ![][]f64
inverse_transform reverses the min-max scaling
struct OneHotEncoder #
struct OneHotEncoder {
mut:
fitted bool
pub mut:
categories_ [][]string // categories for each feature
n_features int
drop_first bool // whether to drop the first category (for dummy encoding)
}
OneHotEncoder encodes categorical features as a one-hot (or dummy) numeric array.
fn (OneHotEncoder) fit #
fn (mut e OneHotEncoder) fit(x [][]string) !
fit learns the categories for each feature
fn (OneHotEncoder) transform #
fn (e &OneHotEncoder) transform(x [][]string) ![][]f64
transform applies one-hot encoding to the data
fn (OneHotEncoder) fit_transform #
fn (mut e OneHotEncoder) fit_transform(x [][]string) ![][]f64
fit_transform fits and transforms in one step
fn (OneHotEncoder) get_feature_names #
fn (e &OneHotEncoder) get_feature_names(input_names []string) ![]string
get_feature_names returns the names of the output features
struct OrdinalEncoder #
struct OrdinalEncoder {
mut:
fitted bool
pub mut:
categories_ [][]string // categories for each feature
n_features int
}
OrdinalEncoder encodes categorical features as ordinal integers.
fn (OrdinalEncoder) fit #
fn (mut e OrdinalEncoder) fit(x [][]string) !
fit learns the categories for each feature
fn (OrdinalEncoder) transform #
fn (e &OrdinalEncoder) transform(x [][]string) ![][]f64
transform applies ordinal encoding to the data
fn (OrdinalEncoder) fit_transform #
fn (mut e OrdinalEncoder) fit_transform(x [][]string) ![][]f64
fit_transform fits and transforms in one step
struct RobustScaler #
struct RobustScaler {
mut:
fitted bool
pub mut:
median_ []f64 // median of each feature
iqr_ []f64 // interquartile range of each feature
n_features int
}
RobustScaler scales features using statistics that are robust to outliers. Uses median and interquartile range (IQR) instead of mean and std.
fn (RobustScaler) fit #
fn (mut s RobustScaler) fit(x [][]f64) !
fit computes the median and IQR to be used for later scaling
fn (RobustScaler) transform #
fn (s &RobustScaler) transform(x [][]f64) ![][]f64
transform applies the robust scaling to the data
fn (RobustScaler) fit_transform #
fn (mut s RobustScaler) fit_transform(x [][]f64) ![][]f64
fit_transform fits and transforms in one step
struct StandardScaler #
struct StandardScaler {
mut:
fitted bool
pub mut:
mean_ []f64 // mean of each feature
std_ []f64 // standard deviation of each feature
n_features int
}
StandardScaler standardizes features by removing the mean and scaling to unit variance. Formula: z = (x - mean) / std
fn (StandardScaler) fit #
fn (mut s StandardScaler) fit(x [][]f64) !
fit computes the mean and std to be used for later scaling
fn (StandardScaler) transform #
fn (s &StandardScaler) transform(x [][]f64) ![][]f64
transform applies the standardization to the data
fn (StandardScaler) fit_transform #
fn (mut s StandardScaler) fit_transform(x [][]f64) ![][]f64
fit_transform fits and transforms in one step
fn (StandardScaler) inverse_transform #
fn (s &StandardScaler) inverse_transform(x [][]f64) ![][]f64
inverse_transform reverses the standardization
- README
- fn cut
- fn digitize
- fn qcut
- fn Binner.new
- fn BinningStrategy.from
- fn LabelEncoder.new
- fn MinMaxScaler.new
- fn OneHotEncoder.new
- fn OrdinalEncoder.new
- fn RobustScaler.new
- fn StandardScaler.new
- enum BinningStrategy
- struct Binner
- struct LabelEncoder
- struct MinMaxScaler
- struct OneHotEncoder
- struct OrdinalEncoder
- struct RobustScaler
- struct StandardScaler