ml.nlp #

Natural Language Processing Tools

This submodule offers tools for Natural Language Processing.

Examples

Here you can see a full set of examples.

Constants #

const ngram_sep = 'NGRAMSEP'

fn count_vectorize #

fn count_vectorize(ngrams [][]string, most_frequent [][]string) []int

count_vectorize will give you an array of occurrences of each ngram from ngrams in most_frequent. Assume ng := [['hello'], ['hello'], ['hi']]. nlp.count_vectorize(ng, nlp.most_frequent_ngrams(ng, 0)) should return [2, 1]. See most_frequent_ngrams for more details on how it works.

fn inverse_document_frequencies #

fn inverse_document_frequencies(document [][][]string) !map[string]f64

inverse_document_frequencies will return the IDF of each term by calling term_idf on each unique ngram. Check term_idf for more details about the parameter document.

fn most_frequent_ngrams #

fn most_frequent_ngrams(ngrams [][]string, n_features int) ![][]string

most_frequent_ngrams returns an array with up to n_features elements denoting the most frequent ngrams in ngrams. Since V does not support map of arrays, the ngrams are joined by "NGRAMSEP". If n_features is <= 0, it will be set to ngrams.len.

fn ngrams #

fn ngrams(tokens []string, n int) ![][]string

ngrams will return an array of grams containing n elements from tokens. Example: ngrams('the apple is red'.split(' '), 3) will return [['the', 'apple', 'is'], ['apple', 'is', 'red']]

fn remove_punctuation #

fn remove_punctuation(x string) string

remove_punctuation will remove the following characters from the string: ,.[]()[]-=_*;:+><\\"´~^!?@#$%¨&/|'`

fn remove_stopwords #

fn remove_stopwords(tokens []string, stopwords []string, ignore_case bool) []string

remove_stopwords will remove all tokens included in stopwords. If ignore_case is true, "THIS" will be considered as "this".

fn remove_stopwords_en #

fn remove_stopwords_en(tokens []string, ignore_case bool) []string

remove_stopwords_en is a wrapper for remove_stopwords, passing a default array of some English stopwords.

fn term_frequencies #

fn term_frequencies(ngrams_sentence [][]string) !map[string]f64

term_frequencies will return the frequency of each term in a sentence. However, since in VSL NLP tools we deal with ngrams instead of regular "words" (1gram tokens), our sentence is actually a collection of ngrams and our words/terms are the ngrams themselves. Keep in mind that, since V does NOT support maps of arrays, ngrams are joined by the constant ngram_sep, which can be found in ml/tokenizer.v. This is why it returns a map[string]f64 and not a map[[]string]f64.

fn term_idf #

fn term_idf(term []string, document [][][]string) !f64

term_idf will return the IDF of a single term term. However, since in VSL NLP tools we deal with ngrams instead of regular "words" (1gram tokens), our document is actually a collection of ngrams and our words/terms are the ngrams themselves. This means that a document (a collection of sentences) is actually a [][][]string (array of sentences, which by themselves are arrays of ngrams, which are arrays of string). Keep in mind that, since V does NOT support maps of arrays, ngrams are joined by the constant ngram_sep, which can be found in ml/tokenizer.v. This is why it returns a map[string]f64 and not a map[[]string]f64.

fn tf_idf #

fn tf_idf(ngram []string, sentence [][]string, document [][][]string) !f64

tf_idf will return the TF * IDF for any given ngram, in a sentence, in a document.

fn tokenize #

fn tokenize(x string) []string

tokenize will return an array of tokens from the string x.

fn LancasterStemmer.new #

fn LancasterStemmer.new(strip_prefix bool) LancasterStemmer

LancasterStemmer.new returns a LancasterStemmer struct with a predefined set of stemming rules.

struct LancasterStemmer #

struct LancasterStemmer {
mut:
	rule_map map[string][]string
pub mut:
	strip_prefix bool
	rules        []string = [
	'ai*2.',
	'a*1.',
	'bb1.',
	'city3s.',
	'ci2>',
	'cn1t>',
	'dd1.',
	'dei3y>',
	'deec2ss.',
	'dee1.',
	'de2>',
	'dooh4>',
	'e1>',
	'feil1v.',
	'fi2>',
	'gni3>',
	'gai3y.',
	'ga2>',
	'gg1.',
	'ht*2.',
	'hsiug5ct.',
	'hsi3>',
	'i*1.',
	'i1y>',
	'ji1d.',
	'juf1s.',
	'ju1d.',
	'jo1d.',
	'jeh1r.',
	'jrev1t.',
	'jsim2t.',
	'jn1d.',
	'j1s.',
	'lbaifi6.',
	'lbai4y.',
	'lba3>',
	'lbi3.',
	'lib2l>',
	'lc1.',
	'lufi4y.',
	'luf3>',
	'lu2.',
	'lai3>',
	'lau3>',
	'la2>',
	'll1.',
	'mui3.',
	'mu*2.',
	'msi3>',
	'mm1.',
	'nois4j>',
	'noix4ct.',
	'noi3>',
	'nai3>',
	'na2>',
	'nee0.',
	'ne2>',
	'nn1.',
	'pihs4>',
	'pp1.',
	're2>',
	'rae0.',
	'ra2.',
	'ro2>',
	'ru2>',
	'rr1.',
	'rt1>',
	'rei3y>',
	'sei3y>',
	'sis2.',
	'si2>',
	'ssen4>',
	'ss0.',
	'suo3>',
	'su*2.',
	's*1>',
	's0.',
	'tacilp4y.',
	'ta2>',
	'tnem4>',
	'tne3>',
	'tna3>',
	'tpir2b.',
	'tpro2b.',
	'tcud1.',
	'tpmus2.',
	'tpec2iv.',
	'tulo2v.',
	'tsis0.',
	'tsi3>',
	'tt1.',
	'uqi3.',
	'ugo1.',
	'vis3j>',
	'vie0.',
	'vi2>',
	'ylb1>',
	'yli3y>',
	'ylp0.',
	'yl2>',
	'ygo1.',
	'yhp1.',
	'ymo1.',
	'ypo1.',
	'yti3>',
	'yte3>',
	'ytl2.',
	'yrtsi5.',
	'yra3>',
	'yro3>',
	'yfi3.',
	'ycn2t>',
	'yca3>',
	'zi2>',
	'zy1s.',
]
}

All credits go to the respective authors of NLTK's LancasterStemmer implementation.

fn (LancasterStemmer) set_rules #

fn (mut stemmer LancasterStemmer) set_rules(rules []string) !

set_rules redefines the rules of stemmer and parses it.

fn (LancasterStemmer) stem #

fn (mut stemmer LancasterStemmer) stem(word string) !string

stem serves as a wrapper for do_stemming, which is private.

README
Constants
fn count_vectorize
fn inverse_document_frequencies
fn most_frequent_ngrams
fn ngrams
fn remove_punctuation
fn remove_stopwords
fn remove_stopwords_en
fn term_frequencies
fn term_idf
fn tf_idf
fn tokenize
fn LancasterStemmer.new
struct LancasterStemmer
- fn set_rules
- fn stem