Skip to content

ml.nlp #

Natural Language Processing Tools

This submodule offers tools for Natural Language Processing.

Examples

Here you can see a full set of examples.

Constants #

const ngram_sep = 'NGRAMSEP'

fn count_vectorize #

fn count_vectorize(ngrams [][]string, most_frequent [][]string) []int

count_vectorize will give you an array of occurrences of each ngram from ngrams in most_frequent. Assume ng := [['hello'], ['hello'], ['hi']]. nlp.count_vectorize(ng, nlp.most_frequent_ngrams(ng, 0)) should return [2, 1]. See most_frequent_ngrams for more details on how it works.

fn inverse_document_frequencies #

fn inverse_document_frequencies(document [][][]string) !map[string]f64

inverse_document_frequencies will return the IDF of each term by calling term_idf on each unique ngram. Check term_idf for more details about the parameter document.

fn most_frequent_ngrams #

fn most_frequent_ngrams(ngrams [][]string, n_features int) ![][]string

most_frequent_ngrams returns an array with up to n_features elements denoting the most frequent ngrams in ngrams. Since V does not support map of arrays, the ngrams are joined by "NGRAMSEP". If n_features is <= 0, it will be set to ngrams.len.

fn ngrams #

fn ngrams(tokens []string, n int) ![][]string

ngrams will return an array of grams containing n elements from tokens. Example: ngrams('the apple is red'.split(' '), 3) will return [['the', 'apple', 'is'], ['apple', 'is', 'red']]

fn remove_punctuation #

fn remove_punctuation(x string) string

remove_punctuation will remove the following characters from the string: ,.[]()[]-=_*;:+><\\"´~^!?@#$%¨&/|'`

fn remove_stopwords #

fn remove_stopwords(tokens []string, stopwords []string, ignore_case bool) []string

remove_stopwords will remove all tokens included in stopwords. If ignore_case is true, "THIS" will be considered as "this".

fn remove_stopwords_en #

fn remove_stopwords_en(tokens []string, ignore_case bool) []string

remove_stopwords_en is a wrapper for remove_stopwords, passing a default array of some English stopwords.

fn term_frequencies #

fn term_frequencies(ngrams_sentence [][]string) !map[string]f64

term_frequencies will return the frequency of each term in a sentence. However, since in VSL NLP tools we deal with ngrams instead of regular "words" (1gram tokens), our sentence is actually a collection of ngrams and our words/terms are the ngrams themselves. Keep in mind that, since V does NOT support maps of arrays, ngrams are joined by the constant ngram_sep, which can be found in ml/tokenizer.v. This is why it returns a map[string]f64 and not a map[[]string]f64.

fn term_idf #

fn term_idf(term []string, document [][][]string) !f64

term_idf will return the IDF of a single term term. However, since in VSL NLP tools we deal with ngrams instead of regular "words" (1gram tokens), our document is actually a collection of ngrams and our words/terms are the ngrams themselves. This means that a document (a collection of sentences) is actually a [][][]string (array of sentences, which by themselves are arrays of ngrams, which are arrays of string). Keep in mind that, since V does NOT support maps of arrays, ngrams are joined by the constant ngram_sep, which can be found in ml/tokenizer.v. This is why it returns a map[string]f64 and not a map[[]string]f64.

fn tf_idf #

fn tf_idf(ngram []string, sentence [][]string, document [][][]string) !f64

tf_idf will return the TF * IDF for any given ngram, in a sentence, in a document.

fn tokenize #

fn tokenize(x string) []string

tokenize will return an array of tokens from the string x.

fn LancasterStemmer.new #

fn LancasterStemmer.new(strip_prefix bool) LancasterStemmer

LancasterStemmer.new returns a LancasterStemmer struct with a predefined set of stemming rules.

struct LancasterStemmer #

struct LancasterStemmer {
mut:
	rule_map map[string][]string
pub mut:
	strip_prefix bool
	rules        []string = [
		'ai*2.',
		'a*1.',
		'bb1.',
		'city3s.',
		'ci2>',
		'cn1t>',
		'dd1.',
		'dei3y>',
		'deec2ss.',
		'dee1.',
		'de2>',
		'dooh4>',
		'e1>',
		'feil1v.',
		'fi2>',
		'gni3>',
		'gai3y.',
		'ga2>',
		'gg1.',
		'ht*2.',
		'hsiug5ct.',
		'hsi3>',
		'i*1.',
		'i1y>',
		'ji1d.',
		'juf1s.',
		'ju1d.',
		'jo1d.',
		'jeh1r.',
		'jrev1t.',
		'jsim2t.',
		'jn1d.',
		'j1s.',
		'lbaifi6.',
		'lbai4y.',
		'lba3>',
		'lbi3.',
		'lib2l>',
		'lc1.',
		'lufi4y.',
		'luf3>',
		'lu2.',
		'lai3>',
		'lau3>',
		'la2>',
		'll1.',
		'mui3.',
		'mu*2.',
		'msi3>',
		'mm1.',
		'nois4j>',
		'noix4ct.',
		'noi3>',
		'nai3>',
		'na2>',
		'nee0.',
		'ne2>',
		'nn1.',
		'pihs4>',
		'pp1.',
		're2>',
		'rae0.',
		'ra2.',
		'ro2>',
		'ru2>',
		'rr1.',
		'rt1>',
		'rei3y>',
		'sei3y>',
		'sis2.',
		'si2>',
		'ssen4>',
		'ss0.',
		'suo3>',
		'su*2.',
		's*1>',
		's0.',
		'tacilp4y.',
		'ta2>',
		'tnem4>',
		'tne3>',
		'tna3>',
		'tpir2b.',
		'tpro2b.',
		'tcud1.',
		'tpmus2.',
		'tpec2iv.',
		'tulo2v.',
		'tsis0.',
		'tsi3>',
		'tt1.',
		'uqi3.',
		'ugo1.',
		'vis3j>',
		'vie0.',
		'vi2>',
		'ylb1>',
		'yli3y>',
		'ylp0.',
		'yl2>',
		'ygo1.',
		'yhp1.',
		'ymo1.',
		'ypo1.',
		'yti3>',
		'yte3>',
		'ytl2.',
		'yrtsi5.',
		'yra3>',
		'yro3>',
		'yfi3.',
		'ycn2t>',
		'yca3>',
		'zi2>',
		'zy1s.',
	]
}

All credits go to the respective authors of NLTK's LancasterStemmer implementation.

fn (LancasterStemmer) set_rules #

fn (mut stemmer LancasterStemmer) set_rules(rules []string) !

set_rules redefines the rules of stemmer and parses it.

fn (LancasterStemmer) stem #

fn (mut stemmer LancasterStemmer) stem(word string) !string

stem serves as a wrapper for do_stemming, which is private.