ml.nlp #
Natural Language Processing Tools
This submodule offers tools for Natural Language Processing.
Examples
Here you can see a full set of examples.
Constants #
const ngram_sep = 'NGRAMSEP'
fn count_vectorize #
fn count_vectorize(ngrams [][]string, most_frequent [][]string) []int
count_vectorize will give you an array of occurrences of each ngram from ngrams
in most_frequent
. Assume ng := [['hello'], ['hello'], ['hi']]
. nlp.count_vectorize(ng, nlp.most_frequent_ngrams(ng, 0))
should return [2, 1]
. See most_frequent_ngrams for more
details on how it works.
fn inverse_document_frequencies #
fn inverse_document_frequencies(document [][][]string) !map[string]f64
inverse_document_frequencies will return the IDF of each term by calling term_idf
on each unique ngram. Check term_idf
for more details about the parameter document
.
fn most_frequent_ngrams #
fn most_frequent_ngrams(ngrams [][]string, n_features int) ![][]string
most_frequent_ngrams returns an array with up to n_features
elements denoting the most frequent ngrams in ngrams
. Since V does not support map of arrays, the ngrams are joined by "NGRAMSEP". If n_features
is <= 0, it will be set to ngrams.len.
fn ngrams #
fn ngrams(tokens []string, n int) ![][]string
ngrams will return an array of grams containing n
elements from tokens
. Example: ngrams('the apple is red'.split(' '), 3)
will return [['the', 'apple', 'is'], ['apple', 'is', 'red']]
fn remove_punctuation #
fn remove_punctuation(x string) string
remove_punctuation will remove the following characters from the string: ,.[]()[]-=_*;:+><\\"
´~^!?@#$%¨&/|'`
fn remove_stopwords #
fn remove_stopwords(tokens []string, stopwords []string, ignore_case bool) []string
remove_stopwords will remove all tokens included in stopwords
. If ignore_case
is true, "THIS" will be considered as "this".
fn remove_stopwords_en #
fn remove_stopwords_en(tokens []string, ignore_case bool) []string
remove_stopwords_en is a wrapper for remove_stopwords
, passing a default array of some English stopwords.
fn term_frequencies #
fn term_frequencies(ngrams_sentence [][]string) !map[string]f64
term_frequencies will return the frequency of each term in a sentence. However, since in VSL NLP tools we deal with ngrams instead of regular "words" (1gram tokens), our sentence is actually a collection of ngrams and our words/terms are the ngrams themselves. Keep in mind that, since V does NOT support maps of arrays, ngrams are joined by the constant ngram_sep
, which can be found in ml/tokenizer.v
. This is why it returns a map[string]f64
and not a map[[]string]f64
.
fn term_idf #
fn term_idf(term []string, document [][][]string) !f64
term_idf will return the IDF of a single term term
. However, since in VSL NLP tools we deal with ngrams instead of regular "words" (1gram tokens), our document is actually a collection of ngrams and our words/terms are the ngrams themselves. This means that a document (a collection of sentences) is actually a [][][]string
(array of sentences, which by themselves are arrays of ngrams, which are arrays of string). Keep in mind that, since V does NOT support maps of arrays, ngrams are joined by the constant ngram_sep
, which can be found in ml/tokenizer.v
. This is why it returns a map[string]f64
and not a map[[]string]f64
.
fn tf_idf #
fn tf_idf(ngram []string, sentence [][]string, document [][][]string) !f64
tf_idf will return the TF * IDF for any given ngram, in a sentence, in a document.
fn tokenize #
fn tokenize(x string) []string
tokenize will return an array of tokens from the string x
.
fn LancasterStemmer.new #
fn LancasterStemmer.new(strip_prefix bool) LancasterStemmer
LancasterStemmer.new returns a LancasterStemmer struct with a predefined set of stemming rules.
struct LancasterStemmer #
struct LancasterStemmer {
mut:
rule_map map[string][]string
pub mut:
strip_prefix bool
rules []string = [
'ai*2.',
'a*1.',
'bb1.',
'city3s.',
'ci2>',
'cn1t>',
'dd1.',
'dei3y>',
'deec2ss.',
'dee1.',
'de2>',
'dooh4>',
'e1>',
'feil1v.',
'fi2>',
'gni3>',
'gai3y.',
'ga2>',
'gg1.',
'ht*2.',
'hsiug5ct.',
'hsi3>',
'i*1.',
'i1y>',
'ji1d.',
'juf1s.',
'ju1d.',
'jo1d.',
'jeh1r.',
'jrev1t.',
'jsim2t.',
'jn1d.',
'j1s.',
'lbaifi6.',
'lbai4y.',
'lba3>',
'lbi3.',
'lib2l>',
'lc1.',
'lufi4y.',
'luf3>',
'lu2.',
'lai3>',
'lau3>',
'la2>',
'll1.',
'mui3.',
'mu*2.',
'msi3>',
'mm1.',
'nois4j>',
'noix4ct.',
'noi3>',
'nai3>',
'na2>',
'nee0.',
'ne2>',
'nn1.',
'pihs4>',
'pp1.',
're2>',
'rae0.',
'ra2.',
'ro2>',
'ru2>',
'rr1.',
'rt1>',
'rei3y>',
'sei3y>',
'sis2.',
'si2>',
'ssen4>',
'ss0.',
'suo3>',
'su*2.',
's*1>',
's0.',
'tacilp4y.',
'ta2>',
'tnem4>',
'tne3>',
'tna3>',
'tpir2b.',
'tpro2b.',
'tcud1.',
'tpmus2.',
'tpec2iv.',
'tulo2v.',
'tsis0.',
'tsi3>',
'tt1.',
'uqi3.',
'ugo1.',
'vis3j>',
'vie0.',
'vi2>',
'ylb1>',
'yli3y>',
'ylp0.',
'yl2>',
'ygo1.',
'yhp1.',
'ymo1.',
'ypo1.',
'yti3>',
'yte3>',
'ytl2.',
'yrtsi5.',
'yra3>',
'yro3>',
'yfi3.',
'ycn2t>',
'yca3>',
'zi2>',
'zy1s.',
]
}
All credits go to the respective authors of NLTK's LancasterStemmer implementation.
fn (LancasterStemmer) set_rules #
fn (mut stemmer LancasterStemmer) set_rules(rules []string) !
set_rules redefines the rules of stemmer and parses it.
fn (LancasterStemmer) stem #
fn (mut stemmer LancasterStemmer) stem(word string) !string
stem serves as a wrapper for do_stemming, which is private.