Natural Language Processing Tools

This submodule offers tools for Natural Language Processing.


Here you can see a full set of examples.

Constants #

const ngram_sep = 'NGRAMSEP'

fn count_vectorize #

fn count_vectorize(ngrams [][]string, most_frequent [][]string) []int

count_vectorize will give you an array of occurrences of each ngram from ngrams in most_frequent. Assume ng := [['hello'], ['hello'], ['hi']]. nlp.count_vectorize(ng, nlp.most_frequent_ngrams(ng, 0)) should return [2, 1]. See most_frequent_ngrams for more details on how it works.

fn inverse_document_frequencies #

fn inverse_document_frequencies(document [][][]string) !map[string]f64

inverse_document_frequencies will return the IDF of each term by calling term_idf on each unique ngram. Check term_idf for more details about the parameter document.

fn most_frequent_ngrams #

fn most_frequent_ngrams(ngrams [][]string, n_features int) ![][]string

most_frequent_ngrams returns an array with up to n_features elements denoting the most frequent ngrams in ngrams. Since V does not support map of arrays, the ngrams are joined by "NGRAMSEP". If n_features is <= 0, it will be set to ngrams.len.

fn ngrams #

fn ngrams(tokens []string, n int) ![][]string

ngrams will return an array of grams containing n elements from tokens. Example: ngrams('the apple is red'.split(' '), 3) will return [['the', 'apple', 'is'], ['apple', 'is', 'red']]

fn remove_punctuation #

fn remove_punctuation(x string) string

remove_punctuation will remove the following characters from the string: ,.[]()[]-=_*;:+><\\"´~^!?@#$%¨&/|'`

fn remove_stopwords #

fn remove_stopwords(tokens []string, stopwords []string, ignore_case bool) []string

remove_stopwords will remove all tokens included in stopwords. If ignore_case is true, "THIS" will be considered as "this".

fn remove_stopwords_en #

fn remove_stopwords_en(tokens []string, ignore_case bool) []string

remove_stopwords_en is a wrapper for remove_stopwords, passing a default array of some English stopwords.

fn term_frequencies #

fn term_frequencies(ngrams_sentence [][]string) !map[string]f64

term_frequencies will return the frequency of each term in a sentence. However, since in VSL NLP tools we deal with ngrams instead of regular "words" (1gram tokens), our sentence is actually a collection of ngrams and our words/terms are the ngrams themselves. Keep in mind that, since V does NOT support maps of arrays, ngrams are joined by the constant ngram_sep, which can be found in ml/tokenizer.v. This is why it returns a map[string]f64 and not a map[[]string]f64.

fn term_idf #

fn term_idf(term []string, document [][][]string) !f64

term_idf will return the IDF of a single term term. However, since in VSL NLP tools we deal with ngrams instead of regular "words" (1gram tokens), our document is actually a collection of ngrams and our words/terms are the ngrams themselves. This means that a document (a collection of sentences) is actually a [][][]string (array of sentences, which by themselves are arrays of ngrams, which are arrays of string). Keep in mind that, since V does NOT support maps of arrays, ngrams are joined by the constant ngram_sep, which can be found in ml/tokenizer.v. This is why it returns a map[string]f64 and not a map[[]string]f64.

fn tf_idf #

fn tf_idf(ngram []string, sentence [][]string, document [][][]string) !f64

tf_idf will return the TF * IDF for any given ngram, in a sentence, in a document.

fn tokenize #

fn tokenize(x string) []string

tokenize will return an array of tokens from the string x.

fn #

fn bool) LancasterStemmer returns a LancasterStemmer struct with a predefined set of stemming rules.

struct LancasterStemmer #

struct LancasterStemmer {
	rule_map map[string][]string
pub mut:
	strip_prefix bool
	rules        []string = [

All credits go to the respective authors of NLTK's LancasterStemmer implementation.

fn (LancasterStemmer) set_rules #

fn (mut stemmer LancasterStemmer) set_rules(rules []string) !

set_rules redefines the rules of stemmer and parses it.

fn (LancasterStemmer) stem #

fn (mut stemmer LancasterStemmer) stem(word string) !string

stem serves as a wrapper for do_stemming, which is private.