dstk.models.predict package#
Submodules#
dstk.models.predict.word2vec module#
This module provides predictive distributional semantic models for learning word embeddings from linguistically preprocessed corpora.
It implements neural network-based distributional semantic models that learn dense vector representations of words by predicting lexical contexts rather than counting co-occurrences. Documents can be optionally normalized before training through lowercasing, lemmatization or stemming, part-of-speech filtering, stop-word removal, and punctuation removal. The trained models can then be used to compute semantic similarity, identify nearest neighbors, and export learned embeddings for downstream analyses.
Core functionalities include: * Training Skip-Gram with Negative Sampling (SGNS) word embedding models * Training FastText word embedding models with subword information * Applying optional linguistic preprocessing before model training * Converting trained embedding models into tabular representations * Computing cosine similarity between word vectors * Retrieving exact and approximate nearest semantic neighbors * Returning intermediate workflow outputs for inspection or reuse
The module is intended to provide predictive (neural network-based) distributional semantic models for lexical semantic analysis in digital humanities and corpus linguistics workflows.
- dstk.models.predict.word2vec.Fasttext(document: Document, *, lowercase: bool = True, base_form: Literal['lemma', 'stem'] | None = 'lemma', allowed_pos: set[str] | None = None, remove_stop_words: bool = True, language: str | None = None, custom_stop_words: list[str] | None = None, window_size: int = 5, n_dimensions: int = 100, n_negative_samples: int = 5, subsampling: float = 0.0001, frequency_threshold: int = 5, min_ngram_size: int = 3, max_ngram_size: int = 6, model: str = 'skipgram', return_parameters: None = None, return_all: Literal[False] = False, **kwargs: dict[str, object]) DistanceMeasurements[source]#
- dstk.models.predict.word2vec.Fasttext(document: Document, *, lowercase: bool = True, base_form: Literal['lemma', 'stem'] | None = 'lemma', allowed_pos: set[str] | None = None, remove_stop_words: bool = True, language: str | None = None, custom_stop_words: list[str] | None = None, window_size: int = 5, n_dimensions: int = 100, n_negative_samples: int = 5, subsampling: float = 0.0001, frequency_threshold: int = 5, min_ngram_size: int = 3, max_ngram_size: int = 6, model: str = 'skipgram', return_parameters: list[str], return_all: Literal[False] = False, **kwargs: dict[str, object]) Generator[Any, None, None]
- dstk.models.predict.word2vec.Fasttext(document: Document, *, lowercase: bool = True, base_form: Literal['lemma', 'stem'] | None = 'lemma', allowed_pos: set[str] | None = None, remove_stop_words: bool = True, language: str | None = None, custom_stop_words: list[str] | None = None, window_size: int = 5, n_dimensions: int = 100, n_negative_samples: int = 5, subsampling: float = 0.0001, frequency_threshold: int = 5, min_ngram_size: int = 3, max_ngram_size: int = 6, model: str = 'skipgram', return_parameters: None = None, return_all: Literal[True], **kwargs: dict[str, object]) Generator[ParameterResult, None, None]
Generates word embeddings using FastText as defined by (Lenci & Sahlgren 164-165).
The document is optionally normalized (lowercasing, lemmatization/stemming, POS filtering, stop-word removal) before training a FastText model. Subword information is used to improve representations of rare and out-of-vocabulary words.
- Parameters:
document (Document) – Input document.
lowercase (bool) – Convert words to lowercase before training.
base_form (Literal["lemma", "stem"] | None) – Use lemmas or stems instead of surface forms.
allowed_pos (set[str] | None) – Keep only words with these POS tags.
remove_stop_words (bool) – Remove stop words before training.
window_size (int) – Context window size.
n_dimensions (int) – Embedding dimensionality.
min_ngram_size (int) – Minimum character n-gram length.
max_ngram_size (int) – Maximum character n-gram length.
model (str) – Training algorithm (
"skipgram"or"cbow").return_parameters (list[str] | None) –
Return only the specified workflow results.
Available values:
context.selection.unitcontext.selection.lexicalsentences_to_stringsave_sentencestrained_modelembeddings_dataframevector_similarity.geometric_measures.similarity
return_all (bool) – Return all workflow results.
kwargs – Additional arguments passed to FastText.
For more information check: https://fasttext.cc/docs/en/python-module.html
- Returns:
Semantic distance measurements or workflow results.
- Return type:
DistanceMeasurements | ReturnParameterGenerator | ReturnAllGenerator
- dstk.models.predict.word2vec.SGNS(document: Document, *, lowercase: bool = True, base_form: Literal['lemma', 'stem'] | None = 'lemma', allowed_pos: set[str] | None = None, remove_stop_words: bool = True, language: str | None = None, custom_stop_words: list[str] | None = None, window_size: int = 5, n_dimensions: int = 300, n_negative_samples: int = 5, word_probability_distribution: float = 0.75, subsampling: float = 1e-05, frequency_threshold: int = 5, return_parameters: None = None, return_all: Literal[False] = False, **kwargs: dict[str, object]) DistanceMeasurements[source]#
- dstk.models.predict.word2vec.SGNS(document: Document, *, lowercase: bool = True, base_form: Literal['lemma', 'stem'] | None = 'lemma', allowed_pos: set[str] | None = None, remove_stop_words: bool = True, language: str | None = None, custom_stop_words: list[str] | None = None, window_size: int = 5, n_dimensions: int = 300, n_negative_samples: int = 5, word_probability_distribution: float = 0.75, subsampling: float = 1e-05, frequency_threshold: int = 5, return_parameters: list[str], return_all: Literal[False] = False, **kwargs: dict[str, object]) Generator[Any, None, None]
- dstk.models.predict.word2vec.SGNS(document: Document, *, lowercase: bool = True, base_form: Literal['lemma', 'stem'] | None = 'lemma', allowed_pos: set[str] | None = None, remove_stop_words: bool = True, language: str | None = None, custom_stop_words: list[str] | None = None, window_size: int = 5, n_dimensions: int = 300, n_negative_samples: int = 5, word_probability_distribution: float = 0.75, subsampling: float = 1e-05, frequency_threshold: int = 5, return_parameters: None = None, return_all: Literal[True], **kwargs: dict[str, object]) Generator[ParameterResult, None, None]
Generates word embeddings with Skip-Gram with Negative Sampling (SGNS) as defined by (Lenci & Sahlgren 162-163).
The document is optionally normalized (lowercasing, lemmatization/stemming, POS filtering, stop-word removal) before training a
Word2Vecmodel. The resulting embeddings can be explored through cosine similarity and nearest-neighbor methods.- Parameters:
document (Document) – Input document.
lowercase (bool) – Convert words to lowercase before training.
base_form (Literal["lemma", "stem"] | None) – Use lemmas or stems instead of surface forms.
allowed_pos (set[str] | None) – Keep only words with these POS tags.
remove_stop_words (bool) – Remove stop words before training.
window_size (int) – Context window size.
n_dimensions (int) – Embedding dimensionality.
frequency_threshold (int) – Minimum word frequency.
return_parameters (list[str] | None) –
Return only the specified workflow results.
Available values:
context.selection.unitcontext.selection.lexicalsentences_to_stringtrained_modelembeddings_dataframevector_similarity.geometric_measures.similarity
return_all (bool) – Return all workflow results.
kwargs – Additional arguments passed to
gensim.models.Word2Vec.
For more information check: https://radimrehurek.com/gensim/models/word2vec.html
- Returns:
Semantic distance measurements or workflow results.
- Return type:
DistanceMeasurements | ReturnParameterGenerator | ReturnAllGenerator