dstk.models.predict package#

Submodules#

dstk.models.predict.word2vec module#

This module provides predictive distributional semantic models for learning word embeddings from linguistically preprocessed corpora.

It implements neural network-based distributional semantic models that learn dense vector representations of words by predicting lexical contexts rather than counting co-occurrences. Documents can be optionally normalized before training through lowercasing, lemmatization or stemming, part-of-speech filtering, stop-word removal, and punctuation removal. The trained models can then be used to compute semantic similarity, identify nearest neighbors, and export learned embeddings for downstream analyses.

Core functionalities include: * Training Skip-Gram with Negative Sampling (SGNS) word embedding models * Training FastText word embedding models with subword information * Applying optional linguistic preprocessing before model training * Converting trained embedding models into tabular representations * Computing cosine similarity between word vectors * Retrieving exact and approximate nearest semantic neighbors * Returning intermediate workflow outputs for inspection or reuse

The module is intended to provide predictive (neural network-based) distributional semantic models for lexical semantic analysis in digital humanities and corpus linguistics workflows.

dstk.models.predict.word2vec.Fasttext(document: Document, *, lowercase: bool = True, base_form: Literal['lemma', 'stem'] | None = 'lemma', allowed_pos: set[str] | None = None, remove_stop_words: bool = True, language: str | None = None, custom_stop_words: list[str] | None = None, window_size: int = 5, n_dimensions: int = 100, n_negative_samples: int = 5, subsampling: float = 0.0001, frequency_threshold: int = 5, min_ngram_size: int = 3, max_ngram_size: int = 6, model: str = 'skipgram', return_parameters: None = None, return_all: Literal[False] = False, **kwargs: dict[str, object]) → DistanceMeasurements[source]#

dstk.models.predict.word2vec.Fasttext(document: Document, *, lowercase: bool = True, base_form: Literal['lemma', 'stem'] | None = 'lemma', allowed_pos: set[str] | None = None, remove_stop_words: bool = True, language: str | None = None, custom_stop_words: list[str] | None = None, window_size: int = 5, n_dimensions: int = 100, n_negative_samples: int = 5, subsampling: float = 0.0001, frequency_threshold: int = 5, min_ngram_size: int = 3, max_ngram_size: int = 6, model: str = 'skipgram', return_parameters: list[str], return_all: Literal[False] = False, **kwargs: dict[str, object]) → Generator[Any, None, None]

dstk.models.predict.word2vec.Fasttext(document: Document, *, lowercase: bool = True, base_form: Literal['lemma', 'stem'] | None = 'lemma', allowed_pos: set[str] | None = None, remove_stop_words: bool = True, language: str | None = None, custom_stop_words: list[str] | None = None, window_size: int = 5, n_dimensions: int = 100, n_negative_samples: int = 5, subsampling: float = 0.0001, frequency_threshold: int = 5, min_ngram_size: int = 3, max_ngram_size: int = 6, model: str = 'skipgram', return_parameters: None = None, return_all: Literal[True], **kwargs: dict[str, object]) → Generator[ParameterResult, None, None]

Generates word embeddings using FastText as defined by (Lenci & Sahlgren 164-165).

The document is optionally normalized (lowercasing, lemmatization/stemming, POS filtering, stop-word removal) before training a FastText model. Subword information is used to improve representations of rare and out-of-vocabulary words.

Parameters:

document (Document) – Input document.
lowercase (bool) – Convert words to lowercase before training.
base_form (Literal["lemma", "stem"] | None) – Use lemmas or stems instead of surface forms.
allowed_pos (set[str] | None) – Keep only words with these POS tags.
remove_stop_words (bool) – Remove stop words before training.
window_size (int) – Context window size.
n_dimensions (int) – Embedding dimensionality.
min_ngram_size (int) – Minimum character n-gram length.
max_ngram_size (int) – Maximum character n-gram length.
model (str) – Training algorithm ("skipgram" or "cbow").
return_parameters (list[str] | None) –
Return only the specified workflow results.

Available values:
- context.selection.unit
- context.selection.lexical
- sentences_to_string
- save_sentences
- trained_model
- embeddings_dataframe
- vector_similarity.geometric_measures.similarity
return_all (bool) – Return all workflow results.
kwargs – Additional arguments passed to FastText.

For more information check: https://fasttext.cc/docs/en/python-module.html

Returns:: Semantic distance measurements or workflow results.
Return type:: DistanceMeasurements | ReturnParameterGenerator | ReturnAllGenerator

dstk.models.predict.word2vec.SGNS(document: Document, *, lowercase: bool = True, base_form: Literal['lemma', 'stem'] | None = 'lemma', allowed_pos: set[str] | None = None, remove_stop_words: bool = True, language: str | None = None, custom_stop_words: list[str] | None = None, window_size: int = 5, n_dimensions: int = 300, n_negative_samples: int = 5, word_probability_distribution: float = 0.75, subsampling: float = 1e-05, frequency_threshold: int = 5, return_parameters: None = None, return_all: Literal[False] = False, **kwargs: dict[str, object]) → DistanceMeasurements[source]#

dstk.models.predict.word2vec.SGNS(document: Document, *, lowercase: bool = True, base_form: Literal['lemma', 'stem'] | None = 'lemma', allowed_pos: set[str] | None = None, remove_stop_words: bool = True, language: str | None = None, custom_stop_words: list[str] | None = None, window_size: int = 5, n_dimensions: int = 300, n_negative_samples: int = 5, word_probability_distribution: float = 0.75, subsampling: float = 1e-05, frequency_threshold: int = 5, return_parameters: list[str], return_all: Literal[False] = False, **kwargs: dict[str, object]) → Generator[Any, None, None]

dstk.models.predict.word2vec.SGNS(document: Document, *, lowercase: bool = True, base_form: Literal['lemma', 'stem'] | None = 'lemma', allowed_pos: set[str] | None = None, remove_stop_words: bool = True, language: str | None = None, custom_stop_words: list[str] | None = None, window_size: int = 5, n_dimensions: int = 300, n_negative_samples: int = 5, word_probability_distribution: float = 0.75, subsampling: float = 1e-05, frequency_threshold: int = 5, return_parameters: None = None, return_all: Literal[True], **kwargs: dict[str, object]) → Generator[ParameterResult, None, None]

Generates word embeddings with Skip-Gram with Negative Sampling (SGNS) as defined by (Lenci & Sahlgren 162-163).

The document is optionally normalized (lowercasing, lemmatization/stemming, POS filtering, stop-word removal) before training a Word2Vec model. The resulting embeddings can be explored through cosine similarity and nearest-neighbor methods.

Parameters:

document (Document) – Input document.
lowercase (bool) – Convert words to lowercase before training.
base_form (Literal["lemma", "stem"] | None) – Use lemmas or stems instead of surface forms.
allowed_pos (set[str] | None) – Keep only words with these POS tags.
remove_stop_words (bool) – Remove stop words before training.
window_size (int) – Context window size.
n_dimensions (int) – Embedding dimensionality.
frequency_threshold (int) – Minimum word frequency.
return_parameters (list[str] | None) –
Return only the specified workflow results.

Available values:
- context.selection.unit
- context.selection.lexical
- sentences_to_string
- trained_model
- embeddings_dataframe
- vector_similarity.geometric_measures.similarity
return_all (bool) – Return all workflow results.
kwargs – Additional arguments passed to gensim.models.Word2Vec.

For more information check: https://radimrehurek.com/gensim/models/word2vec.html

Returns:: Semantic distance measurements or workflow results.
Return type:: DistanceMeasurements | ReturnParameterGenerator | ReturnAllGenerator

dstk.models.predict package

Contents

dstk.models.predict package#

Submodules#

dstk.models.predict.word2vec module#

Module contents#