dstk.models.predict package#

Submodules#

dstk.models.predict.word2vec module#

This module provides predictive distributional semantic models for learning word embeddings from linguistically preprocessed corpora.

It implements neural network-based distributional semantic models that learn dense vector representations of words by predicting lexical contexts rather than counting co-occurrences. Documents can be optionally normalized before training through lowercasing, lemmatization or stemming, part-of-speech filtering, stop-word removal, and punctuation removal. The trained models can then be used to compute semantic similarity, identify nearest neighbors, and export learned embeddings for downstream analyses.

Core functionalities include: * Training Skip-Gram with Negative Sampling (SGNS) word embedding models * Training FastText word embedding models with subword information * Applying optional linguistic preprocessing before model training * Converting trained embedding models into tabular representations * Computing cosine similarity between word vectors * Retrieving exact and approximate nearest semantic neighbors * Returning intermediate workflow outputs for inspection or reuse

The module is intended to provide predictive (neural network-based) distributional semantic models for lexical semantic analysis in digital humanities and corpus linguistics workflows.

dstk.models.predict.word2vec.Fasttext(document: Document, *, lowercase: bool = True, base_form: Literal['lemma', 'stem'] | None = 'lemma', allowed_pos: set[str] | None = None, remove_stop_words: bool = True, language: str | None = None, custom_stop_words: list[str] | None = None, window_size: int = 5, n_dimensions: int = 100, n_negative_samples: int = 5, subsampling: float = 0.0001, frequency_threshold: int = 5, min_ngram_size: int = 3, max_ngram_size: int = 6, model: str = 'skipgram', return_parameters: None = None, return_all: Literal[False] = False, **kwargs: dict[str, object]) DistanceMeasurements[source]#
dstk.models.predict.word2vec.Fasttext(document: Document, *, lowercase: bool = True, base_form: Literal['lemma', 'stem'] | None = 'lemma', allowed_pos: set[str] | None = None, remove_stop_words: bool = True, language: str | None = None, custom_stop_words: list[str] | None = None, window_size: int = 5, n_dimensions: int = 100, n_negative_samples: int = 5, subsampling: float = 0.0001, frequency_threshold: int = 5, min_ngram_size: int = 3, max_ngram_size: int = 6, model: str = 'skipgram', return_parameters: list[str], return_all: Literal[False] = False, **kwargs: dict[str, object]) Generator[Any, None, None]
dstk.models.predict.word2vec.Fasttext(document: Document, *, lowercase: bool = True, base_form: Literal['lemma', 'stem'] | None = 'lemma', allowed_pos: set[str] | None = None, remove_stop_words: bool = True, language: str | None = None, custom_stop_words: list[str] | None = None, window_size: int = 5, n_dimensions: int = 100, n_negative_samples: int = 5, subsampling: float = 0.0001, frequency_threshold: int = 5, min_ngram_size: int = 3, max_ngram_size: int = 6, model: str = 'skipgram', return_parameters: None = None, return_all: Literal[True], **kwargs: dict[str, object]) Generator[ParameterResult, None, None]

Generates word embeddings using FastText as defined by (Lenci & Sahlgren 164-165).

The document is optionally normalized (lowercasing, lemmatization/stemming, POS filtering, stop-word removal) before training a FastText model. Subword information is used to improve representations of rare and out-of-vocabulary words.

Parameters:
  • document (Document) – Input document.

  • lowercase (bool) – Convert words to lowercase before training.

  • base_form (Literal["lemma", "stem"] | None) – Use lemmas or stems instead of surface forms.

  • allowed_pos (set[str] | None) – Keep only words with these POS tags.

  • remove_stop_words (bool) – Remove stop words before training.

  • window_size (int) – Context window size.

  • n_dimensions (int) – Embedding dimensionality.

  • min_ngram_size (int) – Minimum character n-gram length.

  • max_ngram_size (int) – Maximum character n-gram length.

  • model (str) – Training algorithm ("skipgram" or "cbow").

  • return_parameters (list[str] | None) –

    Return only the specified workflow results.

    Available values:

    • context.selection.unit

    • context.selection.lexical

    • sentences_to_string

    • save_sentences

    • trained_model

    • embeddings_dataframe

    • vector_similarity.geometric_measures.similarity

  • return_all (bool) – Return all workflow results.

  • kwargs – Additional arguments passed to FastText.

For more information check: https://fasttext.cc/docs/en/python-module.html

Returns:

Semantic distance measurements or workflow results.

Return type:

DistanceMeasurements | ReturnParameterGenerator | ReturnAllGenerator

dstk.models.predict.word2vec.SGNS(document: Document, *, lowercase: bool = True, base_form: Literal['lemma', 'stem'] | None = 'lemma', allowed_pos: set[str] | None = None, remove_stop_words: bool = True, language: str | None = None, custom_stop_words: list[str] | None = None, window_size: int = 5, n_dimensions: int = 300, n_negative_samples: int = 5, word_probability_distribution: float = 0.75, subsampling: float = 1e-05, frequency_threshold: int = 5, return_parameters: None = None, return_all: Literal[False] = False, **kwargs: dict[str, object]) DistanceMeasurements[source]#
dstk.models.predict.word2vec.SGNS(document: Document, *, lowercase: bool = True, base_form: Literal['lemma', 'stem'] | None = 'lemma', allowed_pos: set[str] | None = None, remove_stop_words: bool = True, language: str | None = None, custom_stop_words: list[str] | None = None, window_size: int = 5, n_dimensions: int = 300, n_negative_samples: int = 5, word_probability_distribution: float = 0.75, subsampling: float = 1e-05, frequency_threshold: int = 5, return_parameters: list[str], return_all: Literal[False] = False, **kwargs: dict[str, object]) Generator[Any, None, None]
dstk.models.predict.word2vec.SGNS(document: Document, *, lowercase: bool = True, base_form: Literal['lemma', 'stem'] | None = 'lemma', allowed_pos: set[str] | None = None, remove_stop_words: bool = True, language: str | None = None, custom_stop_words: list[str] | None = None, window_size: int = 5, n_dimensions: int = 300, n_negative_samples: int = 5, word_probability_distribution: float = 0.75, subsampling: float = 1e-05, frequency_threshold: int = 5, return_parameters: None = None, return_all: Literal[True], **kwargs: dict[str, object]) Generator[ParameterResult, None, None]

Generates word embeddings with Skip-Gram with Negative Sampling (SGNS) as defined by (Lenci & Sahlgren 162-163).

The document is optionally normalized (lowercasing, lemmatization/stemming, POS filtering, stop-word removal) before training a Word2Vec model. The resulting embeddings can be explored through cosine similarity and nearest-neighbor methods.

Parameters:
  • document (Document) – Input document.

  • lowercase (bool) – Convert words to lowercase before training.

  • base_form (Literal["lemma", "stem"] | None) – Use lemmas or stems instead of surface forms.

  • allowed_pos (set[str] | None) – Keep only words with these POS tags.

  • remove_stop_words (bool) – Remove stop words before training.

  • window_size (int) – Context window size.

  • n_dimensions (int) – Embedding dimensionality.

  • frequency_threshold (int) – Minimum word frequency.

  • return_parameters (list[str] | None) –

    Return only the specified workflow results.

    Available values:

    • context.selection.unit

    • context.selection.lexical

    • sentences_to_string

    • trained_model

    • embeddings_dataframe

    • vector_similarity.geometric_measures.similarity

  • return_all (bool) – Return all workflow results.

  • kwargs – Additional arguments passed to gensim.models.Word2Vec.

For more information check: https://radimrehurek.com/gensim/models/word2vec.html

Returns:

Semantic distance measurements or workflow results.

Return type:

DistanceMeasurements | ReturnParameterGenerator | ReturnAllGenerator

Module contents#