dstk.models.count.matrix package#

Submodules#

dstk.models.count.matrix.classical module#

This module provides classic matrix-based distributional semantic models for generating word embeddings. As part of the ‘count’ model category, these methods determine word meanings by analyzing the frequency of co-occurrences within a corpus to build large matrices of word counts. Documents can be optionally normalized before training through lowercasing, lemmatization or stemming, part-of-speech filtering, stop-word removal, and punctuation removal. The trained models can then be used to compute semantic similarity, identify nearest neighbors, and export learned embeddings for downstream analyses.

Core functionalities include:

Implementing the Standard Model (Lenci & Sahlgren) using co-occurrence matrices and PPMI weighting.
Implementing Latent Semantic Analysis (LSA) utilizing word-document matrices and TF-IDF weighting.
Integrated preprocessing pipelines including case normalization, lemmatization/stemming, part-of-speech filtering, and stop-word removal.
Dimensionality reduction using Singular Value Decomposition (SVD) to map words into a lower-dimensional semantic space.
Calculating geometric similarity measures, such as cosine similarity and nearest neighbor identification.

The module is designed to provide foundational methods for extracting semantic relationships from text based on traditional distributional semantics.

dstk.models.count.matrix.classical.LatentSemanticAnalysis(document_index: dict[str, Document], *, lowercase: bool = True, base_form: Literal['lemma', 'stem'] | None = 'lemma', allowed_pos: set[str] | None = None, remove_stop_words: bool = True, language: str | None = None, custom_stop_words: list[str] | None = None, frequency_threshold: int = 50, n_dimensions: int = 300, return_parameters: None, return_all: Literal[False] = False) → DistanceMeasurements[source]#

dstk.models.count.matrix.classical.LatentSemanticAnalysis(document_index: dict[str, Document], *, lowercase: bool = True, base_form: Literal['lemma', 'stem'] | None = 'lemma', allowed_pos: set[str] | None = None, remove_stop_words: bool = True, language: str | None = None, custom_stop_words: list[str] | None = None, frequency_threshold: int = 50, n_dimensions: int = 300, return_parameters: list[str], return_all: Literal[False] = False) → Generator[Any, None, None]

dstk.models.count.matrix.classical.LatentSemanticAnalysis(document_index: dict[str, Document], *, lowercase: bool = True, base_form: Literal['lemma', 'stem'] | None = 'lemma', allowed_pos: set[str] | None = None, remove_stop_words: bool = True, language: str | None = None, custom_stop_words: list[str] | None = None, frequency_threshold: int = 50, n_dimensions: int = 300, return_parameters: None = None, return_all: Literal[True]) → Generator[ParameterResult, None, None]

Generate word embeddings using Latent Semantic Analysis (LSA) as defined by (Lenci & Sahlgren 100-103).

The model builds a word-document matrix, applies TF-IDF weighting, reduces dimensionality with SVD, and provides cosine-based similarity measures.

Parameters:

document_index (DocumentIndex) – Mapping of document names to documents.
frequency_threshold (int) – Minimum frequency required for a word to be included in the vocabulary.
n_dimensions (int) – Number of dimensions in the reduced semantic space.
return_parameters (list[str] | None) –
Names of workflow steps to return instead of the final model output.

Available values:
- context.selection.unit
- context.selection.lexical
- co_matrix.creation.document
- co_matrix.weighting.relevance_measures
- dimensionality_reduction
- vector_similarity.geometric_measures.similarity
return_all (bool) – Return all workflow steps and their outputs.

Returns:

A semantic similarity model, a generator of selected workflow results, or a generator of all workflow results.

Return type:

DistanceMeasurements | ReturnParameterGenerator | ReturnAllGenerator

dstk.models.count.matrix.classical.StandardModel(document: Document, *, lowercase: bool = True, base_form: Literal['lemma', 'stem'] | None = 'lemma', allowed_pos: set[str] | None = None, remove_stop_words: bool = True, language: str | None = None, custom_stop_words: list[str] | None = None, frequency_threshold: int = 50, window_size: int = 3, n_dimensions: int = 300, return_parameters: None = None, return_all: Literal[False] = False) → DistanceMeasurements[source]#

dstk.models.count.matrix.classical.StandardModel(document: Document, *, lowercase: bool = True, base_form: Literal['lemma', 'stem'] | None = 'lemma', allowed_pos: set[str] | None = None, remove_stop_words: bool = True, language: str | None = None, custom_stop_words: list[str] | None = None, frequency_threshold: int = 50, window_size: int = 3, n_dimensions: int = 300, return_parameters: list[str], return_all: Literal[False] = False) → Generator[Any, None, None]

dstk.models.count.matrix.classical.StandardModel(document: Document, *, lowercase: bool = True, base_form: Literal['lemma', 'stem'] | None = 'lemma', allowed_pos: set[str] | None = None, remove_stop_words: bool = True, language: str | None = None, custom_stop_words: list[str] | None = None, frequency_threshold: int = 50, window_size: int = 3, n_dimensions: int = 300, return_parameters: None = None, return_all: Literal[True]) → Generator[ParameterResult, None, None]

Generate distributional word embeddings from a single document using the standard model as defined by (Lenci & Sahlgren 97-99).

The model extracts word co-occurrences within a context window, weights the matrix using positive PMI (PPMI), reduces its dimensionality with SVD, and provides cosine-based similarity measures.