dstk.models package#
Subpackages#
Submodules#
dstk.models.tools module#
Utilities for building, executing, and extending modular linguistic workflows.
This module provides the infrastructure used to define and run workflow-based processing pipelines. A workflow consists of a sequence of parameters, where each parameter is associated with one or more processing methods or hooks. The module dynamically loads and executes these methods, passing the output of one step as the input to the next.
Core functionalities include:
Building reusable workflow execution models with
ModelBuilderRunning parameterized linguistic processing pipelines step by step
Returning intermediate results from selected workflow stages
Executing custom hooks alongside standard workflow parameters
Dynamically exposing workflow methods through wrapper objects
Adapting sentence and token sequence collections for uniform processing
Defining a protocol for semantic similarity and nearest-neighbor operations on word embeddings
The module is intended for constructing flexible and reusable text-processing pipelines, allowing researchers and digital humanities practitioners to combine linguistic operations into configurable workflows.
- class dstk.models.tools.DistanceMeasurements(*args, **kwargs)[source]#
Bases:
ProtocolInterface for semantic similarity methods based on word embeddings.
This protocol represents any object that implements methods for computing cosine similarity and retrieving nearest neighbors.
- Methods:
- cos_similarity(first_word, second_word):
Computes the cosine similarity between two words. Equivalent to
dstk.modules.geometric_distance.cos_similarity.- nearest_neighbors(word, metric, n_words, **kwargs):
Returns the nearest neighbors to a word using a specified metric. Equivalent to
dstk.modules.geometric_distance.nearest_neighbors.
- approximate_nearest_neighbors(word: str, metric: str = 'ivf', n_words: int = 5, n_centroids: int = 100, clusters_to_search: int = 10, n_connections: int = 16, search_depth: int = 8, construction_depth: int = 64) list[Neighbor][source]#
Find words with similar embeddings using a fast, memory-efficient approximate search.
This function returns the closest words to a target word without checking every possible word directly. Instead, it uses structures that give very close results much faster than an exact search, especially on large embedding sets.