dstk.models package#

Subpackages#

Submodules#

dstk.models.tools module#

Utilities for building, executing, and extending modular linguistic workflows.

This module provides the infrastructure used to define and run workflow-based processing pipelines. A workflow consists of a sequence of parameters, where each parameter is associated with one or more processing methods or hooks. The module dynamically loads and executes these methods, passing the output of one step as the input to the next.

Core functionalities include:

  • Building reusable workflow execution models with ModelBuilder

  • Running parameterized linguistic processing pipelines step by step

  • Returning intermediate results from selected workflow stages

  • Executing custom hooks alongside standard workflow parameters

  • Dynamically exposing workflow methods through wrapper objects

  • Adapting sentence and token sequence collections for uniform processing

  • Defining a protocol for semantic similarity and nearest-neighbor operations on word embeddings

The module is intended for constructing flexible and reusable text-processing pipelines, allowing researchers and digital humanities practitioners to combine linguistic operations into configurable workflows.

class dstk.models.tools.DistanceMeasurements(*args, **kwargs)[source]#

Bases: Protocol

Interface for semantic similarity methods based on word embeddings.

This protocol represents any object that implements methods for computing cosine similarity and retrieving nearest neighbors.

Methods:
cos_similarity(first_word, second_word):

Computes the cosine similarity between two words. Equivalent to dstk.modules.geometric_distance.cos_similarity.

nearest_neighbors(word, metric, n_words, **kwargs):

Returns the nearest neighbors to a word using a specified metric. Equivalent to dstk.modules.geometric_distance.nearest_neighbors.

approximate_nearest_neighbors(word: str, metric: str = 'ivf', n_words: int = 5, n_centroids: int = 100, clusters_to_search: int = 10, n_connections: int = 16, search_depth: int = 8, construction_depth: int = 64) list[Neighbor][source]#

Find words with similar embeddings using a fast, memory-efficient approximate search.

This function returns the closest words to a target word without checking every possible word directly. Instead, it uses structures that give very close results much faster than an exact search, especially on large embedding sets.

cos_similarity(first_word: str, second_word: str) float[source]#

Return the cosine similarity between two words.

nearest_neighbors(word: str, metric: str = 'cosine', n_words: int = 5, **kwargs) list[Neighbor][source]#

Return the top-N nearest neighbors to a word using a given metric.

class dstk.models.tools.ModelBuilder(workflow: dict[str, list[dict[str, dict[str, Any]]] | Hook], wrapper: bool = False, name: str | None = None)[source]#

Bases: Generic[W]

class dstk.models.tools.Wrapper(input_data: Any)[source]#

Bases: object

add_method(method: Callable[[Concatenate[Any, P]], R])[source]#

Module contents#