Distributional Semantics Toolkit documentation#
This library is based on the book Distributional Semantics by Alessandro Lenci and Magnus Sahlgren. It attempts to incorporate some of the algorithms described in the book, commonly used in distributional semantics.
Installation#
To install it just run the command:
pip install dstklib
DSTK requires python <3.14 to work.
Usage#
You can find the (temporal) basic documentation here. You can also find a demonstration of how to use it on Google Colab (in spanish) here.
The library can be used in two primary modes: individual methods or automated pipelines constructed via ModelBuilder.
Individual methods#
You can use the included methods individually to do linguistic analysis. Just import the method you want to use from its respective module:
from dstk.corpus.annotation import annotate_corpus
from dstk.parameters.context.selection.unit import get_words
from dstk.parameters.context.selection.lexical import remove_stop_words, to_base_form, to_lower
from dstk.parameters.context.extraction.linguistic.word.window import extract_ngrams
from dstk.parameters.co_matrix.creation.linguistic.word.window import create_word_by_word_matrix
from dstk.parameters.co_matrix.weighting.associative_measures import pmi
from dstk.parameters.dimensionality_reduction import svd
from dstk.parameters.vector_similarity.geometric_measures.similarity import cos_similarity
corpus = ["El rápido zorro marrón salta sobre el perro perezoso. ¿Por qué el zorro no corre más rápido?"]
documents = annotate_corpus(corpus=corpus, language_model="es")
words = get_words(document=documents["document_0"])
filtered_words = remove_stop_words(words=words, language="es")
lemmas = to_base_form(words=filtered_words)
lowered = to_lower(words=lemmas)
contexts = extract_ngrams(words=lowered, window_size=3)
co_matrix = create_word_by_word_matrix(contexts=contexts)
weighted_matrix = pmi(word_by_word_matrix=co_matrix, positive=True)
embeddings = svd(matrix=weighted_matrix, n_dimensions=2)
similarity = cos_similarity(embeddings=embeddings, first_word="zorro", second_word="marrón")
print(similarity)
# Output: 0.22279958362756935
Models#
DSTK has some predefined models included that already cover most of the frequent tasks in distributional semantics:
StandardModel: This model generates word embeddings using the Standard Model as defined by (Lenci & Sahlgren 97-99). It extracts word co-occurrences within a context window, weights the matrix using positive PMI (PPMI), reduces its dimensionality with SVD, and provides cosine-based similarity measures.
LatentSemanticAnalysis: This model generates word embeddings using Latent Semantic Analysis (LSA) as defined by (Lenci & Sahlgren 100-103). The model builds a word-document matrix, applies TF-IDF weighting, reduces dimensionality with SVD, and provides cosine-based similarity measures.
SGNS: This model generates word embeddings using Skip-Gram with Negative Sampling (SGNS) as defined by (Lenci & Sahlgren 162-163). The document is optionally normalized (lowercasing, lemmatization/stemming, POS filtering, stop-word removal) before training a
Word2Vecmodel. The resulting embeddings can be explored through cosine similarity and nearest-neighbor methods.Fasttext: This model generates word embeddings using FastText as defined by (Lenci & Sahlgren 164-165). The document is optionally normalized (lowercasing, lemmatization/stemming, POS filtering, stop-word removal) before training a FastText model. Subword information is used to improve representations of rare and out-of-vocabulary words.
In order to use them, just import the respective model and pass an annotated document as input:
from dstk.corpus.annotation import annotate_corpus
from dstk.models.count.matrix.classical import StandardModel
from dstk.parameters.vector_similarity.geometric_measures import similarity
corpus = ["El rápido zorro marrón salta sobre el perro perezoso. ¿Por qué el zorro no corre más rápido?"]
documents = annotate_corpus(corpus=corpus, language_model="es")
distance = StandardModel(
document=documents["document_0"],
language="es",
frequency_threshold=1,
n_dimensions=2
)
similarity = distance.cos_similarity(first_word="zorro", second_word="marrón")
print(similarity)
# Output: 0.22279958362756935
Building your own models#
You can build your own models using the class ModelBuilder and passing it a custom execution workflow. Just input the name of the methods (in the correct order) you want to use and their corresponding arguments as a list of dictionaries, mapped to the name of the module from dstk.parameters you are importing them from:
from dstk.corpus.annotation import annotate_corpus
from dstk.models.tools import ModelBuilder
corpus = ["El rápido zorro marrón salta sobre el perro perezoso. ¿Por qué el zorro no corre más rápido?"]
documents = annotate_corpus(corpus=corpus, language_model="es")
MyModel = ModelBuilder(
workflow={
"context.selection.unit": [
{"get_words": {}}, # IMPORTANT: The first input is passed automatically; only specify remaining args.
],
"context.selection.lexical": [
{"to_lower": {}},
{"to_base_form": {"base_form": "lemma"}},
{"remove_stop_words": {"language": "es"}},
],
"context.extraction.linguistic.word.window": [
{"extract_ngrams": {"window_size": 3}}
],
"co_matrix.creation.linguistic.word.window": [
{"create_word_by_word_matrix": {}}
],
"co_matrix.weighting.associative_measures": [{"pmi": {"positive": True}}],
"dimensionality_reduction": [{"svd": {"n_dimensions": 2}}],
"vector_similarity.geometric_measures.similarity": [
{"cos_similarity": {"first_word": "zorro", "second_word": "marrón"}},
],
}
)
similarity = MyModel(input_data=documents["document_0"])
print(similarity)
# Output: 0.22279958362756935
You can also extract specific intermediate results in the model run or even all of them by using return_parameters or return_all:
contexts, similarity = MyModel(
input_data=documents["document_0"],
return_parameters=[
"context.selection.lexical",
"vector_similarity.geometric_measures.similarity"
]
)
print([word.text for word in contexts])
print(similarity)
# Output:
# ['rápido', 'zorro', 'marrón', 'saltar', 'perro', 'perezoso', 'zorro', 'correr', 'rápido']
# 0.22279958362756935
If you choose to return all of the results, the model will return a generator yielding items that containing the step details:
.. code-block:: python
results = MyModel(input_data=documents["document_0"], return_all=True)
first_param = next(results)
print(first_param.name)
print([word.text for word in first_param.result])
# Output:
# context.selection.unit
# ['El', 'rápido', 'zorro', 'marrón', 'salta', 'sobre', 'el', 'perro', 'perezoso', 'Por', 'qué', 'el', 'zorro', 'no', 'corre', 'más', 'rápido']
Using the Wrapper Pattern#
You can configure a model to return a Wrapper class containing methods you might want to call multiple times on the final calculated state (such as tracking embeddings). To activate this, pass wrapper=True:
MyModel = ModelBuilder(
workflow={
"context.selection.unit": [
{"get_words": {}},
],
"context.selection.lexical": [
{"to_lower": {}},
{"to_base_form": {"base_form": "lemma"}},
{"remove_stop_words": {"language": "es"}},
],
"context.extraction.linguistic.word.window": [
{"extract_ngrams": {"window_size": 3}}
],
"co_matrix.creation.linguistic.word.window": [
{"create_word_by_word_matrix": {}}
],
"co_matrix.weighting.associative_measures": [{"pmi": {"positive": True}}],
"dimensionality_reduction": [{"svd": {"n_dimensions": 2}}],
"vector_similarity.geometric_measures.similarity": [
{"cos_similarity": {}} # The method should NOT have static args here
]
},
wrapper=True
)
distance = MyModel(input_data=documents["document_0"])
# Note: The original functions normally require the input data as their first argument.
# This wrapper class stores that input internally,
# so when calling methods on the wrapper instance, you only need to provide the additional parameters.
# This pattern works with any method from any module and any type of input data,
# allowing convenient repeated use without passing the main input every time.
print(distance.cos_similarity(first_word="zorro", second_word="marrón"))
print(distance.cos_similarity(first_word="perro", second_word="perezoso"))
# Output:
# 0.22279958362756935
# 0.9688325516828031
Hooks#
You can add hooks (functions containing custom transform logic) directly into your model workflow. You must follow two simple rules:
The function must accept only one positional argument and return a single result.
The input data type must match the output type of the previous step, and its output type must match the expected input of the next step.
from dstk.corpus.annotation import annotate_corpus
from dstk.models.tools import ModelBuilder
from dstk.hooks.tools import Hook
corpus = ["El rápido zorro marrón salta sobre el perro perezoso. ¿Por qué el zorro no corre más rápido?"]
documents = annotate_corpus(corpus=corpus, language_model="es")
def cool_method(words):
for word in words:
word.text = "cool"
return words
CoolHook = Hook(method=cool_method)
MyModel = ModelBuilder(
workflow={
"context.selection.unit": [
{"get_words": {}},
],
"context.selection.lexical": [
{"to_lower": {}},
{"to_base_form": {"base_form": "lemma"}},
{"remove_stop_words": {"language": "es"}},
],
"cool_hook": CoolHook
}
)
words = MyModel(input_data=documents["document_0"])
print([word.text for word in words])
# Output: ['cool', 'cool', 'cool', 'cool', 'cool', 'cool', 'cool', 'cool', 'cool']
If your custom hook relies on additional configuration keywords, you can assign them defaults ahead of configuration:
def cool_method(words, text="cool"):
for word in words:
word.text = text
return words
CoolHook = Hook(method=cool_method)
CoolHook.set_default_args({"text": "custom_text"})
MyModel = ModelBuilder(
workflow={
"context.selection.unit": [
{"get_words": {}},
],
"context.selection.lexical": [
{"to_lower": {}},
{"to_base_form": {"base_form": "lemma"}},
{"remove_stop_words": {"language": "es"}},
],
"cool_hook": CoolHook
}
)
words = MyModel(input_data=documents["document_0"])
print([word.text for word in words])
# Output: ['custom_text', 'custom_text', 'custom_text', 'custom_text', 'custom_text', 'custom_text', 'custom_text', 'custom_text', 'custom_text']
Status#
This release (v3.0.2) is mostly stable, but still considered beta.
Most algorithms are stable and rely on well-tested libraries. However, some custom implementations might need verification.
Performance may be limited for large datasets.
All models should deliver correct results. However, external verification is strongly encouraged.
This library is open-source, and your verification matters! If you use this library:
Try the models on your data.
Compare results with expected behavior.
Report any inconsistencies, crashes, or performance bottlenecks.
Even small confirmations help a lot. Your feedback will help make the next stable release solid and reliable.
Contributing#
I welcome contributions to improve this toolkit. If you have ideas or fixes, feel free to fork the repository and submit a pull request. Here are some ways you can help:
Report bugs or issues.
Suggest new features or algorithms to add.
License#
This project is licensed under the GPL-3 License - see the LICENSE file for details.
You can cantact me by :doc`clicking here <contact>`