dstk.utilities package#
Submodules#
dstk.utilities.clustering module#
This module provides tools for organizing and grouping word embeddings into meaningful clusters.
For researchers in linguistics and digital humanities, ‘word embeddings’ are mathematical representations of words where words with similar meanings are located close together in a high-dimensional space. However, because these spaces are often too complex to analyze directly, this module uses two powerful algorithms: 1. UMAP: Reduces the complexity (dimensionality) of the data while preserving the
relationships between words.
HDBSCAN: Identifies “dense” regions in that simplified space to group similar words together into clusters.
The resulting output identifies groups of related terms and labels outliers as ‘Noise’.
- dstk.utilities.clustering.cluster_embeddings(embeddings: DataFrame, cluster_dimensions: int = 5, n_neighbors: int = 15, min_dist: float = 0.1, compression_metric: str = 'cosine', approximate: int | None = None, min_cluster_size: int = 5, min_samples: int | None = None, cluster_metric: str = 'euclidean') DataFrame[source]#
Reduces dimensions and clusters word embeddings using UMAP and HDBSCAN.
- Parameters:
embeddings (DataFrame) – DataFrame containing the word embeddings.
cluster_dimensions (int) – Number of dimensions for UMAP reduction.
n_neighbors (int) – Number of neighbors to consider for local structure.
min_dist (float) – Minimum distance between points in the reduced space.
compression_metric (str) – Metric used by UMAP (e.g., “cosine”).
approximate (int | None) – If set, a random sample of data is used for speed.
min_cluster_size (int) – Minimum size required to form a cluster in HDBSCAN.
min_samples (int | None) – Number of samples in a neighborhood to consider as core points.
cluster_metric (str) – Metric used by HDBSCAN (e.g., “euclidean”).
- Returns:
A DataFrame with an additional ‘cluster’ column.
- Return type:
DataFrame
dstk.utilities.context_extraction module#
This module provides supplementary utilities for analyzing and summarizing context extraction results. It complements the main context extraction module by offering helper functions for computing descriptive statistics, transforming extracted contexts into tabular representations, and performing other auxiliary operations on collocation data.
Core functionalities include:
Counting the frequency of words appearing in extracted collocates.
Converting context extraction results into pandas DataFrames for analysis.
Supporting exploratory analysis of collocation and context data.
Providing miscellaneous helper functions related to context extraction outputs.
The module is intended to complement the context extraction workflow by providing reusable utilities for post-processing and analyzing extracted linguistic contexts.
- dstk.utilities.context_extraction.collocate_frequency(collocates: list[tuple[Word, ...]]) DataFrame[source]#
Counts the frequency of words in a list of collocations and returns the result as a DataFrame.
- Parameters:
collocates (list[Collocates]) – A list of collocations, where each collocation is a tuple of words.
- Returns:
A DataFrame with two columns: “Word” and “Frequency”, sorted by frequency.
- Return type:
DataFrame
dstk.utilities.data_conversion module#
This module provides utility functions for converting linguistic data between different formats and representations common in computational linguistics and natural language processing (NLP). It facilitates the movement of data between raw text, neural model outputs, and structured tabular formats (pandas DataFrames), as well as standard annotation formats like CoNLL-U.
Core functionalities include: * Converting sequences of lexical items into space-separated strings for text processing. * Transforming Word2Vec and FastText embeddings into pandas DataFrames for easier analysis. * Parsing CoNLL-U files into DataFrames to allow for programmatic manipulation of
linguistic features and metadata.
Exporting processed DataFrames back into the standard CoNLL-U format for sharing or storage.
The module is intended to streamline the workflow of converting data between various stages of a linguistic pipeline, ensuring compatibility between different tools and data storage formats.
- dstk.utilities.data_conversion.conllu_to_df(path: str) DataFrame[source]#
Parses a CoNLL-U file and converts it into a pandas DataFrame.
- Parameters:
path (str) – The system path to the .conllu file.
- Returns:
A DataFrame containing tokens, metadata, and features.
- Return type:
DataFrame
- dstk.utilities.data_conversion.df_to_conllu(dataframe: DataFrame, path: str | None) str[source]#
Converts a DataFrame of linguistic data into a CoNLL-U formatted string.
- Parameters:
dataframe (DataFrame) – A pandas DataFrame containing token information and metadata.
path (str | None) – Optional file path to save the generated CoNLL-U content.
- Returns:
The resulting CoNLL-U formatted string.
- Return type:
str
- dstk.utilities.data_conversion.neural_model_to_dataframe(model: Word2Vec | _FastText) DataFrame[source]#
Converts a trained Word2Vec or FastText model into a DataFrame of word embeddings.
- Parameters:
model (NeuralModels) – A trained Word2Vec or FastText model.
- Returns:
A DataFrame containing the word embeddings and their associated labels.
- Return type:
DataFrame
- dstk.utilities.data_conversion.sequence_to_string(items: Sequence[Word] | Sequence[Token]) str[source]#
Joins a sequence of words or tokens into a single space-separated string.
- Parameters:
items (LexicalItemSequence) – A sequence of word or token objects.
- Returns:
A single string formed by joining the text of each item.
- Return type:
str
dstk.utilities.dataframe_manipulation module#
This module provides helper functions to simplify the extraction of data from pandas DataFrames, a common format used for organizing and storing linguistic datasets. It simplifies the process of converting table-based structures into standard Python lists, making it easier to pass data into various NLP pipelines or downstream analysis tools.
Core functionalities include:
Extracting a specific row as a list by either its numerical position or its label name
Extracting a specific column as a list for easy iteration and processing
Checking if a DataFrame is stored in a sparse format (useful for managing memory when dealing with large datasets)
The module is intended to provide a simplified interface for data retrieval, ensuring that data types are consistent when moving from tabular structures into standard Python lists.
- dstk.utilities.dataframe_manipulation.get_column(dataframe: DataFrame, column: int | str) list[source]#
Returns the specified column from a dataframe as a list of values.
- Parameters:
dataframe (DataFrame) – The dataframe where to extract the column.
column (int | str) – The index of the column to be extracted or its label. You can only extract by label when the datraframe contains no duplicates. Otherwise, it will raise a ValueError.
- Raises:
ValueError – If the provided dataframe contains more than one column with the same name.
- dstk.utilities.dataframe_manipulation.get_row(dataframe: DataFrame, row: int | str) list[source]#
Returns the specified row from a dataframe as a list of values.
- Parameters:
dataframe (DataFrame) – The dataframe where to extract the row.
row (int | str) – The index of the row to be extracted or its label. You can only extract by label when the datraframe contains no duplicates. Otherwise, it will raise a ValueError.
- Raises:
ValueError – If the provided dataframe contains more than one row with the same name.
dstk.utilities.matrix_manipulation module#
This module provides a collection of helper functions for manipulating numerical data structures, specifically focusing on NumPy arrays. It is designed to simplify common data processing tasks such as normalization, scaling, and transforming raw numerical outputs into formats suitable for analysis or visualization.
Core functionalities include: * Scaling and standardizing matrices to ensure uniform variance across features. * General utility functions for handling array dimensions and miscellaneous data transformations.
The module is intended as a toolkit for researchers who need to perform preprocessing tasks on numerical datasets within a Python environment.
- dstk.utilities.matrix_manipulation.scale_matrix(matrix: DataFrame, **kwargs) DataFrame[source]#
Scales the input matrix to have zero mean and unit variance for each feature.
This method applies standardization using scikit-learn’s StandardScaler, which transforms the data such that each colum (feature) has a mean of 0 and a standard deviation of 1.
- Parameters:
matrix (DataFrame) – The input data to scale.
kwargs – Additional keyword arguments to pass to sklearn’s StandardScaler.
For more information check: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
- Returns:
A scaled matrix.
- Return type:
DataFrame
dstk.utilities.typeguards module#
This module provides type guard functions to verify the internal structure and data types of linguistic objects within the library. These guards act as safety checks, ensuring that data—such as collections of words, sentences, or document indices—matches the expected format before it is processed by downstream functions.
Core functionalities include: * Validating document structures (e.g., checking for valid dictionaries or sequences of Stanza Documents) * Verifying linguistic units (validating lists of sentences, words, or tokens) * Checking complex data types like collocations and mixed linguistic sequences * Validating workflow configurations to ensure they adhere to the required schema
By using these guards, the library ensures that errors are caught early when processing large datasets, providing more reliable results for linguistic analysis.
- dstk.utilities.typeguards.is_collocates(collocates: Any) TypeGuard[list[tuple[Word, ...]]][source]#
Checks if the input is a non-empty list of tuples, where each tuple contains Word objects.
- Parameters:
collocates (Any) – The object to check.
- Returns:
True if collocates matches a list of Collocates, otherwise False.
- Return type:
bool
- dstk.utilities.typeguards.is_document_index(index: Any) TypeGuard[dict[str, Document]][source]#
Checks if the input is a non-empty dictionary mapping strings to Stanza Documents.
- Parameters:
index (Any) – The object to check.
- Returns:
True if index matches DocumentIndex, otherwise False.
- Return type:
bool
- dstk.utilities.typeguards.is_documents(documents: Any) TypeGuard[Sequence[Document]][source]#
Checks if the input is a non-empty sequence of Stanza Documents.
- Parameters:
documents (Any) – The object to check.
- Returns:
True if documents matches Documents, otherwise False.
- Return type:
bool
- dstk.utilities.typeguards.is_sentences(sentences: Any) TypeGuard[Sequence[Sentence]][source]#
Checks if the input is a non-empty sequence of Sentence objects.
- Parameters:
sentences (Any) – The object to check.
- Returns:
True if sentences matches Sequence[Sentence], otherwise False.
- Return type:
bool
- dstk.utilities.typeguards.is_sequences(sequences: Any) TypeGuard[Sequence[Sentence] | Sequence[Sequence[Word]] | Sequence[Sequence[Token]]][source]#
Checks if the input is a non-empty sequence of linguistic items (Sentences, Words, or Tokens).
- Parameters:
sequences (Any) – The object to check.
- Returns:
True if sequences matches LinguisticSequences, otherwise False.
- Return type:
bool
- dstk.utilities.typeguards.is_tokens(tokens: Any) TypeGuard[Sequence[Token]][source]#
Checks if the input is a non-empty sequence of Token objects.
- Parameters:
tokens (Any) – The object to check.
- Returns:
True if tokens matches Sequence[Token], otherwise False.
- Return type:
bool
- dstk.utilities.typeguards.is_words(words: Any) TypeGuard[Sequence[Word]][source]#
Checks if the input is a non-empty sequence of Word objects.
- Parameters:
words (Any) – The object to check.
- Returns:
True if words matches Sequence[Word], otherwise False.
- Return type:
bool
- dstk.utilities.typeguards.is_workflow(workflow: Any) TypeGuard[dict[str, list[dict[str, dict[str, Any]]] | Hook]][source]#
Checks if the input is a workflow structure, i.e., a non-empty list of dictionaries where each dictionary maps string method names to argument dictionaries with string keys.
- Parameters:
workflow (Any) – The object to check.
- Returns:
True if workflow matches the workflow structure, otherwise False.
- Return type:
bool
dstk.utilities.word2vec module#
This module provides utility functions for managing neural word embedding models, specifically supporting Word2Vec and FastText formats. It acts as a unified interface to handle the loading and saving of these models, abstracting away the differences between the underlying gensim and fasttext libraries.
Core functionalities include:
Loading Word2Vec models from files with the .model extension.
Loading FastText models from files with the .bin extension.
Automatic detection of model types based on file extensions during loading.
Saving both Word2Vec and FastText models to specified paths while automatically applying the correct file format.
Providing a consistent interface for handling pre-trained word vectors in linguistic workflows.
The module is intended to simplify the integration of vector space models into computational linguistics and digital humanities projects.
- dstk.utilities.word2vec.load_neural_model(path: str) Word2Vec | _FastText[source]#
Loads the trained embeddings in .model (Word2Vec) or .bin (FastText) format, depending on the algorithm used.
- Parameters:
path (str) – Path to the saved model file.
- Returns:
An instance of gensim’s Word2Vec or fasttext’s FastText.
- Return type:
- dstk.utilities.word2vec.save_neural_model(model: Word2Vec | _FastText, path: str) str[source]#
Saves the trained embeddings in .model (Word2Vec) or .bin (FastText) format, depending on the algorithm used.
- Parameters:
model (NeuralModels) – A trained Word2Vec or FastText model.
path (str) – The path (without extension) where to save the model.
- Returns:
An instance of gensim’s Word2Vec or fasttext’s FastText.
- Return type: