dstk.parameters package#

Subpackages#

dstk.parameters.co_matrix package
- Subpackages
  - dstk.parameters.co_matrix.creation package
  - dstk.parameters.co_matrix.weighting package
- Module contents
dstk.parameters.context package
- Subpackages
  - dstk.parameters.context.extraction package
    - Subpackages
    - Module contents
  - dstk.parameters.context.selection package
- Module contents
dstk.parameters.vector_similarity package
- Subpackages
  - dstk.parameters.vector_similarity.geometric_measures package
- Module contents

Submodules#

dstk.parameters.dimensionality_reduction module#

This module offers functionality to transform and reduce high-dimensional text data represented as matrices, enabling more effective downstream analysis and modeling.

Key features include:

Scaling input matrices to zero mean and unit variance using standardization.
Generating low-dimensional word embeddings from co-occurrence matrices using dimensionality reduction techniques:
Truncated Singular Value Decomposition (SVD)
Principal Component Analysis (PCA)

These techniques help distill semantic information from sparse and high-dimensional co-occurrence data, facilitating tasks such as clustering, visualization, and feature extraction in natural language processing pipelines.

All functions return results as Pandas DataFrames for seamless integration with data workflows.

dstk.parameters.dimensionality_reduction.pca(matrix: DataFrame, n_dimensions: int | float = 300, **kwargs) → DataFrame[source]#

Generates word embeddings using Principal Component Analysis (PCA).

Parameters:

matrix (DataFrame) – A Co-occurrence matrix from which embeddings will be generated.
n_dimensions (int or float) – If an integer, the number of dimensions to reduce the word embeddings to. If a float between 0 and 1, specifies the proportion of variance to preserve. Defaults to 300.
kwargs – Additional keyword arguments to pass to sklearn’s PCA.

For more information check: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

Returns:: A DataFrame of word embeddings generated by PCA.
Return type:: DataFrame

dstk.parameters.dimensionality_reduction.svd(matrix: DataFrame, n_dimensions: int = 300, **kwargs) → DataFrame[source]#

Generates word embeddings using truncated Single Value Descomposition (SVD).

Parameters:

matrix (DataFrame) – A Co-occurrence matrix from which embeddings will be generated.
n_dimensions (int) – The number of dimensions to reduce the word embeddings to. Defaults to 300.
kwargs – Additional keyword arguments to pass to sklearn’s TruncatedSVD.

For more information check: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html

Returns:: A DataFrame of word embeddings generated by SVD.
Return type:: DataFrame

dstk.parameters package

Contents

dstk.parameters package#

Subpackages#

Submodules#

dstk.parameters.dimensionality_reduction module#

Module contents#