dstk.parameters package#

Subpackages#

Submodules#

dstk.parameters.dimensionality_reduction module#

This module offers functionality to transform and reduce high-dimensional text data represented as matrices, enabling more effective downstream analysis and modeling.

Key features include:

  • Scaling input matrices to zero mean and unit variance using standardization.

  • Generating low-dimensional word embeddings from co-occurrence matrices using dimensionality reduction techniques:

  • Truncated Singular Value Decomposition (SVD)

  • Principal Component Analysis (PCA)

These techniques help distill semantic information from sparse and high-dimensional co-occurrence data, facilitating tasks such as clustering, visualization, and feature extraction in natural language processing pipelines.

All functions return results as Pandas DataFrames for seamless integration with data workflows.

dstk.parameters.dimensionality_reduction.pca(matrix: DataFrame, n_dimensions: int | float = 300, **kwargs) DataFrame[source]#

Generates word embeddings using Principal Component Analysis (PCA).

Parameters:
  • matrix (DataFrame) – A Co-occurrence matrix from which embeddings will be generated.

  • n_dimensions (int or float) – If an integer, the number of dimensions to reduce the word embeddings to. If a float between 0 and 1, specifies the proportion of variance to preserve. Defaults to 300.

  • kwargs – Additional keyword arguments to pass to sklearn’s PCA.

For more information check: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

Returns:

A DataFrame of word embeddings generated by PCA.

Return type:

DataFrame

dstk.parameters.dimensionality_reduction.svd(matrix: DataFrame, n_dimensions: int = 300, **kwargs) DataFrame[source]#

Generates word embeddings using truncated Single Value Descomposition (SVD).

Parameters:
  • matrix (DataFrame) – A Co-occurrence matrix from which embeddings will be generated.

  • n_dimensions (int) – The number of dimensions to reduce the word embeddings to. Defaults to 300.

  • kwargs – Additional keyword arguments to pass to sklearn’s TruncatedSVD.

For more information check: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html

Returns:

A DataFrame of word embeddings generated by SVD.

Return type:

DataFrame

Module contents#