dstk.parameters.co_matrix.creation package#

Subpackages#

dstk.parameters.co_matrix.creation.linguistic package
- Subpackages
  - dstk.parameters.co_matrix.creation.linguistic.word package
- Module contents

Submodules#

dstk.parameters.co_matrix.creation.document module#

This module provides tools for generating Word By Document Matrix within the framework of distributional semantics. It focuses on “Documents” as a primary category of context, where word occurrences are analyzed based on their presence within larger structures such as articles, books, or specific records.

The module facilitates the transition from raw linguistic sequences to structured numerical representations, allowing researchers to analyze how words appear across different documents.

Core functionalities include: * Converting collections of lexical item sequences (words/tokens) into a

matrix format suitable for distributional analysis.

Integrating with scikit-learn’s CountVectorizer to handle n-grams and stop-word filtering during the matrix construction process.
Generating a sparse DataFrame where rows represent unique terms and columns represent distinct documents (Word x Document matrix).
Mapping internal data structures into standard pandas DataFrames for easier manipulation in downstream analysis.

This module serves as a foundational step in creating co-occurrence matrices based on document-level context rather than purely local linguistic units.

dstk.parameters.co_matrix.creation.document.create_word_by_document_matrix(documents_words: Sequence[Sequence[Word]], document_names: list[str] | None = None, **kwargs) → DataFrame[source]#

Creates a Word By Document Matrix.

Parameters:

documents_words (list[Sequence[Word]]) – A list of word or token object sequences.
document_names (list[str] | None) – Optional list of names for the columns.
kwargs – Additional arguments passed to scikit-learn’s CountVectorizer (e.g., stop_words, ngram_range).

For more information check: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

Returns:: Sparse co-occurrence matrix (word x documents).
Return type:: DataFrame

dstk.parameters.co_matrix.creation package

Contents

dstk.parameters.co_matrix.creation package#

Subpackages#

Submodules#

dstk.parameters.co_matrix.creation.document module#

Module contents#