dstk.parameters.co_matrix.creation package#
Subpackages#
Submodules#
dstk.parameters.co_matrix.creation.document module#
This module provides tools for generating Word By Document Matrix within the framework of distributional semantics. It focuses on “Documents” as a primary category of context, where word occurrences are analyzed based on their presence within larger structures such as articles, books, or specific records.
The module facilitates the transition from raw linguistic sequences to structured numerical representations, allowing researchers to analyze how words appear across different documents.
Core functionalities include: * Converting collections of lexical item sequences (words/tokens) into a
matrix format suitable for distributional analysis.
Integrating with scikit-learn’s CountVectorizer to handle n-grams and stop-word filtering during the matrix construction process.
Generating a sparse DataFrame where rows represent unique terms and columns represent distinct documents (Word x Document matrix).
Mapping internal data structures into standard pandas DataFrames for easier manipulation in downstream analysis.
This module serves as a foundational step in creating co-occurrence matrices based on document-level context rather than purely local linguistic units.
- dstk.parameters.co_matrix.creation.document.create_word_by_document_matrix(documents_words: Sequence[Sequence[Word]], document_names: list[str] | None = None, **kwargs) DataFrame[source]#
Creates a Word By Document Matrix.
- Parameters:
documents_words (list[Sequence[Word]]) – A list of word or token object sequences.
document_names (list[str] | None) – Optional list of names for the columns.
kwargs – Additional arguments passed to scikit-learn’s CountVectorizer (e.g., stop_words, ngram_range).
For more information check: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
- Returns:
Sparse co-occurrence matrix (word x documents).
- Return type:
DataFrame