dstk.parameters.context.selection package#

Submodules#

dstk.parameters.context.selection.lexical module#

This module provides a suite of preprocessing and filtering tools for linguistic data. It is designed to help researchers and students in the digital humanities clean, normalize, and filter sequences of words (Word objects) or tokens based on common linguistic criteria such as part-of-speech tags, frequency thresholds, and lexical normalization.

Core functionalities include:

  • Removing stop words using NLTK’s multi-language support or custom word lists.

  • Filtering text by Part-of-Speech (POS) tags (e.g., retaining only nouns or verbs).

  • Normalizing sequences into base forms through lemmatization or stemming.

  • Cleaning text by removing punctuation and converting characters to lowercase.

  • Filtering words based on minimum frequency thresholds.

  • Isolating specific Named Entities (NER) from token sequences.

The module is intended to simplify the preparation of raw text for more advanced computational linguistic analysis and visualizations.

dstk.parameters.context.selection.lexical.filter_by_frequency(words: Sequence[Word], threshold: int = 50) list[Word][source]#

Filters a sequence of words to keep only those that appear above a minimum frequency.

Parameters:
  • words (Sequence[Word]) – A sequence of Stanza Word objects.

  • threshold (int) – The minimum occurrence count required to keep a word. Defaults to 50.

Returns:

A filtered list of Stanza Word objects.

Return type:

list[Word]

dstk.parameters.context.selection.lexical.filter_by_ner(tokens: Sequence[Token], allowed_ner: set[str]) list[Token][source]#

Filters a sequence of tokens to keep only specified Named Entity Recognition (NER) tags.

Parameters:
  • tokens (Sequence[Token]) – A sequence of Stanza Token objects.

  • allowed_ner (set[str]) – A set of NER tags to keep (e.g., {‘PERSON’, ‘LOC’}).

Returns:

A filtered list of Stanza Token objects.

Return type:

list[Token]

dstk.parameters.context.selection.lexical.filter_by_pos(words: Sequence[Word], allowed_pos: set[str]) list[Word][source]#

Filters a sequence of words to keep only specified parts-of-speech (POS) tags.

Parameters:
  • words (Sequence[Word]) – A sequence of Stanza Word objects.

  • allowed_pos (set[str]) – A set of universal POS tags to keep (e.g., {‘NOUN’, ‘VERB’}).

Returns:

A filtered list of Stanza Word objects.

Return type:

list[Word]

dstk.parameters.context.selection.lexical.remove_punctuation(words: Sequence[Word]) list[Word][source]#

Removes punctuation marks from a sequence of words.

Parameters:

words (Sequence[Word]) – A sequence of Stanza Word objects.

Returns:

A list of Stanza Word objects excluding punctuation.

Return type:

list[Word]

dstk.parameters.context.selection.lexical.remove_stop_words(words: Sequence[Word], *, language: str, custom_stop_words: None = None) list[Word][source]#
dstk.parameters.context.selection.lexical.remove_stop_words(words: Sequence[Word], *, language: None = None, custom_stop_words: list[str]) list[Word]
dstk.parameters.context.selection.lexical.remove_stop_words(words: Sequence[Word], *, language: str, custom_stop_words: list[str]) list[Word]

Filters out stop words from a sequence of words based on language or a custom list.

Parameters:
  • words (Sequence[Word]) – A sequence of Stanza Word objects.

  • language (str or None) – The two-letter ISO language code (e.g., ‘en’, ‘es’). Defaults to None.

  • custom_stop_words (list[str] or None) – A user-defined list of stop words to filter out. Defaults to None.

Returns:

A filtered list of Stanza Word objects.

Return type:

list[Word]

dstk.parameters.context.selection.lexical.to_base_form(words: Sequence[Word], base_form: Literal['lemma', 'stem'] = 'lemma') list[Word][source]#

Normalizes a sequence of words by replacing their text with their lemma or stem.

Parameters:
  • words (Sequence[Word]) – A sequence of Stanza Word objects.

  • base_form (Literal["lemma", "stem"]) – The normalization strategy, either ‘lemma’ or ‘stem’. Defaults to ‘lemma’.

Returns:

A list of normalized Stanza Word objects.

Return type:

list[Word]

dstk.parameters.context.selection.lexical.to_lower(words: Sequence[Word]) list[Word][source]#

Converts the text of all words in a sequence to lowercase.

Parameters:

words (Sequence[Word]) – A sequence of Stanza Word objects.

Returns:

A list of lowercased Stanza Word objects.

Return type:

list[Word]

dstk.parameters.context.selection.unit module#

This module provides utility functions for extracting structured linguistic data from Stanza Documents. It simplifies the process of converting processed documents into manageable Python lists of sentences, tokens, and words to facilitate further analysis.

Core functionalities include:

  • Extracting a sequence of sentences from a Stanza Document

  • Retrieving all tokens (including punctuation) as a list of Token objects

  • Isolating “words” by extracting items and filtering out punctuation marks

  • Providing copies of linguistic objects to ensure data integrity during processing

The module is designed to streamline the transition between raw NLP output and structured text analysis for linguistics-focused workflows.

dstk.parameters.context.selection.unit.get_sentences(document: Document) list[Sentence][source]#

Extracts a list of sentences from a Stanza Document.

Parameters:

document (Document) – The Stanza Document object.

Returns:

A list of sentence objects.

Return type:

list[Sentence]

dstk.parameters.context.selection.unit.get_tokens(document: Document) list[Token][source]#

Extracts all tokens from a Stanza Document as a list.

Parameters:

document (Document) – The Stanza Document object.

Returns:

A list of token objects.

Return type:

list[Token]

dstk.parameters.context.selection.unit.get_words(document: Document) list[Word][source]#

Extracts words from a Stanza Document, excluding punctuation marks.

Parameters:

document (Document) – The Stanza Document object.

Returns:

A list of word objects.

Return type:

list[Word]

Module contents#