dstk.parameters.context.extraction.linguistic.word package#

Submodules#

dstk.parameters.context.extraction.linguistic.word.window module#

This module provides tools for extracting context based on “Linguistic Units” by identifying “Window-based collocates.” Within the framework of distributional semantics, this module focuses on Lexeme contexts—extracting words that appear in close proximity to a target word within a defined window.

Core functionalities include:

Extracting collocates from specified left and right windows around a target word.
Filtering context windows based on part-of-speech (POS) tags to refine linguistic data.
Generating directed bigrams, which specify the directional relationship between a context word and a target word (e.g., Left or Right).
Extracting undirected bigrams where only physical proximity is considered.
Generating n-grams from sequences of lexical items for fixed-length proximity analysis.

The module is specifically tailored for Stanza Word objects and sequences of Lexical Items.

dstk.parameters.context.extraction.linguistic.word.window.extract_collocates(words: list[Word], target_word: str, window_size: tuple[int, int], allowed_pos: set[str] | None = None) → list[tuple[Word, ...]][source]#

Extracts context words around a target word as flat tuples.

Parameters:

words (list[Word]) – A list of Stanza Word objects.
target_word (str) – The word to find within the list.
window_size (tuple[int, int]) – A tuple representing the left and right window sizes.
allowed_pos (set[str] | None) – Optional set of POS tags to filter context words, defaults to None.

Returns:

A list of word tuples matching the window constraints.

Return type:

list[Collocates]

dstk.parameters.context.extraction.linguistic.word.window.extract_directed_bigrams(words: list[Word], target_word: str, window_size: tuple[int, int], allowed_pos: set[str] | None = None) → list[tuple[Word, tuple[str, str]]][source]#

Extracts directed bigrams (tagged with context direction) around a target word.

Collects bigrams in the form: * Left bigrams: (context_word, ("L", target_word)) * Right bigrams: (context_word, ("R", target_word))

Parameters:

words (list[Word]) – A list of Stanza Word objects.
target_word (str) – The word to search for.
window_size (tuple[int, int]) – A tuple representing the left and right window sizes.
allowed_pos (set[str] | None) – Optional set of POS tags to filter context words, defaults to None.

Returns:

A list of directed collocate tuples.

Return type:

list[DirectedCollocates]

dstk.parameters.context.extraction.linguistic.word.window.extract_ngrams(words: Sequence[Token | Word], window_size: int, **kwargs) → list[tuple[Word, ...]][source]#

Splits lexical items into groups of sequential n-grams.

Parameters:

words (Sequence[LexicalItem]) – A sequence of Stanza Word or Token objects.
window_size (int) – The size of the n-gram window.
kwargs – Additional keyword arguments passed to nltk.util.ngrams (e.g., pad_left, pad_right).

For more information check: https://www.nltk.org/api/nltk.util.html#nltk.util.ngrams

Returns:: A list of word tuples representing the consecutive n-grams.
Return type:: list[Collocates]

dstk.parameters.context.extraction.linguistic.word.window.extract_undirected_bigrams(words: list[Word], target_word: str, window_size: tuple[int, int], allowed_pos: set[str] | None = None) → list[Bigram][source]#

Extracts undirected Bigram namedtuples surrounding a target word.

Parameters:

words (list[Word]) – A list of Stanza Word objects.
target_word (str) – The word to search for.
window_size (tuple[int, int]) – A tuple representing the left and right window sizes.
allowed_pos (set[str] | None) – Optional set of POS tags to filter context words, defaults to None.

Returns:

A list of Bigram objects containing the context word and the target word.

Return type:

list[Bigram]

dstk.parameters.context.extraction.linguistic.word package

Contents

dstk.parameters.context.extraction.linguistic.word package#

Submodules#

dstk.parameters.context.extraction.linguistic.word.window module#

Module contents#