dstk.parameters.context.extraction.linguistic.word package#
Submodules#
dstk.parameters.context.extraction.linguistic.word.window module#
This module provides tools for extracting context based on “Linguistic Units” by identifying “Window-based collocates.” Within the framework of distributional semantics, this module focuses on Lexeme contexts—extracting words that appear in close proximity to a target word within a defined window.
Core functionalities include:
Extracting collocates from specified left and right windows around a target word.
Filtering context windows based on part-of-speech (POS) tags to refine linguistic data.
Generating directed bigrams, which specify the directional relationship between a context word and a target word (e.g., Left or Right).
Extracting undirected bigrams where only physical proximity is considered.
Generating n-grams from sequences of lexical items for fixed-length proximity analysis.
The module is specifically tailored for Stanza Word objects and sequences of Lexical Items.
- dstk.parameters.context.extraction.linguistic.word.window.extract_collocates(words: list[Word], target_word: str, window_size: tuple[int, int], allowed_pos: set[str] | None = None) list[tuple[Word, ...]][source]#
Extracts context words around a target word as flat tuples.
- Parameters:
words (list[Word]) – A list of Stanza Word objects.
target_word (str) – The word to find within the list.
window_size (tuple[int, int]) – A tuple representing the left and right window sizes.
allowed_pos (set[str] | None) – Optional set of POS tags to filter context words, defaults to None.
- Returns:
A list of word tuples matching the window constraints.
- Return type:
list[Collocates]
- dstk.parameters.context.extraction.linguistic.word.window.extract_directed_bigrams(words: list[Word], target_word: str, window_size: tuple[int, int], allowed_pos: set[str] | None = None) list[tuple[Word, tuple[str, str]]][source]#
Extracts directed bigrams (tagged with context direction) around a target word.
Collects bigrams in the form: * Left bigrams:
(context_word, ("L", target_word))* Right bigrams:(context_word, ("R", target_word))- Parameters:
words (list[Word]) – A list of Stanza Word objects.
target_word (str) – The word to search for.
window_size (tuple[int, int]) – A tuple representing the left and right window sizes.
allowed_pos (set[str] | None) – Optional set of POS tags to filter context words, defaults to None.
- Returns:
A list of directed collocate tuples.
- Return type:
list[DirectedCollocates]
- dstk.parameters.context.extraction.linguistic.word.window.extract_ngrams(words: Sequence[Token | Word], window_size: int, **kwargs) list[tuple[Word, ...]][source]#
Splits lexical items into groups of sequential n-grams.
- Parameters:
words (Sequence[LexicalItem]) – A sequence of Stanza Word or Token objects.
window_size (int) – The size of the n-gram window.
kwargs – Additional keyword arguments passed to
nltk.util.ngrams(e.g.,pad_left,pad_right).
For more information check: https://www.nltk.org/api/nltk.util.html#nltk.util.ngrams
- Returns:
A list of word tuples representing the consecutive n-grams.
- Return type:
list[Collocates]
- dstk.parameters.context.extraction.linguistic.word.window.extract_undirected_bigrams(words: list[Word], target_word: str, window_size: tuple[int, int], allowed_pos: set[str] | None = None) list[Bigram][source]#
Extracts undirected Bigram namedtuples surrounding a target word.
- Parameters:
words (list[Word]) – A list of Stanza Word objects.
target_word (str) – The word to search for.
window_size (tuple[int, int]) – A tuple representing the left and right window sizes.
allowed_pos (set[str] | None) – Optional set of POS tags to filter context words, defaults to None.
- Returns:
A list of Bigram objects containing the context word and the target word.
- Return type:
list[Bigram]