Ngrams

NGramCounter ¶

NGramCounter()

A stateful n-gram counter that accumulates counts across multiple sequences.

Source code in amads/algorithms/ngrams.py

def __init__(self):
    """Initialize an empty n-gram counter."""
    self.ngram_counts = (
        {}
    )  # Initialize with empty dictionary instead of None

Attributes¶

yules_k `property` ¶

yules_k: float

Calculate Yule's K statistic [1] for the n-gram counts.

Yule's K is a measure of the rate at which tokens are repeated in a sequence. It is calculated according to the formula: $$ K = 1 / |n| * 1000 * (sum(V(m,N) * m^2) - N) / (N * N) $$ where:

$|n|$ is the number of different n-gram lengths
$V(m,N)$ is the number of types with frequency $m$ with $N$ tokens
$N$ is the total number of tokens in the sequence
$m$ is the index for the frequency class in the frequency distribution

Interpretation:

Higher K values indicate more repetitive sequences
Lower K values indicate more diverse sequences with less repetition
K is scaled by 1000 to make values more readable

[1] Yule, G. U. 1944. The Statistical Study of Literary Vocabulary. Cambridge University Press.

Returns:

float –

Mean Yule's K statistic across all n-gram lengths.

Raises:

ValueError –

for empty input.

simpsons_d `property` ¶

simpsons_d: float

Compute mean Simpson's D [1] diversity index over n-grams.

This is closely mathematically related to the definition of Yule's K. Simpson's D is a measure of diversity in a sequence: $$ D = 1 - sum(n_i * (n_i - 1)) / (N * (N - 1)) $$ where:

$n_i$ is the frequency of the i-th type,
$N$ is the total number of tokens in the sequence, and
$D$ is the Simpson's D diversity index.

Interpretation:

Higher D values indicate more repetitive sequences where tokens are often repeated
Lower D values indicate more diverse sequences with many different tokens

[1] Simpson, E. H. 1949. Measurement of diversity. Nature, 163:688

Returns:

float –

Mean Simpson's D value across n-gram lengths.

Raises:

ValueError –

for empty input.

sichels_s `property` ¶

sichels_s: float

Compute Sichel's S statistic over n-grams.

Sichel's S is a measure of the proportion of token types that occur exactly twice in a sequence. It is defined as: $$ S = V(2,N)/(|n| * V(N)) $$ where:

$V(2,N)$ is the number of types that occur exactly twice in the sequence
$V(N)$ is the total number of types in the sequence
$|n|$ is the number of n-gram lengths considered

Interpretation:

Higher S values indicate more types occurring exactly twice
Lower S values indicate fewer types occurring exactly twice

Returns:

float –

Mean Sichel's S value across n-gram lengths.

Raises:

ValueError –

for empty input.

mean_entropy `property` ¶

mean_entropy: float

Compute entropy of n-gram distribution.

For each ngram length n, calculates the Shannon entropy of the ngram distribution and divides by the maximum entropy for that length (log2 N). The mean is then taken over these relative entropy values across all lengths.

This is defined as: $$ H = -sum_{i=1}^{N}(p(x_i) * log2(p(x_i))) $$ where:

$H$ is the Shannon entropy
$N$ is the total number of tokens in the sequence
$p(x_i)$ is the probability of the i-th type

Interpretation:

Higher entropy indicates more random/unpredictable sequences
Lower entropy indicates more predictable/structured sequences

Returns:

float –

Mean Shannon entropy value across n-gram lengths.

Raises:

ValueError –

for empty input.

mean_productivity `property` ¶

mean_productivity: float

Compute mean productivity of n-gram distribution.

Mean productivity is defined as the mean of the number of types occurring only once divided by the total number of tokens. The types occurring only once in a sequence are known as hapax legomena. $$ meanproductivity = sum(V1(N)/|n|) $$ where

$V1(N)$ is the number of types occurring once
$|n|$ is the number of n-gram lengths

Interpretation:

Higher productivity indicates more diverse/generative sequences with many unique tokens
Lower productivity indicates more repetitive sequences with fewer unique tokens

Returns:

float –

Mean productivity value across n-gram lengths.

Raises:

ValueError –

for empty input.

Functions¶

count_ngrams ¶

count_ngrams(tokens: list, n: Union[int, list, None] = None) -> None

Update n-gram counts from a sequence of tokens.

Parameters:

tokens (list) –

List of tokens to process
n (int | list | None, default: None ) –

If int, count n-grams of that specific length. If list, count n-grams of the specified lengths. If None, count n-grams of all possible lengths.

Examples:

>>> tresillo = [3, 3, 2]
>>> ten_tresillo = tresillo * 10
>>> ngc = NGramCounter()
>>> ngc.count_ngrams(tokens=ten_tresillo, n=3)
>>> ngc.ngram_counts
{('3', '3', '2'): 10, ('3', '2', '3'): 9, ('2', '3', '3'): 9}

Source code in amads/algorithms/ngrams.py

def count_ngrams(
    self, tokens: list, n: Union[int, list, None] = None
) -> None:
    """Update n-gram counts from a sequence of tokens.

    Parameters
    ----------
    tokens : list
        List of tokens to process
    n : int | list | None
        If int, count n-grams of that specific length.
        If list, count n-grams of the specified lengths.
        If None, count n-grams of all possible lengths.

    Examples
    --------
    >>> tresillo = [3, 3, 2]
    >>> ten_tresillo = tresillo * 10
    >>> ngc = NGramCounter()
    >>> ngc.count_ngrams(tokens=ten_tresillo, n=3)
    >>> ngc.ngram_counts
    {('3', '3', '2'): 10, ('3', '2', '3'): 9, ('2', '3', '3'): 9}

    """
    # Determine n-gram lengths to count
    if n is None:
        n_values = range(1, len(tokens) + 1)
    elif isinstance(n, int):
        if n < 1:
            raise ValueError(f"n-gram length {n} is less than 1")
        if n > len(tokens):
            raise ValueError(
                f"n-gram length {n} is larger than token sequence "
                f"length {len(tokens)}"
            )
        n_values = [n]
    else:
        # n is a list
        for val in n:
            if not isinstance(val, int):
                raise TypeError(
                    f"n-gram lengths must be integers, got {type(val)}"
                )
            if val < 1:
                raise ValueError(f"n-gram length {val} is less than 1")
            if val > len(tokens):
                raise ValueError(
                    f"n-gram length {val} is larger than token sequence "
                    f"length {len(tokens)}"
                )
        n_values = n

    # Count n-grams and update the counter
    for n in n_values:
        for i in range(len(tokens) - n + 1):
            # Create hashable n-gram
            ngram = tuple(str(token) for token in tokens[i : i + n])
            # Update count in the dictionary
            self.ngram_counts[ngram] = self.ngram_counts.get(ngram, 0) + 1

get_counts ¶

get_counts(n: Optional[int] = None) -> Dict

Get the current n-gram counts.

Parameters:

n (int, default: None ) –

If provided, only return counts for n-grams of this length. If None, return counts for all n-gram lengths.

Returns:

dict –

Dictionary mapping each n-gram to its count

Source code in amads/algorithms/ngrams.py

def get_counts(self, n: Optional[int] = None) -> Dict:
    """Get the current n-gram counts.

    Parameters
    ----------
    n : int, optional
        If provided, only return counts for n-grams of this length.
        If None, return counts for all n-gram lengths.

    Returns
    -------
    dict
        Dictionary mapping each n-gram to its count
    """
    if n is None:
        return self.ngram_counts.copy()
    return {k: v for k, v in self.ngram_counts.items() if len(k) == n}

Ngrams

NGramCounter ¶

Attributes¶

yules_k property ¶

simpsons_d property ¶

sichels_s property ¶

mean_entropy property ¶

mean_productivity property ¶

Functions¶

count_ngrams ¶

get_counts ¶

yules_k `property` ¶

simpsons_d `property` ¶

sichels_s `property` ¶

mean_entropy `property` ¶

mean_productivity `property` ¶