Skip to content

Ngrams

NGramCounter

NGramCounter()

A stateful n-gram counter that accumulates counts across multiple sequences.

Source code in amads/algorithms/ngrams.py
10
11
12
13
14
def __init__(self):
    """Initialize an empty n-gram counter."""
    self.ngram_counts = (
        {}
    )  # Initialize with empty dictionary instead of None

Attributes

yules_k property

yules_k: float

Calculate Yule's K statistic [1] for the n-gram counts.

Yule's K is a measure of the rate at which tokens are repeated in a sequence. It is calculated according to the formula: $$ K = 1 / |n| * 1000 * (sum(V(m,N) * m^2) - N) / (N * N) $$ where:

  • $|n|$ is the number of different n-gram lengths
  • $V(m,N)$ is the number of types with frequency $m$ with $N$ tokens
  • $N$ is the total number of tokens in the sequence
  • $m$ is the index for the frequency class in the frequency distribution

Interpretation:

  • Higher K values indicate more repetitive sequences
  • Lower K values indicate more diverse sequences with less repetition
  • K is scaled by 1000 to make values more readable

[1] Yule, G. U. 1944. The Statistical Study of Literary Vocabulary. Cambridge University Press.

Returns:

  • float

    Mean Yule's K statistic across all n-gram lengths.

Raises:

  • ValueError

    for empty input.

simpsons_d property

simpsons_d: float

Compute mean Simpson's D [1] diversity index over n-grams.

This is closely mathematically related to the definition of Yule's K. Simpson's D is a measure of diversity in a sequence: $$ D = 1 - sum(n_i * (n_i - 1)) / (N * (N - 1)) $$ where:

  • $n_i$ is the frequency of the i-th type,
  • $N$ is the total number of tokens in the sequence, and
  • $D$ is the Simpson's D diversity index.

Interpretation:

  • Higher D values indicate more repetitive sequences where tokens are often repeated
  • Lower D values indicate more diverse sequences with many different tokens

[1] Simpson, E. H. 1949. Measurement of diversity. Nature, 163:688

Returns:

  • float

    Mean Simpson's D value across n-gram lengths.

Raises:

  • ValueError

    for empty input.

sichels_s property

sichels_s: float

Compute Sichel's S statistic over n-grams.

Sichel's S is a measure of the proportion of token types that occur exactly twice in a sequence. It is defined as: $$ S = V(2,N)/(|n| * V(N)) $$ where:

  • $V(2,N)$ is the number of types that occur exactly twice in the sequence
  • $V(N)$ is the total number of types in the sequence
  • $|n|$ is the number of n-gram lengths considered

Interpretation:

  • Higher S values indicate more types occurring exactly twice
  • Lower S values indicate fewer types occurring exactly twice

Returns:

  • float

    Mean Sichel's S value across n-gram lengths.

Raises:

  • ValueError

    for empty input.

mean_entropy property

mean_entropy: float

Compute entropy of n-gram distribution.

For each ngram length n, calculates the Shannon entropy of the ngram distribution and divides by the maximum entropy for that length (log2 N). The mean is then taken over these relative entropy values across all lengths.

This is defined as: $$ H = -sum_{i=1}^{N}(p(x_i) * log2(p(x_i))) $$ where:

  • $H$ is the Shannon entropy
  • $N$ is the total number of tokens in the sequence
  • $p(x_i)$ is the probability of the i-th type

Interpretation:

  • Higher entropy indicates more random/unpredictable sequences
  • Lower entropy indicates more predictable/structured sequences

Returns:

  • float

    Mean Shannon entropy value across n-gram lengths.

Raises:

  • ValueError

    for empty input.

mean_productivity property

mean_productivity: float

Compute mean productivity of n-gram distribution.

Mean productivity is defined as the mean of the number of types occurring only once divided by the total number of tokens. The types occurring only once in a sequence are known as hapax legomena. $$ meanproductivity = sum(V1(N)/|n|) $$ where

  • $V1(N)$ is the number of types occurring once
  • $|n|$ is the number of n-gram lengths

Interpretation:

  • Higher productivity indicates more diverse/generative sequences with many unique tokens
  • Lower productivity indicates more repetitive sequences with fewer unique tokens

Returns:

  • float

    Mean productivity value across n-gram lengths.

Raises:

  • ValueError

    for empty input.

Functions

count_ngrams

count_ngrams(tokens: list, n: Union[int, list, None] = None) -> None

Update n-gram counts from a sequence of tokens.

Parameters:

  • tokens (list) –

    List of tokens to process

  • n (int | list | None, default: None ) –

    If int, count n-grams of that specific length. If list, count n-grams of the specified lengths. If None, count n-grams of all possible lengths.

Examples:

>>> tresillo = [3, 3, 2]
>>> ten_tresillo = tresillo * 10
>>> ngc = NGramCounter()
>>> ngc.count_ngrams(tokens=ten_tresillo, n=3)
>>> ngc.ngram_counts
{('3', '3', '2'): 10, ('3', '2', '3'): 9, ('2', '3', '3'): 9}
Source code in amads/algorithms/ngrams.py
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
def count_ngrams(
    self, tokens: list, n: Union[int, list, None] = None
) -> None:
    """Update n-gram counts from a sequence of tokens.

    Parameters
    ----------
    tokens : list
        List of tokens to process
    n : int | list | None
        If int, count n-grams of that specific length.
        If list, count n-grams of the specified lengths.
        If None, count n-grams of all possible lengths.

    Examples
    --------
    >>> tresillo = [3, 3, 2]
    >>> ten_tresillo = tresillo * 10
    >>> ngc = NGramCounter()
    >>> ngc.count_ngrams(tokens=ten_tresillo, n=3)
    >>> ngc.ngram_counts
    {('3', '3', '2'): 10, ('3', '2', '3'): 9, ('2', '3', '3'): 9}

    """
    # Determine n-gram lengths to count
    if n is None:
        n_values = range(1, len(tokens) + 1)
    elif isinstance(n, int):
        if n < 1:
            raise ValueError(f"n-gram length {n} is less than 1")
        if n > len(tokens):
            raise ValueError(
                f"n-gram length {n} is larger than token sequence "
                f"length {len(tokens)}"
            )
        n_values = [n]
    else:
        # n is a list
        for val in n:
            if not isinstance(val, int):
                raise TypeError(
                    f"n-gram lengths must be integers, got {type(val)}"
                )
            if val < 1:
                raise ValueError(f"n-gram length {val} is less than 1")
            if val > len(tokens):
                raise ValueError(
                    f"n-gram length {val} is larger than token sequence "
                    f"length {len(tokens)}"
                )
        n_values = n

    # Count n-grams and update the counter
    for n in n_values:
        for i in range(len(tokens) - n + 1):
            # Create hashable n-gram
            ngram = tuple(str(token) for token in tokens[i : i + n])
            # Update count in the dictionary
            self.ngram_counts[ngram] = self.ngram_counts.get(ngram, 0) + 1

get_counts

get_counts(n: Optional[int] = None) -> Dict

Get the current n-gram counts.

Parameters:

  • n (int, default: None ) –

    If provided, only return counts for n-grams of this length. If None, return counts for all n-gram lengths.

Returns:

  • dict

    Dictionary mapping each n-gram to its count

Source code in amads/algorithms/ngrams.py
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
def get_counts(self, n: Optional[int] = None) -> Dict:
    """Get the current n-gram counts.

    Parameters
    ----------
    n : int, optional
        If provided, only return counts for n-grams of this length.
        If None, return counts for all n-gram lengths.

    Returns
    -------
    dict
        Dictionary mapping each n-gram to its count
    """
    if n is None:
        return self.ngram_counts.copy()
    return {k: v for k, v in self.ngram_counts.items() if len(k) == n}