Ngrams
NGramCounter
¶
NGramCounter()
A stateful n-gram counter that accumulates counts across multiple sequences.
Source code in amads/algorithms/ngrams.py
10 11 12 13 14 | |
Attributes¶
yules_k
property
¶
yules_k: float
Calculate Yule's K statistic [1] for the n-gram counts.
Yule's K is a measure of the rate at which tokens are repeated in a sequence. It is calculated according to the formula: $$ K = 1 / |n| * 1000 * (sum(V(m,N) * m^2) - N) / (N * N) $$ where:
- $|n|$ is the number of different n-gram lengths
- $V(m,N)$ is the number of types with frequency $m$ with $N$ tokens
- $N$ is the total number of tokens in the sequence
- $m$ is the index for the frequency class in the frequency distribution
Interpretation:
- Higher K values indicate more repetitive sequences
- Lower K values indicate more diverse sequences with less repetition
- K is scaled by 1000 to make values more readable
[1] Yule, G. U. 1944. The Statistical Study of Literary Vocabulary. Cambridge University Press.
Returns:
-
float–Mean Yule's K statistic across all n-gram lengths.
Raises:
-
ValueError–for empty input.
simpsons_d
property
¶
simpsons_d: float
Compute mean Simpson's D [1] diversity index over n-grams.
This is closely mathematically related to the definition of Yule's K. Simpson's D is a measure of diversity in a sequence: $$ D = 1 - sum(n_i * (n_i - 1)) / (N * (N - 1)) $$ where:
- $n_i$ is the frequency of the i-th type,
- $N$ is the total number of tokens in the sequence, and
- $D$ is the Simpson's D diversity index.
Interpretation:
- Higher D values indicate more repetitive sequences where tokens are often repeated
- Lower D values indicate more diverse sequences with many different tokens
[1] Simpson, E. H. 1949. Measurement of diversity. Nature, 163:688
Returns:
-
float–Mean Simpson's D value across n-gram lengths.
Raises:
-
ValueError–for empty input.
sichels_s
property
¶
sichels_s: float
Compute Sichel's S statistic over n-grams.
Sichel's S is a measure of the proportion of token types that occur exactly twice in a sequence. It is defined as: $$ S = V(2,N)/(|n| * V(N)) $$ where:
- $V(2,N)$ is the number of types that occur exactly twice in the sequence
- $V(N)$ is the total number of types in the sequence
- $|n|$ is the number of n-gram lengths considered
Interpretation:
- Higher S values indicate more types occurring exactly twice
- Lower S values indicate fewer types occurring exactly twice
Returns:
-
float–Mean Sichel's S value across n-gram lengths.
Raises:
-
ValueError–for empty input.
mean_entropy
property
¶
mean_entropy: float
Compute entropy of n-gram distribution.
For each ngram length n, calculates the Shannon entropy of the ngram distribution and divides by the maximum entropy for that length (log2 N). The mean is then taken over these relative entropy values across all lengths.
This is defined as: $$ H = -sum_{i=1}^{N}(p(x_i) * log2(p(x_i))) $$ where:
- $H$ is the Shannon entropy
- $N$ is the total number of tokens in the sequence
- $p(x_i)$ is the probability of the i-th type
Interpretation:
- Higher entropy indicates more random/unpredictable sequences
- Lower entropy indicates more predictable/structured sequences
Returns:
-
float–Mean Shannon entropy value across n-gram lengths.
Raises:
-
ValueError–for empty input.
mean_productivity
property
¶
mean_productivity: float
Compute mean productivity of n-gram distribution.
Mean productivity is defined as the mean of the number of types occurring only once divided by the total number of tokens. The types occurring only once in a sequence are known as hapax legomena. $$ meanproductivity = sum(V1(N)/|n|) $$ where
- $V1(N)$ is the number of types occurring once
- $|n|$ is the number of n-gram lengths
Interpretation:
- Higher productivity indicates more diverse/generative sequences with many unique tokens
- Lower productivity indicates more repetitive sequences with fewer unique tokens
Returns:
-
float–Mean productivity value across n-gram lengths.
Raises:
-
ValueError–for empty input.
Functions¶
count_ngrams
¶
count_ngrams(tokens: list, n: Union[int, list, None] = None) -> None
Update n-gram counts from a sequence of tokens.
Parameters:
-
tokens(list) –List of tokens to process
-
n(int | list | None, default:None) –If int, count n-grams of that specific length. If list, count n-grams of the specified lengths. If None, count n-grams of all possible lengths.
Examples:
>>> tresillo = [3, 3, 2]
>>> ten_tresillo = tresillo * 10
>>> ngc = NGramCounter()
>>> ngc.count_ngrams(tokens=ten_tresillo, n=3)
>>> ngc.ngram_counts
{('3', '3', '2'): 10, ('3', '2', '3'): 9, ('2', '3', '3'): 9}
Source code in amads/algorithms/ngrams.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 | |
get_counts
¶
get_counts(n: Optional[int] = None) -> Dict
Get the current n-gram counts.
Parameters:
-
n(int, default:None) –If provided, only return counts for n-grams of this length. If None, return counts for all n-gram lengths.
Returns:
-
dict–Dictionary mapping each n-gram to its count
Source code in amads/algorithms/ngrams.py
80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 | |