Skip to content

Complexity

complexity

This module provides functionality for measuring the complexity of discrete sequences using the LZ77 compression algorithm.

Author: Huw Cheston (2025)

References
  • Ziv, J., & Lempel, A. (1977). A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, 23(3), 337–343. https://doi.org/10.1109/TIT.1977.1055714

  • Cheston, H., Schlichting, J. L., Cross, I., & Harrison, P. M. C. (2024). Rhythmic qualities of jazz improvisation predict performer identity and style in source-separated audio recordings. Royal Society Open Science, 11(11). https://doi.org/10.1098/rsos.240920


lz77_encode

lz77_encode(input_list: list[Hashable]) -> list

Runs the LZ77 compression algorithm over the input data, generating tuples of (distance, length, symbol)

Parameters:

  • input_list (list[Hashable]) –

    The input sequence to be compressed.

Returns:

  • list

    A list of tuples of (distance, length, symbol) representing the LZ77 encoded data.

Source code in amads/algorithms/complexity.py
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
def lz77_encode(input_list: list[Hashable]) -> list:
    """Runs the LZ77 compression algorithm over the input `data`,
    generating tuples of (distance, length, symbol)

    Parameters
    ----------
    input_list : list[Hashable]
        The input sequence to be compressed.

    Returns
    -------
    list
        A list of tuples of (distance, length, symbol)
        representing the LZ77 encoded data.
    """

    # Catch sequences that have a length of zero
    if len(input_list) == 0:
        raise ValueError("Cannot encode an empty sequence!")
    encoded = []
    i = 0
    while i < len(input_list):
        best_length = 0
        best_distance = 0
        # Search the entire processed portion (positions 0 to i-1)
        for j in range(0, i):
            length = 0
            # Allow overlapping matches: compare while within input bounds.
            while (
                (i + length < len(input_list))
                and (j + length < len(input_list))
                and (input_list[j + length] == input_list[i + length])
            ):
                length += 1
            if length > best_length:
                best_length = length
                best_distance = i - j
        # If a match was found, output a match token.
        if best_length > 0:
            encoded.append((best_distance, best_length, input_list[i]))
            i += best_length
        else:
            # No match: output a literal token with length 1.
            encoded.append((0, 1, input_list[i]))
            i += 1
    return encoded

lz77_decode

lz77_decode(encoded: list[tuple[int, int, Hashable]]) -> list[Hashable]

Decode a list of LZ77 tokens, each of which are 3-tuples with form (distance, length, symbol)

Parameters:

  • encoded (list[tuple[int, int, Hashable]]) –

    A list of LZ77 encoded tokens.

Returns:

  • list[Hashable]

    The decoded sequence.

Source code in amads/algorithms/complexity.py
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
def lz77_decode(encoded: list[tuple[int, int, Hashable]]) -> list[Hashable]:
    """Decode a list of LZ77 tokens, each of which are 3-tuples with form (distance, length, symbol)

    Parameters
    ----------
    encoded : list[tuple[int, int, Hashable]]
        A list of LZ77 encoded tokens.

    Returns
    -------
    list[Hashable]
        The decoded sequence.
    """

    # Catch sequences that have a length of zero
    if len(encoded) == 0:
        raise ValueError("Cannot decode an empty sequence!")
    decoded = []
    for token in encoded:
        distance, length, symbol = token
        # Literal token: simply append the symbol.
        if distance == 0:
            decoded.append(symbol)
        # Match token: copy 'length' characters from the already decoded list,
        # starting at position len(decoded) - distance.
        else:
            start = len(decoded) - distance
            for i in range(length):
                decoded.append(decoded[start + i])
    return decoded

lz77_complexity

lz77_complexity(
    sequence: Iterable[Hashable], normalized: bool = False
) -> Union[int, float]

Compresses a discrete sequence using the LZ77 algorithm and returns the length of the compressed output.

This function applies LZ77 compression to a sequence of hashable elements (e.g., strings, floats, integers). Higher compression lengths indicate greater complexity in the sequence.

Parameters:

  • sequence (Iterable[Hashable]) –

    A discrete sequence, such as pitch classes or inter-onset intervals.

  • normalized (Optional(bool), default: False ) –

    If True, the result is expressed relative to the input size, where 1.0 indicates maximum complexity (i.e., no compression is possible). Defaults to False.

Returns:

  • int | float:

    The length of the compressed sequence, either as a raw integer or a normalized float.

Source code in amads/algorithms/complexity.py
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
def lz77_complexity(
    sequence: Iterable[Hashable], normalized: bool = False
) -> Union[int, float]:
    """
    Compresses a discrete sequence using the LZ77 algorithm and returns the length of the compressed output.

    This function applies LZ77 compression to a sequence of hashable elements (e.g., strings, floats, integers).
    Higher compression lengths indicate greater complexity in the sequence.

    Parameters
    ----------
    sequence : Iterable[Hashable]
        A discrete sequence, such as pitch classes or inter-onset intervals.

    normalized : Optional(bool)
        If True, the result is expressed relative to the input size, where 1.0 indicates
        maximum complexity (i.e., no compression is possible). Defaults to False.

    Returns
    -------
    int | float:
        The length of the compressed sequence, either as a raw integer or a normalized float.
    """

    # Compress the sequence
    sequence = list(sequence)
    compressed = lz77_encode(sequence)
    # Express with relation to length of input string if required: returns a float
    if normalized:
        return len(compressed) / len(sequence)
    # Otherwise, take the length of the compressed string as the complexity of the input: returns an int
    else:
        return len(compressed)