numcodecs_tokenize

numcodecs_tokenize

TokenizeCodec for the numcodecs buffer compression API.

Modules:

  • typing

    Commonly used type variables.

Classes:

  • TokenizeCodec

    Codec that tokenizes the unique data values and encodes the token indices

TokenizeCodec

Bases: Codec

Codec that tokenizes the unique data values and encodes the token indices and token lookup table.

Encoding produces a 1D array of unsigned integers, most of which will be the token indices. Tokenization can improve compressibility since the indices may only require a smaller data type and may have many zero bytes. Applying a byte shuffle codec after tokenization can improve compression by a byte-based lossless compressor.

Methods:

  • encode

    Encode the data in buf by tokenizing the unique values in buf.

  • decode

    Decode the data in buf.

codec_id class-attribute instance-attribute

codec_id: str = 'tokenize'

encode

encode(
    buf: ndarray[S, dtype[T]],
) -> ndarray[tuple[int], dtype[U]]

Encode the data in buf by tokenizing the unique values in buf.

Parameters:
Returns:

decode

decode(
    buf: ndarray[tuple[int], dtype[U]],
    out: None | ndarray[S, dtype[T]] = None,
) -> ndarray[S, dtype[T]]

Decode the data in buf.

Parameters:
  • buf (ndarray[tuple[int], dtype[U]]) –

    Tokenized 1D array with an unsigned integer dtype.

  • out (None | ndarray[S, dtype[T]], default: None ) –

    Writeable array to store decoded data.

Returns: