`numcodecs_tokenize`

numcodecs_tokenize

TokenizeCodec for the numcodecs buffer compression API.

Modules:

typing –

Commonly used type variables.

Classes:

TokenizeCodec –

Codec that tokenizes the unique data values and encodes the token indices

TokenizeCodec

Bases: Codec

Codec that tokenizes the unique data values and encodes the token indices and token lookup table.

Encoding produces a 1D array of unsigned integers, most of which will be the token indices. Tokenization can improve compressibility since the indices may only require a smaller data type and may have many zero bytes. Applying a byte shuffle codec after tokenization can improve compression by a byte-based lossless compressor.

Methods:

encode –

Encode the data in buf by tokenizing the unique values in buf.
decode –

Decode the data in buf.

codec_id `class-attribute` `instance-attribute`

codec_id: str = 'tokenize'

encode

encode(
    buf: ndarray[S, dtype[T]],
) -> ndarray[tuple[int], dtype[U]]

Encode the data in buf by tokenizing the unique values in buf.

Parameters:	`buf` (`ndarray[S, dtype[T]]`) – Array to be tokenized.

Returns:	`enc`( `ndarray[tuple[int], dtype[U]]` ) – Tokenized 1D array with an unsigned integer dtype.

decode

decode(
    buf: ndarray[tuple[int], dtype[U]],
    out: None | ndarray[S, dtype[T]] = None,
) -> ndarray[S, dtype[T]]

Decode the data in buf.

Parameters:	`buf` (`ndarray[tuple[int], dtype[U]]`) – Tokenized 1D array with an unsigned integer dtype. `out` (`None \| ndarray[S, dtype[T]]`, default: `None` ) – Writeable array to store decoded data.

Returns:	`dec`( `ndarray[S, dtype[T]]` ) – Un-tokenized array.

numcodecs_tokenize

TokenizeCodec

codec_id class-attribute instance-attribute

encode

decode

`numcodecs_tokenize`

codec_id `class-attribute` `instance-attribute`