numcodecs_tokenize
numcodecs_tokenize
TokenizeCodec for the numcodecs buffer compression API.
Modules:
-
typing–Commonly used type variables.
Classes:
-
TokenizeCodec–Codec that tokenizes the unique data values and encodes the token indices
TokenizeCodec
Bases: Codec
Codec that tokenizes the unique data values and encodes the token indices and token lookup table.
Encoding produces a 1D array of unsigned integers, most of which will be the token indices. Tokenization can improve compressibility since the indices may only require a smaller data type and may have many zero bytes. Applying a byte shuffle codec after tokenization can improve compression by a byte-based lossless compressor.
Methods: