This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these messages)
|
liblzg is a compression library for performing lossless data compression. It implements an algorithm that is a variation of the LZ77 algorithm, called the LZG algorithm, with the primary focus of providing a very simple and fast decoding method. One of the key features of the algorithm is that it requires no memory during decompression. The software library is free software, distributed under the zlib license.
Stable release | 1.0.10
/ November 29, 2018 |
---|---|
Repository | |
Written in | C, Pascal, Lua, Assembly language, JavaScript |
Operating system | Cross-platform |
Type | Data compression |
License | zlib license |
Website | liblzg |
Algorithm
editIf a duplicate series of bytes (a repeated string) is spotted in the uncompressed data stream, then a back-reference is inserted, linking to the previous location of that identical string instead. An encoded match to an earlier string consists of a length (3–128 bytes) and a distance (1–526,341 bytes). The level of compression can be controlled by specifying the maximum distance for which duplicated strings will be searched (this is the size of the sliding window).
Data format
editThe data format consists of a header, followed by the compressed data. The header contains an identifier and house keeping information, such as compressed and decompressed data sizes and a 32-bit checksum (a variant of the Fletcher checksum).
The compressed data starts with four bytes, identifying four unique 8-bit marker symbols (m1, m2, m3 and m4). These are used to separate literal data bytes from various forms of length-distance pair encodings.
Any symbol that is not a marker byte is considered a literal byte, and will be copied as is to the decompressed data buffer. However, if the decoder encounters any of the four marker bytes, it will decode a length-distance pair that is used as a back reference into the previously decompressed data.
The marker bytes are interpreted as follows (% denotes a binary number):
General copy (m1)
editm1 represents the most general form of a copy operation, and it occupies four bytes in the compressed data stream:
m1 | %ooolllll | %mmmmmmmm | %nnnnnnnn |
...where length=DECODELENGTH(%lllll+2), and offset=%ooommmmmmmmnnnnnnnn + 2056.
Medium copy (m2)
editm2 is a shorter form of a copy operation, occupying three bytes in the compressed data stream:
m2 | %ooolllll | %mmmmmmmm |
...where length=DECODELENGTH(%lllll+2), and offset=%ooommmmmmmm + 8.
Short copy (m3)
editm3 requires only two bytes, and is used for short lengths, close to the marker:
m3 | %lloooooo |
...where length=%ll+3, and offset=%oooooo + 8.
Near copy (m4)
editm4 requires only two bytes, and is used for nearby copies (including RLE, when the offset is 1):
m4 | %ooolllll |
...where length=DECODELENGTH(%lllll+2), and offset=%ooo + 1.
Literal copy
editAs a special case, if any of the marker symbols are followed by a zero byte (0), the marker symbol itself is written to the decompressed buffer.
Non-linear length encoding
editThe DECODELENGTH function implements a non-linear mapping of a number in the range 3-33 to a number in the range 3-128, according to the following table:
Length parameter, L (3-33) | Decoded length (3-128) |
---|---|
33 | 128 |
32 | 72 |
31 | 48 |
30 | 35 |
<30 | L |
Worst case data growth
editAs the marker symbols are chosen as the four least common symbols in the uncompressed data stream (with a probability of at most each), and a single occurrence of a marker symbol requires two bytes to encode, the compressed data may grow by at most < 1.6% compared to the decompressed data (worst case).
The liblzg library compensates for this by using a plain 1:1 copy mode if the encoder identifies that the compressed data will be larger than the original uncompressed data. Hence, in practice, the maximum data growth is 0% (plus the size of the data header, which is 16 bytes).
Implementations
editBoth the compression and the decompression algorithms are implemented in an open source library, written in the C programming language. There are also several alternate implementations of the decompression algorithm available (for instance in JavaScript and 8-bit assembly language).