Statistical and Entropic Modeling for Lossless Data Compression in the Internet of Things
ABSTRACT
This paper introduces a novel lossless compression scheme designed specifically for highly constrained environments such as those encountered in the Internet of Things (IoT). The proposed method, termed Chaotic Compression Standard (CCS), employs a hierarchical symbol partitioning strategy coupled with a simplified variant of the range Asymmetric Numeral Systems (rANS), effectively avoiding costly division operations. A key contribution of CCS lies in the dynamic embedding of zone-identifying bits within the compressed data stream, achieved through precise alignment within fixed-size binary packets.
This architecture offers a compelling balance between compression efficiency, low computational complexity, and optimal suitability for embedded systems.
Empirical results indicate that CCS achieves slightly better compression ratios (approximately 1%) compared to classical methods such as rANS coding, but offers a substantial improvement in encoding speed, reducing processing time by approximately 80%. These results highlight its particular suitability for embedded IoT applications.
Given the exponential growth in data generated by IoT devices, conventional lossless compression methods are increasingly constrained by processing and energy limitations. CCS addresses these challenges by integrating statistical modeling and an entropy coding paradigm capable of outperforming traditional approaches such as Huffman, range coding, and arithmetic coding. Preliminary evaluations underscore the algorithm's effectiveness, particularly on structured datasets, and highlight its strong potential for hardware-level deployment and future entropy-aware optimizations.
Keywords Lossless data compression, Entropy coding, Symbol frequency analysis, Information theory.
Residual Compressibility Beyond Shannon: Structure and Saturation at Maximum Entropy
Abstract
Shannon entropy is traditionally regarded as the ultimate measure of information and compressibility. However, our experiments on 256-byte binary blocks reveal that even at maximum apparent entropy (H ≈ 8 bits per byte), significant compression gains (up to 12.5%) remain achievable. This phenomenon, which we define as residual compressibility, is observed in both ChaCha20-generated blocks and mathematically controlled distributions, exposing hidden structures undetected by conventional encoders. We propose a new theoretical reading of entropy as a vectorial or topological structure that accounts for dimensions such as symbol emergence dynamics and internal organization. Our experimental results suggest that classical entropy masks fundamental structural distinctions, calling for a redefinition of the informational model to better account for effective compressibility.
Keywords Lossless data compression, Entropy coding, Symbol frequency analysis, Information theory.
From Apparent Chaos to Measurable Complexity: ϕ Entropy and the Compressibility Undecidability Zone
Abstract
Classical information theory assumes that Shannon entropy fully captures the uncertainty—and hence the compressibility—of a finite sequence. Yet empirical evidence shows that two blocks with identical symbol frequencies can exhibit radically different compressibility once the temporal order of arrivals is considered. We introduce ϕ entropy, a structural complement to Shannon’s measure that quantifies the coherence of symbol repetitions through a binary redundancy map. The metric combines three ordinal descriptors—variance of redundant positions, local alternation, and clustering—into a single, parameter free index ϕ. We define the effective entropy
Heff=HShannon-α
and show experimentally, on 20 000 blocks of 256 bytes drawn from cryptographic (ChaCha20), synthetic and textual sources, that Heff predicts the real compression gain (R² = 0.83) obtained by a class of order based coders.
Plotting compression success over the (H, ϕ) plane reveals an unexpected compressibility undecidability zone: for 6.97 ≤ H ≤ 7.50 bits/octet and 0.2 ≤ ϕ ≤ 0.4, neither frequency based algorithms (Huffman, rANS) nor structural coders can save a single bit. Above this band, blocks that remain nominally “maximally random (H≈8) become compressible whenever ϕ>0.45; below it, high frequency imbalance alone suffices.
These results demonstrate that Shannon entropy is incomplete, not incorrect, and that ϕ entropy supplies the missing temporal dimension. The proposed framework unifies frequency and order, explains long standing anomalies (compression of encrypted data), and opens a quantitative path toward complexity aware coding, anomaly detection and randomness profiling.
La théorie classique de l’information postule que l’entropie de Shannon suffit à mesurer l’incertitude — et donc la compressibilité — d’une séquence finie. Pourtant, l’expérience montre que deux blocs présentant la même distribution de symboles peuvent se révéler compressibles ou non dès que l’on tient compte de l’ordre temporel d’apparition. Nous introduisons pour cela l’entropie ϕ, complément structurel de l’entropie de Shannon, qui quantifie la cohérence des répétitions à l’aide d’une carte de redondance binaire. Cette métrique regroupe trois descripteurs ordinaux — variance des positions redondantes, alternance locale et effet de grappes — en un indice unique et sans paramètre, ϕ.
Nous définissons l’entropie effective
Heff=HShannon-α
et montrons expérimentalement, sur 20 000 blocs de 256 octets (données ChaCha20, séries synthétiques, corpus textuel), que Heff prédit le gain de compression réel (R² = 0,83) obtenu avec une famille de codeurs basés sur l’ordre.
La cartographie des succès de compression dans le plan (H, ϕ) révèle une zone d’indécidabilité compressive : pour 6.97 ≤ H ≤ 7.50 bits/octet et 0.2 ≤ ϕ ≤ 0.4, ni les algorithmes fréquentiels (Huffman, rANS) ni les codeurs structurels ne parviennent à économiser le moindre bit. Au delà de cette bande, des blocs demeurant nominalement « maximalement aléatoires » (H≃8) redeviennent compressibles dès que ϕ>0.45 ; en deçà, le simple déséquilibre de fréquence suffit.
Ces résultats démontrent que l’entropie de Shannon est incomplète, non pas fausse, et que l’entropie ϕ en fournit la dimension temporelle manquante. Le cadre proposé unifie fréquence et ordre, éclaire des anomalies connues (compression de données chiffrées) et ouvre une voie quantitative vers un codage sensible à la complexité, la détection d’anomalies et le profilage de la pseudo aléa.
Keywords Lossless data compression, Entropy coding, Symbol frequency analysis, Information theory.