You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
`fastchaos` implement [integer chaos game representation (iCGR) algorithm](https://www.liebertpub.com/doi/abs/10.1089/cmb.2018.0173) for DNA sequence encoding and decoding. `fastchaos` is the first complete implementation of the algorithm in a bioinformatic tool aiming at users. It also add to the original algorithm a output file format which is a `zst` compressed JSON file containing the 3 integers of 100bp subsequences of the supplied sequence. This allow fast encoding and decoding.
13
+
`chaoscoder` implement [integer chaos game representation (iCGR) algorithm](https://www.liebertpub.com/doi/abs/10.1089/cmb.2018.0173) for DNA sequence encoding and decoding. `chaoscoder` is the first complete implementation of the algorithm in a bioinformatic tool aiming at users. It also add to the original algorithm a output file format which is a `zst` compressed JSON file containing the 3 integers of 100bp subsequences of the supplied sequence. This allow fast encoding and decoding.
14
14
15
-
`fastchaos` also implements [chaos game representation (CGR) of DNA sequence](https://academic.oup.com/nar/article-abstract/18/8/2163/2383530) in a fast tool that draw the representation of a sequence and can compare the CGR image using the [DSSIM algorithm](https://github.com/kornelski/dssim/).
15
+
`chaoscoder` also implements [chaos game representation (CGR) of DNA sequence](https://academic.oup.com/nar/article-abstract/18/8/2163/2383530) in a fast tool that draw the representation of a sequence and can compare the CGR image using the [DSSIM algorithm](https://github.com/kornelski/dssim/).
title: 'fastchaos: block-based integer chaos game representation encoding and decoding of DNA sequences'
2
+
title: 'chaoscoder: block-based integer chaos game representation encoding and decoding of DNA sequences'
3
3
tags:
4
4
- DNA sequence analysis
5
5
- Chaos game representation
@@ -13,42 +13,52 @@ authors:
13
13
orcid: 0000-0002-9078-8844
14
14
affiliation: 1
15
15
affiliations:
16
-
- name: Equipe Bioinformatique et Biostatistique, Laboratoire de Microbiologie, Biotechnologie et Bioinformatique, Institut National Polytechnique Félix Houphouët-Boigny, Côte d'Ivoire
16
+
- name: Equipe Bioinformatique et Biostatistique, Laboratoire de Microbiologie, Biotechnologie et Bioinformatique, Institut National Polytechnique Félix Houphouët-Boigny, BP 1093 Yamoussoukro, Côte d'Ivoire
17
17
index: 1
18
18
date: 15 July 2025
19
19
bibliography: paper.bib
20
20
---
21
21
22
22
# Summary
23
23
24
-
Computational analysis of DNA sequences is fundamental in modern bioinformatics, enabling tasks such as classification, genome comparison, mutation detection, and evolutionary studies. To support these analyses, DNA sequences, represented as strings of nucleotide letters (A, T, C, G), must be converted into numerical formats suitable for mathematical operations and machine learning workflows.
24
+
Computational analysis of DNA sequences underpins numerous bioinformatics applications, including sequence classification, genome comparison, mutation detection, and evolutionary studies. These tasks often require transforming symbolic nucleotide sequences (A, T, C, G)into numerical representations suitable for mathematical processing or machine learning.
25
25
26
-
One widely used encoding method is the Chaos Game Representation (CGR), which maps sequences onto a 2D space, revealing compositional and structural patterns [@jeffrey_chaos_1990; @vinga_pattern_2012]. However, CGR relies on floating-point arithmetic, which introduces rounding errors and limits precision-especially problematic for long sequences and exact sequence reconstruction.
26
+
Chaos Game Representation (CGR) is a well-established method that encodes DNA sequences as points in a 2D space, revealing motifs and structural patterns [@jeffrey_chaos_1990]. However, traditional CGR depends on floating-point arithmetic, leading to rounding errors and imprecision—especially when applied to long sequences or tasks that require exact sequence reconstruction.
27
27
28
-
To address these limitations, our software implements the Integer Chaos Game Representation (iCGR), a mathematically robust alternative that operates entirely in integer space [@yin_encoding_2018]. This guarantees lossless encoding and decoding. Furthermore, we introduce a block-based iCGR algorithm that enables the encoding of long genomic sequences by processing them in overlapping segments. This makes the method scalable and compatible with high-throughput genome analysis.
28
+
`chaoscoder`implements the Integer Chaos Game Representation (iCGR), a variant that operates entirely in integer space to provide lossless encoding and decoding[@yin_encoding_2018]. To address the exponential scaling limitation of iCGR, the software introduces a block-based variant that divides sequences into overlapping segments, enabling scalable and parallelizable encoding of genome-length sequences.
29
29
30
-
The software supports efficient encoding, decoding, and standardized storage of iCGR coordinates. It is designed to be fast, precise, and extensible, making it suitable for a wide range of genomic applications where reliability and performance are essential.
30
+
The software provides a command-line interface for encoding, decoding, visualizing CGRs, and comparing sequence structure via image-based SSIM (Structural Similarity Index Measure). It supports standardized storage of encoded data in a custom `.bicgr` file format, designed for efficient downstream use.
31
+
32
+
Written in Rust for performance and reliability, `chaoscoder` is well-suited for researchers and developers working with large-scale genomic datasets where precision, reversibility, and scalability are essential.
31
33
32
34
# Implementation
33
35
34
-
## Encoding and decoding DNA sequences by block-based integer CGR
36
+
## Encoding and decoding DNA sequences by integer CGR
37
+
38
+
`chaoscoder` provides a CLI to encode and decode DNA sequences using the iCGR algorithm proposed by Yin [@yin_encoding_2018]. For sequences shorter than 100 nucleotides, the classic iCGR approach is used, mapping each base to integer coordinates without rounding errors.
39
+
40
+
## Block-based encoding
35
41
36
-
To solve the problem of exponential scaling that limits the iCGR method to sequence length of 100 nt, we propose a block-based approach consisting of splitting sequences into fixed-size blocks (e.g. 50-100 nt) to ensure that the computation remain within harware limits. The algorithms first split sequences into overlapping fragments based on input from the user (Figure 1) and then encode subsequences into tri-integers based on the iCGR algorithm defined by Yin [@yin_encoding_2018].
42
+
Due to the exponential nature of coordinate growth in iCGR, encoding long sequences (e.g., full genomes) directly is computationally infeasible. To mitigate this, `chaoscoder` implements a block-based iCGR approach. Sequences are partitioned into fixed-size, optionally overlapping segments (e.g., 50–100 nt), each of which is independently encoded using the iCGR algorithm (Figure 1).
37
43
38
-
## The block-based integer chaos game representation file format
44
+
The result is a scalable encoding strategy that maintains the reversibility and precision of iCGR while enabling genome-scale processing.
39
45
40
-
The file structure of a block-based integer chaos game representation file (.bicgr) follows a tab-separated-like format (Figure 2).
46
+
## The `.bicgr` file format
47
+
48
+
The block-based integer Chaos Game Representation (.bicgr) format is a custom tab-separated file structure (Figure 2).
41
49
42
50

43
51
44
-
The BICGR format specifies three mandatory columns and one optional section. The first section is the sequence ID which is mandatory while the second field is the sequence description which is optional. The first section is the overlap argument used by the encoding algorithm. The triintegers are listed as x, y, and n for each block and are arranged according to the 5 to 3 'orientation' of the DNA strand, as outputted by the encoding algorithm.
52
+
It includes the sequence ID (mandatory), the sequence description (optional), the overlap parameter used during encoding and the iCGR tri-integer coordinates (`x`, `y`, and `n`) for each block, listed in 5' to 3' orientation. This structure ensures consistent, interpretable, and easily parsable output for integration into downstream pipelines.
45
53
46
54
## Other features
47
55
48
-
`fastchaos` has several other useful features. First, `fastchaos` can draw the traditional CGR of a sequences and compare them with other sequence's CGR by computing the structural similarity index (SSIM) between the two images. This allows for a more nuanced comparison of sequence similarity beyond simple sequence alignment. Second `fastchaos` takes advantage of multithreading to speed up the encoding and decoding process.
56
+
`chaoscoder` offers additional functionalities to support exploratory and comparative genomics. First, the software can generate 2D CGR images for encoded sequences. Second, users can compute Structural Similarity Index (SSIM) between CGR images to compare sequence patterns without alignment.
57
+
Finally, encoding and decoding tasks are multithreaded to improve performance on large datasets.
58
+
49
59
50
60
# Installation
51
61
52
-
The `fastchaos` software is programmed in Rust, and is available on GitHub at [https://github.com/Ebedthan/fastchaos](https://github.com/Ebedthan/fastchaos).
62
+
`chaoscoder`is written in Rust and distributed via GitHub at [https://github.com/Ebedthan/chaoscoder](https://github.com/Ebedthan/chaoscoder).
0 commit comments