Skip to content

Commit cceeb8f

Browse files
committed
feat(math): add χ² probability and convert EntropyReport to RandomnessReport
Introduce another randomness measure based on Chi Square probability by using unblob-native's chi_square_probability function. This function returns the Chi Square distribution probability. Chi-square tests are effective for distinguishing compressed from encrypted data because they evaluate the uniformity of byte distributions more rigorously than Shannon entropy. In compressed files, bytes often cluster around certain values due to patterns that still exist (albeit less detectable), resulting in a non-uniform distribution. Encrypted data, by contrast, exhibits nearly perfect uniformity, as each byte value from 0–255 is expected to appear with almost equal frequency, making it harder to detect any discernible patterns. The chi-square distribution is calculated for the stream of bytes in the chunk and expressed as an absolute number and a percentage which indicates how frequently a truly random sequence would exceed the value calculated. The percentage is the only value that is of interest from unblob's perspective, so that's why we only return it. According to ent doc⁰: > We [can] interpret the percentage as the degree to which the > sequence tested is suspected of being non-random. If the percentage is > greater than 99% or less than 1%, the sequence is almost certainly not > random. If the percentage is between 99% and 95% or between 1% and 5%, > the sequence is suspect. Percentages between 90% and 95% and 5% and 10% > indicate the sequence is “almost suspect”. [0] - https://www.fourmilab.ch/random/ This entropy measure is introduced by modifying the EntropyReport class so that it contains two RandomnessMeasurements: - shannon: for Shannon entropy, which was already there - chi_square: for Chi Square entropy, which we introduce EntropyReport is renamed to RandomnessReport to reflect that all measurements are not entropy only. The format_entropy_plot has been adjusted to display two lines within the entropy graph. One for Shannon, the other for Chi Square. This commit breaks the previous API by converting entropy_depth and entropy_plot to randomness_depth and randomness_plot in ExtractionConfig. The '--entropy-depth' CLI option is replaced by '--randomness-depth'.
1 parent 60f2d2d commit cceeb8f

File tree

11 files changed

+220
-159
lines changed

11 files changed

+220
-159
lines changed

docs/guide.md

Lines changed: 53 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -114,10 +114,10 @@ $ cat alpine-report.json
114114
]
115115
```
116116

117-
### Entropy calculation
117+
### Randomness calculation
118118

119119
If you are analyzing an unknown file format, it might be useful to know the
120-
entropy of the contained files, so you can quickly see for example whether the
120+
randomness of the contained files, so you can quickly see for example whether the
121121
file is **encrypted** or contains some random content.
122122

123123
Let's make a file with fully random content at the start and end:
@@ -128,59 +128,61 @@ $ dd if=/dev/random of=random2.bin bs=10M count=1
128128
$ cat random1.bin alpine-minirootfs-3.16.1-x86_64.tar.gz random2.bin > unknown-file
129129
```
130130

131-
A nice ASCII entropy plot is drawn on verbose level 3:
131+
A nice ASCII randomness plot is drawn on verbose level 3:
132132

133133
```console
134134
$ unblob -vvv unknown-file | grep -C 15 "Entropy distribution"
135135

136-
2022-07-30 07:58.16 [debug ] Ended searching for chunks all_chunks=[0xa00000-0xc96196] pid=19803
137-
2022-07-30 07:58.16 [debug ] Removed inner chunks outer_chunk_count=1 pid=19803 removed_inner_chunk_count=0
138-
2022-07-30 07:58.16 [warning ] Found unknown Chunks chunks=[0x0-0xa00000, 0xc96196-0x1696196] pid=19803
139-
2022-07-30 07:58.16 [info ] Extracting unknown chunk chunk=0x0-0xa00000 path=unknown-file_extract/0-10485760.unknown pid=19803
140-
2022-07-30 07:58.16 [debug ] Carving chunk path=unknown-file_extract/0-10485760.unknown pid=19803
141-
2022-07-30 07:58.16 [debug ] Calculating entropy for file path=unknown-file_extract/0-10485760.unknown pid=19803 size=0xa00000
142-
2022-07-30 07:58.16 [debug ] Entropy calculated highest=99.99 lowest=99.98 mean=99.98 pid=19803
143-
2022-07-30 07:58.16 [warning ] Drawing plot pid=19803
144-
2022-07-30 07:58.16 [debug ] Entropy chart chart=
145-
Entropy distribution
146-
┌---------------------------------------------------------------------------┐
147-
100┤•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••│
148-
90┤ │
149-
80┤ │
150-
70┤ │
151-
60┤ │
152-
50┤ │
153-
40┤ │
154-
30┤ │
155-
20┤ │
156-
10┤ │
157-
0┤ │
158-
└┬---┬---┬---─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬┘
159-
1 4 7 12 16 20 24 29 33 37 41 46 50 54 59 63 67 71 76 80
160-
[y] entropy % [x] mB
161-
pid=19803
162-
2022-07-30 07:58.16 [info ] Extracting unknown chunk chunk=0xc96196-0x1696196 path=unknown-file_extract/13197718-23683478.unknown pid=19803
163-
2022-07-30 07:58.16 [debug ] Carving chunk path=unknown-file_extract/13197718-23683478.unknown pid=19803
164-
2022-07-30 07:58.16 [debug ] Calculating entropy for file path=unknown-file_extract/13197718-23683478.unknown pid=19803 size=0xa00000
165-
2022-07-30 07:58.16 [debug ] Entropy calculated highest=99.99 lowest=99.98 mean=99.98 pid=19803
166-
2022-07-30 07:58.16 [warning ] Drawing plot pid=19803
167-
2022-07-30 07:58.16 [debug ] Entropy chart chart=
168-
Entropy distribution
169-
┌---------------------------------------------------------------------------┐
170-
100┤•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••│
171-
90┤ │
172-
80┤ │
173-
70┤ │
174-
60┤ │
175-
50┤ │
176-
40┤ │
177-
30┤ │
178-
20┤ │
179-
10┤ │
180-
0┤ │
181-
└┬---┬---┬---─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬--─┬┘
182-
1 4 7 12 16 20 24 29 33 37 41 46 50 54 59 63 67 71 76 80
183-
[y] entropy % [x] mB
136+
2024-10-30 10:52.03 [debug ] Calculating chunk for pattern match handler=arc pid=1963719 real_offset=0x1685f5b start_offset=0x1685f5b
137+
2024-10-30 10:52.03 [debug ] Header parsed header=<arc_head archive_marker=0x1a, header_type=0x1, name=b'8\xa7i&po\xc77\xd5h\x9a\x9d\xf1', size=0x26d171fa, date=0x1bfd, time=0xe03f, crc=-0x3b95, length=0x349997d5> pid=1963719
138+
2024-10-30 10:52.03 [debug ] Ended searching for chunks all_chunks=[0xa00000-0xc96196] pid=1963719
139+
2024-10-30 10:52.03 [debug ] Removed inner chunks outer_chunk_count=1 pid=1963719 removed_inner_chunk_count=0
140+
2024-10-30 10:52.03 [warning ] Found unknown Chunks chunks=[0x0-0xa00000, 0xc96196-0x1696196] pid=1963719
141+
2024-10-30 10:52.03 [info ] Extracting unknown chunk chunk=0x0-0xa00000 path=unknown-file_extract/0-10485760.unknown pid=1963719
142+
2024-10-30 10:52.03 [debug ] Carving chunk path=unknown-file_extract/0-10485760.unknown pid=1963719
143+
2024-10-30 10:52.03 [debug ] Calculating randomness for file path=unknown-file_extract/0-10485760.unknown pid=1963719 size=0xa00000
144+
2024-10-30 10:52.03 [debug ] Shannon entropy calculated block_size=0x20000 highest=99.99 lowest=99.98 mean=99.98 path=unknown-file_extract/0-10485760.unknown pid=1963719 size=0xa00000
145+
2024-10-30 10:52.03 [debug ] Chi square probability calculated block_size=0x20000 highest=97.88 lowest=3.17 mean=52.76 path=unknown-file_extract/0-10485760.unknown pid=1963719 size=0xa00000
146+
2024-10-30 10:52.03 [debug ] Entropy chart chart=
147+
Randomness distribution
148+
┌───────────────────────────────────────────────────────────────────────────┐
149+
100┤ •• Shannon entropy (%) •••••••••♰••••••••••••••••••••••••••••••••••│
150+
90┤ ♰♰ Chi square probability (%) ♰ ♰ ♰♰♰♰ ♰ ♰ ♰ │
151+
80┤♰ ♰ ♰♰ ♰♰ ♰♰ ♰ ♰ ♰♰♰♰♰♰♰♰♰ ♰ ♰♰♰♰♰♰ ♰♰ ♰♰ │
152+
70┤♰♰♰♰ ♰ ♰ ♰ ♰ ♰♰♰ ♰ ♰ ♰ ♰ ♰♰♰♰♰♰♰♰♰ ♰♰ ♰ ♰ ♰ ♰♰♰ ♰♰♰♰♰♰ │
153+
60┤♰♰♰♰ ♰♰ ♰♰ ♰ ♰♰♰♰ ♰ ♰♰ ♰ ♰ ♰ ♰♰♰♰♰♰ ♰♰ ♰ ♰ ♰♰♰♰ ♰ ♰♰♰ ♰♰♰♰♰♰♰ │
154+
50┤ ♰♰♰ ♰♰ ♰♰ ♰♰ ♰♰♰♰ ♰♰ ♰ ♰♰♰ ♰♰♰♰♰♰ ♰ ♰ ♰ ♰♰♰♰♰ ♰ ♰♰♰ ♰ ♰♰♰♰♰ ♰ │
155+
40┤ ♰♰ ♰♰ ♰ ♰♰ ♰♰♰♰ ♰♰ ♰ ♰♰♰ ♰♰♰♰♰♰ ♰♰ ♰♰ ♰♰♰♰♰♰ ♰ ♰♰♰ ♰ ♰♰♰♰ ♰♰ ♰│
156+
30┤ ♰ ♰♰ ♰♰ ♰♰♰♰ ♰ ♰♰ ♰♰ ♰♰ ♰ ♰♰ ♰ ♰ ♰♰♰ ♰ ♰ ♰♰ ♰ ♰♰♰ ♰♰ ♰ │
157+
20┤ ♰♰ ♰♰ ♰♰♰ ♰ ♰♰ ♰ ♰♰ ♰ ♰ ♰ ♰ ♰ ♰ ♰♰ │
158+
10┤ ♰ ♰ ♰ ♰ ♰ ♰♰ ♰ ♰ ♰♰ │
159+
0┤ ♰ ♰ │
160+
└─┬──┬─┬──┬────┬───┬──┬──┬──┬───┬───┬──┬────┬───┬────┬──┬──┬────┬──┬───┬──┬─┘
161+
0 2 5 7 11 16 20 23 27 30 34 38 42 47 51 56 60 63 68 71 76 79
162+
131072 bytes
163+
path=unknown-file_extract/0-10485760.unknown pid=1963719
164+
2024-10-30 10:52.03 [info ] Extracting unknown chunk chunk=0xc96196-0x1696196 path=unknown-file_extract/13197718-23683478.unknown pid=1963719
165+
2024-10-30 10:52.03 [debug ] Carving chunk path=unknown-file_extract/13197718-23683478.unknown pid=1963719
166+
2024-10-30 10:52.03 [debug ] Calculating randomness for file path=unknown-file_extract/13197718-23683478.unknown pid=1963719 size=0xa00000
167+
2024-10-30 10:52.03 [debug ] Shannon entropy calculated block_size=0x20000 highest=99.99 lowest=99.98 mean=99.98 path=unknown-file_extract/13197718-23683478.unknown pid=1963719 size=0xa00000
168+
2024-10-30 10:52.03 [debug ] Chi square probability calculated block_size=0x20000 highest=99.03 lowest=0.23 mean=42.62 path=unknown-file_extract/13197718-23683478.unknown pid=1963719 size=0xa00000
169+
2024-10-30 10:52.03 [debug ] Entropy chart chart=
170+
Randomness distribution
171+
┌───────────────────────────────────────────────────────────────────────────┐
172+
100┤ •• Shannon entropy (%) •••••••••••••••••••••♰••••••••••••••••••••••│
173+
90┤ ♰♰ Chi square probability (%) ♰ ♰♰ ♰ │
174+
80┤♰♰ ♰♰ ♰♰ ♰ ♰♰ ♰ ♰♰ ♰ ♰♰ │
175+
70┤♰ ♰ ♰ ♰ ♰ ♰ ♰ ♰ ♰ ♰ ♰♰ ♰♰ ♰♰♰ ♰ ♰♰ ♰♰ │
176+
60┤ ♰ ♰♰ ♰ ♰ ♰ ♰ ♰♰♰♰♰ ♰♰ ♰♰ ♰♰ ♰ ♰ ♰♰♰ ♰♰ ♰ ♰ ♰♰ ♰ │
177+
50┤ ♰ ♰♰♰ ♰ ♰ ♰ ♰ ♰ ♰♰♰♰ ♰ ♰♰ ♰ ♰♰♰ ♰ ♰ ♰ ♰♰♰ ♰♰ ♰ ♰ ♰♰ ♰♰ ♰ │
178+
40┤ ♰♰♰♰ ♰♰ ♰♰ ♰ ♰ ♰♰ ♰♰♰ ♰♰♰ ♰♰♰ ♰♰ ♰ ♰ ♰ ♰♰ ♰ ♰♰ ♰ ♰ ♰ ♰ ♰♰♰ ♰♰ │
179+
30┤ ♰♰♰♰ ♰♰ ♰♰ ♰♰ ♰♰ ♰♰ ♰♰♰♰♰ ♰♰ ♰ ♰ ♰ ♰♰ ♰♰♰ ♰ ♰ ♰ ♰ ♰ ♰ ♰ ♰│
180+
20┤ ♰♰♰ ♰ ♰ ♰♰ ♰♰ ♰♰♰♰ ♰♰ ♰ ♰ ♰ ♰♰ ♰♰ ♰ ♰♰ ♰♰ ♰ ♰ │
181+
10┤ ♰ ♰ ♰ ♰ ♰ ♰ ♰ ♰♰ ♰ ♰♰ ♰♰ ♰♰ ♰ ♰ ♰ │
182+
0┤ ♰ ♰ ♰♰ ♰ ♰♰ │
183+
└─┬──┬─┬──┬────┬───┬──┬──┬──┬───┬───┬──┬────┬───┬────┬──┬──┬────┬──┬───┬──┬─┘
184+
0 2 5 7 11 16 20 23 27 30 34 38 42 47 51 56 60 63 68 71 76 79
185+
131072 bytes
184186
```
185187

186188
### Skip extraction with file magic

fuzzing/search_chunks_fuzzer.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -40,8 +40,8 @@ def test_search_chunks(data):
4040
config = ExtractionConfig(
4141
extract_root=Path("/dev/shm"), # noqa: S108
4242
force_extract=True,
43-
entropy_depth=0,
44-
entropy_plot=False,
43+
randomness_depth=0,
44+
randomness_plot=False,
4545
skip_magic=[],
4646
skip_extension=[],
4747
skip_extraction=False,

tests/test_cleanup.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ def test_remove_extracted_chunks(input_file: Path, output_dir: Path):
5050
input_file.write_bytes(ZIP_BYTES)
5151
config = ExtractionConfig(
5252
extract_root=output_dir,
53-
entropy_depth=0,
53+
randomness_depth=0,
5454
)
5555

5656
all_reports = process_file(config, input_file)
@@ -62,7 +62,7 @@ def test_keep_all_problematic_chunks(input_file: Path, output_dir: Path):
6262
input_file.write_bytes(DAMAGED_ZIP_BYTES)
6363
config = ExtractionConfig(
6464
extract_root=output_dir,
65-
entropy_depth=0,
65+
randomness_depth=0,
6666
)
6767

6868
all_reports = process_file(config, input_file)
@@ -75,7 +75,7 @@ def test_keep_all_unknown_chunks(input_file: Path, output_dir: Path):
7575
input_file.write_bytes(b"unknown1" + ZIP_BYTES + b"unknown2")
7676
config = ExtractionConfig(
7777
extract_root=output_dir,
78-
entropy_depth=0,
78+
randomness_depth=0,
7979
)
8080

8181
all_reports = process_file(config, input_file)
@@ -97,7 +97,7 @@ def test_keep_chunks_with_null_extractor(input_file: Path, output_dir: Path):
9797
input_file.write_bytes(b"some text")
9898
config = ExtractionConfig(
9999
extract_root=output_dir,
100-
entropy_depth=0,
100+
randomness_depth=0,
101101
handlers=(_HandlerWithNullExtractor,),
102102
)
103103
all_reports = process_file(config, input_file)

tests/test_cli.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -184,7 +184,7 @@ def test_dir_for_file(tmp_path: Path):
184184

185185

186186
@pytest.mark.parametrize(
187-
"params, expected_depth, expected_entropy_depth, expected_process_num, expected_verbosity, expected_progress_reporter",
187+
"params, expected_depth, expected_randomness_depth, expected_process_num, expected_verbosity, expected_progress_reporter",
188188
[
189189
pytest.param(
190190
[],
@@ -233,7 +233,7 @@ def test_dir_for_file(tmp_path: Path):
233233
def test_archive_success(
234234
params,
235235
expected_depth: int,
236-
expected_entropy_depth: int,
236+
expected_randomness_depth: int,
237237
expected_process_num: int,
238238
expected_verbosity: int,
239239
expected_progress_reporter: Type[ProgressReporter],
@@ -263,8 +263,8 @@ def test_archive_success(
263263
config = ExtractionConfig(
264264
extract_root=tmp_path,
265265
max_depth=expected_depth,
266-
entropy_depth=expected_entropy_depth,
267-
entropy_plot=bool(expected_verbosity >= 3),
266+
randomness_depth=expected_randomness_depth,
267+
randomness_plot=bool(expected_verbosity >= 3),
268268
process_num=expected_process_num,
269269
handlers=BUILTIN_HANDLERS,
270270
verbose=expected_verbosity,

0 commit comments

Comments
 (0)