This is a library to encode bits into text.
You can install from source by doing,
$ git clone git@github.com:fastforwardlabs/steganos.git
$ cd steganos
$ python setup.py install
or simply,
$ pip install git+https://github.com/fastforwardlabs/steganos.git
To find out how many bits can be encoded into a string:
import steganos
original_text = '"Hello," he said.\n\t"I am 9 years old"'
capacity = steganos.bit_capacity(original_text)
To encode bits into a string:
import steganos
bits = '101'
original_text = '"Hello," he said.\n\t"I am 9 years old"'
encoded_text = steganos.encode(bits, original_text)
Retrieving the bits from a string requires the original text into which the bits were encoded.
If you have the complete encoded text, use the decode_full_text function:
import steganos
bits = '101'
original_text = '"Hello," he said.\n\t"I am 9 years old"'
encoded_text = steganos.encode(bits, original_text)
recovered_bits = steganos.decode_full_text(encoded_text, original_text)
# recovered_bits.startswith('101') == True
If you have on part of the encoded text, you can use the decode_partial_text function. If you know the indices of the original text that the partial encoded text corresponds to, you can pass those in as a tuple (start_index, end_index) as the final parameter. Otherwise, they will be inferred.
import steganos
bits = '101'
original_text = '"Hello," he said.\n\t"I am 9 years old"'
encoded_text = steganos.encode(bits, original_text)
partial_text = encoded_text[:8]
recovered_bits = steganos.decode_partial_text(partial_text, original_text)
# recovered_bits.startswith('1?1') == True
In order to help send encoded messages as opposed to just storing bytes, we
provide bytes_to_binary
and binary_to_bytes
in order to encode/decode a
message to and from steganos' binary format.
import steganos
message = b'Hello World!'
original_text = open('text.txt').read()
bits = steganos.bytes_to_binary(message)
encoded_text = steganos.encode(bits, original_text)
recovered_bits = steganos.decode_full_text(encoded_text, original_text)
recovered_msg = steganos.binary_to_bytes(recovered_bits)
# recovered_msg.startswith(b'Hello World!') == True
By default, and decoded message will be the maximum length encodable within the
source document. That is to say, if you have a document that can store 8 bits
and your message is just two bits, the decoded result will be your two bits
repeated four times. This can be solved by providing the message_bits
parameter to the decode function. In addition to returning with the proper
number of bits, this also will give possible increased accuracy for partial
decodings.
bits = '101'
original_text = '"Hello," he said.\n\t"I am 9 years old"'
encoded_text = steganos.encode(bits, original_text)
partial_text = encoded_text[14:26]
recovered_bits = steganos.decode_partial_text(partial_text, original_text)
recovered_bits_limit = steganos.decode_partial_text(partial_text, original_text, message_bits=3)
# recovered_bits == '1??101'
# recovered_bits_limit = '101'
Steganos encoding works by generating 'branchpoints' for a given original
text. Each branchpoint represents a change to the text that does not change the
meaning of the text. Each branchpoint is 'executed', which means that the
change it defines is made, according to the bits we are trying to encode. For
example, if we want to encode '10' in a text for which we can generate two
branchpoints, the first of those is executed and the second is not. Note that
if there are more branchpoints available than there are bits to encode, the bits
are repeated to make use of the spare capacity. For example, if we want to
encode '10' in a text with 4 branchpoints, steganos.encode
automatically
encodes '1010', improving our ability to retrieve the encoded information from
an incomplete encoded text.
Steganos decoding works by figuring out which branchpoints were executed on a given text. It does this by comparing the encoded text to the original.
Each branchpoint is represented as a list of changes. Each change is a tuple of length three. The first two elements are the start and end indices of the chunk to be removed from the text, and the third element is the text with which it is to be replaced. The end index is non-inclusive. Branchpoints are represented in this way so that they can be easily interleaved.
Adding a new type of branchpoint should only entail changes to src/branchpoints.py and test/branchpoints_test.py. Simply add a function that accepts a string and returns a list of branchpoints represented in the manner described above.
Note that there are functions called unicode_branchpoints
, ascii_branchpoints
and global_branchpoints
in the branchpoints module. Functions that add branchpoints that take advantage of unicode codepoints should be called from the unicode_branchpoints
function. Other local branchpoints should be called from the ascii_branchpoints
function.
Some changes to the text only make sense when applied universally (e.g. using oxford commas). These can be represented as a single branchopint with many changes. Functions that find global branchpoints should be called from the global_branchpoints
function.
The get_all_branchpoints
function in that module will then integrate the new branchpoints appropriately, and no further changes will have to be made.
Please note that adding new branchpoints will make it impossible to decode text that had been encoded before those branchpoints were added. As such, we should bump the version every time new branchpoints are added and keep track of which texts were encoded with which version.
An arbitrary example to demonstrate a function that finds branchpoints with multiple changes each is below. This will generate branchoints that every time the letter 'a' appears will change it to 'x' and will change the letter two before to 'y'. This is of course not a legitimate branchpoint because it alters the semantics of the text.
def example_branchpoints(text: str):
a_indices = [index for index, char in enumerate(text) if char == 'a']
return [[(index - 2, index - 1, 'y'), (index, index + 1, 'x')] for index in a_indices]
Get pytest with pip install pytest
, then run py.test test/
. There are no production dependencies.
- The code contains only sample global, ascii, and unicode branchpoints.
- Enable flag for 'ascii-only' branchpoints.