Skip to content

failpt/word-design-for-DNA-computing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

55 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Word Design for DNA Computing on Surfaces

This is an experimental project attempting to solve problem 033 on CSPLib.

Given a fixed size $n$, I try to find a set of $n$ strings satisfying the problem constraints. I introduce a cnfencoder package containing DNAEncoder, which reduces the problem of finding a set of a given size to SAT (helps construct a CNF formula to feed the solver and decode the solution). In pursuit of experimentation, I use three different SAT solvers – Glucose, Kissat, and CaDiCaL (paper) – and compare their performance on fixed input sizes.

Running the code

Usage:

word_design.py [-h] [-n NUMBER] [-of OUTPUT] [-s SOLVER] [-q] [-ord]

Options:

  -h, --help            show this help message and exit
  -n NUMBER, --number NUMBER
                        desired set size                       (default: 25)
  -of OUTPUT, --output OUTPUT
                        output DIMACS file for the CNF formula (default: formula.cnf)
  -s SOLVER, --solver SOLVER
                        the SAT solver binary name             (default: bin/glucose-syrup)
  -q, --quiet           suppress solver output
  -ord, --order         makes sure the words are fully ordered (by bit encoding)

If you are not working on a Unix-based system, replace ./script.py with python script.py in all the following code.

Examples

  % ./word_design.py -of examples/example1.cnf -q

Output:

[bin/glucose-syrup] Runtime: 0.0367 seconds
CATCAACC | CGATGAGA | CCCGTAAA | AGAGACGT | TTCCCTTC | ACCCAAAC | CTTGGCAT | TGCCTCTT | CAGGACTT | CGTAGCTA | GGCGTTTA | AATCGAGG | TCCAAGCT | GCCTTAGT | CGACTGAA | GATTACGC | CACACGTA | CCCAATTG | GGACATTC | CTAACAGC | CTTCGTCA | TGTCACGA | ATGACGTC | CCATCCAA | CTACCCTT

  % ./word_design.py -n 54 -of examples/example2.cnf -s bin/kissat -q -ord

Output:

[bin/kissat       ] Runtime: 0.8142 seconds
GAACCATC | CTTCGATC | GGAAATGC | GGTTTAGC | CTAGACTC | CAATCCTG | CCAGGTAA | CTTATGGC | CACGTATC | GTCGCAAT | CCGACTTA | CCTTAGTC | CACCAACA | GCCTGTTA | CACACCAT | GACTTCCT | GTGAGCTA | CATCTTCC | CCACTACT | GTCAAGCT | CTCATCCA | GTGTCACA | GTTACCGT | CCCTCAAA | CGAACAAC | GCGTACAT | GCAACAGA | CCTAACAG | GTTCACCA | GCACCTAT | CCATACCA | CTCTCTCT | GTCCTTGT | CTACCCAA | GACACTCA | CCTCATGA | GCTCTCTT | CCTGCATT | CTCCATAC | ATCTGGCA | AGACCACA | TCTACCCA | ACCACACT | ACCGATCA | ACACACGT | AATCCCCT | AACAACCC | ACTCTGCA | ATCCCGTT | ATAACGCG | AAACGTCG | TTAGCCCT | TTCTACCG | ATACTCCC

  % ./word_design.py -n 69 -of examples/example3.cnf -s bin/glucose-syrup -q 

Output:

[bin/glucose-syrup] Runtime: 6.5772 seconds
GTAGAGCA | GACTTGAG | CCAGTCAT | ATGTGGGT | AGCGCATT | TTGGGTTG | GCGTCAAT | TACCCAGT | CCATGCTA | CGAGAAAG | TCTGTCTG | CGTCGTAT | GGTATGTG | CTTCAGAG | CTAGCTTC | TACTTCGC | ATCATGCG | TAGTCGTC | CATCACTC | GGAGTACT | GCCATTCT | CCGTAATC | TTTGCAGG | GTTGGTGA | AAGGTGGA | TTCCGTCT | CCTGATCT | ACGGATAG | CAAACGTG | TGGACACT | GTCTAAGC | CCTTCGTT | TATGGTCC | TCTCCTGA | TTGTCCAG | GACAAGGT | TTGAGCCA | GCCTCTTA | GTCACATG | ACTTGTGC | TACACTCG | CCACTACA | TGCTACTG | TGTCAAGC | CCGAGTTT | CTGTTGTG | TGTGTGGT | GCTCATTG | AGGTCCTA | CCAAATGC | GAATCGGA | CAGGGAAT | TACGGGTT | TAATGGCG | CTGCTTGA | GAGATTGC | TCGATAGG | CTTACCAC | GTTCCACT | GGCAGAAA | GAGCCTTT | AACAGGAC | TGAAACCC | GTGCATAC | ATGCTCCT | TCCTGGAA | AGGCTTTC | ACTAACCG | ACAGAACC

  % ./word_design.py -n 69 -of examples/example4.cnf -s bin/cadical -q -ord

Output:

[bin/cadical      ] Runtime: 1.3771 seconds
CGGCTAAA | CGGACATT | GCCCTTTT | GATTGGCA | GGTTTGTG | CGGGATTA | GGTCTACT | GCGAACTA | GCTTCCAT | GTTCGTTC | GACTCGTT | CCTGCTTT | GCATTGCT | GTTCAGGA | CAGATGGA | GTGTGCTT | CTTTTGCC | GTGGTGTA | GGGTTTCA | GTCAAGCT | CGTGTCAT | CGACGTTT | GGTATCGA | CCATTCTC | CAAGACAG | CTGAAGTC | CACGTTGT | GAGAGACT | CCGTGTAA | GTCGGAAA | CTACCGTA | GCAAAGAC | CAGTGATC | CTTCACCT | GCTGTATC | CCGAATCT | CTCTGGAT | CCCACAAA | CTTGGATG | GGGTAGAT | GGATGCAA | CTCATTCG | CTCGATAC | GATGATCC | GTAAGAGG | GCACATCA | GAGTCTAC | GTGTCAGA | TCTACGCA | ACTCTAGG | TCCTATGG | ATAGCTCC | ACTTGGTC | TCGCAAAG | AGAGAGGT | TACCTACG | TGACTGCA | TGGACTAG | TCGAGCAT | AGTGCGAA | ACGGTTAG | AACTGCGT | TAGAGGTG | TGTTCAGG | TGGCTTGT | TCAGGGAA | AATCCGCT | ATCCCCTT | TCGGCATA

  % ./word_design.py -n 83 -of examples/example5.cnf -s bin/cadical -q

Output:

[bin/cadical      ] Runtime: 11.3519 seconds
CGATTTGC | TCAACGGA | CCATGGAT | ATCTTCGC | ATTAGGGC | ACTCTCAC | TACGTACG | CTCAGTGA | TCGCGATT | CGTTGAGA | GCTATCCA | AGAAAGGG | GTGGACTA | TATCCCTG | GGAGACAT | GAGCGTAA | GGGTATAG | AGCTTGCA | TGTGAGGA | CAGAACGT | GTGGCTAT | GATACGGT | GATGCAAG | GTCAAGTC | GCTCTTTG | AGCACTGT | GCCTACTT | CGACAAAG | GAAGCGTA | TGACACCA | AGTCAAGC | CGCCATTA | CCAAGCTA | ACCTAGAG | AGACCGAT | ATGCACCT | GTTTTGGG | GGGACAAA | GAATCACC | GACTGTGT | TCCAGTTG | CAAGATGG | ACGAGTAC | AGTGTTGG | TTCTGCAG | CACGTGTT | CCTCCAAA | GGTCCATT | CTGGTAAG | TGTCGTGT | TTTCCGAC | TAAGGCGT | GACACTCA | CGGTAACT | CGAGGTAA | GCGTTAGA | TCAGGAAC | CTCTCCTA | CCTGATCA | AAACCTGC | CCTGTAGT | ACGGATGT | AACAACCC | GGTCTGAA | CCTTTGTC | CTCCTACA | CACGACAA | GAGCTATC | GTAGAAGC | ATGGCAGA | ACTGGGTA | ATGATGCG | AGCTCAAC | CATAAGCG | GGAGTACA | TTGGTGGT | AGGGAATG | GTAAGGCA | CATCTTCC | AACCGAAG | AAGGTCCA | CTGATCTC | TCAGACTG

  % ./word_design.py -n 71 -of examples/example6.cnf -q

Output:

[bin/glucose-syrup] Runtime: 10.3126 seconds
TTTCCGCT | TGAAAGGG | CTGGAAAG | GCCGTAAT | ATACGCGT | TTCGTCGT | TTAGCAGG | TACTACCG | TGCCTGAA | GCTTGCAA | TCGTGCTT | CTATGACC | TGGGTGTT | TAAAGCGC | GGGAGAAT | GAATGGTG | TGACTCCT | CTTCTAGG | GATGGAAC | GACTCGAT | TGGTCATC | AACTGCTC | GCAGATCT | TAAGCGTC | TTCACCTG | GGTACCTT | TCGGAACA | CGCAACAT | GCAGGATA | GGCATTGA | AGCACAGT | AGTGCGTA | CACATGCT | CTACCTAC | AGTTGGGT | TGCTTAGG | GATAGTCG | CGAAATCC | CAGCCAAT | TACCGTAG | GAGAGCTA | CTGAGTTG | ACGTGAGA | GTGTCACA | CCAACCTA | GCATAGGA | AGATGCAG | TGCCGATT | ACCACTTC | AGTAACCG | TCCTCGTA | ATGAGCAC | ACAAGTGG | CATTGCCT | ATCCCTGA | CAGTCTGA | GTCTGAGT | CCTTGTTC | GGTAAAGC | AGAGTGAC | TCGTAGAG | ACCGTCTA | CATACACC | CTTGGTGT | GTCAAGAG | CGAGCTTT | GGATACTC | CCGATTAC | GCTTCTGT | TGTAGGTC | CATGAGTG

You may view the .cnf files for each of the examples inside examples/.

Encoding

Notation: $[n]$ denotes the set {0, ..., n-1}, $w^b$ is $w$ if $b=0$ and $\neg w$ otherwise

We encode a word with two bits per letter: A=10, C=00, G=01, T=11; this bijection is particularly convenient because it breaks the set into {A, T} and {C, G} by the first bit, and we only need to flip one bit to get a complement letter.

We encode a set of size $n$ with two/four sets of variables, depending on the value of eliminate_permutations the user passes to DNAEncoder: $Var(w, p, b)$ representing the $b$-th bit of the $p$-th position of word $w$ indexed from 0, $Match(w_1, w_2, p)$ representing wether words $w_1$ and $w_2$ have the same letter in position $p$ and, if eliminate_permutations=True, $PrevEq(w_1, w_2, p, b)$ representing whether words $w_1$ and $w_2$ are identical up to and excluding the $b$-th bit at $p$-th position, and $CurrEq(w_1, w_2, p, b)$ representing equality up to and including the current bit. We need $PrevEq,CurrEq$ to order every pair of words (by MSB), effectively eliminating permuations.

The last two were made in an optimization attempt via search space reduction, but in practice it ended up slowing down certain solvers, because they had to seed out sets that were technically correct until they were to find an ordered one. That is why I added the --order option instead of hardwiring it into the encoder.

We represent the problem constraints as a conjunction of the following clauses:

  • For every $C\subset [8]$ s.t. $|C|=5$:

    $\bigwedge_{w\in [n]}(\bigvee_{p\in C}Var(w, p, 0))\wedge (\bigvee_{p\in C}\neg Var(w, p, 0))$

    (No 5 letters of any word belong to {A, T} or {C, G})

  • $\forall w_i, w_j: i,j \in [n]$ and $i<j$ (to eliminate redundant clauses), position $p \in [8]$, and bit pattern $b_0, b_1 \in$ {0, 1}:

    • $Var(w_i, p, 0)^{b_0} \vee Var(w_i, p, 1)^{b_1} \vee Var(w_j, p, 0)^{b_0} \vee Var(w_j, p, 1)^{b_1} \vee Match(w_i, w_j, p)$

      I.e., if $w_i, w_j$ have the same letter at position $p$, the $Match(w_i, w_j, p)$ variable must be True.

    • $\forall C\subset [8]$ s.t. $|C|=5$:

      $\bigvee_{p\in C} \neg Match(w_i, w_j, p)$

    (Hamming distance)

  • $\forall w_i, w_j: i,j \in [n]$ and $i\leq j$, position $p \in [8]$, and bit pattern $b_0, b_1 \in$ {0, 1}:

    • $Var(w_i, p, 0)^{b_0} \vee Var(w_i, p, 1)^{b_1 \oplus 1} \vee Var(w_j, 7-p, 0)^{b_0} \vee Var(w_j, 7-p, 1)^{b_1} \vee Match(w_i, w_j, p)$

      We evaluate $w_j$ at $7-p$ to get its reverse; the second bit of $w_i$ is flipped to get its complement.

    • $\forall C\subset [8]$ s.t. $|C|=5$:

      $\bigvee_{p\in C} \neg Match(w_i, w_j, p)$

    (Reverse & Watson-Crick distance)

  • If eliminate_permutations=True:

    $\forall w_i, w_{i+1}: i\in[n-2]$, position $p \in [8]$, and bit $b \in$ {0, 1}:

    • $\neg CurrEq(w_i,w_{i+1},p,b)\vee PrevEq(w_i,w_{i+1},p,b)$

    • $\neg CurrEq(w_i,w_{i+1},p,b)\vee \neg Var(w_i,p,b)\vee Var(w_{i+1},p,b)$

    • $\neg CurrEq(w_i,w_{i+1},p,b)\vee Var(w_i,p,b)\vee \neg Var(w_{i+1},p,b)$

      I.e., $CurrEq$ implies $PrevEq$ and equality of the current bits ( $Var(w_i,p,b) \leftrightarrow Var(w_{i+1},p,b)$ ).

    • $\neg PrevEq(w_i, w_{i+1}, p, b) \vee \neg Var(w_i, p, b) \vee Var(w_{i+1}, p, b)$

      I.e., if $PrevEq$ is true, we forbid the case where $w_i$ has a 1 and $w_{i+1}$ has a 0.

    • $\neg CurrEq(w_{n-2}, w_{n-1}, 7, 1)$

      I.e., words cannot be identical.

    (Total order)

Experiments

Experiments were run on Apple M3 Pro CPU (arm64) with 18 GB of unified memory on macOS 15.4.1 (Sequoia).

To compare the performance of the three SAT solvers, I fed the set sizes {15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 66, 67, 68, 69, 70, 71} to all three solvers and {72, 74, 75, 76, 78, 80, 81, 82, 83} only to Kissat and CaDiCaL, deeming Glucose too slow to continue. I measured the results with hyperfine, which ran each batch three times.

I compared the mean runtimes of the solvers with and without the ordering constraint that eliminated all word permutations.

I conclude that for larger sets, Kissat performs better when the ordering constraint is enforced, while CaDiCaL and Glucose are actually getting significantly slower, and that the current fastest option for finding them is unordered CaDiCaL.

You may find the .cnf for each size in experiments/cnfs/ and the hyperfine reports in experiments/logs/.

If you wish to play with the plotting, you may edit experiments.py and run

% ./experiments.py

from the root directory.

Results

Archive The four biggest sets I found so far are of size 84.

You may find the corresponding formulas in results/order84.cnf and results/norder84.cnf respectively.

  1. A sequence with ordering found by Kissat in 107.3449 seconds:

    GACCATTG | GAATCCTG | CTCCAATC | GGTTTTCG | GAAGTCCT | GTCTGTTC | CCTTTCCT | CAAATCGG | GCAACTCT | CGCTTAAG | CCAGACTT | GGAATGAG | GACTAACC | GGGTCAAT | GCGTAAGA | CAGTTAGC | CACTCTCT | CCAACAAG | CGTAGATG | GATAGTCC | GATACAGG | GTATGACG | TGAGCCAT | TATGCCTC | TACCTTGC | TTCGAGAC | TGTCCGAA | TGACATGG | TCACGACT | ACGTGTAC | ACTCCTTG | ACGGTAAG | AATCCGCT | AAAACCCC | AACGGTAG | AGGCTACT | TACTGAGG | TTATCGGG | TCGTAGAG | TGAGTACG | TTCGCTTG | AACTGCCA | ATGAGGAG | TCAATGGC | TGTGGAAC | AGCTACAC | ATGCTCTG | TGCTGGAT | AGAACGGT | ACGCACAT | TCAGAGCA | ACTCTGAC | AAGCCAAC | TTTTGGCC | TTGGTGGA | TTGACAGC | AGTTCAGC | ACAGCAGA | TAAAGGCG | TGGTGTTG | ACCATCTC | AACACGTG | TCAGGTTC | ATCTTGGC | TGTCTCGT | TATGAGGG | TGCGACTA | TTCTTCCG | ACATTGCG | ACAGGGAT | TCCCTATG | TTGCAACG | TACCGCTT | TGGATCAC | TCGCTTCA | AGTGTGCA | ACAAGTGG | ACACATCC | TCTTCACG | ACGTCGTT | AGACCCTA | AGACGAAG | TCCAATCG | TTACGCAC
    
  2. A sequence without ordering found by Kissat in 158.0229 seconds:

    TGCCATCA | GAAATCCC | CATGTCCT | TCACGTTC | GGTCTGAA | AGTGCCTT | ATGTGTCC | AAGCGAAC | ACCATTCC | AACTGCTC | TCGTTACC | TGTCTAGC | GCATTCGA | CCTCCAAT | CAAACGAG | AGCCGTTT | ATCTCACG | ACTGCTCA | TTCGCTCT | ACGGATAG | TAGCCGAA | CTATAGCG | CACCACTA | TAAGCCAC | CAGATGCA | GGCAAACT | ACCAGAAG | CTTGCCAA | ATGACCCA | TCTGAGGA | ACACGCAA | ACCCTAGT | TGGTCCTA | TGTCAGTG | TATCCTCC | CAACTACG | CCGCTTTA | CCATGATG | GACACCTT | GTCAACAG | CCTATGTC | TCGATGAG | CCAAACCT | AGACAACC | CCTTTCAG | CCCAAGAA | GTACCATC | ACCGAGTT | AGCACTGA | CCTTATCC | ATCCACCT | ACTCAGAC | GCCTTGAT | GCCTGTTA | CGGTAATC | AATCCCGA | CGCTTGTA | AAGTCCAG | ATACCTGG | TTCCTCAC | GCATCACT | ACTTCGTG | CGCATAAC | TCTAGCAC | GCTACTAG | TACCCATG | GCAGAAAC | TCTCTCCA | CATAGAGC | GTGCATAC | CTCAGGTT | ACGTAGCA | TTACGGAG | CAGAGCAT | CGAACACA | AGGAACAC | GACCTTCT | CCACAAGA | TCAGACTG | TTGACGTC | ACAACAGC | GTTGAGTC | TACGTTGC | TACGGGAT
    
  3. A sequence with ordering found by CaDiCaL in 901.8953 seconds:

    CGAACACA | CTCACTAC | GTTATGGC | CAAAGCAC | CGATTTGC | GTGACAAG | CTCATAGG | GAATGTCC | GTCTCATC | CATCATCG | CCAGAATC | CTGTTACC | CTTCAAGC | CGAGATAG | CTACAGTG | GAAGAAGG | CTCCTTCA | CCTAGACT | GCCCATAA | TCGACATC | ACCTATGG | ATAGGACC | TCGGATCA | ACTCTGCA | TAGATCGC | TGTGACTC | TTAACCCC | AACGCAGT | AGCGAAAC | ATCTCGGA | TCATCGTG | TCACTACG | ATGCCACT | TTGGGTAG | TAGGCGAA | TCCAACGA | TGGTGAAC | TGGAATGG | TTTCCGTC | TGCCCAAA | AAGGTGTC | ATCCACTC | TACATGCG | TATCCCCT | TCTCCTGA | ACTTGCGA | ACACAGAC | AGTTCCTG | ATGGATGC | AACCTACC | AAAACCGG | ACGTGTCT | AGGATTCC | TAGCACAG | TAACCAGC | TGCTGTCA | ACTACCAC | TTCTACCG | ACGAAGTG | AATCCGAG | TTTGCAGG | AATAGGCC | TCCCTTTC | ACCGTATG | ACTCGATC | AAGCGTAC | TGGCGTTT | TAAGGCTG | TGACTCAC | TCAGCTAC | ACCACTCA | ACCCTCAT | AAGTGAGG | TACTCCAC | TCTTAGCC | ATCCGAAG | AAGTACCC | TCGAGCAT | TATGGTGC | ACAAGTGC | TCCTCACT | AGAACGTC | AGCAGTTG | AGCTTGAG
    
  4. A sequence without ordering found by CaDiCaL in 78.1115 seconds:

    TTTTGGGG | TTTTCCCC | TTACCGGT | CTTACGCT | TTAAGGCC | TTAGGCGA | TTACACCG | TGTACGAC | TTGATCGG | TTCACGTG | GTTAGCCA | TTGCTGCA | TTGGGACT | TTGGATGC | GGGATTGT | TTCGCCAT | ATCTTGGC | TGTAGACG | TTGCGTTG | TTCCAGAC | CTTCTCTC | TACCTCGA | TAGGCAGA | CCAAAAGG | TACAGGGT | AGTGAAGC | TAGGTCTC | CTGACTAG | TAAGTGGG | TATGAGCC | TACGGAAC | TACTCGCA | TAACCACC | TATCCTGG | GACGATAG | TACTGCTG | TAGTGTCC | TCGCACTT | TGCTACGT | GTTACTGC | TGAACCCT | CATTGAGC | AGATCTGC | GTATCTCG | TCTAACGC | TCTGGTGT | TCCTATCG | TGCATTCC | TCGTCATC | GTACTAGG | ATGTCAGG | TGAGGGAT | ATGAGCAC | TGAGCTTG | ATTGGCTG | ATACGGAG | GTAGAGTG | TCTCTGAG | TCCCTACT | CGATGACT | CACCTATG | TGGTAGAG | TCACTTGC | TCCTTCAC | GGTATCAG | AACCCCTT | ATCGAGCT | GGTTTGTC | ACTTCGTG | GCGAATTC | GATCTTCC | ATCGCATC | AACAGTCG | CGTTACTG | ACATGGGT | TATCGCCT | AGAGTCGT | AAAACCGG | AAGCACAG | AGGATGTG | ATTCGACC | AGCAACTC | AATGCACG | AAACAGGC
    

The biggest set so far contains 85 words and was found by CaDiCaL in 819.2578 seconds:

TCCGATCT | GTATTCCG | CCAATGAC | AACTCCTG | GATATCGC | AGACATCG | GCAGCTTA | GATGGGTA | AATCCGGT | TGATCCTC | TCTGAGTG | GGTGAACT | CATACGTG | TGGATCTG | ATGGGCTA | TGGTCTGT | CGAGATGA | CCTGTCTA | CTGTGAGT | ATTGGTCG | GACCATTG | GACTACGT | ATACGGCA | CTTCTCAC | AGAACCCT | CCGTTAAG | ATGCCATG | GCCTTGTA | TCGATGCT | TACAGCCT | CTGTTTCC | ATCGCTGT | GAAGTGGT | GATTGTGG | TGAAGACG | TGTCATGC | TCGTGCTT | GAACCTCT | ATGGTAGC | TAGCTTCG | CCAGTACT | GACCTACA | TTTCGACC | GGAATGCA | CCTCGTTT | TCCTACAC | AGTTGCGT | ACCTGTGA | AACCTGAC | AGCGAGAA | GGGTGTTA | GTCTGTAC | CGGTAACA | CGTTGATG | GTCAACCA | CTCGATTC | TGGCAAAG | AGGAAGGT | TATCGGAG | AAGAGTCC | GGTGTTAC | TCAGGAAC | CAGCCTTA | ACGAGGAA | ATTGCGTC | TTAGTGCC | ATCGTCAG | AATCGCTC | GGCTAATC | ATGTCCAC | GGTTCGTT | CCTTCTAC | GCAAGTGT | GTTCTGCT | TGCTTAGG | TCCACACA | TCTAGCGA | CCTTTGGT | CTAGCCAA | GCGCTTAT | CCCAACTT | TTATGGGG | TGGGCATA | TAGTGGCA | CACAGTAG

You may find the corresponding formula in results/norder85.cnf

About

Looking for solutions of certain sizes with Glucose, Kissat and CaDiCaL

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages