update

Apoorv Malik · Apoorv Malik · commit d84cc7462d17 · 2024-06-04T01:18:17.000-07:00
diff --git a/eval/rnastralign/README.md b/eval/rnastralign/README.md
@@ -0,0 +1,22 @@
+# RNAstralign Evaluation
+This directory contains the structural distance evaluation script and RNAstralign dataset.
+
+## Data
+Four families (Group I Intron, tmRNA, tRNA, and 5S rRNA) are used for parameter tuning and another four families (SRP, RNaseP, telomerase, and 16S rRNA) are used for testing. For Group I Intron, 5S rRNA, SRP, RNaseP, and 16S rRNA, there are multiple subfamilies within each family, so we chose one specific subfamily for these five families (See table below for more details).
+
+| family     | subfamily           | avg. seq. len. | avg. seq. identity |
+|------------|---------------------|----------------|--------------------|
+| Group 1    | IC1                 |          428.5 |               0.31 |
+| tmRNA      | -                   |          367.4 |               0.35 |
+| tRNA       | -                   |          77.1  |               0.48 |
+| 5S rRNA    | Bacteria            |          116.2 |               0.61 |
+|------------|---------------------|----------------|--------------------|
+| SRP        | Protozoan           |          285.8 |               0.35 |
+| RNaseP     | Bacterial           |          360.0 |               0.43 |
+| telomerase | -                   |          444.9 |               0.45 |
+| 16S RNA    | Alphaproteobacteria |         1419.2 |               0.85 |
+
+There are two versions of the data, aligned version (all the homologs in the sample are aligned) and unaligned version:
+- Aligned Version: [data/aln/](./rnastralign/data/aln/)
+- Unaligned Version: [data/no-aln/](./rnastralign/data/no-aln/)
+
diff --git a/eval/rnastralign/get_sequence_identity.py b/eval/rnastralign/get_sequence_identity.py
@@ -30,14 +30,6 @@ def main(data_path):
             avg_identity = utility.calculate_msa_seq_identity(seqs)
             seq_identities[family].append(avg_identity)
 
-    # Print formatted results
-    # print("Sequence Identity")
-    # print("{:<8}\t{:>10}".format("Family", "Identity"))
-
-    # for family in sorted(seq_identities):
-    #     avg_value = np.mean(seq_identities[family])
-    #     print("{:<8}\t{:>10.2f}".format(family, avg_value))
-
     # Print formatted results
     print("Sequence Identity")
     print("{:<8}\t{:>10}\t{:>10}".format("Family", "Identity", "Length"))