CLOP: Contrastive Language-Omics Pre-training

Project description

CLOP aims to provide a shared embedding for omics (DNA, RNA, protein) sequences and their functions which can be used to perform downstream analysis at high speed.

It is based on the CLIP architecture, which jointly trains an image transformer and a text transformer to project respectively pictures and captions into the same embedding space.

In CLOP, we use Frequency Chaos Game Representation to represent DNA sequences as a "fingerprint" image of fixed dimension.

This transformation allows us to work with sequences of very different lengths without limitations related to context window.

We directly fine-tune the CLIP transformers using these DNA images and function texts.

Status

The fine-tuning of the model could not be done in time, there are 2 wip demos:

A telegram bot is available to return the image representation of input DNA sequences: https://t.me/clip_clop_bot
A mock interface on GitHub pages to propose related functions to an input sequence: https://baudrly.github.io/clop/

Use cases

The shared embedding can be used directly for various downstream genomic analysis, such as predicting the function of an input sequence, finding closely related sequences with similar functions, or for zero shot classification of DNA sequences (e.g. to detect contaminating sequences).

graph LR

    subgraph func[Function prediction]
        CLOPFUN[CLOP]
    end
    subgraph fuzz[Fuzzy matching]
        CLOPFUZ[CLOP]
        MATCH["🧬🧬🧬"]
    end
    subgraph zero[Zero shot classification]
        CLOPZERO[CLOP]
    end
  AFUN["🧬"] -->|embed| CLOPFUN
  CLOPFUN -->|closest texts| FUN["Antibiotic resistance\nAntibiotic degradation"]
  AFUZ["🧬"] -->|embed| CLOPFUZ
  CLOPFUZ -->|closest dna| MATCH
  AZER["🧬"] -->|embed| CLOPZERO
  DOL["🐬"] -->|embed| CLOPZERO
  BAC["🦠"] -->|embed| CLOPZERO
  CLOPZERO --> |similarity| DOLSIM["🐬, 🧬"]
  CLOPZERO --> |similarity| BACSIM["🦠, 🧬"]
  BACSIM --> MAX
  DOLSIM --> MAX
  MAX --> SELECT["🦠"]

Training data

For this demo, we restricted the training set to human transcript sequences (version GRCh38) and their functional annotations, available to download from https://www.ncbi.nlm.nih.gov/genome/guide/human/

We further subsampled 50,000 sequence-annotation pairs for the fine-tuning experiment.

Acknowledgement

This project originated at the 2023 SDSC-hackathon on Generative AI. It was initiated by the team Swiss-Androsace (see members in the LICENSE copyright notice).

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
data_utils		data_utils
notebooks		notebooks
LICENSE		LICENSE
README.md		README.md
demo.html		demo.html
telegram_demo.py		telegram_demo.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CLOP: Contrastive Language-Omics Pre-training

Project description

Status

Use cases

Training data

Acknowledgement

About

Uh oh!

Releases

Packages

Contributors 5

Uh oh!

Languages

License

cmdoret/clop

Folders and files

Latest commit

History

Repository files navigation

CLOP: Contrastive Language-Omics Pre-training

Project description

Status

Use cases

Training data

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Uh oh!

Languages

Packages