Skip to content

Commit

Permalink
Merge pull request #24 from yutanagano/develop
Browse files Browse the repository at this point in the history
Prepare for first stable release
  • Loading branch information
yutanagano authored Jun 10, 2024
2 parents dee6a09 + 663db23 commit 382b2e5
Show file tree
Hide file tree
Showing 20 changed files with 288 additions and 348 deletions.
16 changes: 16 additions & 0 deletions .readthedocs.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
version: 2

build:
os: ubuntu-22.04
tools:
python: "3.12"

sphinx:
configuration: docs/conf.py

python:
install:
- method: pip
path: .
extra_requirements:
- docs
21 changes: 21 additions & 0 deletions LICENSE.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2024 Yuta Nagano

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
185 changes: 17 additions & 168 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,179 +1,28 @@
# SCEPTR

> [!NOTE]
> The latest version of SCEPTR no longer supports Python versions earlier than 3.9.
**S**imple **C**ontrastive **E**mbedding of the **P**rimary sequence of **T** cell **R**eceptors (**SCEPTR**) is a BERT-like attention model trained on T cell receptor (TCR) data.
It maps TCRs to vector representations, which can be used for downstream TCR and TCR repertoire analysis such as TCR clustering or classification.

## Installation

### From [PyPI](https://pypi.org/project/sceptr/) (Recommended)

```bash
pip install sceptr
```

### From Source

> [!IMPORTANT]
> To install `sceptr` from source, you must have [`git-lfs`](https://git-lfs.com/) installed and set up on your system.
> This is because you must be able to download the trained model weights directly from the Git LFS servers during your install.
<div align="center">

#### Using `pip`

From your Python environment, run the following replacing `<VERSION_TAG>` with the appropriate version specifier (e.g. `v1.0.0-alpha.1`).
The latest release tags can be found by checking the 'releases' section on the github repository page.

```bash
pip install git+https://github.com/yutanagano/sceptr.git@<VERSION_TAG>
```

#### Manual install
# SCEPTR

You can also clone the repository, and from within your Python environment, navigate to the project root directory and run:
[![Latest release](https://img.shields.io/pypi/v/sceptr)](https://pypi.org/p/sceptr)
![Tests](https://github.com/yutanagano/sceptr/actions/workflows/tests.yaml/badge.svg)
[![Documentation Status](https://readthedocs.org/projects/sceptr/badge/?version=latest)](https://sceptr.readthedocs.io)
[![License](https://img.shields.io/badge/license-MIT-blue)](https://github.com/yutanagano/tidytcells?tab=MIT-1-ov-file#readme)

```bash
pip install .
```
### Check out the [documentation page](https://sceptr.readthedocs.io).

Note that even for manual installation, you still need `git-lfs` to properly de-reference the stub files at `git-clone`-ing time.
</div>

#### Troubleshooting
**SCEPTR** (**S**imple **C**ontrastive **E**mbedding of the **P**rimary sequence of **T** cell **R**eceptors) is a small, fast, and accurate TCR representation model that can be used for alignment-free TCR analysis, including for TCR-pMHC interaction prediction and TCR clustering (metaclonotype discovery).
Our [manuscript (coming soon)](about:blank) demonstrates that SCEPTR can be used for few-shot TCR specificity prediction with improved accuracy over previous methods.

A recent security update to `git` has resulted in some difficulties cloning repositories that rely on `git-lfs`.
This can result in an error message with a message along the lines of:
SCEPTR is a BERT-like transformer-based neural network implemented in [Pytorch](https://pytorch.org).
With the default model providing best-in-class performance with only 153,108 parameters (typical protein language models have tens or hundreds of millions), SCEPTR runs fast- even on a CPU!
And if your computer does have a [CUDA-enabled GPU](https://en.wikipedia.org/wiki/CUDA), the sceptr package will automatically detect and use it, giving you blazingly fast performance without the hassle.

```
fatal: active `post-checkout` hook found during `git clone`
```
sceptr's API exposes three intuitive functions: `calc_vector_representations`, `calc_cdist_matrix`, and `calc_pdist_vector`- and it's all you need to make full use of the SCEPTR models.
What's even better is that they are fully compliant with [pyrepseq](https://pyrepseq.readthedocs.io)'s [tcr_metric](https://pyrepseq.readthedocs.io/en/latest/api.html#pyrepseq.metric.tcr_metric.TcrMetric) API, so sceptr will fit snugly into the rest of your repertoire analysis workflow.

If this happens, you can temporarily set the `GIT_CLONE_PROTECTION_ACTIVE` environment variable to `false` by prepending `GIT_CLONE_PROTECTION_ACTIVE=false` before the install command like below:
## Installation

```bash
GIT_CLONE_PROTECTION_ACTIVE=false pip install git+https://github.com/yutanagano/sceptr.git@<VERSION_TAG>
pip install sceptr
```

This is [a known issue](https://github.com/git-lfs/git-lfs/issues/5749) for `git` version `2.45.1` and [is fixed](https://lore.kernel.org/git/xmqqr0dheuw5.fsf@gitster.g/T/#u) from version `2.45.2`.

## Prescribed data format

> [!IMPORTANT]
> SCEPTR only recognises TCR V/J gene symbols that are IMGT-compliant, and also known to be functional (i.e. known pseudogenes or ORFs are not allowed).
> For easy standardisation of TCR gene nomenclature in your data, as well as filtering your data for functional V/J genes, check out [tidytcells](https://pypi.org/project/tidytcells/).
SCEPTR expects to receive TCR data in the form of [pandas](https://pandas.pydata.org/) [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html?highlight=dataframe#pandas.DataFrame) instances.
Therefore, all TCR data should be represented as a `DataFrame` with the following structure and data types.
The column order is irrelevant.
Each row should represent one TCR.
Incomplete rows are allowed (e.g. only beta chain data available) as long as the SCEPTR variant that is being used has at least some partial information to go on.

| Column name | Column datatype | Column contents |
|---|---|---|
|TRAV|`str`|IMGT symbol for the alpha chain V gene|
|CDR3A|`str`|Amino acid sequence of the alpha chain CDR3, including the first C and last W/F residues, in all caps|
|TRAJ|`str`|IMGT symbol for the alpha chain J gene|
|TRBV|`str`|IMGT symbol for the beta chain V gene|
|CDR3B|`str`|Amino acid sequence of the beta chain CDR3, including the first C and last W/F residues, in all caps|
|TRBJ|`str`|IMGT symbol for the beta chain J gene|

## Usage

### Functional API (`sceptr.sceptr`)

The eponymous `sceptr` submodule is the easiest way to use SCEPTR.
It loads the default SCEPTR variant (currently `ab_sceptr`) and exposes its methods directly as module-level functions.

> [!TIP]
> To use the functional API, import the `sceptr` submodule like so:
> ```
> from sceptr import sceptr
> ```
> Attempting to access the submodule as an attribute of the top level module
> ```
> import sceptr
>
> sceptr.sceptr.calc_vector_representations() #...do something...
> ```
> will result in an error.
---

#### `sceptr.sceptr.calc_vector_representations(instances: DataFrame) -> ndarray`

Map a table of TCRs provided as a pandas `DataFrame` in the above format to a set of vector representations.

Parameters:

- tcrs (`DataFrame`): DataFrame in the presribed format.

Returns:

A 2D [numpy](https://numpy.org/) [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html) object where every row vector corresponds to a row in the original TCR `DataFrame`.
The returned array will have shape (N, D) where N is the number of TCRs in the input data and D is the dimensionality of the SCEPTR model.

---

#### `sceptr.sceptr.calc_cdist_matrix(anchors: DataFrame, comparisons: DataFrame) -> ndarray`

Generate a cdist matrix between two collections of TCRs.

Parameters:

- anchor_tcrs (`DataFrame`): DataFrame in the prescribed format, representing TCRs from collection A.
- comparison_tcrs (`DataFrame`): DataFrame in the prescribed format, representing TCRs from collection B.

Returns:

A 2D numpy `ndarray` representing a cdist matrix between TCRs from collection A and B.
The returned array will have shape (X, Y) where X is the number of TCRs in collection A and Y is the number of TCRs in collection B.

---

#### `sceptr.sceptr.calc_pdist_vector(instances: DataFrame) -> ndarray`

Generate a pdist set of distances between each pair of TCRs in the input data.

Parameters:

- tcrs (`DataFrame`): DataFrame in the prescribed format.

Returns

A 2D numpy `ndarray` representing a pdist vector of distances between each pair of TCRs in the input data.
The returned array will have shape (1/2 * N * (N-1),), where N is the number of TCRs in the input data.

---

### Loading specific SCEPTR variants (`sceptr.variant`)

For more curious users, model variants are available to load and use through the `sceptr.variant` submodule.

The module exposes functions, each named after a particular model variant, which when called, will return a `Sceptr` object corresponding to the selected model variant.
This `Sceptr` object will then have the methods: `calc_pdist_vector`, `calc_cdist_matrix`, and `calc_vector_representations` available to use, with function signatures exactly as defined above for the functional API in the `sceptr.sceptr` submodule.

#### Paired-chain variants

|Name|Description|
|---|---|
|`sceptr.variant.default`|default model used by the functional API|
|`sceptr.variant.mlm_only`|default model trained without autocontrastive learning|
|`sceptr.variant.left_aligned`|similar to default model but with learnable token embeddings and a sinusoidal position information embedding method more similar to the original NLP BERT/transformer models|
|`sceptr.variant.cdr3_only`|only uses the CDR3 loops as input|
|`sceptr.variant.cdr3_only_mlm_only`|only uses CDR3 loops as input, and did not receive autocontrastive learning|
|`sceptr.variant.large`|larger variant with model dimensionality 128|
|`sceptr.variant.small`|smaller variant with model dimensionality 32|
|`sceptr.variant.tiny`|smaller variant with model dimensionality 16|
|`sceptr.variant.blosum`|variant using BLOSUM62 embeddings instead of one-hot|
|`sceptr.variant.average_pooling`|variant using the average-pooling method to generate the TCR representation vector|
|`sceptr.variant.shuffled_data`|variant trained on the Tanno et al. dataset with randomised alpha/beta pairing|
|`sceptr.variant.synthetic_data`|variant trained using synthetic TCR sequences generated by OLGA|
|`sceptr.variant.dropout_noise_only`|variant trained without residue/chain dropping during autocontrastive learning|
|`sceptr.variant.finetuned`|variant fine-tuned using supervised contrastive learning for six pMHCs with peptides GILGFVFTL, NLVPMVATV, SPRWYFYYL, TFEYVSQPFLMDLE, TTDPSFLGRY and YLQPRTFLL (from [VDJdb](https://vdjdb.cdr3.net/))|

#### Single-chain variants

|Name|Description
|---|---|
|`sceptr.variant.a_sceptr`|alpha-chain only variant|
|`sceptr.variant.b_sceptr`|beta-chain only variant|
Binary file added docs/about_sceptr.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
5 changes: 3 additions & 2 deletions docs/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,8 @@ API reference
=============

.. toctree::
:maxdepth: 2
:maxdepth: 1

sceptr_sceptr
sceptr
sceptr_variant
sceptr_model
29 changes: 15 additions & 14 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,27 +8,28 @@
# -- Project information -----------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information

project = 'sceptr'
copyright = '2024, Yuta Nagano'
author = 'Yuta Nagano'
version = sceptr.VERSION
project = "sceptr"
copyright = "2024, Yuta Nagano"
author = "Yuta Nagano"
release = sceptr.__version__

# -- General configuration ---------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration

extensions = [
"sphinx.ext.autodoc",
"sphinx.ext.autosummary",
"sphinx.ext.napoleon"
]

templates_path = ['_templates']
exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
extensions = ["sphinx.ext.autodoc", "sphinx.ext.autosummary", "sphinx.ext.napoleon"]

templates_path = ["_templates"]
exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"]


# -- Options for HTML output -------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output

html_theme = 'sphinx_book_theme'
html_static_path = ['_static']
html_theme = "sphinx_book_theme"
html_theme_options = {
"repository_url": "https://github.com/yutanagano/sceptr",
"path_to_docs": "docs",
"use_repository_button": True,
"use_issues_button": True,
}
html_static_path = ["_static"]
27 changes: 19 additions & 8 deletions docs/index.rst
Original file line number Diff line number Diff line change
@@ -1,13 +1,24 @@
.. sceptr documentation master file, created by
sphinx-quickstart on Fri Jun 7 10:32:22 2024.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
SCEPTR
======

SCEPTR: A fast and performant TCR representation model
======================================================
**SCEPTR** (\ **S**\ imple **C**\ ontrastive **E**\ mbedding of the **P**\ rimary sequence of **T** cell **R**\ eceptors) is a small, fast, and performant TCR representation model that can be used for alignment-free downstream TCR and TCR repertoire analysis such as TCR clustering or classification.
Our `manuscript (coming soon) <about:blank>`_ demonstrates SCEPTR's state-of-the-art performance (as of 2024) on downstream TCR specificity prediction.

SCEPTR (Simple Contrastive Embedding of the Primary sequence of T cell Receptors) is a BERT-like attention model trained on T cell receptor (TCR) data.
It maps TCRs to vector representations, which enables alignment-free downstream TCR and TCR repertoire analysis such as TCR clustering or classification.
SCEPTR is a BERT-like transformer-based neural network implemented in `Pytorch <https://pytorch.org>`_.
With the default model providing best-in-class performance with only 153,108 parameters (typical protein language models have tens or hundreds of millions), SCEPTR runs fast- even on a CPU!
And if your computer does have a `CUDA-enabled GPU <https://en.wikipedia.org/wiki/CUDA>`_, the sceptr package will automatically detect and use it, giving you blazingly fast performance without the hassle.

sceptr's :ref:`API <api>` exposes three intuitive functions: :py:func:`~sceptr.calc_vector_representations`, :py:func:`~sceptr.calc_cdist_matrix`, and :py:func:`~sceptr.calc_pdist_vector`-- and it's all you need to make full use of the SCEPTR models.
What's even better is that they are fully compliant with `pyrepseq <https://pyrepseq.readthedocs.io>`_'s `tcr_metric <https://pyrepseq.readthedocs.io/en/latest/api.html#pyrepseq.metric.tcr_metric.TcrMetric>`_ API, so sceptr will fit snugly into the rest of your repertoire analysis toolkit.

.. figure:: about_sceptr.png
:width: 700px
:alt: Schematic diagrams showing a visual introduction to the architecture of SCEPTR, as well as how it was trained-- namely, autocontrastive learning and masked-language modelling.

A visual introduction to how SCEPTR works, taken from our SCEPTR preprint.
SCEPTR is a TCR language model (a,b) pre-trained using masked-language modelling and autocontrastive learning (c,d).
(a) The default model uses the ``<cls>`` pooling method, but there is also a variant that is trained to use average-pooling (see :py:func:`sceptr.variant.average_pooling`).
Please see the manuscript for more details.

.. toctree::
:maxdepth: 2
Expand Down
18 changes: 6 additions & 12 deletions docs/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,44 +12,38 @@ From `Source <https://github.com/yutanagano/sceptr>`_
-----------------------------------------------------

.. important::
To install `sceptr` from source, you must have `git-lfs <https://git-lfs.com/>`_ installed and set up on your system.
To install ``sceptr`` from source, you must have `git-lfs <https://git-lfs.com/>`_ installed and set up on your system.
This is because you must be able to download the trained model weights directly from the Git LFS servers during your install.

Using `pip`
...........

From your Python environment, run the following replacing `<VERSION_TAG>` with the appropriate version specifier (e.g. `v1.0.0-beta.1`).
From your Python environment, run the following replacing ``<VERSION_TAG>`` with the appropriate version specifier (e.g. ``v1.0.0-beta.1``).
The latest release tags can be found by checking the 'releases' section on the github repository page.

.. code-block:: bash
$ pip install git+https://github.com/yutanagano/sceptr.git@<VERSION_TAG>
Manual install
..............

You can also clone the repository, and from within your Python environment, navigate to the project root directory and run:

.. code-block:: bash
$ pip install .
Note that even for manual installation, you still need `git-lfs` to properly de-reference the stub files at `git-clone`-ing time.
Note that even for manual installation, you still need ``git-lfs`` to properly de-reference the stub files at ``git-clone``-ing time.

Troubleshooting
...............

A recent security update to `git` has resulted in some difficulties cloning repositories that rely on `git-lfs`.
A recent security update to ``git`` has resulted in some difficulties cloning repositories that rely on ``git-lfs``.
This can result in an error message with a message along the lines of:

.. code-block:: bash
$ fatal: active `post-checkout` hook found during `git clone`
If this happens, you can temporarily set the `GIT_CLONE_PROTECTION_ACTIVE` environment variable to `false` by prepending `GIT_CLONE_PROTECTION_ACTIVE=false` before the install command like below:
If this happens, you can temporarily set the ``GIT_CLONE_PROTECTION_ACTIVE`` environment variable to ``false`` by prepending ``GIT_CLONE_PROTECTION_ACTIVE=false`` before the install command like below:

.. code-block:: bash
$ GIT_CLONE_PROTECTION_ACTIVE=false pip install git+https://github.com/yutanagano/sceptr.git@<VERSION_TAG>
This is `a known issue <https://github.com/git-lfs/git-lfs/issues/5749>`_ for `git` version `2.45.1` and is fixed from version `2.45.2`.
This is `a known issue <https://github.com/git-lfs/git-lfs/issues/5749>`_ for ``git`` version ``2.45.1`` and is fixed from version ``2.45.2``.
7 changes: 7 additions & 0 deletions docs/sceptr.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
.. _api:

``sceptr``
==========

.. automodule:: sceptr
:members:
5 changes: 5 additions & 0 deletions docs/sceptr_model.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
``sceptr.model``
================

.. autoclass:: sceptr.model.Sceptr()
:members:
5 changes: 0 additions & 5 deletions docs/sceptr_sceptr.rst

This file was deleted.

Loading

0 comments on commit 382b2e5

Please sign in to comment.