The aim of the tool PaSiMap is to map protein sequences as coordinates based on their pairwise similarities.
In order to determine the pairwise similarities, PaSiMap first computes global alignments for each pair of sequences. The similarity of each aligned pair of sequences is then quantified as a number in order to allow the mapping with the multidimensional scaling method cc_analysis
For more details please refer to PaSiMap paper.
The easiest way to use PaSiMap is with our PaSiMap webserver.
Simply submit your query to the webserver and download your results. You can also try out PaSiMap with the example query provided on the webserver.
-
Get PaSiMap from GitHub:
git clone https://github.com/ksu00/pasimap.gitThis results in the directory named
pasimap.⚠️ All the following steps will be done from within thispasimapdirectory. -
Set up directory structure:
# In pasimap directory. mkdir src/webserver mkdir src/webserver/static mkdir src/webserver/static/tmpThe directory for each job (contains query, interim results and final results) will have to be located in this
tmpdirectory. -
Set up Python Virtual Environment:
⚠️ Make sure that the correct Python version (Python 3) is active, PaSiMap was developed with Python 3.6.12.Install Python package for Virtual Environments:
pip install virtualenv --userCreate new Virtual Environment for PaSiMap:
# In pasimap directory. python -m venv venvActivate Virtual Environment for PaSiMap:
# In pasimap directory. source venv/bin/activateInstall packages required by PaSiMap:
# In pasimap directory. pip install -r requirements.txt -
Set up needleall (EMBOSS).
Download and configure EMBOSS software suite (version 6.6.0) from EMBOSS.
Add
needlealltoPATHOR adjust to yourneedleallpath in the following line inrun_pipeline.sh(inpasimapdirectory):time needleall -asequence $in_file_path \Adjust location of substitution matrix to your path in the following line in
run_pipeline.sh(inpasimapdirectory):substmat_dir_path=/usr/local/EMBOSS-6.6.0/emboss/data; -
Set up cc_analysis:
Download binary of cc_analysis.
Add cc_analysis to to
PATHOR adjust to your cc_analysis path in the following line inrun_pipeline.sh(inpasimapdirectory):time cc_analysis -dim $dim \
-
Create job directory (e.g.
ASDF, but you can name it whatever you want):# In pasimap directory. mkdir src/webserver/static/tmp/ASDF -
Prepare query:
# In pasimap directory. mkdir src/webserver/static/tmp/ASDF/0_input touch src/webserver/static/tmp/ASDF/0_input/dim.txt touch src/webserver/static/tmp/ASDF/0_input/state.txt touch src/webserver/static/tmp/ASDF/0_input/input.txt touch src/webserver/static/tmp/ASDF/0_input/count.txt-
Specify dimensionality for output in
dim.txt, e.g.:echo 3 > src/webserver/static/tmp/ASDF/0_input/dim.txt -
Specify mode for PaSiMap in
state.txt, e.g.:echo unaligned > src/webserver/static/tmp/ASDF/0_input/state.txtThe possible options are
unalignedfor unaligned protein sequences (in FASTA-format),alignedfor MSA of protein sequences (in FASTA-format) andquantifierfor pairwise similarities. -
Specify query in
input.txtaccording tostate.txt, e.g. unaligned protein sequences ( in FASTA-format) forunaligned.Please refer to Help page of PaSiMap webserver for details.
-
Specify number of datapoints (e.g. sequences) of query in
count.txt. For sequences, you can take advantage of the syntax of FASTA-format:grep -c '>' src/webserver/static/tmp/ASDF/0_input/input.txt > src/webserver/static/tmp/ASDF/0_input/count.txt
-
-
Activate Virtual Environment for PaSiMap, if not already active:
# In pasimap directory. source venv/bin/activate -
Run job:
# In pasimap directory. ./run_pipeline.sh ASDF