-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathintro.qmd
345 lines (290 loc) · 17.5 KB
/
intro.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
---
title: "Introduction"
author: "Modesto Redrejo Rodríguez"
date: "`r Sys.Date()`"
toc: true
toc_float: true
format:
html:
theme: simplex
toc: true
toc-location: right
toc-depth: 4
number-sections: true
code-overflow: wrap
link-external-icon: true
link-external-newwindow: true
bibliography: references.bib
editor_options:
markdown:
wrap: 80
---
```{r wrap-hook, include=FALSE}
#Markdown options
library(knitr)
library(formatR)
opts_chunk$set(tidy.opts=list(width.cutoff=50),tidy=TRUE,fig.cap = "", fig.path = "Plot")
#Loading packages
paquetes <- c("ggplot2","data.table")
invisible(lapply(paquetes, library, character.only = TRUE))
#Determine the output format of the document
outputFormat = opts_knit$get("rmarkdown.pandoc.to")
```
# Goals {data-link="Preface"}
[Structural
Bioinformatics](https://en.wikipedia.org/wiki/Structural_bioinformatics "Wikipedia")
(SB) is a broad discipline that covers structural and computational biology,
from visualization and analysis of the structure of biomacromolecules to protein
modeling and molecular docking. The great promise of SB is predicated on the
belief that a high-resolution structural information about biological systems
will allow us to precisely reason about the function of these systems and the
effects of modifications and perturbations.
The goals of SB require at least four different research lines (see Chapter 1 in
@structur):
1. *Visualization* of complex structures with several sources of information:
sequence, structural data, electrostatic fields, location of functional
sites, and areas of variability.
2. *Classification* of the structures, making if necessary to cluster similar
structures together in a hierarchical classification allow us to identify
common origins and diversification paths. Similar to other fields of biology
classification is tedious but required to understand the structural space.
3. *Prediction* of structures remains an area of keen interest and a field of
research itself. As we will see below, the number of different sequences is
much higher than the availability of structures, which make prediction an
essential and useful tool.
4. *Simulation.* Experimentally obtained structures are primarily static
structural models (see warning below). However, the properties of these
molecules are often the results of their dynamic motions. The definition of
energy functions that govern the folding of proteins and their subsequent
stable dynamics can be analyzed by molecular dynamics simulations, although
computation capacities may be limiting to reach a biologically relevant
timescales.
Powered by large amount of data and great technical advances, the field has
experienced a great revolution in the last decade. The increase of experimental
capacities to analyze the structure of proteins and other biological molecules
and structures (see @callaway2020) and the development of Artificial
Intelligence (AI)-assisted structure prediction boosted the capacity of
life-science researchers to address a wide variety of questions regarding
proteins diversity, evolution and function. This revolution underwent a great
acceleration in the last 2-3 years and the implications in biology,
biotechnology, and biomedicine are still unforeseen.
# Before going forward: Protein Structure 101 {#sec-str}
Although you can make some protein modeling without being an expert in
structural biology, a basic understanding of protein structure is strongly
advisable. In this course there are some students without a background in
biology. Moreover, over the years, I noticed that graduate students in biology,
biomedicine, and related fields have a very different background on protein
structure. If you want to review and update your background on protein
structure, I recommend you reading Chapter 2 of @structur, the great recent
review by @stollar2020 and the
[Wikipedia](https://en.wikipedia.org/wiki/Protein_structure) and
[Proteopedia](https://proteopedia.org/wiki/index.php/Introduction_to_protein_structure)
articles on protein structures, which constituted my main source for this brief
section (follow picture links).
[{#fig-str
.figure}](https://en.wikipedia.org/wiki/Protein_structure)
Proteins are key components of life, playing key roles in almost any possible
vital function, either as structural, or scaffolding elements or as active
enzymes that catalyze metabolic reactions. Proteins are built as polymers of
amino acids and the sequence of amino acids of a particular protein can be also
called the **primary structure** of the protein. Amino acid chains can
spontaneously fold up into three-dimensional structures, mostly stabilized by
hydrogen bonds between amino acids. The amino acid sequence determines different
layers of 3D structure. Each of the 20 natural amino acids has different
physicochemical properties that affect its preferred conformation. Thus, the
first level of folding is called **secondary structure**, forming common
patterns as we will see in a moment.
[{#fig-aa
.figure}](https://www.reddit.com/r/chemistry/comments/acyald/venn_diagram_showing_the_properties_of_the_20/)
These stretches of secondary structure patterns can fold in 3D due to
interactions between the side chains of amino acids. This is called protein
**tertiary structure**. Finally, two or more individual peptide chains can form
multisubunit proteins that have the so-called **quaternary structure**.
It should be noted that the peptide bond itself cannot rotate as it has a double
bond-like character. Therefore, rotation can only occur about the bond between
the Cα and the C = O group, (the phi (φ) angle) and the Cα and the NH group,
(the psi (ψ) angle). In fact, the polypeptide backbone chain is composed of a
repeating series of two rotatable bonds followed by one non-rotatable (peptide)
bond. However, not all 360º of the psi and phi angles are possible as
neighboring sidechains can clash due to steric hindrance. For certain angles and
amino acid combinations, the atoms cannot be in the same physical place and this
partly explains why some amino acids have a higher propensity (likelihood) to
form different types of secondary structures.
[{#fig-bond
.figure}](https://portlandpress.com/essaysbiochem/article/64/4/649/226515/Uncovering-protein-structure)
Within these restraints, the two principal local conformations that avoid steric
hindrance and maximize backbone--backbone hydrogen bonding are the **α-helix**
and the **β-sheet** secondary structures. The α-helix was proposed initially as
left-handed by Linus Pauling in 1951, but the crystal structure of myoglobin in
1958 showed that, although both can be found, the right-handed form is the
common one. In the common right-handed helices, the backbone NH group hydrogen
bonds to the backbone C = O group of the amino acid located four residues
earlier along the protein sequence. This results in a polypeptide chain that
twists in a regular coil shape with the R-groups pointing outwards away from the
peptide backbone. It takes approximately 3.6 residues to complete a full turn of
a helix.
::: {layout-ncol="1"}
[{#fig-alpha
.figure}](https://en.wikipedia.org/wiki/Alpha_helix)
[{#fig-beta
.figure}](https://en.wikipedia.org/wiki/Beta_sheet)
:::
Different amino-acid sequences have different propensities for forming α-helical
structures. [Methionine](https://en.wikipedia.org/wiki/Methionine "Methionine"),
[alanine](https://en.wikipedia.org/wiki/Alanine "Alanine"),
[leucine](https://en.wikipedia.org/wiki/Leucine "Leucine"),
[glutamate](https://en.wikipedia.org/wiki/Glutamate "Glutamate"), and
[lysine](https://en.wikipedia.org/wiki/Lysine "Lysine") have especially high
helix-forming propensities, whereas
[proline](https://en.wikipedia.org/wiki/Proline "Proline") and
[glycine](https://en.wikipedia.org/wiki/Glycine "Glycine") have poor
helix-forming propensities.
[Proline](https://en.wikipedia.org/wiki/Proline "Proline") either breaks or
kinks a helix, both because it cannot donate an amide [hydrogen
bond](https://en.wikipedia.org/wiki/Hydrogen_bond "Hydrogen bond") (having no
amide hydrogen), and also because its bulky sidechain interferes sterically with
the backbone of the preceding turn. However, proline is often seen as the
*first* residue of a helix, it is presumed due to its structural rigidity. At
the other extreme, [glycine](https://en.wikipedia.org/wiki/Glycine "Glycine")
also tends to disrupt helices because its high conformational flexibility makes
it entropically expensive to adopt the relatively constrained α-helical
structure.
**β-Sheets** are composed of two or more extended polypeptide chains called
β-strands that run alongside each other. They can be arranged in either a
parallel or antiparallel manner. The residues arrange themselves in a regular
zigzag manner with the adjacent peptide bonds pointing in opposite directions.
In this arrangement, the NH group and the C = O group of each amino acid are
hydrogen-bonded to the C = O group and NH group respectively on the adjacent
strands. Chains can run in opposite directions, forming an antiparallel β-sheet,
or in the same direction, forming a parallel β-sheet. Sidechains from each of
the residues point away from the sheets and alternate in opposite directions
between residues. It is common to see a pattern of alternating hydrophilic and
hydrophobic residues in the primary structure, giving the β-sheets hydrophilic
and hydrophobic faces.
Large aromatic residues
([tyrosine](https://en.wikipedia.org/wiki/Tyrosine "Tyrosine"),
[phenylalanine](https://en.wikipedia.org/wiki/Phenylalanine "Phenylalanine"),
[tryptophan](https://en.wikipedia.org/wiki/Tryptophan "Tryptophan")) and
β-branched amino acids
([threonine](https://en.wikipedia.org/wiki/Threonine "Threonine"),
[valine](https://en.wikipedia.org/wiki/Valine "Valine"),
[isoleucine](https://en.wikipedia.org/wiki/Isoleucine "Isoleucine")) are favored
to be found in β-strands. As in the case of α-helixes, β-strands are often ended
by [glycines](https://en.wikipedia.org/wiki/Glycine "Glycine"), which are
especially common in β-turns (the most common connector between strands), as
[amino acids](https://en.wikipedia.org/wiki/Amino_acid "Amino acid") with
positive φ angles.
The side chain of amino acids also have their torsion angles, referred as χ1,
χ2, χ3...
[{#fig-chi .figure
width="450"}](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-306)
## Ramachandran Plot {#sec-rama}
As you probably already figure out, many combinations of φ and ψ angles are
forbidden because of the principle of steric exclusion: two atoms cannot be in
the same place at the same time. This was initially shown by [Gopalasamudram
Ramachandran](https://en.wikipedia.org/wiki/G._N._Ramachandran), who also
devised a plot to visualize the allowed angle values, so-called Ramachandran
plot. This plot can represent the angles of a particular amino acid, of all the
amino acids in a protein or many proteins. Analysis of φ and ψ angles in known
proteins clearly show that roughly three-quarters of all possible φ, ψ
combinations are excluded.
[{#fig-rama0
.figure}](https://proteopedia.org/wiki/index.php/Ramachandran_Plot)
The core regions in the Rama plot also correspond with common secondary
structures, as usually represented in textbooks.
{#fig-ram .figure}
Functionally and structurally relevant residues are more likely than others to
have torsion angles that can be distributed into the [allowed but
disfavored]{.ul} regions of a Ramachandran plot. The specific geometry of these
functionally relevant residues, while somewhat energetically unfavorable, may be
important for the protein's function, catalytic or otherwise. Such conformations
need to be stabilized by the protein using H-bonds, steric packing, or other
means, and should very seldom occur for highly solvent-exposed residues.
[{#fig-rama2
.figure}](https://proteopedia.org/wiki/index.php/Ramachandran_Plot)
## Protein folds, domains and motifs
The final three dimensional tertiary structure of a protein is commonly referred
as its **fold**. Within the overall protein fold, we can recognize distinct
**domains** and **motifs.** Domains are compact sections of the protein that
represent structurally and (usually) functionally independent regions. That
means that a domain maintain its main features, even if separated from the
overall protein. On the other hand, motifs are small substructures that are not
necessarily independent and consist of only a few secondary structure stretches.
Indeed, motifs can be also referred as *super-secondary* structure.
The diversity of protein folds, domains and motifs, and combination of those,
can be used for classification of protein structures hierarchically, as in many
other fields of biology. The first classification was proposed in the 70's and
consisted of four groups of folds, as shown in the figure below. All-α proteins
are based almost entirely on an α-helical structure, and all β-structure are
based on β-sheets. α/β structure is based on as mixture of α-helices and
β-sheet, often organized as parallel β-strands connected by α-helices. Finally
α+β structures consist of discrete α-helix and β-sheet motifs that are not
interwoven (as they are in α/β proteins).
{#fig-chlothia .figure}
As known fold space has become more and more complex, these types of
classifications have been adjusted and extended such that a complete hierarchy
is created. The most commonly referred approaches to this sort of classification
are those used by SCOP and CATH databases, as we will see in the [Structural
Databases](ddbb.html#strDDBB) section.
## [**Hands on: Playing with secondary structures**]{style="color:green"}

There are a few online alternatives to model any peptide sequence and quickly
see the effect of amino acid composition in the secondary structure. One of the
best-known is Foldit ([www.fold.it](http://www.fold.it), @miller2020), a gaming
platform for biochemistry and structural biology teaching. It is a highly
recommended alternative for most courses related to protein structure.
In this course we are going to try a more recent proposal, recently twitted by
Sergey Ovchinnikov (see
<https://twitter.com/sokrypton/status/1535857255647690753>). It is based on
ColabFold (see <https://github.com/sokrypton/ColabFold> and @mirdita2022), an
Alphafold2 (see @jumper2021) free notebook in [Google Colab
notebook](https://colab.research.google.com/?hl=en). All you need is a Google
account and the following *cheatsheet*.
[{#fig-single
.figure}](https://twitter.com/sokrypton/status/1535857255647690753)
Now go to ColabFold Single:
<https://colab.research.google.com/github/sokrypton/af_backprop/blob/beta/examples/AlphaFold_single.ipynb>
Construct some small proteins and compare the output. Note that the first model
will take 3-5 min, but the others will be faster. I provide here some
interesting examples (IUPAC one-letter amino acid code):
1. AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
2. KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK
3. PVAVEARENGRLAVRVEGAIAVLIRENGRLVVRVEGG
4. PELEKHREELGEFLKKETGIAVEIRENGRLEVRVEGYTDVKIEGGTERLKRFLEEL
5. ACTWEGNKLTCA
**1. Answer the following questions:**\
**- Why is a poly-K more stable (dark blue) than a poly-A?**\
\
**- Could you predict the structure of a poly-V or a poly-G?**\
\
**- What would happen if you introduce a K5W in the structure number 2? and in
the 4?**\
\
\
**2. Now, try to create peptides with a custom motif, such as:**\
\
**- Two helices.**\
**- A four-strands beta-sheet.**\
**- Alpha-beta-beta-alpha.**\