-
Notifications
You must be signed in to change notification settings - Fork 55
/
interfacing_ml_libraries.rst
413 lines (305 loc) · 13 KB
/
interfacing_ml_libraries.rst
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
Interfacing with ML Libraries
=============================
Overview
--------
Machine Learning is the field of study that gives computers the capability to learn without
being explicitly programmed. Machine learning (ML) is well known for its powerful ability to recognize
patterns and signals. Recently, the mass spectrometry community has embraced ML techniques for large-scale data analysis.
Predicting accurate retention times (RT) has shown to improve identification in bottom-up proteomics.
In this tutorial we will predict the RT from amino acid sequence data using simple machine learning methods.
First, we import all necessary libraries for this tutorial.
.. code-block:: ipython3
:linenos:
!pip install seaborn
!pip install xgboost
.. code-block:: python
:linenos:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import string
from collections import Counter
import numpy as np
from urllib.request import urlretrieve
Once we have imported all libraries successfully, we are going to store the dataset in a variable.
.. code-block:: python
:linenos:
gh = "https://raw.githubusercontent.com/OpenMS/pyopenms-docs/master"
urlretrieve(gh + "/src/data/pyOpenMS_ML_Tutorial.tsv", "data.tsv")
tsv_data = pd.read_csv("data.tsv", sep="\t", skiprows=17)
Here, we have prepared a ``tsv`` file that contains three columns ``sequence``, RT and ``charge``.
Note that this table could also be easily created from identification data as produced in previous chapters.
Before we move forward lets try to understand more about our data:
a. Sequence - Chains of amino acids form peptides or proteins.
The arrangement of amino acids is referred as amino acid sequence.
The composition and order of amino acids affect the physicochemical properties of the peptide and lead to different
retention in the column.
b. Retention time (RT) - is the time taken for an analyte to pass through a chromatography column.
From the amino acid sequence we can derive additional properties (machine learning features) used to train
our model.
We can easily check for its shape by using the tsv_data.shape attribute,
which will return the size of the dataset.
.. code-block:: python
:linenos:
print(tsv_data.shape)
.. code-block:: output
(15896, 3)
Explore the top 5 rows of the dataset by using head() method on pandas DataFrame.
.. code-block:: python
:linenos:
tsv_data.head()
.. code-block:: output
sequence RT charge
0 EEETVAK 399.677766 2
1 EQEEQQQQEGHNNK 624.555300 3
2 SHGGHTVISK 625.797960 3
3 SGTHNMYK 625.982520 2
4 AARPTRPDK 626.073300 3
As the RT column is our response variable, we will be storing it separately as Y1_test
.. code-block:: python
:linenos:
Y1_test = tsv_data["RT"]
Preprocessing
-------------
Cleaning data before applying a machine learning method keeps the relevant
information in potentially massive amount of data.
Here we will apply some simple preprocessing to extract novel machine learning features from the amino acid
sequences. Some of the parameters that can be derived are
1. {Alphabet}_count = The count of Amino Acids in the sequence.
2. {Alphabet}_freq = The count of Amino Acids divided by the total length of the sequence.
3. length = The total number of amino acids in the sequence.
.. code-block:: python
:linenos:
alphabet_list = list(string.ascii_uppercase)
column_headers = (
["sequence"]
+ [i + "_count" for i in alphabet_list]
+ [i + "_freq" for i in alphabet_list]
+ ["charge", "length"]
)
types = (
["object"]
+ ["int64" for i in alphabet_list]
+ ["float64" for i in alphabet_list]
+ ["int64", "int64"]
)
pdcols = dict(zip(column_headers, types))
As we have all the column names, now we will start populating it.
.. code-block:: python
:linenos:
df = pd.DataFrame(
np.zeros((len(tsv_data.index), len(column_headers))), columns=column_headers
)
df["sequence"] = tsv_data["sequence"]
df["charge"] = tsv_data["charge"]
# For populating the length column
df["length"] = df["sequence"].str.len()
df = df.astype(dtype=pdcols)
# For populating the {alphabet}_count columns
def count(row):
counts = Counter(row["sequence"])
for count in counts:
row[count + "_count"] = int(counts[count])
return row
df = df.apply(lambda row: count(row), axis=1)
df.head()
.. code-block:: output
sequence A_count B_count C_count D_count E_count F_count G_count H_count I_count ... S_freq T_freq U_freq V_freq W_freq X_freq Y_freq Z_freq charge length
0 EEETVAK 1 0 0 0 3 0 0 0 0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2 7
1 EQEEQQQQEGHNNK 0 0 0 0 4 0 1 1 0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3 14
2 SHGGHTVISK 0 0 0 0 0 0 2 2 1 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3 10
3 SGTHNMYK 0 0 0 0 0 0 1 1 0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2 8
4 AARPTRPDK 2 0 0 1 0 0 0 0 0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3 9
Now we have completed all the data preprocessing steps. We have deduced a good amount of information from the amino acid sequences
that might have influence on the retention time in the column.
Now we are good to proceed on building the machine learning model.
Modelling
---------
.. code-block:: python
:linenos:
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import StratifiedKFold
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from matplotlib import pyplot
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import ShuffleSplit
.. code-block:: python
:linenos:
test_df = df.copy()
test_df = test_df.drop("sequence", axis=1)
Now, we create the train and test set for cross-validation of the results
using the ``train_test_split`` function from sklearn's model_selection module with test_size
size equal to 30% of the data. To maintain reproducibility of the results, a random_state is also assigned.
.. code-block:: python
:linenos:
# Splitting Test data into test and validation
X_train, X_test, Y_train, Y_test = train_test_split(
test_df, Y1_test, test_size=0.3, random_state=3
)
We will be using the ``XGBRegressor()`` class because it is clearly a regression problem as the response variable ( retention time ) is continuous.
.. code-block:: python
:linenos:
xg_reg = XGBRegressor(
n_estimators=300,
random_state=3,
max_leaves=5,
colsample_bytree=0.7,
max_depth=7,
)
Fit the regressor to the training set and make predictions on the test set using the familiar ``.fit()`` and ``.predict()`` methods.
.. code-block:: python
:linenos:
xg_reg.fit(X_train, Y_train)
Y_pred = xg_reg.predict(X_test)
Compute the root mean square error (rmse) using the mean_sqaured_error function from sklearn's metrics module.
.. code-block:: python
:linenos:
rmse = np.sqrt(mean_squared_error(Y_test, Y_pred))
print("RMSE: %f" % (rmse))
.. code-block:: output
RMSE: 437.017290
Store the **Observed** v/s **Predicted** value in pandas dataframe and print.
.. code-block:: python
:linenos:
k = pd.DataFrame(
{"Observed": Y_test.values.flatten(), "Predicted": Y_pred.flatten()}
)
print(k)
.. code-block:: output
Observed Predicted
0 3652.28442 3927.141846
1 4244.80320 4290.294434
2 3065.19054 3703.156982
3 909.50610 762.218567
4 1982.80902 2628.958740
... ... ...
4764 5527.23804 5599.530762
4765 3388.76430 3272.557617
4766 3101.35566 3346.364990
4767 5515.94682 5491.597168
4768 2257.63092 2258.312988
We will now generate a **Observed** v/s **Predicted** plot that gives a high level overview about the model performance.
We can clearly see that only few outliers are there and most of them lie in between the central axis.
This means that prediction actually works and observed and predicted value won't differ too much.
.. code-block:: python
:linenos:
sns.lmplot(
x="Observed", y="Predicted", data=k, scatter_kws={"alpha": 0.2, "s": 5}
)
.. image:: img/ml_tutorial_predicted_vs_observed.png
.. code-block:: python
:linenos:
p = sns.kdeplot(data=k["Observed"] - k["Predicted"], fill=True)
p.set(xlabel="Observed-Predicted (s)")
.. image:: img/ml_tutorial_kdplot.png
In order to build more robust models, it is common to do a k-fold cross validation where all the entries in the original training dataset are
used for both training as well as validation. Also, each entry is used for validation just once. XGBoost supports
k-fold cross validation via the cv() method. All we have to do is specify the nfolds parameter, which is the number of cross validation sets we want to build.
.. code-block:: python
:linenos:
# Performing k-fold cross validation
X = np.arange(10)
ss = ShuffleSplit(n_splits=5, test_size=0.25, random_state=0)
performance_df = pd.DataFrame()
performance_list = []
counter = 0
for train_index, test_index in ss.split(X_train, Y_train):
counter += 1
X_train_Kfold, X_test_Kfold = (
X_train[X_train.index.isin(train_index)].to_numpy(),
X_train[X_train.index.isin(test_index)].to_numpy(),
)
y_train_Kfold, y_test_Kfold = (
Y_train[Y_train.index.isin(train_index)].to_numpy().flatten(),
Y_train[Y_train.index.isin(test_index)].to_numpy().flatten(),
)
Regressor = XGBRegressor()
Regressor.fit(X_train_Kfold, y_train_Kfold)
predictions = Regressor.predict(X_test_Kfold)
df = pd.DataFrame(
{"Observed": y_test_Kfold.flatten(), "Predicted": predictions.flatten()}
)
print("Fold-" + str(counter))
print("---------------------")
print(df)
.. code-block:: output
Fold-1
---------------------
Observed Predicted
0 1845.17346 2051.894043
1 1155.68124 1911.122192
2 2847.94272 2753.223145
3 2370.70494 2670.160889
4 4111.31718 3961.675049
... ... ...
1935 3880.18458 3454.832031
1936 4125.82776 4068.806152
1937 4586.33838 3829.927002
1938 2261.99454 3225.578613
1939 4342.82430 3943.912354
[1940 rows x 2 columns]
Fold-2
---------------------
Observed Predicted
0 3476.56062 4075.536377
1 4009.78704 4022.654785
2 2847.94272 2779.675293
3 3669.33108 4026.944824
4 3997.12632 3566.471436
... ... ...
1907 2916.91818 2744.992676
1908 3569.64318 3862.661621
1909 2118.25278 2221.599854
1910 1787.61012 1839.471802
1911 3583.44846 3210.243164
[1912 rows x 2 columns]
Fold-3
---------------------
Observed Predicted
0 2052.18066 2237.868896
1 4336.45050 3622.901367
2 2317.39104 2496.773438
3 3356.40018 3291.187988
4 1778.73198 2034.299683
... ... ...
1934 3795.23424 2968.955322
1935 3622.34358 3203.385742
1936 2261.99454 3115.011475
1937 4112.62578 3743.435791
1938 4342.82430 3721.162842
[1939 rows x 2 columns]
Fold-4
---------------------
Observed Predicted
0 1762.89840 1691.997803
1 1292.39622 1418.658325
2 1914.00468 1779.962769
3 4571.86566 4618.782715
4 2317.39104 2417.823242
... ... ...
1985 2779.37664 2702.244385
1986 4335.23442 3733.191162
1987 2916.91818 2609.322021
1988 4125.82776 3947.512939
1989 3429.54294 3550.206787
[1990 rows x 2 columns]
Fold-5
---------------------
Observed Predicted
0 2790.00414 3010.381592
1 3476.56062 3972.215820
2 1845.17346 1901.611572
3 4009.78704 3884.857178
4 3578.05344 2993.831787
... ... ...
1975 3778.69704 4209.392090
1976 1494.22332 1612.613281
1977 4125.82776 3902.622559
1978 4701.03624 4372.867676
1979 1888.41552 2342.040771
[1980 rows x 2 columns]
That's it, we trained a simple machine learning model to predict peptide retention times from peptide data.
Sophisticated machine models integrate retention time data from many experiments add additional properties
(or even learn them from data) of peptides to achieve lower prediction errors.