-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Question:
Does SequenceFeature.get_df_parts() use 0- or 1-based indexing for tmd_start / tmd_stop?
Description:
I'm trying to determine whether the SequenceFeature.get_df_parts() function expects the tmd_start and tmd_stop values in the DataFrame to be 1-based or 0-based indexed. This matters because P1 annotations (e.g. from MEROPS) usually refer to residue positions starting from 1.
Concrete ### Example:
If the cleavage site P1 is at position 10 and the TMD should be 10 amino acids long:
Should I write tmd_start = 6 (assuming 1-based indexing)?
Or should I use tmd_start = 5 (assuming 0-based indexing)?
Similar for tmd_stop:
Does tmd_stop include the amino acid at that position?
Or is it excluded, as is typical in Python slicing?
Code to reproduce:
record = next(SeqIO.parse(input_file, "fasta"))
p1 = 10
p1_start_with_0 = p1 -1
seq = str(record.seq)
p1_amino_acid = seq[p1_start_with_0]
actual_tmd = str(seq[p1_start_with_0-4:p1_start_with_0+6])
tmd_start_with_1 = p1 - 4
tmd_end_with_1 = p1 + 6
df_animo_acid = pd.DataFrame({"entry": [id_sub], "sequence": [seq], "tmd_start": [tmd_start_with_1], "tmd_stop": [tmd_end_with_1]})
print("df_amino_acid:")
print(df_animo_acid)
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_animo_acid, jmd_c_len=5, jmd_n_len=5)
tmd_from_get_parts = df_parts["tmd"].iloc[0]
Output:
df_amino_acid:
entry sequence tmd_start tmd_stop
0 A0A0A0VBX4_45 LDRYLQRGVRDVHRPCQSVR 6 16
df_parts:
tmd jmd_n_tmd_n tmd_c_jmd_c
A0A0A0VBX4_45 QRGVRDVHRPC LDRYLQRGVRD VHRPCQSVR-
TMDs:
QRGVRDVHRP
QRGVRDVHRPC (from get_df_parts)
p1 in tmd from seq: R
p1 in tmd from get_df_parts: R
length optained tmd from seq: 10
length optained tmd from get_df_parts: 11
length of tmd from get_df_parts behind p1: 6
Conclusion:
TMDs matched in content at the start
But get_df_parts() returns 11 residues instead of 10 → it seems to include the residue at tmd_stop.
get_df_parts() appears to interpret tmd_start and tmd_stop as 1-based, matching typical annotation formats like UniProt/MEROPS.
tmd_stop is inclusive – the residue at that position is included in the result.
Suggestion: It would be helpful if the behavior were clarified in the documentation