Skip to content

tmd_start and tmd_stop definition in the Position-based format for SequenceFeature.get_df_parts() #15

@TimErWes

Description

@TimErWes

Question:

Does SequenceFeature.get_df_parts() use 0- or 1-based indexing for tmd_start / tmd_stop?

Description:

I'm trying to determine whether the SequenceFeature.get_df_parts() function expects the tmd_start and tmd_stop values in the DataFrame to be 1-based or 0-based indexed. This matters because P1 annotations (e.g. from MEROPS) usually refer to residue positions starting from 1.

Concrete ### Example:
If the cleavage site P1 is at position 10 and the TMD should be 10 amino acids long:

Should I write tmd_start = 6 (assuming 1-based indexing)?
Or should I use tmd_start = 5 (assuming 0-based indexing)?

Similar for tmd_stop:
Does tmd_stop include the amino acid at that position?
Or is it excluded, as is typical in Python slicing?

Code to reproduce:

record = next(SeqIO.parse(input_file, "fasta"))

p1 = 10
p1_start_with_0 = p1 -1

seq = str(record.seq)

p1_amino_acid = seq[p1_start_with_0]
actual_tmd = str(seq[p1_start_with_0-4:p1_start_with_0+6])

tmd_start_with_1 = p1 - 4
tmd_end_with_1 = p1 + 6

df_animo_acid = pd.DataFrame({"entry": [id_sub], "sequence": [seq], "tmd_start": [tmd_start_with_1], "tmd_stop": [tmd_end_with_1]})
print("df_amino_acid:")

print(df_animo_acid)
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_animo_acid, jmd_c_len=5, jmd_n_len=5)
tmd_from_get_parts = df_parts["tmd"].iloc[0]

Output:

df_amino_acid:
entry sequence tmd_start tmd_stop
0 A0A0A0VBX4_45 LDRYLQRGVRDVHRPCQSVR 6 16
df_parts:
tmd jmd_n_tmd_n tmd_c_jmd_c
A0A0A0VBX4_45 QRGVRDVHRPC LDRYLQRGVRD VHRPCQSVR-

TMDs:
QRGVRDVHRP
QRGVRDVHRPC (from get_df_parts)
p1 in tmd from seq: R
p1 in tmd from get_df_parts: R
length optained tmd from seq: 10
length optained tmd from get_df_parts: 11
length of tmd from get_df_parts behind p1: 6

Conclusion:

TMDs matched in content at the start
But get_df_parts() returns 11 residues instead of 10 → it seems to include the residue at tmd_stop.
get_df_parts() appears to interpret tmd_start and tmd_stop as 1-based, matching typical annotation formats like UniProt/MEROPS.
tmd_stop is inclusive – the residue at that position is included in the result.
Suggestion: It would be helpful if the behavior were clarified in the documentation

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions