Dataset for benchmark can be found at the following link:
Spectral entropy is a useful property to measure the complexity of a spectrum. It is inspired by the concept of Shannon entropy in information theory. (ref)
Entropy similarity, which measured spectral similarity based on spectral entropy, has been shown to outperform dot product similarity in compound identification. (ref)
The calculation of entropy similarity can be accelerated by using the Flash Entropy Search algorithm. (ref)
Dynamic Entropy Search is built and optimized based on Flash Entropy Search. Besides the excellent search performance, it allows unlimited library spectra with high speed and low memory.
This repository contains the source code to build index, update index, calculate spectral entropy and entropy similarity in python.
You can establish your own library locally as follows:
# Step 1: Import DynamicEntropySearch.
from dynamic_entropy_search.dynamic_entropy_search import DynamicEntropySearch
# Step 2: Assign the path for your library.
entropy_search=DynamicEntropySearch(path_data=path_of_your_library)
# Step 3: Add spectra into the library. This adding operation can be performed multiple times.
entropy_search.add_new_spectra(spectra_list=spectra_1_for_library)
entropy_search.add_new_spectra(spectra_list=spectra_2_for_library)
......
# Step 4: Call build_index() and write() lastly to end adding operation.
entropy_search.build_index()
entropy_search.write()Suppose you have a lot of spectral libraries, you need to format them like this:
import numpy as np
# For each spectral library, it is a list consisting of multiple dictionaries of MS2 spectra.
# For each spectrum, 'precursor_mz' and 'peaks' are necessary.
# 'precursor_mz' should be a float, and 'peaks' should be a 2D np.ndarray like np.ndarray([[m/z, intensity], [m/z, intensity], [m/z, intensity]...], dtype=np.float32).
spectra_1_for_library = [{
"id": "Demo spectrum 1",
"precursor_mz": 150.0,
"peaks": np.array([[100.0, 1.0], [101.0, 1.0], [103.0, 1.0]], dtype=np.float32),
}, {
"id": "Demo spectrum 2",
"precursor_mz": 200.0,
"peaks": np.array([[100.0, 1.0], [101.0, 1.0], [102.0, 1.0]], dtype=np.float32),
"metadata": "ABC"
}, {
"id": "Demo spectrum 3",
"precursor_mz": 250.0,
"peaks": np.array([[200.0, 1.0], [101.0, 1.0], [202.0, 1.0]], dtype=np.float32),
"XXX": "YYY",
}, {
"precursor_mz": 350.0,
"peaks": np.array([[100.0, 1.0], [101.0, 1.0], [302.0, 1.0]], dtype=np.float32),
},
]
spectra_2_for_library ... # Similar to spectra_1_for_library
spectra_3_for_library ... # Similar to spectra_1_for_libraryNote that the precursor_mz and peaks keys are required, the reset of the keys are optional.
The spectra in the spectra library should be cleaned using clean_spectrum() in ms_entropy before passed into the add_new_spectrum().
from ms_entropy import clean_spectrum
precursor_ions_removal_da = 1.6
for spec in spectra_1_for_library:
spec['peaks'] = clean_spectrum(
peaks = spec['peaks'],
max_mz = spec['precursor_mz'] - precursor_ions_removal_da, # Max m/z in peaks.
min_mz: float = -1.0, # Min m/z in peaks.
noise_threshold: float = 0.01, # The minimum intensity to keep. Defaults to 0.01, which will remove peaks with intensity < 0.01 * max_intensity.
min_ms2_difference_in_da: float = 0.05, # The minimum m/z difference between two peaks in the resulting spectrum.
min_ms2_difference_in_ppm: float = -1.0, # The minimum m/z difference between two peaks in the resulting spectrum. Defaults to -1, which will use the min_ms2_difference_in_da instead.
max_peak_num: int = -1, # The maximum number of peaks to keep.
normalize_intensity: bool = True, # Whether to normalize the intensity to sum to 1.
)
for spec in spectra_2_for_library:
spec['peaks'] = clean_spectrum(
peaks = spec['peaks'],
max_mz = spec['precursor_mz'] - precursor_ions_removal_da
) # Other parameters can be set as aforementioned.
for spec in spectra_3_for_library:
spec['peaks'] = clean_spectrum(
peaks = spec['peaks'],
max_mz = spec['precursor_mz'] - precursor_ions_removal_da
) # Other parameters can be set as aforementioned.Note that three parameters (1) min_ms2_difference_in_da in clean_spectrum(), (2) max_ms2_tolerance_in_da in the initialization of class DynamicEntropySearch() and (3) ms2_tolerance_in_da in any search functions of DynamicEntropySearch() should follow this rule: min_ms2_difference_in_da > max_ms2_tolerance_in_da * 2 > ms2_tolerance_in_da * 2. An error will be reported if the condition is not met.
Then you can have your spectra libraries to be added into the library.
# Firstly, import DynamicEntropySearch.
from dynamic_entropy_search.dynamic_entropy_search import DynamicEntropySearch
# Secondly, assign the path for your library.
entropy_search=DynamicEntropySearch(
path_data=path_of_your_library,
max_ms2_tolerance_in_da=0.024, # Maximum MS/MS tolerance (in Daltons) used during spectrum search.
extend_fold=3, # Expansion factor for preallocated storage in each m/z block. Determines ``reserved_len = data_len * extend_fold``.
mass_per_block: float = 0.05, # m/z step size for creating the index blocks.
num_per_group: int = 100_000_000, # Number of spectra assigned to each group.
cache_list_threshold: int = 1_000_000, # Number of spectra to accumulate in memory before writing them to disk.
max_indexed_mz: float = 1500.00005, # Maximum m/z value to index. Ions above this threshold are grouped into a single block.
intensity_weight="entropy", # "entropy" or None.Determines whether intensities are entropy-weighted.
)
# Thirdly, add spectra into the library one by one.
entropy_search.add_new_spectra(spectra_list=spectra_1_for_library)
entropy_search.add_new_spectra(spectra_list=spectra_2_for_library)
entropy_search.add_new_spectra(spectra_list=spectra_3_for_library)
# Lastly, call build_index() and write() to end the adding operation.
entropy_search.build_index()
entropy_search.write()It is necessary to initialize DynamicEntropySearch using a specified path_data, which is the path of your library. The reset of the parameters are optional.
If you only want to build index for open search, you can set index_for_neutral_loss in add_new_spectra() and build_index() to False.
It is necessary to call build_index() and write() lastly after all add_new_spectra() as the end of adding operation.
You can perform identity search, open search, neutral loss search or hybrid search based on your need.
Suppose you have established a library locally under path_of_your_library using the aforementioned method.
Now you can perform search with a query spectrum in correct format like this:
import numpy as np
# For each query spectrum, 'precursor_mz' and 'peaks' are necessary.
# 'precursor_mz' should be a float, and 'peaks' should be a 2D np.ndarray like np.ndarray([[m/z, intensity], [m/z, intensity], [m/z, intensity]...], dtype=np.float32).
query_spectrum = {"precursor_mz": 150.0,
"peaks": np.array([[100.0, 1.0], [101.0, 1.0], [102.0, 1.0]], dtype=np.float32)}If your query spectra is a list consisting of several spectrum:
import numpy as np
# For each query_spectra_list, it is a list consisting of multiple dictionaries of query MS2 spectra.
# For each query spectrum, 'precursor_mz' and 'peaks' are necessary.
# 'precursor_mz' should be a float, and 'peaks' should be a 2D np.ndarray like np.ndarray([[m/z, intensity], [m/z, intensity], [m/z, intensity]...], dtype=np.float32).
query_spectra_list = [{
"precursor_mz": 150.0,
"peaks": np.array([[100.0, 1.0], [101.0, 1.0], [102.0, 1.0]], dtype=np.float32)
},{
"precursor_mz": 250.0,
"peaks": np.array([[108.0, 1.0], [113.0, 1.0], [157.0, 1.0]], dtype=np.float32)
},{
"precursor_mz": 299.0,
"peaks": np.array([[119.0, 1.0], [145.0, 1.0], [157.0, 1.0]], dtype=np.float32)
},
]You can call the DynamicEntropySearch class with corresponding path_data to search the library like this:
from dynamic_entropy_search.dynamic_entropy_search import DynamicEntropySearch
# Assign the path for your library
entropy_search=DynamicEntropySearch(path_data=path_of_your_library)
# Search the library and you can fetch the metadata from the results with the highest scores
result=entropy_search.search_topn_matches(
precursor_mz=query_spectrum['precursor_mz'],
peaks=query_spectrum['peaks'],
ms1_tolerance_in_da=0.01, # You can change ms1_tolerance_in_da as needed.
ms2_tolerance_in_da=0.02, # You can change ms2_tolerance_in_da as needed.
method='open', # or 'neutral_loss' or 'hybrid' or 'identity'.
clean=True, # If you don't want to use the internal clean process in this function, set it to False.
topn=3, # You can change topn as needed.
need_metadata=True, # Set it to True if need metadata.
)
# After that, you can print the result like this:
print(result)If the query spectra is a list, iterate it to perform search.
from dynamic_entropy_search.dynamic_entropy_search import DynamicEntropySearch
# Assign the path for your library
entropy_search=DynamicEntropySearch(path_data=path_of_your_library)
# For query_spectra_list, iterate it to perform search for each elements.
for spec in query_spectra_list:
result=entropy_search.search_topn_matches(
precursor_mz=spec['precursor_mz'],
peaks=spec['peaks'],
ms1_tolerance_in_da=0.01, # You can change ms1_tolerance_in_da as needed.
ms2_tolerance_in_da=0.02, # You can change ms2_tolerance_in_da as needed.
method='open', # or 'neutral_loss' or 'hybrid' or 'identity'.
clean=True, # If you don't want to use the internal clean process in this function, set it to False.
topn=3, # You can change topn as needed.
need_metadata=True, # Set it to True if need metadata.
)
# After that, you can print the result like this:
print(result)Besides search_topn_matches(), You can also perform search using other functions:
from dynamic_entropy_search.dynamic_entropy_search import DynamicEntropySearch
# Assign the path for your library
entropy_search=DynamicEntropySearch(path_data=path_of_your_library)
# For query_spectra_list, iterate it to perform search for each elements.For example:
### Use `search()` and get an array with all entropy similarities ###
for spec in query_spectra_list:
result=entropy_search.search(
precursor_mz=spec['precursor_mz'],
peaks=spec['peaks'],
ms1_tolerance_in_da=0.01, # You can change ms1_tolerance_in_da as needed.
ms2_tolerance_in_da=0.02, # You can change ms2_tolerance_in_da as needed.
method='open', # or 'neutral_loss' or 'hybrid' or 'identity' or 'all'.
clean=True, # If you don't want to use the internal clean process in this function, set it to False.
)
print(result)
### Use `identity_search()` and get an array with all entropy similarities based on identity search ###
for spec in query_spectra_list:
result=entropy_search.identity_search(
precursor_mz=spec['precursor_mz'],
peaks=spec['peaks'],
ms1_tolerance_in_da=0.01, # You can change ms1_tolerance_in_da as needed.
ms2_tolerance_in_da=0.02, # You can change ms2_tolerance_in_da as needed.
)
print(result)
### Use `open_search()` and get an array with all entropy similarities based on open search ###
for spec in query_spectra_list:
result=entropy_search.open_search(
peaks=spec['peaks'],
ms2_tolerance_in_da=0.02, # You can change ms2_tolerance_in_da as needed.
)
print(result)
### Use `neutral_loss_search()` and get an array with all entropy similarities based on neutral loss search ###
for spec in query_spectra_list:
result=entropy_search.neutral_loss_search(
precursor_mz=spec['precursor_mz'],
peaks=spec['peaks'],
ms2_tolerance_in_da=0.02, # You can change ms2_tolerance_in_da as needed.
)
print(result)
### Use `hybrid_search()` and get an array with all entropy similarities based on hybrid search ###
for spec in query_spectra_list:
result=entropy_search.hybrid_search(
precursor_mz=spec['precursor_mz'],
peaks=spec['peaks'],
ms2_tolerance_in_da=0.02, # You can change ms2_tolerance_in_da as needed.
)
print(result)RepositorySearch offers prebuilt indexes for public metabolomics repositories, comprising more than 1.4 billion spectra. As a part of DynamicEntropySearch, users can use RepositorySearch to search against these public metabolomics repositories. We have built the indexes and upload them to (https://huggingface.co/datasets/YuanyueLiZJU/dynamic_entropy_search/tree/main).
Suppose you have downloaded the prebuilt indexes from (https://huggingface.co/datasets/YuanyueLiZJU/dynamic_entropy_search/tree/main) and extracted them to path_repository_indexes on your local machine, you can perform search like this:
Firstly, assign the path of the prebuilt indexes as the path_data of RepositorySearch class.
from dynamic_entropy_search.repository_search import RepositorySearch
search_engine=RepositorySearch(path_data=path_repository_indexes)Prepare query spectrum in correct format (see aforementioned points to prepare the format).
import numpy as np
query_spec={
"charge": 1,
"peaks": np.array([[58.0646, 1894], [86.095, 98105]], dtype=np.float32),
"precursor_mz": 183.987125828,
}
query_spec['peaks']=clean_spectrum(
peaks=query_spec['peaks'],
max_mz = query_spec['precursor_mz'] - precursor_ions_removal_da
)
# Or a list:
query_spec = [{
"precursor_mz": 150.0,
"peaks": np.array([[100.0, 1.0], [101.0, 1.0], [102.0, 1.0]], dtype=np.float32),
"charge": 1
},{
"precursor_mz": 250.0,
"peaks": np.array([[108.0, 1.0], [113.0, 1.0], [157.0, 1.0]], dtype=np.float32),
"charge": -1
},{
"precursor_mz": 299.0,
"peaks": np.array([[119.0, 1.0], [145.0, 1.0], [157.0, 1.0]], dtype=np.float32),
"charge": 1
},
]
# Also need to clean
for spec in query_spec:
spec['peaks']=clean_spectrum(
peaks=spec['peaks'],
max_mz = spec['precursor_mz'] - precursor_ions_removal_da
)Then perform search, and you can get top few results.
# Perform search
search_result = search_engine.search_topn_matches(
charge=query_spec["charge"],
precursor_mz=query_spec["precursor_mz"],
peaks=query_spec["peaks"],
method="open", # or 'hybrid' or 'neutral_loss' or 'identity'
)
# If the query spectra is a list:
for spec in query_spec:
search_result = search_engine.search_topn_matches(
charge=spec["charge"],
precursor_mz=spec["precursor_mz"],
peaks=spec["peaks"],
method="open", # or 'hybrid' or 'neutral_loss' or 'identity'
)If you want to extract any spectrum from the results:
def get_spectrum_data(search_engine: RepositorySearch, charge, spec_idx):
# You can specify the spectrum you want to extract from results by setting spec_idx
spec = search_engine.get_spectrum(charge, spec_idx)
spec.pop("scan", None)
return spec
# For example, set `spec_idx` to 0.
spec_data = get_spectrum_data(search_engine, query_spec["charge"], search_result[0].pop("spec_idx"))
spec_data.update(search_result[0])
print(f"Top match spectrum data: {spec_data}")
# For example, set `spec_idx` to 1.
spec_data = get_spectrum_data(search_engine, query_spec["charge"], search_result[1].pop("spec_idx"))
spec_data.update(search_result[1])
print(f"Match spectrum data: {spec_data}")Here is an example of the result:
Top match spectrum data: {'precursor_mz': 512.233642578125, 'charge': -1, 'rt': 76.76499938964844, 'peaks': array([[200.00693 , 0.74098176],
[202.0056 , 0.2590183 ]], dtype=float32), 'file_name': 'metabolomics_workbench/ST003745/x01997_NEG.mzML.gz', 'scan': np.uint64(1139), 'similarity': np.float64(0.8030592799186707)}