Theoretical Background

Dataset for benchmark can be found at the following link:

Theoretical Background

Spectral entropy is a useful property to measure the complexity of a spectrum. It is inspired by the concept of Shannon entropy in information theory. (ref)

Entropy similarity, which measured spectral similarity based on spectral entropy, has been shown to outperform dot product similarity in compound identification. (ref)

The calculation of entropy similarity can be accelerated by using the Flash Entropy Search algorithm. (ref)

Dynamic Entropy Search is built and optimized based on Flash Entropy Search. Besides the excellent search performance, it allows unlimited library spectra with high speed and low memory.

How to use this package

This repository contains the source code to build index, update index, calculate spectral entropy and entropy similarity in python.

Usage of library construction (combining initializing and updating process)

You can establish your own library locally as follows:

In brief

# Step 1: Import DynamicEntropySearch.
from dynamic_entropy_search.dynamic_entropy_search import DynamicEntropySearch

# Step 2: Assign the path for your library.
entropy_search=DynamicEntropySearch(path_data=path_of_your_library)

# Step 3: Add spectra into the library. This adding operation can be performed multiple times.
entropy_search.add_new_spectra(spectra_list=spectra_1_for_library)
entropy_search.add_new_spectra(spectra_list=spectra_2_for_library)
......

# Step 4: Call build_index() and write() lastly to end adding operation.
entropy_search.build_index()
entropy_search.write()

In details

Step 1: prepare the spectral libraries

Suppose you have a lot of spectral libraries, you need to format them like this:

import numpy as np
# For each spectral library, it is a list consisting of multiple dictionaries of MS2 spectra.

# For each spectrum, 'precursor_mz' and 'peaks' are necessary. 
# 'precursor_mz' should be a float, and 'peaks' should be a 2D np.ndarray like np.ndarray([[m/z, intensity], [m/z, intensity], [m/z, intensity]...], dtype=np.float32).


spectra_1_for_library = [{
    "id": "Demo spectrum 1",
    "precursor_mz": 150.0,
    "peaks": np.array([[100.0, 1.0], [101.0, 1.0], [103.0, 1.0]], dtype=np.float32), 
}, {
    "id": "Demo spectrum 2",
    "precursor_mz": 200.0,
    "peaks": np.array([[100.0, 1.0], [101.0, 1.0], [102.0, 1.0]], dtype=np.float32),
    "metadata": "ABC"
}, {
    "id": "Demo spectrum 3",
    "precursor_mz": 250.0,
    "peaks": np.array([[200.0, 1.0], [101.0, 1.0], [202.0, 1.0]], dtype=np.float32),
    "XXX": "YYY",
}, {
    "precursor_mz": 350.0,
    "peaks": np.array([[100.0, 1.0], [101.0, 1.0], [302.0, 1.0]], dtype=np.float32),
},
    ]

spectra_2_for_library ... # Similar to spectra_1_for_library
spectra_3_for_library ... # Similar to spectra_1_for_library

Note that the precursor_mz and peaks keys are required, the reset of the keys are optional.

The spectra in the spectra library should be cleaned using clean_spectrum() in ms_entropy before passed into the add_new_spectrum().

from ms_entropy import clean_spectrum

precursor_ions_removal_da = 1.6

for spec in spectra_1_for_library:
    spec['peaks'] = clean_spectrum(
        peaks = spec['peaks'],
        max_mz = spec['precursor_mz'] - precursor_ions_removal_da, # Max m/z in peaks.
        min_mz: float = -1.0, # Min m/z in peaks.
        noise_threshold: float = 0.01, # The minimum intensity to keep. Defaults to 0.01, which will remove peaks with intensity < 0.01 * max_intensity.
        min_ms2_difference_in_da: float = 0.05, # The minimum m/z difference between two peaks in the resulting spectrum.
        min_ms2_difference_in_ppm: float = -1.0, # The minimum m/z difference between two peaks in the resulting spectrum. Defaults to -1, which will use the min_ms2_difference_in_da instead.
        max_peak_num: int = -1, # The maximum number of peaks to keep.
        normalize_intensity: bool = True, # Whether to normalize the intensity to sum to 1.

    )

for spec in spectra_2_for_library:
    spec['peaks'] = clean_spectrum(
        peaks = spec['peaks'],
        max_mz = spec['precursor_mz'] - precursor_ions_removal_da
    )   # Other parameters can be set as aforementioned.

for spec in spectra_3_for_library:
    spec['peaks'] = clean_spectrum(
        peaks = spec['peaks'],
        max_mz = spec['precursor_mz'] - precursor_ions_removal_da
    ) # Other parameters can be set as aforementioned.

Note that three parameters (1) min_ms2_difference_in_da in clean_spectrum(), (2) max_ms2_tolerance_in_da in the initialization of class DynamicEntropySearch() and (3) ms2_tolerance_in_da in any search functions of DynamicEntropySearch() should follow this rule: min_ms2_difference_in_da > max_ms2_tolerance_in_da * 2 > ms2_tolerance_in_da * 2. An error will be reported if the condition is not met.

Then you can have your spectra libraries to be added into the library.

Step 2: perform update

# Firstly, import DynamicEntropySearch.
from dynamic_entropy_search.dynamic_entropy_search import DynamicEntropySearch

# Secondly, assign the path for your library.
entropy_search=DynamicEntropySearch(
        path_data=path_of_your_library, 
        max_ms2_tolerance_in_da=0.024, # Maximum MS/MS tolerance (in Daltons) used during spectrum search.
        extend_fold=3, # Expansion factor for preallocated storage in each m/z block. Determines ``reserved_len = data_len * extend_fold``. 
        mass_per_block: float = 0.05, # m/z step size for creating the index blocks.
        num_per_group: int = 100_000_000, # Number of spectra assigned to each group. 
        cache_list_threshold: int = 1_000_000, # Number of spectra to accumulate in memory before writing them to disk.
        max_indexed_mz: float = 1500.00005, # Maximum m/z value to index. Ions above this threshold are grouped into a single block. 
        intensity_weight="entropy",  # "entropy" or None.Determines whether intensities are entropy-weighted. 
)

# Thirdly, add spectra into the library one by one.
entropy_search.add_new_spectra(spectra_list=spectra_1_for_library)
entropy_search.add_new_spectra(spectra_list=spectra_2_for_library)
entropy_search.add_new_spectra(spectra_list=spectra_3_for_library)

# Lastly, call build_index() and write() to end the adding operation.
entropy_search.build_index()
entropy_search.write()

It is necessary to initialize DynamicEntropySearch using a specified path_data, which is the path of your library. The reset of the parameters are optional.

If you only want to build index for open search, you can set index_for_neutral_loss in add_new_spectra() and build_index() to False.

It is necessary to call build_index() and write() lastly after all add_new_spectra() as the end of adding operation.

Usage of search

You can perform identity search, open search, neutral loss search or hybrid search based on your need.

In brief

Suppose you have established a library locally under path_of_your_library using the aforementioned method.

Now you can perform search with a query spectrum in correct format like this:

import numpy as np
# For each query spectrum, 'precursor_mz' and 'peaks' are necessary. 
# 'precursor_mz' should be a float, and 'peaks' should be a 2D np.ndarray like np.ndarray([[m/z, intensity], [m/z, intensity], [m/z, intensity]...], dtype=np.float32).

query_spectrum = {"precursor_mz": 150.0,
                  "peaks": np.array([[100.0, 1.0], [101.0, 1.0], [102.0, 1.0]], dtype=np.float32)}

If your query spectra is a list consisting of several spectrum:

import numpy as np

# For each query_spectra_list, it is a list consisting of multiple dictionaries of query MS2 spectra.

# For each query spectrum, 'precursor_mz' and 'peaks' are necessary. 
# 'precursor_mz' should be a float, and 'peaks' should be a 2D np.ndarray like np.ndarray([[m/z, intensity], [m/z, intensity], [m/z, intensity]...], dtype=np.float32).

query_spectra_list = [{
                "precursor_mz": 150.0,
                "peaks": np.array([[100.0, 1.0], [101.0, 1.0], [102.0, 1.0]], dtype=np.float32)
                },{
                "precursor_mz": 250.0,
                "peaks": np.array([[108.0, 1.0], [113.0, 1.0], [157.0, 1.0]], dtype=np.float32)
                },{
                "precursor_mz": 299.0,
                "peaks": np.array([[119.0, 1.0], [145.0, 1.0], [157.0, 1.0]], dtype=np.float32)
                },
                ]

You can call the DynamicEntropySearch class with corresponding path_data to search the library like this:

from dynamic_entropy_search.dynamic_entropy_search import DynamicEntropySearch

# Assign the path for your library
entropy_search=DynamicEntropySearch(path_data=path_of_your_library)

# Search the library and you can fetch the metadata from the results with the highest scores
result=entropy_search.search_topn_matches(
        precursor_mz=query_spectrum['precursor_mz'],
        peaks=query_spectrum['peaks'],
        ms1_tolerance_in_da=0.01, # You can change ms1_tolerance_in_da as needed.
        ms2_tolerance_in_da=0.02, # You can change ms2_tolerance_in_da as needed.
        method='open', # or 'neutral_loss' or 'hybrid' or 'identity'.
        clean=True, # If you don't want to use the internal clean process in this function, set it to False.
        topn=3, # You can change topn as needed.
        need_metadata=True, # Set it to True if need metadata.
)

# After that, you can print the result like this:
print(result)

If the query spectra is a list, iterate it to perform search.

from dynamic_entropy_search.dynamic_entropy_search import DynamicEntropySearch

# Assign the path for your library
entropy_search=DynamicEntropySearch(path_data=path_of_your_library)

# For query_spectra_list, iterate it to perform search for each elements.
for spec in query_spectra_list:
    result=entropy_search.search_topn_matches(
            precursor_mz=spec['precursor_mz'],
            peaks=spec['peaks'],
            ms1_tolerance_in_da=0.01, # You can change ms1_tolerance_in_da as needed.
            ms2_tolerance_in_da=0.02, # You can change ms2_tolerance_in_da as needed.
            method='open', # or 'neutral_loss' or 'hybrid' or 'identity'.
            clean=True, # If you don't want to use the internal clean process in this function, set it to False.
            topn=3, # You can change topn as needed.
            need_metadata=True, # Set it to True if need metadata.
    )
    # After that, you can print the result like this:
    print(result)

Multiple search options

Besides search_topn_matches(), You can also perform search using other functions:

from dynamic_entropy_search.dynamic_entropy_search import DynamicEntropySearch

# Assign the path for your library
entropy_search=DynamicEntropySearch(path_data=path_of_your_library)

# For query_spectra_list, iterate it to perform search for each elements.

For example:

### Use `search()` and get an array with all entropy similarities ###
for spec in query_spectra_list:
    result=entropy_search.search(
            precursor_mz=spec['precursor_mz'],
            peaks=spec['peaks'],
            ms1_tolerance_in_da=0.01, # You can change ms1_tolerance_in_da as needed.
            ms2_tolerance_in_da=0.02, # You can change ms2_tolerance_in_da as needed.
            method='open', # or 'neutral_loss' or 'hybrid' or 'identity' or 'all'.
            clean=True, # If you don't want to use the internal clean process in this function, set it to False.
    )
    print(result)


### Use `identity_search()` and get an array with all entropy similarities based on identity search ###
for spec in query_spectra_list:
    result=entropy_search.identity_search(
            precursor_mz=spec['precursor_mz'],
            peaks=spec['peaks'],
            ms1_tolerance_in_da=0.01, # You can change ms1_tolerance_in_da as needed.
            ms2_tolerance_in_da=0.02, # You can change ms2_tolerance_in_da as needed.
    )
    print(result)


### Use `open_search()` and get an array with all entropy similarities based on open search ###
for spec in query_spectra_list:
    result=entropy_search.open_search(
            peaks=spec['peaks'],
            ms2_tolerance_in_da=0.02, # You can change ms2_tolerance_in_da as needed.
    )
    print(result)


### Use `neutral_loss_search()` and get an array with all entropy similarities based on neutral loss search ###
for spec in query_spectra_list:
    result=entropy_search.neutral_loss_search(
            precursor_mz=spec['precursor_mz'],
            peaks=spec['peaks'],
            ms2_tolerance_in_da=0.02, # You can change ms2_tolerance_in_da as needed.
    )
    print(result)


### Use `hybrid_search()` and get an array with all entropy similarities based on hybrid search ###
for spec in query_spectra_list:
    result=entropy_search.hybrid_search(
            precursor_mz=spec['precursor_mz'],
            peaks=spec['peaks'],
            ms2_tolerance_in_da=0.02, # You can change ms2_tolerance_in_da as needed.
    )
    print(result)

Usage of RepositorySearch

RepositorySearch offers prebuilt indexes for public metabolomics repositories, comprising more than 1.4 billion spectra. As a part of DynamicEntropySearch, users can use RepositorySearch to search against these public metabolomics repositories. We have built the indexes and upload them to (https://huggingface.co/datasets/YuanyueLiZJU/dynamic_entropy_search/tree/main).

Suppose you have downloaded the prebuilt indexes from (https://huggingface.co/datasets/YuanyueLiZJU/dynamic_entropy_search/tree/main) and extracted them to path_repository_indexes on your local machine, you can perform search like this:

Firstly, assign the path of the prebuilt indexes as the path_data of RepositorySearch class.

from dynamic_entropy_search.repository_search import RepositorySearch

search_engine=RepositorySearch(path_data=path_repository_indexes)

Prepare query spectrum in correct format (see aforementioned points to prepare the format).

import numpy as np
query_spec={
        "charge": 1,
        "peaks": np.array([[58.0646, 1894], [86.095, 98105]], dtype=np.float32),
        "precursor_mz": 183.987125828,
    }

query_spec['peaks']=clean_spectrum(
        peaks=query_spec['peaks'],
        max_mz = query_spec['precursor_mz'] - precursor_ions_removal_da
    )
# Or a list:
query_spec = [{
                "precursor_mz": 150.0,
                "peaks": np.array([[100.0, 1.0], [101.0, 1.0], [102.0, 1.0]], dtype=np.float32),
                "charge": 1
                },{
                "precursor_mz": 250.0,
                "peaks": np.array([[108.0, 1.0], [113.0, 1.0], [157.0, 1.0]], dtype=np.float32),
                "charge": -1
                },{
                "precursor_mz": 299.0,
                "peaks": np.array([[119.0, 1.0], [145.0, 1.0], [157.0, 1.0]], dtype=np.float32),
                "charge": 1
                },
                ]

# Also need to clean
for spec in query_spec:
    spec['peaks']=clean_spectrum(
            peaks=spec['peaks'],
            max_mz = spec['precursor_mz'] - precursor_ions_removal_da
        )

Then perform search, and you can get top few results.

# Perform search
search_result = search_engine.search_topn_matches(
    charge=query_spec["charge"],
    precursor_mz=query_spec["precursor_mz"],
    peaks=query_spec["peaks"],
    method="open", # or 'hybrid' or 'neutral_loss' or 'identity'
)

# If the query spectra is a list:
for spec in query_spec:
    search_result = search_engine.search_topn_matches(
        charge=spec["charge"],
        precursor_mz=spec["precursor_mz"],
        peaks=spec["peaks"],
        method="open", # or 'hybrid' or 'neutral_loss' or 'identity'
    )

If you want to extract any spectrum from the results:

def get_spectrum_data(search_engine: RepositorySearch, charge, spec_idx):
    # You can specify the spectrum you want to extract from results by setting spec_idx
    spec = search_engine.get_spectrum(charge, spec_idx)
    spec.pop("scan", None)
    return spec
# For example, set `spec_idx` to 0.
spec_data = get_spectrum_data(search_engine, query_spec["charge"], search_result[0].pop("spec_idx"))
spec_data.update(search_result[0])
print(f"Top match spectrum data: {spec_data}")

# For example, set `spec_idx` to 1.
spec_data = get_spectrum_data(search_engine, query_spec["charge"], search_result[1].pop("spec_idx"))
spec_data.update(search_result[1])
print(f"Match spectrum data: {spec_data}")

Here is an example of the result:

Top match spectrum data: {'precursor_mz': 512.233642578125, 'charge': -1, 'rt': 76.76499938964844, 'peaks': array([[200.00693   ,   0.74098176],
       [202.0056    ,   0.2590183 ]], dtype=float32), 'file_name': 'metabolomics_workbench/ST003745/x01997_NEG.mzML.gz', 'scan': np.uint64(1139), 'similarity': np.float64(0.8030592799186707)}

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
benchmark_code		benchmark_code
dynamic_entropy_search		dynamic_entropy_search
.readthedocs.yaml		.readthedocs.yaml
README.md		README.md
example.py		example.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Theoretical Background

How to use this package

Usage of library construction (combining initializing and updating process)

In brief

In details

Step 1: prepare the spectral libraries

Step 2: perform update

Usage of search

In brief

Multiple search options

Usage of RepositorySearch

About

Uh oh!

Releases

Packages

Languages

2bereal-me/DynamicEntropySearch

Folders and files

Latest commit

History

Repository files navigation

Theoretical Background

How to use this package

Usage of library construction (combining initializing and updating process)

In brief

In details

Step 1: prepare the spectral libraries

Step 2: perform update

Usage of search

In brief

Multiple search options

Usage of RepositorySearch

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages