Skip to content

Add support for extra features #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 55 commits into
base: main
Choose a base branch
from
Draft

Conversation

iblacksand
Copy link
Collaborator

@iblacksand iblacksand commented Sep 20, 2024

Adds support for additional features to FunMap. Users can use the extra_feature_file key in their config to specify a TSV file that contains features for a gene pair in any scale.

New Features

  • Ability to add new features given in the format of gene-pairs and a corresponding value (see details below)
  • New data import Rust library for performance improvements
  • Use a combination of expression data and extra features, or just extra features
    • use only_extra_features = true in your config YAML file to ignore expression data

Status

  • Import extra feature file
    • Feature import tested, but final version not finalized
  • Features used for prediction
  • LLR calculation
  • LLR plot creation
  • Full successful run

Implementation notes

  • Duplicate pairs are removed, with only the first pair being kept
    • May need to change to be the last pair kept, as this is how the cohort level information functions
  • The all_average curve in the LLR plot does not use the extra features, only the cohort information

Other Changes

  • All packages were updated to latest version
  • Using pyproject.toml instead of setup.py.
  • New folder format to support maturin for Rust library integration
  • Build and publishes package using GitHub Actions

Data Format

The format for the extra_feature_file is below. The first column is the first gene in the pair, the second column is the second gene. The following columns are the feature values. If a feature does not a value for the specified pair, it should have a value of NA.

Columns are tab-separated.

Gene A Gene B Feature_X Feature_Y
ABC1 DEF2 0.12 34.5
GHI3 JKLM4 -0.6 NA

@iblacksand iblacksand added the enhancement New feature or request label Sep 20, 2024
@iblacksand iblacksand self-assigned this Sep 20, 2024
@iblacksand
Copy link
Collaborator Author

Current error is dealing with plotting.

Details

Traceback (most recent call last):
  File "/home/elizarra/extra_feature_test/.venv/bin/funmap", line 8, in <module>
    sys.exit(cli())
             ^^^^^
  File "/home/elizarra/extra_feature_test/.venv/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/elizarra/extra_feature_test/.venv/lib/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/home/elizarra/extra_feature_test/.venv/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/elizarra/extra_feature_test/.venv/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/elizarra/extra_feature_test/.venv/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/elizarra/extra_feature_test/.venv/lib/python3.11/site-packages/funmap/cli.py", line 353, in run
    fig_names = plot_results(
                ^^^^^^^^^^^^^
  File "/home/elizarra/extra_feature_test/.venv/lib/python3.11/site-packages/funmap/plotting.py", line 271, in plot_results
    plot_llr_compare_networks(
  File "/home/elizarra/extra_feature_test/.venv/lib/python3.11/site-packages/funmap/plotting.py", line 668, in plot_llr_compare_networks
    ax2.get_yticklabels()[4].set_color("red")
    ^^^^^^^^^^^^^^^^^^^^^
  File "/home/elizarra/extra_feature_test/.venv/lib/python3.11/site-packages/matplotlib/axes/_base.py", line 74, in wrapper
    return get_method(self)(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/elizarra/extra_feature_test/.venv/lib/python3.11/site-packages/matplotlib/axis.py", line 1468, in get_ticklabels
    return self.get_majorticklabels()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/elizarra/extra_feature_test/.venv/lib/python3.11/site-packages/matplotlib/axis.py", line 1425, in get_majorticklabels
    self._update_ticks()
  File "/home/elizarra/extra_feature_test/.venv/lib/python3.11/site-packages/matplotlib/axis.py", line 1282, in _update_ticks
    minor_locs = self.get_minorticklocs()
                 ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/elizarra/extra_feature_test/.venv/lib/python3.11/site-packages/matplotlib/axis.py", line 1501, in get_minorticklocs
    minor_locs = np.asarray(self.minor.locator())
                            ^^^^^^^^^^^^^^^^^^^^
  File "/home/elizarra/extra_feature_test/.venv/lib/python3.11/site-packages/matplotlib/ticker.py", line 2341, in __call__
    return self.tick_values(vmin, vmax)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/elizarra/extra_feature_test/.venv/lib/python3.11/site-packages/matplotlib/ticker.py", line 2358, in tick_values
    raise ValueError(
ValueError: Data has no positive values, and therefore can not be log-scaled.

@iblacksand
Copy link
Collaborator Author

Incorporated changes from main, increasing version number to 0.2.0

@iblacksand
Copy link
Collaborator Author

A problem for the current implementation is file size. For example, some data sets can be up to 6 GB if we use the current TSV format. We have to optimize loading of this file so that the memory and IO usage is limited.

@iblacksand
Copy link
Collaborator Author

For the extra feature support, I am creating a Rust library to process the large files in a memory-optimized manner. The current flow is described in the diagram below.

  1. Get all of the genes in all the data sets in the funmap
  2. Save unified gene order
    • Allows us to have a unified order for gene pairs that the extra features can be aligned to
  3. Reorder the extra feature gene pairs to match the new unified order
    • If a feature does not have a gene pair, it will be represented as NA
  4. Save each feature (each column) as separate pkl file
  5. Import each column/feature into a Pandas dataframe in the Python part of FunMap
graph TD
    A["Expression Data"]
    B["Extra Feature Files"]
    C["Identify All Unique Genes"]
    D["Save unified gene order to pkl file"]
    E["Realign extra features with unified gene order"]
    F["Save each feature as pkl file"]
    A --> C
    B --> C
    C --> D
    C --> E
    E --> F
Loading

TODO:

  • Determine if performance of this method is adequate.
    • Saving and reading pkl files may be too slow
    • Initial thought is data is too large to be saved in memory while building data frame.
    • Transferring between languages would be a bottleneck, however IO is also slow.
  • Identify other candidates for file type
    • pkl was chosen as it is easy to import with Pandas and is well-known

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant