A Python package to retrieve and prepare gene expression data from Gene Expression Omnibus and Genomic Data Commons.
Archived: This was my earlier attempt to automate common data retrieval and preprocessing tasks for gene expression data. Interested readers might want to check out these other resources instead:
- https://github.com/alexvpickering/crossmeta
- https://maayanlab.cloud/archs4/
- https://jhubiostatistics.shinyapps.io/recount/
The gdc module (to retrieve and prepare data from GDC) is not implemented.
- Python 3.9 or higher.
- To use the functions for normalization or batch correction from the
preprocessmodule, install R and packages:
For RMA normalization, you will need to install platform design info packages, such as:
- pd.clariom.d.human
- pd.hg.u133.plus.2
- other packages for different platforms you might encounter
These packages can be installed in an R environment by running the script install_r_packages.R. This install script was written for R 4.3.
Run the below commands at the command line. Replace dummy email with your email which will be submitted in your GEO queries to the NCBI API.
git clone https://github.com/fogg-lab/transcriptomic-data-integrator.git
cd transcriptomic-data-integrator
pip install -e .
configure-ncbi-email YOUR_EMAIL@EXAMPLE.COMRefer to the documentation and Colab notebooks.
- The function
tdi.geo.map_probes_to_genesis not guaranteed to work on all microarray platform technologies. This is due to differences in how the probe set annotation table is organized between different platforms. - Other GEO query functions, such as
tdi.geo.get_geo_clinical_characteristics, fail when the data for the study on GEO is not organized according to how this package expects. This happens more times than not.
If you encounter any problems using the package, please submit an issue to report it.