Groups GEO (Gene Expression Omnibus) samples based on the keywords that the sample descriptions share.
To install the geogrouper package, follow the following steps. The latter steps (5-8) involve the
installation of python_mcl, the MCL clustering algorithm implementation in python that this
uses.
cd <path/to/your/working/directory>git clone https://github.com/mnpatil17/GeoGroupercd GeoGrouperpip install -e .git clone https://github.com/koteth/python_mcl # do this inside the outer GeoGrouper directorycd python_mclpython setup.py installcd ..
Using the geogrouper package is very simple. The primary method is cluster_descriptions_from_file
To cluster from a file:
from geogrouper import cluster_descriptions_from_file
clusters_for_each_series = cluster_descriptions_from_file(path_to_data_file)
To cluster from a file AND print to terminal as you go:
from geogrouper import cluster_descriptions_from_file
clusters_for_each_series = cluster_descriptions_from_file(path_to_data_file, should_print_output=True)
To cluster from a file AND print to terminal only the series that have at least N samples:
from geogrouper import cluster_descriptions_from_file
clusters_for_each_series = cluster_descriptions_from_file(path_to_data_file, should_print_output=True, print_series_sample_size=N)
To cluster a list of sample descriptions with some additional description text (abstract_text):
from geogrouper import cluster_descriptions
clusters, mcl_matrix = cluster_descriptions(sample_titles_list, abstract_text)
geo_id.py: handles reading from a specified datatablekeywords.py: has multiple methods for finding keywords for a series (not all are used)geogrouper.py: the primary file, which handles the clustering.cluster_descriptions_from_fileis the primary methodutils.py: various utility functions
The file keywords.py contains the logic to find keywords from GEO data. Currently there are two methods:
get_acronyms()get_common_words()
Changing the way the main algorithm finds keywords will change the effectiveness of the algorithm. Therefore, to iterate on this algorithm, changing the way keywords are found is a great way to improve performance.