Eco-discretizer

Overview

This script uses k-means clustering to discretize climatic datasets. The point is to convert continuous data to a discrete form that is useful for large numbers of comparative methods that use discrete data and also for exploratory analyses.

The script accepts as inputs a species list file specieslist.csv in the working directory (one species per line) and lists of extracted environmental data for points (format is with occurrences in rows and environmental data in columns, no headers). The naming format should be, e.g., pno1_Species_a.csv (numbering of variables starts at 1). These must be in the same directory as the script. The output is a csv containing the species labels and a numeric character coding from 0 to k - 1. The program will repeatedly test k values from 2 to the specified number.

The distance matrices (raw and normalized) and k-means distortions are also saved in case they are useful for something. One idea is to plot the final averaged distance matrix in R, using a plot that makes sense like an MDS analysis. Classification surprises often happen when an individual is intermediate between an expected group and some neighbor.

The python libraries numpy, scipy, and pandas are required.

It is called like:

./pno_discretization.py numberOfVariables numberOfCategories

where numberOfCategories is k, e.g.,

./pno_discretization.py 35 7

Approach

Euclidean distances are calculated from all pairs of points between two species (100 randomly with replacement if the possible combinations are greater than this). The average of these is used to populated a species distance matrix. Finally, k-means clustering is applied to the matrix and a csv with the character coding (numerically from 0 to 1-k) is saved.

It is assumed that missing data is coded. as -9999 and points with missing data in any variable are discarded.

Explanation of Files

pno_discretization.py—Python script for habitat coding.
final_classification_k_7.csv—Result file for Saxifragales analysis.
final_classification_k_7_biogeobearsformat.csv—Result file for Saxifragales analysis, in a format ready to use for BioGeoBEARS. For the BioGeoBEARS run script, see [https://github.com/ryanafolk/biogeographic_coder] and change file paths appropriately to point to habitat classifications.
ultrametric_occur_matched_forcedultra.habitatclassificationmatched.tre—Tree used for BioGeoBEARS, with sampling matched to habitat classifications.

Possible errors

NameError: name 'process' is not defined -- check for files that exist but are empty with find . -size 0.

FileNotFoundError: [Errno 2] No such file or directory: -- check if the listed species only has some files. specieslist.csv should only contain entries that have data for all variables.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Eco-discretizer

Overview

Approach

Explanation of Files

Possible errors

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
pnos		pnos
README.md		README.md
final_classification_k_7.csv		final_classification_k_7.csv
final_classification_k_7_biogeobearsformat.csv		final_classification_k_7_biogeobearsformat.csv
pno_discretization.py		pno_discretization.py
specieslist.csv		specieslist.csv
ultrametric_occur_matched_forcedultra.habitatclassificationmatched.tre		ultrametric_occur_matched_forcedultra.habitatclassificationmatched.tre

ryanafolk/eco-discretizer

Folders and files

Latest commit

History

Repository files navigation

Eco-discretizer

Overview

Approach

Explanation of Files

Possible errors

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages