Skip to content

mpecchi/gcms_data_analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

83 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This package provides a structured approach to managing and analyzing GCMS (Gas Chromatography-Mass Spectrometry) qualitative tables. It is designed to automate several key processes in the analysis of GCMS data, focusing on both derivatized and non-derivatized samples. Key features include:

  • Analysis of multiple semi-quantitative GCMS data tables into a unified analysis framework.
  • Automatic management of "files" (intended as a single analysis) and "samples" (intended as the average and deviation of multiple "files".
  • Construction of a compound database for all identified compounds with detailed properties sourced from PubChemPy.
  • Decomposition of compounds into functional groups via a validated fragmentation algorithm.
  • Implementation of calibration and semi-calibration techniques, utilizing Tanimoto similarity and molecular weight comparisons.
  • Production of detailed reports for individual files and samples, alongside comprehensive and aggregated reports highlighting functional group mass fractions.
  • Advanced plotting functionalities for visual data analysis and reporting.

Naming Convention for Samples

Proper naming of sample files is crucial for the automated processing and analysis data. To accurately identify and handle replicates of the same sample, filenames must follow a specific convention. Adherence to this convention enables the software to systematically organize, compare, and analyze data across different samples and their replicates.

Required Format

The filenames should be structured as follows: name-of-sample-with-dashes-only_replicatenumber.

  • Use dashes (-) exclusively to separate words within the sample name.
  • Underscore (_) is reserved for separating the sample name from the replicate number.
  • Replicate number should be a numeral, indicating the sequence or replicate of the sample.

Correct Examples

The following examples adhere to the naming convention and are processed correctly by the software:

  • Bio-oil-foodwaste-250C_1 indicates the first replicate of a sample named "Bio-oil-foodwaste-250C".
  • FW_2 signifies the second replicate of a sample named "FW".

Non-Acceptable Examples

Avoid using underscores (_) within the sample name or omitting the underscore before the replicate number. These examples demonstrate incorrect naming that will not be processed correctly:

  • bio_oil_1 uses underscores to separate words in the sample name, which is not acceptable.
  • FW1 omits the underscore between the sample name and the replicate number, leading to potential misinterpretation.

Configuration Parameters

The Project class offers several configuration parameters to tailor the data analysis process to your specific needs. These parameters affect all instances of the class and can be set at the beginning of your project. Here's a breakdown of each parameter and how to use it:

auto_save_to_excel

  • Description: Controls whether analysis results are automatically saved in an Excel file.
  • Type: Boolean
  • Default: True
  • Usage: Project.auto_save_to_excel = False to disable auto-saving to Excel.

plot_font

  • Description: Specifies the font style used in graphical plots generated by the project.
  • Type: String
  • Default: 'Dejavu Sans'
  • Usage: Project.set_plot_font('Arial') to use Arial font in plots.

plot_grid

  • Description: Determines whether grid lines are displayed in plots.
  • Type: Boolean
  • Default: False
  • Usage: Project.set_plot_grid(True) to enable grid lines in plots.

load_delimiter

  • Description: The delimiter character used in the data files to be loaded.
  • Type: String
  • Default: '\t' (tab character)
  • Usage: Project.load_delimiter = ',' to use comma as the delimiter.

load_skiprows

  • Description: Specifies how many initial rows should be skipped during the data loading process.
  • Type: Integer
  • Default: 8
  • Usage: Project.load_skiprows = 5 to skip the first 5 rows of the data files.

columns_to_keep_in_files

  • Description: A list of columns to retain from data files during processing.
  • Type: List of strings
  • Default: ['Ret.Time', 'Height', 'Area', 'Name']
  • Usage: Project.columns_to_keep_in_files = ['Column1', 'Column2'] to specify which columns to keep.

columns_to_rename_in_files

  • Description: A dictionary mapping original column names to their new names during data processing.
  • Type: Dictionary
  • Default: {'Ret.Time': 'retention_time', 'Height': 'height', 'Area': 'area', 'Name': 'comp_name'}
  • Usage: Project.columns_to_rename_in_files = {'OldName': 'NewName'} to rename columns.

compounds_to_rename

  • Description: A dictionary mapping original compound names to their new names as specified by the user.
  • Type: Dictionary
  • Default: {}
  • Usage: Project.compounds_to_rename = {'OldCompound': 'NewCompound'} to rename compounds.

tanimoto_similarity_threshold

  • Description: Sets the threshold for Tanimoto similarity to determine compound matches. Essential for distinguishing compounds in complex mixtures.
  • Type: Float
  • Default: 0.4
  • Usage: Project.set_tanimoto_similarity_threshold(0.5) to adjust the similarity threshold.

delta_mol_weight_threshold

  • Description: Defines the maximum allowed molecular weight difference for considering two compounds as matches. Helps in fine-tuning compound identification accuracy.
  • Type: Integer
  • Default: 100
  • Usage: Project.set_delta_mol_weight_threshold(50) to tighten or loosen the molecular weight matching criterion.

column_to_sort_values_in_samples

  • Description: Determines which column is used for sorting compounds within each sample, affecting data organization and subsequent analysis.
  • Type: String
  • Default: 'retention_time'
  • Usage: Project.set_column_to_sort_values_in_samples('area') to change the sorting criterion, enhancing analytical focus.

Example

A comprehensive example is provided on the GitHub repository to show how inputs should be formatted. To test the module, install the gcms_data_analysis module, download the example folder given in the repository, and run the example_gcms_data_analysis.py. The folder_path needs to be set to where your data folder is.

The example code is shown here for convenience:

# Import necessary libraries
import pathlib as plib  # Used for handling file and directory paths
from gcms_data_analysis import Project  # Import the Project class from the gcms_data_analysis package

# Define the folder path where your data is located. Change this path to where you've stored your data files.
folder_path = plib.Path(plib.Path(__file__).parent, 'data')

# Set global configurations for the Project class.
# These configurations affect all instances of the class.
Project.set_folder_path(folder_path)  # Set the base folder path for the project's data files
Project.set_plot_grid(False)  # Disable grid lines in plots for a cleaner look
Project.set_plot_font('Sans')  # Set the font style for plots to 'Sans'

# Initialize a Project instance to manage and analyze GCMS data
gcms = Project()

# Load metadata from a user-provided 'files_info.xlsx' file, or generate it from .txt GC-MS files if not provided
files_info0 = gcms.load_files_info()

# Load individual GCMS .txt files as pandas DataFrames
files = gcms.load_all_files()

# Load classification codes and mass fractions for functional groups from a provided file
class_code_frac = gcms.load_class_code_frac()

# Load calibration data for standard and derivatized samples, and determine if they are derivatized
calibrations, is_calibr_deriv = gcms.load_calibrations()
c1, c2 = calibrations['calibration'], calibrations['deriv_calibration']

# Generate a comprehensive list of all compounds found across samples
list_of_all_compounds = gcms.create_list_of_all_compounds()

# Similarly, create a list of all derivatized compounds found across samples
list_of_all_deriv_compounds = gcms.create_list_of_all_deriv_compounds()

# Load properties for standard and derivatized compounds from provided files
compounds_properties = gcms.load_compounds_properties()
deriv_compounds_properties = gcms.load_deriv_compounds_properties()

# Flag indicating whether new compounds have been added, triggering a need to regenerate properties data
new_files_with_new_compounds_added = False
if new_files_with_new_compounds_added:
    compounds_properties = gcms.create_compounds_properties()
    deriv_compounds_properties = gcms.create_deriv_compounds_properties()

# Apply calibration data to all loaded files, adjusting compound concentrations based on calibration curves
files, is_files_deriv = gcms.apply_calibration_to_files()

# Extract specific files for detailed analysis or further operations
f11, f22, f33 = files['A_1'], files['Ader_1'], files['B_1']

# Add statistical information to the files_info DataFrame, such as mean, median, and standard deviation for each file
files_info = gcms.add_stats_to_files_info()

# Create a samples_info DataFrame without applying calibration data, for initial analysis
samples_info_0 = gcms.create_samples_info()

# Create samples and their standard deviations from the files, storing the results in dictionaries
samples, samples_std = gcms.create_samples_from_files()
s1, s2, s3 = samples['A'], samples['Ader'], samples['B']
sd1, sd2, sd3 = samples_std['A'], samples_std['Ader'], samples_std['B']

# Add statistical information to the samples_info DataFrame, enhancing the initial analysis with statistical data
samples_info = gcms.add_stats_to_samples_info()

# Generate reports for specific parameters (e.g., concentration, mass fraction) for files and samples
rep_files_conc = gcms.create_files_param_report(param='conc_vial_mg_L')
rep_files_fr = gcms.create_files_param_report(param='fraction_of_sample_fr')
rep_samples_conc, rep_samples_conc_std = gcms.create_samples_param_report(param='conc_vial_mg_L')
rep_samples_fr, rep_samples_fr_std = gcms.create_samples_param_report(param='fraction_of_sample_fr')

# Generate aggregated reports based on functional groups for files and samples, for specific parameters
agg_files_conc = gcms.create_files_param_aggrrep(param='conc_vial_mg_L')
agg_files_fr = gcms.create_files_param_aggrrep(param='fraction_of_sample_fr')
agg_samples_conc, agg_samples_conc_std = gcms.create_samples_param_aggrrep(param='conc_vial_mg_L')
agg_samples_fr, agg_samples_fr_std = gcms.create_samples_param_aggrrep(param='fraction_of_sample_fr')

# Plotting results based on the generated reports, allowing for visual comparison of average values and standard deviations
# Plot results for individual files or samples based

gcms.plot_ave_std(param='fraction_of_sample_fr', min_y_thresh=0, files_or_samples='files',
    legend_location='outside',
    only_samples_to_plot=['A_1', 'A_2', 'Ader_1', 'Ader_2'], #y_lim=[0, 5000]
            )
# plot results bases on aggreport
gcms.plot_ave_std(param='fraction_of_sample_fr', aggr=True, files_or_samples='files',
                min_y_thresh=0.01,
    y_lim=[0, .5], color_palette='Set2')

gcms.plot_ave_std(param='fraction_of_sample_fr', min_y_thresh=0,
    legend_location='outside', only_samples_to_plot=['A', 'Ader'], #y_lim=[0, 5000]
            )
# plot results bases on aggreport
gcms.plot_ave_std(param='fraction_of_sample_fr', aggr=True, min_y_thresh=0.01,
    y_lim=[0, .5], color_palette='Set2')