Skip to content
Daniel Marley edited this page Oct 11, 2018 · 3 revisions

CWoLa for Hadronic Top/Antitop Tagging (CHEETAH)

The CWoLa method is applied to discriminate the hadronic decays of top quarks and antiquarks.

This framework was split from the analysis framework, and as such shares many attributes.

Follow these instructions to replicate the training procedure for any myriad of reasons (different physics objects, definitions, working points, samples, etc.).

Getting started

Cheetah is setup to generate flat ntuples from C++ framework that can be passed into the python framework using uproot.
The two languages are used because of the tools available in their respective ecosystems: C++ (speed, ROOT libraries), Python (advanced ML tools).

Training Procedure

A CMSSW environment is used to generate samples for training while a non-CMSSW environment to perform the actual training using Keras + Tensorflow / PyTorch.
As such, the following guidelines reflect the author's setup.
Ideally, this setup can be easily extended for any user's purpose.

Ntuple Production

Using the CMSSW and C++ environment, Cheetah is used to prepare ntuples specifically for training. The input ntuples are flat ntuples prepared by the analysis. (The flat ntuples from C++ framework can be passed into the python framework using uproot.)

The two languages are used because of the tools available in their respective ecosystems: C++ (speed, ROOT libraries) & Python (advanced ML tools).

Cheetah builds the Event for each entry in the ROOT file. Physics objects (AK8/AK4/leptons/MET) are defined as structs within the framework (interface/physicsObjects.h). The Event object is passed to other classes (histogrammer, eventSelection, etc.) that need information from the event.

DESIGN PHILOSOPHY: Classes that use information from the event to generate new information, e.g., kinematic reconstruction, should be called from the Event class. To achieve this, pass the external classes structs of necessary information, then return the new object back to the Event class. Thus, users can access all 'event-level' information from the Event class and do not need to instantiate extra tools in running macros.

Running macros, to perform the event loop, are stored in the bin/ directory.
These macros outline the basic setup of the configuration, file loop, TTree loop (if necessary), and event loop.
The event selection and information for output files are also declared in this script.

Before running, confirm the options in config/training.txt (or your custom configuration file) are appropriate!
To execute the framework:

$ source setup.csh
$ run_training config/training.txt

Analysis Flow

  1. The steering macro (see bin/) first initializes and sets the configurations
    • Declare settings/objects that are 'global' to all files being processed
  2. File loop
    • Prepare output that is file-specific
      • Initialize output file, cutflow histograms, efficiencies, etc.
  3. TTree loop (for input files that have physics information in multiple TTrees)
    • Declare objects that are 'global' to all events in the tree:
      • Event object
      • output ttree
      • histograms & efficiencies
  4. Event Loop
    • Build the Event object (jets, leptons, extras e.g., kinematic reconstruction)
    • Apply a selection(s), if desired
    • Save information to TTree & histograms

Different classes are used to achieve this workflow, and each one can be modified or extended by the user (inherit from these classes to build your own!).

Standard classes:

Class About
configuration Class that contains all information for organization. Multiple functions that return basic information as well
Event Class that contains all of the information from the event -> loads information from TTree and re-organizes information into structs & functions, calculates weights, etc.
eventSelection Class to apply custom event selection (defined by user)
histogrammer Class for generating histograms (interface between TH1/TH2 and Cheetah)
miniTree Class used exclusively for generating flat ntuples that are used in machine learning contexts
tools Collection of functions for doing simple tasks common to different aspects of Cheetah
truthMatching Class for determining the matching between truth and reconstructed objects

Machine learning:

Class About
deepLearning Class for handling the training/inference for machine learning tasks. The training aspect only prepares inputs as it is assumed training is done in a python environment.

Kinematic reconstruction:

Class About
ttbarReco General reconstruction of the AK8+AK4 system + quality criteria

If you add directories to the framework, ensure they will be compiled by checking BuildFile.xml and bin/BuildFile.xml.
If there are issues, it may be necessary to clean the directory and re-compile everything: scram b clean

It is also possible to submit batch jobs using the script python/submitBatchJobs.py with the text file batch.txt. For more information, please see the wiki page for batch jobs.

Training

The actual training of the NN is performed in a python environment outside of CMSSW using the packages Asimov + HEP Plotter.
The uproot package loads information from the ROOT file, prepared in the previous step, into a Pandas dataframe that is then easily used in the framework.

Scripts Description
python/runAsimov.py Steering script that determines what options are set and what order to call the functions.
python/plotlabels.py Labels (colors and binning) for samples and variables

The relevant Asimov classes are called to perform all the training and plot making (Asimov works as an interface between HEP data and Keras).

Inference Procedure

LWTNN: Deep learning in C++ (use models generated from python tools in C++)

  • deepLearning.cxx

A std::map<std::string,double> is created where the keys represent the different variables used in the training. For each AK8, the map is filled with new values and the lwtnn tool predicts the DNN score.

Questions or Comments

For more information please submit an issue or PR.