Name	Name	Last commit message	Last commit date
Latest commit History 2 Commits
Data	Data
DataIngestion_FeaturePreparation	DataIngestion_FeaturePreparation
Docs	Docs
Models	Models
Training_BigDL_Zoo	Training_BigDL_Zoo
Training_Other_ML	Training_Other_ML
Training_TFKeras_CPU	Training_TFKeras_CPU
Training_TFKeras_CPU_Distributed	Training_TFKeras_CPU_Distributed
Training_TFKeras_CPU_GPU_K8S_Distributed	Training_TFKeras_CPU_GPU_K8S_Distributed
Training_TFKeras_GPU	Training_TFKeras_GPU
LICENSE	LICENSE
README.md	README.md

Name

Last commit message

Last commit date

Data

DataIngestion_FeaturePreparation

Training_TFKeras_CPU_Distributed

Training_TFKeras_CPU_GPU_K8S_Distributed

Training_TFKeras_GPU

LICENSE

README.md

SparkDLTrigger

This repository contains code, notebooks, and datasets accompanying published work on the implementation of a ML pipeline for a particle classifier.

Published work

Machine Learning Pipelines with Modern Big Data Tools for High Energy Physics Comput Softw Big Sci 4, 8 (2020).
Related blog entries:
- Machine Learning Pipelines for High Energy Physics Using Apache Spark with BigDL and Analytics Zoo
- Distributed Deep Learning for Physics with TensorFlow and Kubernetes

Authors

Authors and contacts: Matteo.Migliorini@cern.ch, Riccardo.Castellotti@cern.ch, Luca.Canali@cern.ch
Original research article, raw data and neural network models by: T.Q. Nguyen et al., Comput Softw Big Sci (2019) 3: 12
Acknowledgements: Marco Zanetti, Thong Nguyen, Maurizio Pierini, Viktor Khristenko, CERN openlab, members of the Hadoop and Spark service at CERN, CMS Bigdata project, Intel team for BigDL and Analytics Zoo consultancy: Jiao (Jennie) Wang and Sajan Govindan.

Data

The datasets used for this work can be downloaded from this link.

TF-Spawner

TF-Spawner, a custom tool we have developed for distributed training with TensorFlow using cloud resources (CPU and GPU nodes).

Notebooks and Python code

Notebooks with data preparation code using Apache Spark
Notebooks with machine learning training
- Distributed DL training with Apache Spark and BigDL/AnalyticsZoo
- Training DL models with TensorFlow (tf.keras):
  - tf.keras on CPU, multiple methods
  - tf.keras with GPU
  - Distributed tf.keras on CPU
  - Distributed tf.keras on CPU and GPU using Kubernetes
  - Saved models
- Other ML training: using Spark ML

Physics Use Case

Event data flows collected from the particle detector (CMS experiment) contains different types of event topologies of interest. A particle classifier built with neural networks can be used as event filter, improving state of the art in accuracy.
This work reproduces the findings of the paper Topology classification with deep learning to improve real-time event selection at the LHC using tools from the Big Data ecosystem, notably Apache Spark and BigDL/Analytics Zoo.

Data Pipelines for Deep Learning

Data pipelines are of paramount importance to make machine learning projects successful, by integrating multiple components and APIs used for data processing across the entire data chain. A good data pipeline implementation can accelerate and improve the productivity of the work around the core machine learning tasks. The four steps of the pipeline we built are:

Data Ingestion: where we read data from ROOT format and from the CERN-EOS storage system, into a Spark DataFrame and save the results as a table stored in Apache Parquet files
Feature Engineering and Event Selection: where the Parquet files containing all the events details processed in Data Ingestion are filtered and datasets with new features are produced
Parameter Tuning: where the best set of hyperparameters for each model architecture are found performing a grid search
Training: where the best models found in the previous step are trained on the entire dataset.

Results

The results of the DL model(s) training are satisfactoy and match the results of the original research paper.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SparkDLTrigger

Published work

Authors

Data

TF-Spawner

Notebooks and Python code

Physics Use Case

Data Pipelines for Deep Learning

Results

Additional Info and References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

cerndb/SparkDLTrigger

Folders and files

Latest commit

History

Repository files navigation

SparkDLTrigger

Published work

Authors

Data

TF-Spawner

Notebooks and Python code

Physics Use Case

Data Pipelines for Deep Learning

Results

Additional Info and References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages