This repository contains an exercise to evaluate those students interested in applying for the Distributed Big Data Analysis with TDataFrame project, included in the Google Summer of Code (GSoC) program and offered by the EP-SFT group and the IT department at CERN. The detailed description of the project can be found here.
The exercise is divided into two tasks. In order to complete the first one very little programming is required, but the student will need to combine a set of technologies that are important for the project, namely Python, Spark, ROOT (with TDataFrame) and Jupyter notebooks. The second task does involve programming and it aims to test the student's skills with JavaScript.
Please follow the guidelines below to go through the exercise and work at a pace that suits you. We recommend that you start with the first task and, only if you successfully complete it, you go for the second one. Do not hesitate to ask us, the mentors, any question you might have.
Once you complete any of the tasks of this exercise, please send us by e-mail the requested deliverables and the answers to the proposed questions at: etejedor@cern.ch, diogo.castro@cern.ch, danilo.piparo@cern.ch, prasanth.kothuri@cern.ch
The objective of this task is to execute a Python Jupyter notebook whose code corresponds to a data analysis that uses ROOT's TDataFrame. The analysis is distributed with Spark thanks to the DistROOT module.
In preparation for this task you will need to:
- Install ROOT v6.12/06, which you can download from here.
- Install Spark on your machine and run some simple test locally.
- Install Jupyter on your machine and launch a local notebook server.
- Download the test notebook and the DistROOT module from this repository.
Once you have set up your environment with all the software listed above, you should be able to run the test notebook. The deliverable of this task is precisely that notebook after being executed and saved, that is, the notebook containing the code cells plus the generated output cells.
At the light of the results you obtained, please answer the following questions:
- What is the number of rows (or entries) of the initial dataset?
- How many entries are left after applying the filters?
- Could you add another filter to cell #3 on any of the columns of the dataset? How would you specify it? What is the resulting number of entries after adding that filter?
- How many partitions of the dataset are created in the distributed execution with Spark? What are the entry ranges for each of these partitions?
The objective of this task is to write a Jupyter Nbextension that replaces all the “SWAN” strings which are located inside markdown cells with the SWAN logo. The image should only be displayed if the extension is active, therefore only the rendered output of a markdown cell should be replaced, and not the content of the cell itself.
In preparation of this task, you will need to:
- Install Jupyter on your machine and launch a local notebook server.
- Have basic understanding about notebooks, e.g. types of cells (markdown, code).
- Understand how to install and run an Nbextension.
- Have a look at examples of notebook extensions here.
The deliverable of this task is a packaged Jupyter Nbextension ready to be installed, which provides the functionality described above.
The "SWAN" word replacement described in the first part of this task should also happen when a cell is modified, and not only when the notebook is loaded for the first time.
In that sense, we propose here an additional feature for your extension: listen to the markdown cell render event of Jupyter and perform the "SWAN" word replacement when that event is triggered on a particular cell. This will require that you understand how to monitor changes within cells and how to manipulate its contents.
The deliverable of this part is the same packaged Jupyter Nbextension that was requested before, with the additional functionality described above.