Poor man's Run Control for DUNE DAQ applications
If you're already familiar with nanorc and wish to troubleshoot a problem, skip to the FAQ here
This tutorial will guide you through the one-host minidaq example.
This tutorial assumes you run on a linux host with /cvmfs mounted, such as lxplus at CERN.
If you want to run with Kubernetes support, the "Kubernetes support" section is later in this document
First, set up a working area according to the daq-buildtools instructions.
Information can be found in daqconf wiki, or in here, following is to generate a configuration and run nanorc only.
Get the example data file (TODO: asset manager and configuration - likely to change for v4.0.0):
curl -o frames.bin -O https://cernbox.cern.ch/index.php/s/0XzhExSIMQJUsp0/downloadGenerate a hardware map, with the name HardwareMap.txt:
# DRO_SourceID DetLink DetSlot DetCrate DetID DRO_Host DRO_Card DRO_SLR DRO_Link
100 0 4 6 3 localhost 0 0 0
101 1 4 6 3 localhost 0 0 1
Generate a configuration:
daqconf_multiru_gen fake_daqNext (if you want to), you can create a file called top_level.json which contains:
{
"apparatus_id": "fake_daq",
"minidaq": "fake_daq"
}More lines can be added later, each corresponding to a different config. This allows several sets of apps to be run in the same nanorc instance.
Now you're ready to run.
There are 3 nanorc commands:
nanorc: for normal development,nano04rc: for production environment at NP04,nanotimingrc: to run the timing global partition.
To see a list of options you can pass nanorc in order to control things such as the amount of information it prints and the timeouts for transitions, run nanorc -h. We'll skip those for now in the following demo:
nanorc top_level.json partition-name# or "nanorc fake_daq partition-name" if you didn't create the top_level.json
╭──────────────────────────────────────────────────────────────────────────╮
│ Shonky NanoRC │
│ This is an admittedly shonky nano RC to control DUNE-DAQ applications. │
│ Give it a command and it will do your biddings, │
│ but trust it and it will betray you! │
│ Use it with care! │
╰──────────────────────────────────────────────────────────────────────────╯
shonky rc>
To see the commands available use help.
shonky rc> help
Documented commands (type help <topic>):
========================================
boot exclude scrap stop
change_rate expert_command shutdown stop_run
conf include start stop_trigger_sources
disable_triggers ls start_run terminate
drain_dataflow message start_shell wait
enable_triggers pin_threads status
Undocumented commands:
======================
exit help quit
boot will start your applications. In the case of the example, a trigger application to supply triggers, a hardware signal interface (HSI) application, a readout application and a dataflow application which receives the triggers.
shonky rc> boot
# apps started ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:02
dataflow ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:02
hsi ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:02
ruemu0 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:01
trigger ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:02
Apps
┏━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━┓
┃ name ┃ host ┃ alive ┃ pings ┃ last cmd ┃ last succ. cmd ┃
┡━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━┩
│ dataflow │ mu2edaq13 │ True │ True │ None │ None │
│ hsi │ mu2edaq13 │ True │ True │ None │ None │
│ ruemu0 │ mu2edaq13 │ True │ True │ None │ None │
│ trigger │ mu2edaq13 │ True │ True │ None │ None │
└──────────┴───────────┴───────┴───────┴──────────┴────────────────┘
You can then send the start_runcommand to get things going. start_run requires a run number as argument. It also optionally takes booleans to toggle data storage (--disable-data-storage and --enable-data-storage) and an integer to control trigger separation in ticks (--trigger-interval-ticks <num ticks>).
The Finite State Machine (FSM) is illustrated below. It shows all the transitions available for a normal DAQ application.
The commands produce quite verbose output so that you can see what was sent directly to the applications without digging in the logfiles.
Triggers will not be generated until after a enable_triggers command is issued, and then trigger records with 2 links each at a default of 1 Hz will be generated.
Use 'status' to see what's going on:
shonky rc> status
Apps
┏━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━┓
┃ name ┃ host ┃ alive ┃ pings ┃ last cmd ┃ last succ. cmd ┃
┡━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━┩
│ dataflow │ mu2edaq13 │ True │ True │ None │ None │
│ hsi │ mu2edaq13 │ True │ True │ None │ None │
│ ruemu0 │ mu2edaq13 │ True │ True │ None │ None │
│ trigger │ mu2edaq13 │ True │ True │ None │ None │
└──────────┴───────────┴───────┴───────┴──────────┴────────────────┘
When you've seen enough use the stop_run or shutdown commands. In case you experience timeout problems booting applications or sending commands, consider changing the hosts values from localhost to the hostname of your machine. This has to do with SSH authentication.
Nanorc commands can be autocompleted with TAB, for example, TAB will autocomplete ter to terminate. Options like --disable-data-storage will be completed with TAB after typing start_run --d.
You can also control nanorc in "batch mode", e.g.:
run_number=999
nanorc fake_daq partition-name boot conf start_run --disable-data-storage $run_number wait 2 shutdown
Notice the ability to control the time via transitions from the command line via the wait argument.
If you want to execute command and be dropped in a shell, you can use start_shell:
run_number=999
nanorc fake_daq partition-name boot conf start_run --disable-data-storage $run_number start_shell
Logs are kept in the working directory at the time you started nanorc, named log_<application name>_<port>.txt, or, if you are running nano04rc in /log/ on the host in which the application is running;..
You can look at the header and the value of attributes in the hdf5 file using:
h5dump-shared -H -A swtest_run000103_0000_*.hdf5(your file will be named something else, of course).
To get the TriggerRecordHeaders and FragmentHeaders:
hdf5_dump.py -p both -f swtest_run000103_0000_*.hdf5It can be instructive to take a closer look at how we can tell nanorc to boot the DAQ's applications. Let's take a look at a relatively simple example file in the nanorc repo, examples/ruemu_conf/boot.json:
{
"apps": {
"dataflow0": {
"exec": "daq_application_ssh",
"host": "dataflow0",
"port": 3338
},
"dfo": {
"exec": "daq_application_ssh",
"host": "dfo",
"port": 3335
},
[...]
},
"env": {
"DUNEDAQ_ERS_DEBUG_LEVEL": "getenv_ifset",
"DUNEDAQ_ERS_ERROR": "erstrace,throttle,lstdout",
"DUNEDAQ_ERS_FATAL": "erstrace,lstdout",
"DUNEDAQ_ERS_INFO": "erstrace,throttle,lstdout",
"DUNEDAQ_ERS_VERBOSITY_LEVEL": "getenv:1",
"DUNEDAQ_ERS_WARNING": "erstrace,throttle,lstdout"
},
"exec": {
"daq_application_ssh": {
"args": [
"--name",
"{APP_NAME}",
"-c",
"{CMD_FAC}",
"-i",
"{INFO_SVC}",
"--configurationService",
"{CONF_LOC}"
],
"cmd": "daq_application",
"comment": "Application profile using PATH variables (lower start time)",
"env": {
"CET_PLUGIN_PATH": "getenv",
"CMD_FAC": "rest://localhost:{APP_PORT}",
"DETCHANNELMAPS_SHARE": "getenv",
"DUNEDAQ_SHARE_PATH": "getenv",
"INFO_SVC": "file://info_{APP_NAME}_{APP_PORT}.json",
"LD_LIBRARY_PATH": "getenv",
"PATH": "getenv",
"TIMING_SHARE": "getenv",
"TRACE_FILE": "getenv:/tmp/trace_buffer_{APP_HOST}_{DUNEDAQ_PARTITION}"
}
}
},
"external_connections": [],
"hosts": {
"dataflow0": "localhost",
"dfo": "localhost",
"dqm0-df": "localhost",
"dqm0-ru": "localhost",
"hsi": "localhost",
"ruemu0": "localhost",
"trigger": "localhost"
},
"response_listener": {
"port": 56789
}
}
...you'll notice a few features about it which are common to boot files. Looking at the highest-level keys:
appscontains the definition of what applications will run, and what sockets they'll be controlled onenvcontains a list of environment variables which can control the applicationshostsis the cheatsheet wherebyappsmaps the labels of hosts to their actual namesexecdefines the exact procedure by which an application will be launchedexternal_connectionsis a list of external connectionsresponse_listenershows on which port of localhost nanorc should expect the responses from the applications when commands are sent.
It should be pointed out that some substitutions are made when nanorc uses a file such as this to boot the processes. Specifically:
"getenv"is replaced with the actual value of the environment variable, throwing a Python exception if it is unset"getenv:<default value>"is replaced with the actual value of the environment variable if it is set, with<default value>used if it is unset"getenv_ifset"is replaced with the actual value of the environment variable if it is set, otherwise nanorc doesn't set this variable- If a host is provided as
localhostor127.0.0.1, the result of the Python callsocket.gethostnameis used in its place {APP_NAME}is replaced by the app key name by nanorc{CMD_FAC}and{INFO_SVC}are replaced by their corresponding environment value{CONF_LOC}is replaced by a path to a local directory containing the configuration data that the application must be able to see.
To access the WebUI, add the --web option when running nanorc. When nanorc starts up, it will display a box like this :
which shows what lxplus node to connect to (in dark blue in the picture above lxplusXXXX.cern.ch).
Before you can connect, a SOCKS proxy must be set up to that node in another terminal window, using ssh -N -D 8080 username@lxplusXXXX.cern.ch and substituting XXXX with whatever number is shown.
Once you have set up your browser to use a SOCKS proxy, connect to the address in the browser, and you should see something like this:
From here, using nanorc is just about the same as in the terminal:
- transitions between FSM states can be done using the State Control Buttons,
- the information that nanorc outputs can be viewed by clicking the expansion triangle under "Last response from nanorc" to see the details of the response.
Note that this information will also be shown as output to the terminal.
To use the TUI (Terminal User Interface), add the --tui option when running nanorc. No proxy is required in this case.
Again, using nanorc is similar to the terminal:
- Transitions between states are done with the buttons in the top right: press I on the keyboard to toggle whether optional inputs are taken.
- Logs are displayed in the bottom right, and can be searched by typing in the box above them.
This page describes the functionality which is supported within Kubernetes in v3.1.0.
Before you go off and try this, you at least need to have:
- A k8s cluster running. To check that this is the case, you can head to the dashboard (on the NP04 cluster, that's here, after you have setup a SOCKS proxy to lxplus).
- All the nodes being able to access
/cvmfs. - The configuration service running in the cluster. To check that, you can head to the configuration service URL (at NP04, it's here) and making sure that you see text "DAQLing Configuration Management Service v1.0.0".
- The dunedaq images on which you want to run available from all the nodes. The simplest way to do that is to have your image on dockerhub (at NP04, we have a repository here, but at the time this is being written, on the 2nd Aug 2022, it doesn't work).
All of this is available on the NP04 cluster.
The Kubernetes prototype process manager in nanorc knows about 2 types of DUNE DAQ applications: applications that require access to FELIX card, and other applications. Applications are booted in the order provided by boot.json. There is no affinity set, if the user wants to run an application on different node, he/she will have to specify during the configuration.
Applications are run in pods with a restart policy of never, so failing applications are not restarted.
Nanorc makes use of a set of microservices either outside the k8s cluster or inside, in addition to the other services running on the NP04 cluster.
Partitions are supported to the same level that they are in the ssh process manager version of nanorc, that is two or more instances of nanorc may be run concurrently.
The k8s version implements a first version of resource management: Felix cards and data storage (?) cannot be claimed by more than one partition. Readout apps start on the hosts specified in the configuration, all other apps have anti-affinity with readout apps, so start on other hosts.
The following nanorc microservices are supported and can be used inside the k8s cluster or outside.
| Service | Notes |
|---|---|
| Run Number | Provides the run number from Oracle DB |
| Run Registry | Archives configuration data (zip) and metadata to Oracle/Postgres |
| eLisa | Interface to electronic logbook at CERN |
| Configuration | Interface to MongoDB service to provide configuration |
| Felix Plugin | Registers detected Felix cards with k8s as a custom resource |
-
The output files are written on folders that are directly mounted in the host.
-
It is currently not possible to address more than one Felix card on the same host (not a problem at NP04)
Up to date documentation on the k8s test cluster at NP04 is here, and the node assignment is here
Current nodes list (on the 2nd Aug 2022):
| node | Purpose/notes |
|---|---|
| np04-srv-001 | Storage |
| np04-srv-002 | Storage |
| np04-srv-003 | Storage |
| np04-srv-004 | Storage |
| np04-srv-010 | Generic DAQ host |
| np04-srv-011 | Generic DAQ host |
| np04-srv-012 | Generic DAQ host |
| np04-srv-013 | Generic DAQ host |
| np04-srv-015 | Control plane |
| np04-srv-016 | Generic DAQ host |
| np04-srv-025 | Generic DAQ host |
| np04-srv-026 | Felix card host |
| np04-srv-028 | Felix card host |
| np04-srv-029 | Felix card host |
| np04-srv-030 | Felix card host |
Log on to the np04 cluster and follow the instructions in here. 2 important notes:
- You do not need to be on
np04-srv-015to use nanorc and K8s. But you will need to have the correctKUBECONFIGproperly set as described in the previous link. - You do not need to create your namespace. That is handled automatically by nanorc.
Using the instructions at this link, set up a work area or a release.
An important point is that if you want to run K8s with custom code inside the sourcecode directory you will need to create a docker image yourself (see instructions at the end of this README How to build a daq_application image and distribute it).
It is worth mentioning that the dbt-workarea-env command will set up spack, which is the DAQ build system. This makes some alterations to a low-level library in LD_LIBRARY_PATH, which can cause some utilities like ssh, nano and htop to not work (you would get a segfault when running them). To fix this, run LD_LIBRARY_PATH=/lib64 [PROGRAM_NAME]: this will manually reset the path to what it was before spack was set up. However, this should not be required in order to run any of the commands on this page.
Create a daqconf file as such (named config.json in the rest of the instructions):
{
"boot": {
"ers_impl": "cern",
"opmon_impl": "cern",
"use_k8s": true,
"image": "np04docker.cern.ch/ghcr/dune-daq/c8-spack:latest"
}
}Next, you need to generate the configuration:
daqconf_multiru_gen -c config.json daq-configAnd upload it on the MongoDB, the first argument is the directory you have just generated with daqconf_multiru_gen and the second is the key in the configuration on the MongoDB, it cannot have underscores in it since it's accessed via HTTP requests:
upload-conf daq-config ${USER}-configuration # name it something better!This will create an entry in the configuration service containing all the configuration data. In this example, it will be called ${USER}-configuration, so you probably want to name it better.
Note that the upload-conf utility will fails if you have proxy.
... after unsetting the proxy.
source ~np04daq/bin/web_proxy.sh -u
nanorc --pm k8s://np04-srv-015:31000 db://username-configuration partition-name
nanorc> [...]
nanorc> boot
nanorc> start_run 12345
nanorc> [...]Hop on http://np04-srv-015:31001/ (after setting up a web SOCKS proxy to lxplus if you are not physically at CERN) to check the status of the cluster. Note you will need to select the partition you fed at boot. You'll be able to see if the pods are running or not, and where.
First, do
$ kubectl get pods -n partition-name
NAME READY STATUS RESTARTS AGE
dataflow0 1/1 Running 0 66s
...
to get a list of pods corresponding to DAQ applications. This should be run from the node that hosts the control plane, which is np04-srv-015. (The control plane is a collection of top level components that manage the whole kubernetes cluster, including the API server and a data store that keeps track of current/desired state.)
Now do:
kubectl exec -n partition-name --stdin --tty dataflow0 -- /bin/bash
to log on the pod which runs the dataflow app. The app runs in /dunedaq/run (which is where the data file will be too in this particular example).
To get the log, open a new terminal window on np04-srv-015, on which k8s is already installed, and do:
$ kubectl get pods -n partition-nameNote: partition-name is given as a parameter to the nanorc command.
$ kubectl get pods -n partition-name
NAME READY STATUS RESTARTS AGE
dataflow0 1/1 Running 0 66s
...And you can use the pod name with the kubectl logs command:
$ kubectl logs dataflow0 -n partition-nameIf by any "chance" something terrible happened with your pod and it core dumped for example, the k8s may either try to restart it, or give up. You can still view the stdout/stderr logs by doing:
$ kubectl logs dataflow0 -n partition-name --previousGo to http://np04-srv-009:3000/ and select your partition on the left.
daq_applicationslive in pods (not deployments), with k8s pod restart policy of "Never".- mounts
cvmfsin the pod (dunedaqanddunedaq-development). - ... Many more to discover...
NOTE: You need to be on np04-srv-{015,016,026} for this to work, if you are not, setup the daqenv above from above in the usual way (the first 4 commented out instructions):
# cd dunedaq-k8s/swdir
# source dbt-env.sh
# dbt-workarea-env
# cd ..
cd pocket/images/daq_application/daq_area_cvmfs/
./build_docker_image username-image-nameYou should change username-image-name to something appropriate containing your username.
If you get an error message that starts with "Got permission denied while trying to connect to the docker daemon socket", then it is likely that your account is not in the docker group (this can be checked with groups). You can ask to be added in #np04-sysadmin, on the DUNE slack workspace.
After that, you should be able to see your image:
docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
...
username-image-name N22-06-27 3e53688480dc 9 hours ago 1.79GB
...Next, login to the local docker image repository (you'll only ever need to do that once):
docker login np04docker.cern.chUsername is np04daq and password is available upon request to Pierre Lasorak or Bonnie King.
You need to then push the image to the NP04 local images repo:
./harbor_push username-image-name:N22-06-27 # that nightly is obviously the nightly that used earlier to setup your daq areaAnd that's it!
If the instructions after docker login didn't work, you can always do it manually:
docker save --output username-image-name-N22-06-27.tar username-image-name:N22-06-27That command will create a tar file with the image. Next, you need to ssh on each of the node of the cluster, cd where the tar file is and do:
docker load --input username-image-name-N22-06-27.tarand check that docker images is correct (you can check that the size and the IMAGE ID are the same as the one on host you generated the image):
docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
username-image-name N22-06-27 3e53688480dc 9 hours ago 1.79GB
...Clone the np04daq-configs and checkout feature/k8s,
git clone ssh://git@gitlab.cern.ch:7999/dune-daq/online/np04daq-configs.git
cd np04daq-configs
git checkout feature/k8s
This branch uses a special image in the configuration (np04docker.cern.ch/dunedaq-local/k8s-v400-rc3:rc-v4.0.0-3-01) which has Eric’s modifications to IPM. Other modifications are:
- Removing the cern.ch postfix from the hostname
- Setting the parameter start_connectivity_service to false
- Adding a use_k8s = true and the name of the image as described above
Run the daq-config bash script:
./recreate_np04_daq_configurations.sh
Upload the new nanorc configuration of your choice
upload-conf [np04_daq_APA_conf] [flx-k8s-test]
Here are some examples of [np04_daq_APA_conf]'s that you can upload:
np02_coldbox_daq_100mHz_conf/np02_coldbox_daq_268ms_conf/np02_coldbox_daq_5Hz/np02_coldbox_daq_conf/np02_coldbox_daq_hma_conf/np02_coldbox_daq_tp_conf/np02_coldbox_flxcard_conf/np02_coldbox_flxcard_WIB12_conf/np02_coldbox_flxcard_WIB34_conf/np02_coldbox_flxcard_WIB56_conf/np02_coldbox_timing_tlu_conf/np02_coldbox_wib_conf/np02_coldbox_wib_pulser_conf/np02_coldbox_wib_WIB12_conf/np02_coldbox_wib_WIB34_conf/np02_coldbox_wib_WIB56_conf/np04_daq_APA1_conf/np04_daq_APA1_tp_conf/np04_daq_APA2_conf/np04_daq_APA2_tp_conf/np04_daq_APA3_conf/np04_daq_APA4_conf/np04_daq_conf/np04_daq_DAPHNE_conf/np04_daq_DAPHNE_test_conf/np04_daq_TPC_conf/np04_flxcard_APA1_conf/np04_flxcard_APA2_conf/np04_flxcard_APA3_conf/np04_flxcard_APA4_conf/np04_flxcard_conf/np04_flxcard_TPC_conf/
For this to work you need to be able to use kubectl which can be is done by,
export KUBECONFIG=/nfs/home/np04daq/np04-kubernetes/config
Remember to inform np04_shifterassistant on slack that you are taking one of the APA’s for a spin and make sure no one else is running anything on it first.
Now you can run the nanorc using the following command:
nanorc --pm k8s://np04-srv-015:31000 db://[flx-k8s-test] [yourname-flx-k8s-test-run]



