[UMBRELLA] Dropped events

## Why this issue?

When Falco is running, a producer (a.k.a the [driver](https://falco.org/docs/event-sources/drivers/)) continuously forwards events to a consumer (the Falco userspace program) with a buffer that sits in the middle. When - for any reason - the consumer is not able to consume the incoming events, then an events drop occurs.

Starting from v0.15.0, Falco introduced a mechanism to detect dropped events and take actions, as explained in the [official documentation](https://falco.org/docs/event-sources/dropped-events/). However, events drop is still an issue as reported by many users.

Since the problem depends on many factors and can be hard to analyze and understand, this issue aims to give users an overview of that and collect the knowledge acquired until now.

Please, note that this document does not try to directly solve this problem and consider that some assumptions might be wrong (feel free to confute them!).

*N.B.*
*At the time of writing, it was not clear the outcome of this issue, so I just chose to label it as "documentation".*

## Event drop alerts

Currently, when dropped events are detected, [Falco will print out some statistics](https://github.com/falcosecurity/falco/blob/e46641d24d74fbe7921039834fdf5b01c4715009/userspace/falco/event_drops.cpp#L112) that can give users some information about the kind of drops that happened.

An example of an alert regarding dropped events:
```
22:01:56.865201866: Debug Falco internal: syscall event drop. 10334471 system calls dropped in last second. (ebpf_enabled=0 n_drops=10334471 n_drops_buffer_clone_fork_enter=0 n_drops_buffer_clone_fork_exit=0 n_drops_buffer_connect_enter=0 n_drops_buffer_connect_exit=0 n_drops_buffer_dir_file_enter=0 n_drops_buffer_dir_file_exit=0 n_drops_buffer_execve_enter=0 n_drops_buffer_execve_exit=0 n_drops_buffer_open_enter=0 n_drops_buffer_open_exit=0 n_drops_buffer_other_interest_enter=0 n_drops_buffer_other_interest_exit=0 n_drops_buffer_total=10334471 n_drops_bug=0 n_drops_page_faults=0 n_drops_scratch_map=0 n_evts=21356397)
```
Note that the statistics reported above are relative to the timeframe specified (the last second) and not cumulative. Furthermore, note that an event represents just a single syscall received by the driver, regardless of whether a rule is triggered or not.

So, what do those values mean?

- `ebpf_enabled` indicates whenever the driver is the eBPF probe (`=1`) or a kernel module (`=0`)
- `n_drops` is the sum of others `n_drop_*` fields (see the section below) and represents the total number of dropped events
- `n_evts` is the number of events that the driver should send according to its configuration. It also includes  `n_drops` since `n_drops` is the number of events that the driver should send to userspace but is not able to send due to various reasons.


Also note that in extreme cases, drop alerts may be rate-limited, so consider incrementing [those values](https://github.com/falcosecurity/falco/blob/8611af437303241c79996fbba6a2347404c572c3/falco.yaml#L83) in the configuration file, for example:
```
syscall_event_drops:
  actions:
    - alert
  rate: 100
  max_burst: 1000
```

## Kind of drops

As you can notice, not all drops are the same. Below an explanation for each kind of them (ordered by the less frequent to the most one).

- `n_drops_bug` is the number of dropped events caused by an invalid condition in the kernel instrumentation, something went wrong basically. AFAIK, only the eBPF probe [can generate this kind of drop](https://github.com/falcosecurity/libs/search?q=PPM_FAILURE_BUG&unscoped_q=PPM_FAILURE_BUG), and luckily there are no reports of this problem.
- `n_drops_pf` (where `pf` stands for *page fault*) is the number of dropped events caused by invalid memory access; it happens whenever the memory page (referenced by the syscall) had disappeared before the driver was able to collect information about the event. We noticed that rarely, it sometimes happens on GKE, and it is related to some process that is continuously crashing (see #1309).
- `n_drops_buffer` is the number of dropped events caused by a full buffer (the buffer sits between the producer and the consumer). It's the most frequent one, and it's related to performance. We have also different categories of buffer drops to understand which syscall triggered them (e.g. `n_drops_buffer_clone_fork_exit`, `n_drops_buffer_connect_enter`, ...)

> Those fields are defined by the driver in this [struct](https://github.com/falcosecurity/libs/blob/master/userspace/libscap/scap.h#L127-L150)

### Performance-related drops (n_drops_buffer)

We experience this kind of event dropping when the consumer is blocked for a while (note that the task that consumes events is single-threaded). That is strictly related to performance and can happen for several reasons. We also added a [benchmark command](https://github.com/falcosecurity/event-generator#benchmark) in the event-generator to experiment with this problem (see https://github.com/falcosecurity/event-generator/pull/36 for more details).

Possible causes:

#### Limited CPU resource 
The consumer hits the maximum CPU resources allocated for it and gets blocked for a while. For example, the official Helm chart comes with a [200m CPU hard limit](https://github.com/falcosecurity/charts/blob/ad6059b79e22bb0ffc808d8c1342e23df95eceb5/falco/values.yaml#L19-L27) that may cause this problem.

#### Large/complex ruleset (high CPU usage)
The larger and more complex the ruleset, the more CPU will be needed. At some point, either with or without resource limitation, high CPU usage can produce event dropping.


#### Fetching metadata from external components (I/O blocking)
In some cases, fetching metadata (e.g., container information, k8s metadata) from an external component can be a blocking operation. 

For example, the `--disable-cri-async` flag is quite explanatory about that:
```
--disable-cri-async           Disable asynchronous CRI metadata fetching.
                               This is useful to let the input event wait for the container metadata fetch
                               to finish before moving forward. Async fetching, in some environments leads
                               to empty fields for container metadata when the fetch is not fast enough to be
                               completed asynchronously. This can have a performance penalty on your environment
                               depending on the number of containers and the frequency at which they are created/started/stopped
```

Another option that might cause problem is:
```
-A                            Monitor all events, including those with EF_DROP_SIMPLE_CONS flag.
```

Slow responses from the Kubernetes API server could cause this problem too. 

> __Please note__:  [`k8s.ns.name` and `k8s.pod.*`](https://falco.org/docs/reference/rules/supported-fields/#field-class-container) (i.e., `k8s.pod.name`, `k8s.pod.id`, `k8s.pod.labels`, and `k8s.pod.label.*`) are populated with data fetched from the container runtime so you don't need to enable the k8s enrichment you need only these fields

#### Blocking output (I/O blocking)

Falco [outputs mechanism](https://falco.org/docs/alerts/) can also have an impact and might block the event processing for a while, producing drops.


#### The buffer size

If you are not able to solve your drop issues you can always increase the syscall buffer size (the shared buffer between userspace and kernel that contains all collected data). You can find more info on how to change its dimension in the [Falco config file](https://github.com/falcosecurity/falco/blob/master/falco.yaml#L173-L224) 


## Debugging

When debugging, the first time to consider is that multiple causes may occur simultaneously. It is worth excluding every single cause, one by one.

Once `n_drops_bug` and `n_drops_pf` cases are excluded, for the performance-related drops (ie. `n_drops_buffer`) a handy checklist is:

- [ ] Make sure you're using the latest Falco version (and the last Helm chart version, if using this installation method) and the rule files are updated to match that version
- [ ] Make sure docker (and containerd, if enabled) options are appropriately configured and the Falco can access the socket (the [official documentation](https://falco.org/docs/running/#docker) and the [k8s deployment resource example](https://github.com/falcosecurity/evolution/blob/master/deploy/kubernetes/kernel-and-k8s-audit/daemonset.yaml) can help with that)
- [ ] Increase [syscall_event_drops](https://github.com/falcosecurity/falco/blob/8611af437303241c79996fbba6a2347404c572c3/falco.yaml#L83) values, to avoid that drop alerts are being rate-limited
- [ ] Remove custom rules, if any
- [ ] Remove any CPU resource limitation (for example [this](https://github.com/falcosecurity/charts/blob/ad6059b79e22bb0ffc808d8c1342e23df95eceb5/falco/values.yaml#L19-L27))
- [ ] Remove the `-A` option, if any
- [ ] Remove the `--disable-cri-async` option, if any
- [ ] Remove the `-U` option, if any
- [ ] Make sure you are using `-k https://$(KUBERNETES_SERVICE_HOST)` (instead of `-k https://kubernetes.default`, see [this comment](https://github.com/falcosecurity/falco/issues/558#issuecomment-473106946))
- [ ] Disable K8s Audit Events by setting `webserver.enabled` to false in the [config file](https://github.com/falcosecurity/falco/blob/8611af437303241c79996fbba6a2347404c572c3/falco.yaml#L139) and removing any other related configuration
- [ ] Completely disable K8s support (by removing `-K`, `-k`, and `-pk` options)
- [ ] Disable any other integration, if any
- [ ] Disable one-by-one all outputs, including `stdout_output` (event drop alerts still show up)

Finally, some useful links that could help with debugging:

- Interesting issues about drops 
https://github.com/falcosecurity/falco/issues/669, 
https://github.com/falcosecurity/falco/issues/669#issuecomment-570824776,
https://github.com/falcosecurity/falco/issues/669#issuecomment-642000182,
https://github.com/falcosecurity/falco/issues/961,
https://github.com/falcosecurity/falco/issues/1382,
https://github.com/falcosecurity/falco/issues/615,
https://github.com/falcosecurity/falco/issues/1231,
https://github.com/falcosecurity/falco/issues/558

- Issues related to `n_drops_pf` 
https://github.com/falcosecurity/falco/issues/917,
https://github.com/falcosecurity/falco/issues/770,
https://github.com/falcosecurity/falco/issues/669#issuecomment-641466585

- Some threads on our Slack channel
https://kubernetes.slack.com/archives/CMWH3EH32/p1592904527372800, https://kubernetes.slack.com/archives/CMWH3EH32/p1599741275086000 

- Drop related to the K8s support (`-K`, `-k`, and `-pk` options)
https://github.com/falcosecurity/falco/issues/2129

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[UMBRELLA] Dropped events #1403

Why this issue?

Event drop alerts

Kind of drops

Performance-related drops (n_drops_buffer)

Limited CPU resource

Large/complex ruleset (high CPU usage)

Fetching metadata from external components (I/O blocking)

Blocking output (I/O blocking)

The buffer size

Debugging

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development