[Tracking] Opening multiple event sources in the same Falco instance

# Motivation

The plugin system allows Falco to open new kinds of event sources that go beyond the historical syscall use case. Recently, this has been leveraged to port the k8s audit log event source to a plugin (see: https://github.com/falcosecurity/plugins/tree/master/plugins/k8saudit, and https://github.com/falcosecurity/falco/pull/1952). One of the core limitation that comes from the plugin system implementation of the libraries, is that a given Falco instance is capable of opening **only one event source**. In the example above, this implies that Falco instances are not able to ingest both syscalls and k8s audit logs together. This can instead be accomplished by deploying two distinct Falco instances, one for each event source.

# Feature Requirements

- **(R1)** A single Falco instance should become able to open more than one event source at once and in parallel
- **(R2)** There should be **feature parity** and **performance parity** between having 2+ source active in parallel in a single Falco instances and having 2+ single-source Falco instances with the same event sources

# Proposed Solution

## Release Goals

~_To be defined_. This is out of reach for [Falco 0.32.1](https://github.com/falcosecurity/falco/issues/2070).~ [Falco 0.33.0](https://github.com/falcosecurity/falco/issues/2230)

## Terminology

- **Capture Mode**: A configuration of `sinsp` inspectors that reads event from a trace file
- **Live Mode**: A configuration of `sinsp` inspectors that reads event from one of the supported modes (kmod, ebpf, gvisor, plugin)

## Design

- **(D1)** The feature is implemented in Falco only, and mostly only affects the codebase of `falcosecurity/falco`. Both libsinsp and libscap will keep working in single-source mode
- **(D2)** Falco manages multiple sinsp instances, one in each thread
- **(D3)** Falco manages one or more instances of `sinsp` inspectors
  - If the # of inspectors is 1, everything runs in the main thread just like now
  - If the # of inspectors is 2+, each inspector runs in its own a separate thread (see **(R1)**). The whole event data path happens in parallel within each thread (event production, data enrichment, event-rule matching, and output formatting)
- **(D4)** If in capture mode, Falco runs a only 1 inspector configured to read events from a trace file
- **(D5)** If in live mode, Falco runs 1 inspector for each active event source
  - If an event source terminates due to EOF being reached, Falco waits for the other event sources to terminate too
  - If an event source terminates with an error, Falco forces the termination of all the other event sources
- **(D6)** There is 1 instance of the Falco Rule Engine (just like now), and we leverage/enforce thread-safety guarantees to make sure it is safe and non-blocking for different threads to perform event-rule matching
- **(D7)** There is 1 instance of the Falco Output Engine (just like now), and we leverage/enforce thread-safety guarantees to make sure it is safe for different threads to send alerts when an event-rule match is found
  - Non-blocking guarantees are less of a concern here, because the number of alerts is orders of magnitudes lower than the number of events

## Technical Limitations of the Design

- **(L1)** There cannot be 2+ event sources with the same name active at the same time
  - This would defeat the thread-safety guarantees of the Rule Engine, which are based on the notion of event source partitioning
  - _Potential Workarounds_ (for the future, just in case):
    - Have more than one instances of the Rule Engine to handle the increased event source cardinality. For example, the second Rule Engine instance would cover all the second event source replicas, the third Rule Engine instance will handle the third replicas, and so on
    - We make the Rule Engine thread safe without the event source <-> thread 1-1 mapping assumptions. This is hardly achievable, because this would imply making the whole filtercheck system of libsinsp thread-safe too. Another naive solution would be to create one mutex for each event source to protect the access to the Rule Engine. In both scenarios, this would be hard to manage and performance would be sub-optimal
    - We have one Rule Engine for each source, which could become harder to manager. For example, rule files would need to be loaded by all the rule engines, which makes the initialization phase and hot-reloading slower too. However, this is something we can consider for the future.
- **(L2)** Filterchecks cannot be shared across different event sources to guarantee thread-safety in the Rule Engine. The direct implication is that if an plugin with extractor capability is compatible with 2+ active event sources (e.g. `json` can extract from both `aws_cloudtrail` and `k8s_audit`), we need to create and initialize two different instances of the plugin (1 for each inspector)
  - Practically, this means that a given plugin instance will always extract fields coming from the same event source (a.k.a. subsequent calls to `plugin_extract_fields` will never receive events from two distinct event sources for the same initialized pluginstate)
  - This limitation can actually be turned as a _by-design feature_, because doing the contrary would violate **(R2)**
  - _Potential Workarounds_ (for the future, just in case):
    - Make field extraction thread-safe (hardly doable, see points in **(L1)**

## Technical Blockers

This is the list of things we mandatorily need to work on to see this initiative happen.

- [x] **(B1)** The rule engine <-> inspector source index mapping needs to be handle in different ways for capture mode and live mode
  - In capture mode, the rule engine source index is the same as the source index in the plugin manager of the single inspector used in capture mode (with the exception of the `syscall` source, which is by convention the last source index after all the plugin ones)
  - In live mode, each rule engine source index should be uniquely assigned to each inspector in live mode that runs in its own thread
  - https://github.com/falcosecurity/falco/pull/2182
- [x] **(B2)** Plugins can potentially be loaded multiple times in order to be registered to each live mode inspector
  - In live mode, a single plugin with field extraction capability can be registered to all the inspectors configured with an event  source they are compatible with
    - Note: This also applies to plugins with both field extraction and event sourcing capabilities. In this case, the plugin is register and used both its capabilities only with the inspector in which its event source is active, whereas it is just registered for its field extraction capability in all other event-source-compatible inspectors.
  - https://github.com/falcosecurity/falco/pull/2182
- [x] **(B3)**  The plugin API and the Plugin SDK Go should be revised to support multi-thread and concurrency assumptions (will likely be the only change needed outside of `falcosecurity/falco`)
  - [x] **Plugin API**:
    - Most API symbols will need to support being called concurrently, with every call having distinct `ss_plugin_t*`s
    - https://github.com/falcosecurity/libs/pull/547
  - [x] **Plugin SDK Go**:
    - Tracking the discussion in another issue: https://github.com/falcosecurity/plugin-sdk-go/issues/62
      - https://github.com/falcosecurity/plugin-sdk-go/pull/65
- [x] **(B4)** Print-only Falco actions (e.g. list fields, list events, etc...) are dependent on the app state inspector. These needs to be stateless and can be implemented by allocating a `sinsp` inspector on-the-fly because they just access static information
  - https://github.com/falcosecurity/falco/pull/2097
- [x] **(B5)** The Falco StatsWriter (`-s` option) is not thread-safe
  - https://github.com/falcosecurity/falco/pull/2109
- [x] **(B6)** The Falco Rule Engine is does not provide any thread-safety guarantee
  - https://github.com/falcosecurity/falco/pull/2081 (non-blocker)
  - https://github.com/falcosecurity/falco/pull/2082
- [x] **(B7)** Signal-based actions (termination, restart, and output reopening) are not thread safe
  - https://github.com/falcosecurity/falco/pull/2091
- [x] **(B8)** The Falco Output framework is not entirely thread safe
  - https://github.com/falcosecurity/falco/pull/2080
  - https://github.com/falcosecurity/falco/pull/2139
- [x] **(B9)** Libsinsp and libscap have a bunch of global and static variables that limit our freedom of having multiple inspectors running in parallel (`g_infotables`, `g_logger`, `g_initializer`, `g_filterlist`, `g_decoderlist`, `s_cri_xxx`, `libsinsp::grpc_channel_registry::s_channels`, `s_callback`, `g_event_info`, `g_ppm_events`, `g_chisel_dirs`, `g_chisel_initializer`, `g_syscall_code_routing_table`, `g_syscall_table`, `g_syscall_info_table`, `g_json_error_log`)
  - This may seem like a lot, but we should good to go as-is because most of these are read-only tables or objects. The ones that actually bundle some logic are either thread-safe (`g_logger`), or used only by inspectors running the `syscall` event source. Since due to **(L1)** we don't allow two inspectors running the `syscall` source at the same time, it should be safe to assume that no concurrent access will happen to the syscall-related globals.
- [x] **(B10)** The whole Falco application logic should be revised to support multiple inspectors, multiple filtercheck factories, and to distinguish the capture-mode and live-mode use cases as defined in the _Design_ section. This will required all previous **(BXX)** points to be satisfied first
  - https://github.com/falcosecurity/falco/pull/2182

## Nice to Have

- [x] **(N1)** Add a new `-–enable-source=xxx` option as a dual to `-–disable-source=xxx`. This design implies that the active event sources are chosen in an _opt-out_ fashion: every loaded source gets activated, with the exception of the disabled ones. The `-–enable-source` option will make the UX better to define the only source users want to activate
  - https://github.com/falcosecurity/falco/pull/2085
- [x] **(N2)** Improve the regression testing framework `falco_test.py` to support selecting the active event source. Without it, all non-syscall tests will hang or fail, because the syscall event source is implicitly activated along with the testing-subject one and will cause Falco to not terminate (example: k8s audit tests)
  - https://github.com/falcosecurity/falco/pull/2085
- [x] **(N3)** Reduce threadiness of `/healtz` webserver (based on [cpp-httplib](https://github.com/yhirose/cpp-httplib)). The webserver library [documents that the default threadiness is 8 or `std::thread::hardware_concurrency()`](https://github.com/yhirose/cpp-httplib#default-thread-pool-support). This is OK, but since we go on a multi-threaded model we should consider limiting the number of threads spawned by Falco. In my 8-core setup, I Falco spawned 30 threads (only few were active, luckily) with a simple test with syscalls and 2 Go plugins loaded.
  - https://github.com/falcosecurity/falco/pull/2090

# Linked Discussions

- https://kubernetes.slack.com/archives/CMWH3EH32/p1655762074442099
- https://kubernetes.slack.com/archives/CMWH3EH32/p1655209558876649
- https://kubernetes.slack.com/archives/CMWH3EH32/p1647969834426869
- https://kubernetes.slack.com/archives/CMWH3EH32/p1646148516502339?thread_ts=1646061690.259599&cid=CMWH3EH32
- https://kubernetes.slack.com/archives/CMWH3EH32/p1645610199932789?thread_ts=1645396079.322259&cid=CMWH3EH32
- https://kubernetes.slack.com/archives/CMWH3EH32/p1645038583559959?thread_ts=1645034669.067289&cid=CMWH3EH32
- https://github.com/falcosecurity/falco/issues/2110

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tracking] Opening multiple event sources in the same Falco instance #2074

Motivation

Feature Requirements

Proposed Solution

Release Goals

Terminology

Design

Technical Limitations of the Design

Technical Blockers

Nice to Have

Linked Discussions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development