Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New blog post introducing Data Prepper 1.3.0 and its new features. #729

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions _authors/ddpowers.markdown
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
---
short_name: ddpowers
name: David Powers
photo: '/assets/media/authors/ddpowers.jpeg'
github: dapowers87
---

David Powers is a Software Development Engineer at AWS working in search services. He is a maintainer on the Data Prepper project.
8 changes: 8 additions & 0 deletions _authors/tylgry.markdown
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
---
name: Tayor Gray
short_name: tylgry
photo: '/assets/media/authors/tylgry.jpg'
github: graytaylor0
---

Taylor Gray is a Software Development Engineer at AWS with a focus on OpenSearch and observability. He is a maintainer of the Data Prepper project.
Original file line number Diff line number Diff line change
@@ -0,0 +1,224 @@
---
layout: post
title: "Introducing Data Prepper 1.3.0 with New Aggregation Processor"
authors:
- tylgry
- ddpowers
- dlv
date: 2022-03-21 10:00:00 -0500
categories:
- technical-post
twittercard:
description: " Data Prepper 1.3.0 is available for use today. This release provides a log aggregation processor and other new processors."
---

Data Prepper is an open source data collector for trace and log data that can filter,
enrich, and aggregate data for downstream analysis and visualization. Data Prepper now
has more support for log enrichment and aggregation with the new features released in
Data Prepper 1.3.0. This release includes new processors to mutate fields on events,
aggregate distinct log events, drop certain events, and more.

In this post, we’d like to introduce some of the new enrichment processors that are
part of Data Prepper 1.3.0. The
[Release Notes](https://github.com/opensearch-project/data-prepper/blob/main/release/release-notes/data-prepper.release-notes-1.3.0.md)
include the full list of processors as well as links to their documentation.

## Mutate and String Processors

The [mutate event processors](https://github.com/opensearch-project/data-prepper/blob/main/data-prepper-plugins/mutate-event-processors/README.md)
offer the opportunity to rename, copy, add, or even delete entries in events. Using the Add Entry Processor,
for example, would allow a user to add in an entry to their data to help debug the flow of data:


```yaml
...
processor:
- add_entries:
entries:
- key: "debug"
value: true
...
```

The mutate string processors offer tools to manipulate strings in the incoming data. Say there was a need to split
a string into an array. Simply add this processor:


```yaml
...
processor:
- split_string:
entries:
- source: "message"
delimiter: "&"
...
```

and an entry named `message` with a string such as `"a&b&c"` would transform into `["a", "b", "c"]`. Currently,
there is support for `lowercase`, `uppercase`, `trim`, `substitute`, and `split`.

The [Logstash conversion tool](https://github.com/opensearch-project/data-prepper/blob/main/docs/logstash_migration_guide.md)
has been updated to support conversion from Logstash's `mutate` filter into these new processors.

## Filtering with the Drop Processor

Data Prepper now supports a new [drop processor](https://github.com/opensearch-project/data-prepper/tree/main/data-prepper-plugins/drop-events-processor)
which can filter out specified log events. Say you are collecting web request logs and only wish to store
non-successful requests. You could create a pipeline which drops any requests where the response is less than 400
so that only HTTP status codes 400 and above remain. The following example pipeline shows how you could configure this.

```yaml
log-pipeline:
source:
http:
ssl: false
processor:
- grok:
match:
log: [ "%{COMMONAPACHELOG}" ]
- drop:
drop_when: "/response < 400"
sink:
- opensearch:
hosts: [ "https://opensearch:9200" ]
index: failure_logs
```

In the sample above, one of the processors is `drop`. The `drop_when` property defines a condition to determine
which Events to drop from the pipeline. This condition is `/response < 400`.

## Extracting Key-Value Pairs from Strings

Often log data includes strings of key-value pairs. One common scenario is an HTTP query string. For example, if a
web user queries for a pageable URL, the HTTP logs might have the following HTTP query string:
`page=3&q=my-search-term`. If you wish to perform analysis using the search terms, you may wish to extract the
value of `q` from a query string. Data Prepper’s new
[key-value processor](https://github.com/opensearch-project/data-prepper/blob/main/data-prepper-plugins/key-value-processor/README.md)
provides robust support for extracting keys and values from strings like these.

The following example shows how you could use the new `split_string` processor and `key_value` processor to get query
parameters from an Apache log line.

```yaml
processor:
- grok:
match:
message: [ "%{COMMONAPACHELOG}" ]
- split_string:
entries:
- source: request
delimiter: "?"
- key_value:
source: "/request/1"
field_split_characters: "&"
value_split_characters: "="
destination: query_params
```

## Setting Timestamps on Events

Data Prepper provides a new
[date processor](https://github.com/opensearch-project/data-prepper/blob/main/data-prepper-plugins/date-processor/README.md)
to allow pipeline authors to configure the timestamp. This gives pipeline authors a couple options. The first option
is to parse a field in the current Event according to a date-time pattern. This is useful if your log events already
have timestamps in them. You can configure the timezone as well in case the timestamps come from other timezones. The
second option is to use the time that Data Prepper receives events as the timestamp for events. You may wish to use
this when receiving log data that does not have a timestamp.

## Aggregate Processor

Users often want to aggregate data from different Events over a period of time. This is important in order to reduce
unnecessary log volume, and to handle use cases like multi-line logs that come in as separate Events. The `aggregate`
processor is a stateful processor that groups Events together based on the values for a set of specified
[identification_keys](https://github.com/opensearch-project/data-prepper/tree/main/data-prepper-plugins/aggregate-processor#identification_keys),
and performs a configurable [action](https://github.com/opensearch-project/data-prepper/tree/main/data-prepper-plugins/aggregate-processor#action)
such as `remove_duplicates` or `put_all` on each group. You can use existing actions, or you can create your own actions
as Java code to perform custom aggregations.

State in the aggregate processor is stored in memory. For example, in order to combine four Events into one, Data Prepper
needs to retain pieces of the first three Events. The state of an aggregate group of Events is kept for a configurable
amount of time. Depending on your logs, the aggregate action being used, and the amount of memory available, the
aggregation could take place over a long period of time.

In Data Prepper 1.3, two actions are being released with the `aggregate` processor: `remove_duplicates` and `put_all`.
However, creating custom actions is extremely simple. If you are interested in learning more about creating custom actions, read the
[aggregate README](https://github.com/opensearch-project/data-prepper/tree/main/data-prepper-plugins/aggregate-processor#creating-new-aggregate-actions).
Additionally, please create a [Github issue](https://github.com/opensearch-project/data-prepper/issues/new/choose)
for any actions you would like Data Prepper to support.

At the moment, the aggregate processor is only useful for single-node clusters of Data Prepper. However, utilizing the
[peer forwarder](https://github.com/opensearch-project/data-prepper/issues/700)
to aggregate over multiple-node clusters is planned for a future release of Data Prepper.

The following pipeline configuration extracts fields of `sourceIp`, `destinationIp`, and `port` using the `grok`
processor, and then aggregates on those fields over a period of 30 seconds using the `aggregate` processor and
the `put_all` action. At the end of the 30 seconds, the aggregated log is sent to the OpenSearch sink.

```yaml
aggregate_pipeline:
source:
http:
ssl: false
processor:
- grok:
match:
log: ["%{IPORHOST:sourceIp} %{IPORHOST:destinationIp} %{NUMBER:port:int}"]

- aggregate:
group_duration: "30s"
identification_keys: ["sourceIp", "destinationIp", "port"]
action:
put_all:
sink:
- opensearch:
hosts: ["https://opensearch:9200"]
index: aggregated_logs
```

Given the following batch of logs:

```json
{ "log": "127.0.0.1 192.168.0.1 80", "status": 200 }
{ "log": "127.0.0.1 192.168.0.1 80", "bytes": 1000 }
{ "log": "127.0.0.1 192.168.0.1 80" "http_verb": "GET" }
```

The grok processor will extract the `identification_keys` to create the following logs:

```json
{ "sourceIp": "127.0.0.1", "destinationIp": "192.168.0.1", "port": 80, "status": 200 }
{ "sourceIp": "127.0.0.1", "destinationIp": "192.168.0.1", "port": 80, "bytes": 1000 }
{ "sourceIp": "127.0.0.1", "destinationIp": "192.168.0.1", "port": 80, "http_verb": "GET" }
```

And when the group is concluded after a duration of 30 seconds from the time that the first log is
received by the `aggregate` processor, the following aggregated log will be shipped to the sink:

```json
{ "sourceIp": "127.0.0.1", "destinationIp": "192.168.0.1", "port": 80, "status": 200, "bytes": 1000, "http_verb": "GET" }
```

## Other Improvements

In addition to the new features already described, Data Prepper 1.3.0 has a few other improvements.

* Many OpenSearch users setup rolling indexes based on time to help reduce storage costs. You can now configure Data Prepper to use a date and time pattern in your index names for log-based indexes. Data Prepper can also convert index names with date-time patterns from your Logstash configuration files.
* Data Prepper now uses the term “Processor” instead of “Prepper” in pipelines. This disambiguates the Data Prepper product from the processors which provide enrichment and transformation.
* Data Prepper is internally migrating plugins to the new Event model. Once completed, generic processors will be able to work for any Event type including traces. This release includes some work toward that goal allowing some trace sources and sinks to work with Events.

## Looking to the Next Release

This release allows Data Prepper to solve more log use-cases for developers and teams.
[Data Prepper 1.4](https://github.com/opensearch-project/data-prepper/milestone/5)
has other important features coming. We’d especially like to highlight the following significant changes.

* Data Prepper will begin to support metrics thanks to a [community contribution](https://github.com/opensearch-project/data-prepper/pull/1154).
* Many users have asked for a way to route different Events to different Sinks. Data Prepper’s [Conditional Routing](https://github.com/opensearch-project/data-prepper/issues/1007) will allow users to route based on our new Data Prepper Expression syntax.

You can see the [Data Prepper roadmap](https://github.com/opensearch-project/data-prepper/projects/1) to see other
upcoming changes. If there are any features on the roadmap that you are most interested in, please comment on the
issue to help the team prioritize issues. You can also request any changes by creating a
[GitHub issue](https://github.com/opensearch-project/data-prepper/issues/new/choose). This project is open source
and we are happy to accept [community contributions](https://github.com/opensearch-project/data-prepper/blob/main/CONTRIBUTING.md).


Binary file added assets/media/authors/ddpowers.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/media/authors/tylgry.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.