Skip to content

[Feature][Connector-V2][Syslog] Add Syslog source connector with RFC 3164 over TCP support#10900

Draft
MartinLam12 wants to merge 1 commit into
apache:devfrom
MartinLam12:feature/connector-syslog-rfc3164
Draft

[Feature][Connector-V2][Syslog] Add Syslog source connector with RFC 3164 over TCP support#10900
MartinLam12 wants to merge 1 commit into
apache:devfrom
MartinLam12:feature/connector-syslog-rfc3164

Conversation

@MartinLam12
Copy link
Copy Markdown

Purpose

Closes #2649

Adds a new Syslog source connector that receives syslog messages over TCP and parses them according to RFC 3164 (BSD syslog format).

What's Included

  • connector-syslog module under seatunnel-connectors-v2/
  • RFC 3164 parser that extracts structured fields from each message
  • Registered in plugin-mapping.properties and seatunnel-dist
  • Unit tests covering standard messages, PID parsing, single-digit day padding, and malformed input
  • English and Chinese documentation

Output Schema

Each syslog message is parsed into 7 columns:

Column Type Description
facility INT Facility code (0–23) derived from PRI field
severity INT Severity level (0–7) derived from PRI field
timestamp STRING Raw timestamp from the message
hostname STRING Originating host
app_name STRING Application/process name from TAG field
proc_id STRING Process ID from TAG field, empty if absent
message STRING Log message content

Scope (Phase 1)

  • Protocol: TCP only
  • Format: RFC 3164 (BSD syslog)
  • Transport: connector acts as a server, listening on a configured port
  • Output: parsed fields (not raw string)

UDP and RFC 5424 support are deferred to a future phase.

Example Config

source {
  Syslog {
    port = 514
    host = "0.0.0.0"
  }
}

Checklist

  • Spotless formatting applied (./mvnw spotless:apply)
  • Module compiles cleanly (./mvnw -q -DskipTests verify)
  • Unit tests written
  • plugin-mapping.properties updated
  • seatunnel-dist/pom.xml updated
  • Documentation added in docs/en and docs/zh
  • Apache license header on all new files
  • No new external dependencies introduced

🤖 Generated with Claude Code

…3164 over TCP support

Implements a new source connector that listens on a TCP port and receives
syslog messages formatted according to RFC 3164 (BSD syslog). Each message
is parsed into structured columns: facility, severity, timestamp, hostname,
app_name, proc_id, and message.

- connector-syslog module with SyslogSource, SyslogSourceReader, SyslogSourceFactory
- RFC 3164 parser with facility/severity extraction via PRI field
- Registers connector in plugin-mapping.properties and seatunnel-dist
- Unit tests covering standard messages, PID field, single-digit day, and error cases
- English and Chinese documentation

Closes apache#2649

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions Bot added document core SeaTunnel core module connectors-v2 labels May 18, 2026
@dybyte
Copy link
Copy Markdown
Contributor

dybyte commented May 18, 2026

Could you please enable CI following the instructions?

Copy link
Copy Markdown
Contributor

@dybyte dybyte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a minimal e2e test for this connector?

Copy link
Copy Markdown
Contributor

@DanielLeens DanielLeens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution. I reviewed the full current head from the runtime path instead of only the parser code. The RFC 3164 parsing itself is a nice start, but I found two blocking issues before this connector is ready to merge.

What problem this PR solves

  • User pain point
    Users may want SeaTunnel to receive RFC 3164 syslog messages directly over TCP instead of tailing files or relying on another collector.
  • Fix approach
    This PR adds a passive TCP syslog source, parser tests, and connector docs.
  • One sentence
    The parser direction is fine, but the current source still advertises a bounded/batch contract that the runtime cannot honor, and it does not yet have executable lifecycle coverage for a real network source.

Simple example: if a batch job starts this connector and waits for the input to finish, there is no predefined end-of-input. The source is just a listening socket. Today the code marks it as BOUNDED in batch mode and stops after the first client disconnects, which is not a real bounded-source contract.

1. Code change review

1.1 Core logic analysis

Runtime chain I checked:

job starts
  -> SyslogSource.getBoundedness()
      -> returns BOUNDED when job.mode = BATCH
  -> SyslogSourceReader.open()
      -> create listening ServerSocket
  -> SyslogSourceReader.pollNext()
      -> accept client connection
      -> processConnection(...)
      -> if bounded -> signalNoMoreElement() and return after the first connection

Key findings:

  1. This is a passive listener source, not a finite snapshot source.
  2. In batch mode there is no real completion boundary to know when all syslog input has arrived.
  3. The current implementation ends the source after the first accepted connection finishes.
  4. The docs currently advertise both batch and stream support, so users are told a runtime contract that the code cannot actually guarantee.

1.2 Compatibility impact

This introduces a user-visible contract mismatch rather than an API break. The current batch claim would mislead users into relying on behavior that is not actually implemented.

1.3 Performance / side effects

The main risk is correctness and job lifecycle semantics, not raw CPU or memory. In batch mode the job can either wait forever for the first client or terminate after just one connection while more senders still exist.

1.4 Error handling and logging

Issue 1: the connector reports BOUNDED in batch mode even though the runtime is a passive TCP listener with no real end-of-input contract

  • Location: seatunnel-connectors-v2/connector-syslog/src/main/java/org/apache/seatunnel/connectors/seatunnel/syslog/source/SyslogSource.java:78-82; seatunnel-connectors-v2/connector-syslog/src/main/java/org/apache/seatunnel/connectors/seatunnel/syslog/source/SyslogSourceReader.java:95-103; docs/en/connectors/source/Syslog.md:11-18
  • Problem description:
    SyslogSource.getBoundedness() switches to BOUNDED when job.mode = BATCH. But SyslogSourceReader is implemented as a long-lived listening socket. There is no finite split, no byte/record boundary, and no protocol marker that tells the source when all syslog producers are done. The current bounded path just accepts one client connection, processes it, then calls signalNoMoreElement() and returns.
  • Potential risk:
    This is a blocking correctness issue on the normal runtime path. A batch job can hang waiting for the first connection forever, or stop after the first sender disconnects while later senders still have data.
  • Best improvement suggestion:
    Option A: make this connector streaming-only for now and remove the batch claim from both runtime metadata and docs.
    Option B: if batch support is really required, define an explicit bounded ingestion contract first (for example, a finite source input artifact rather than a passive socket listener).
  • Severity: High

Issue 2: the current test coverage only validates string parsing; it does not execute the real socket lifecycle of a network source

  • Location: seatunnel-connectors-v2/connector-syslog/src/test/java/org/apache/seatunnel/connectors/seatunnel/syslog/source/SyslogSourceReaderTest.java:25-102
  • Problem description:
    The new tests are all parser-level unit tests. They never open a ServerSocket, never verify accept/read/close behavior, and never cover the boundedness/lifecycle branch that is central to this PR.
  • Potential risk:
    The most important behavior for a network source — connection handling and end-of-input semantics — is still unvalidated.
  • Best improvement suggestion:
    Add at least one minimal integration-style test that starts the reader, pushes a few syslog lines through a real socket, and verifies the lifecycle behavior you want to support.
  • Severity: High

2. Code quality evaluation

2.1 Code style

The parser implementation is straightforward and readable. The blockers are semantic/runtime issues rather than style issues.

2.2 Test coverage and stability

The current parser assertions are deterministic and do not introduce a flaky-test pattern. Stability rating: Stable.

The coverage gap is that the real network lifecycle is not tested at all yet.

2.3 Documentation

docs/en/connectors/source/Syslog.md currently marks both batch and stream as supported. If this connector stays listener-based, docs/zh needs the same correction as well.

3. Architecture

3.1 Solution elegance

Adding a passive socket source is a useful direction, but the connector contract should match what a listener can realistically guarantee.

3.2 Maintainability

The runtime is simple, which is good. Making it "batch" by stopping after the first connection adds surprising lifecycle behavior that will be hard to reason about later.

3.3 Extensibility

Once the connector is honest about being streaming-only, it will be much easier to extend safely.

3.4 Historical compatibility

No historical-upgrade blocker here; the issue is the brand-new contract being inaccurate on first introduction.

4. Issue summary

No. Issue Location Severity
Issue 1 batch / bounded support is advertised but the listener runtime has no real end-of-input contract SyslogSource.java:78-82; SyslogSourceReader.java:95-103; Syslog.md:11-18 High
Issue 2 tests only cover parsing, not the real socket lifecycle SyslogSourceReaderTest.java:25-102 High

5. Merge conclusion

Conclusion: can merge after fixes

  1. Blocking items
  • Issue 1: make the connector contract honest by removing the current fake bounded/batch behavior or redesigning batch semantics explicitly.
  • Issue 2: add executable validation for the real socket lifecycle, not only parser assertions.
  1. Suggested but non-blocking follow-ups
  • The current GitHub Build is action_required because the workflow run was not detected on the fork. Please enable Actions / retrigger CI as suggested by the check-run message.

Overall, the parser foundation is promising, but from the runtime perspective this is not merge-ready yet. A passive syslog listener should not claim a bounded batch contract unless the source can actually define and verify the end of input.

@davidzollo davidzollo added the First-time contributor First-time contributor label May 18, 2026
@MartinLam12 MartinLam12 marked this pull request as draft May 18, 2026 19:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

connectors-v2 core SeaTunnel core module document First-time contributor First-time contributor

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature][syslog-connector] can seatunnel support a syslog-connector

4 participants