Skip to content

Unusual crash on corrupted message #435

Closed
@bschofield

Description

Expected & actual behavior

Somehow, I ended up with a corrupted message (or message batch?) on my production pulsar cluster. I'm unsure of the source of the corruption: it may have been generated by the pulsar CGo client which I was using, or it may have been generated elsewhere.

When using the CGo client, the corruption manifested as consumers reading from the bad topic silently hanging, and subsequently being disconnected from the broker. Since the CGo client is now unsupported, I bit the bullet and moved over to the pure golang version. (Massive kudos to you all on keeping the interfaces so similar, by the way.)

Following the move to this client, the pure-go consumers began crashing with the following trace:

panic: runtime error: slice bounds out of range [:1890492169] with capacity 324

goroutine 187 [running]:
github.com/apache/pulsar-client-go/pulsar/internal.(*buffer).Read(0xc000983340, 0xc070ae9f05, 0x14685e0, 0xe80560, 0xc000aac000)
	/home/ben/pkg/mod/github.com/apache/pulsar-client-go@v0.3.0/pulsar/internal/buffer.go:113 +0x6b
github.com/apache/pulsar-client-go/pulsar/internal.(*MessageReader).readSingleMessage(0xc000187cd0, 0x144, 0x0, 0x0, 0x2b, 0xc000aac000, 0x0)
	/home/ben/pkg/mod/github.com/apache/pulsar-client-go@v0.3.0/pulsar/internal/commands.go:145 +0x77
github.com/apache/pulsar-client-go/pulsar/internal.(*MessageReader).ReadMessage(0xc000187cd0, 0xc000136008, 0xc000187cb0, 0x1, 0x1, 0x2b, 0x0)
	/home/ben/pkg/mod/github.com/apache/pulsar-client-go@v0.3.0/pulsar/internal/commands.go:129 +0x5a
github.com/apache/pulsar-client-go/pulsar.(*partitionConsumer).MessageReceived(0xc000326840, 0xc001fd2e80, 0xfd4080, 0xc000982100, 0xa4f001, 0xc0001daf70)
	/home/ben/pkg/mod/github.com/apache/pulsar-client-go@v0.3.0/pulsar/consumer_partition.go:494 +0x31a
github.com/apache/pulsar-client-go/pulsar/internal.(*connection).handleMessage(0xc0001daf00, 0xc001fd2e80, 0xfd4080, 0xc000982100)
	/home/ben/pkg/mod/github.com/apache/pulsar-client-go@v0.3.0/pulsar/internal/connection.go:658 +0x115
github.com/apache/pulsar-client-go/pulsar/internal.(*connection).internalReceivedCommand(0xc0001daf00, 0xc0002701c0, 0xfd4080, 0xc000982100)
	/home/ben/pkg/mod/github.com/apache/pulsar-client-go@v0.3.0/pulsar/internal/connection.go:547 +0x27c
github.com/apache/pulsar-client-go/pulsar/internal.(*connection).run(0xc0001daf00)
	/home/ben/pkg/mod/github.com/apache/pulsar-client-go@v0.3.0/pulsar/internal/connection.go:401 +0x365
github.com/apache/pulsar-client-go/pulsar/internal.(*connection).start.func1(0xc0001daf00)
	/home/ben/pkg/mod/github.com/apache/pulsar-client-go@v0.3.0/pulsar/internal/connection.go:235 +0x72
created by github.com/apache/pulsar-client-go/pulsar/internal.(*connection).start
	/home/ben/pkg/mod/github.com/apache/pulsar-client-go@v0.3.0/pulsar/internal/connection.go:231 +0x3f

Steps to reproduce

Unfortunately, I don't think this is reproducible. I needed to keep the cluster up, so I added some debug statements which identified the bad topic and partition, then cleared the backlog.

I'm reporting the bug for two reasons. Firstly, so that anyone who encounters the same issue in the future finds this and can add more info. Secondly, in case you wish to add some more logic for detecting corrupted messages, so that this issue is detected on the client side without a crash.

System configuration

Pulsar broker: 2.6.1
pulsar-client-go: 0.3.0

Same issue observed in Ubuntu and Alpine.

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions