Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsqd: panic when running two instances with same -data-path #583

Merged
merged 1 commit into from
Oct 4, 2015

Conversation

mreiferson
Copy link
Member

panic: runtime error: makeslice: len out of range

goroutine 16 [running]:
github.com/bitly/nsq/nsqd.(*diskQueue).readOne(0xc2080d4160, 0x0, 0x0, 0x0, 0x0, 0x0)
    /Users/cheney/Projects/GoProject/src/github.com/bitly/nsq/nsqd/diskqueue.go:264 +0x56a
github.com/bitly/nsq/nsqd.(*diskQueue).ioLoop(0xc2080d4160)
    /Users/cheney/Projects/GoProject/src/github.com/bitly/nsq/nsqd/diskqueue.go:576 +0x706
created by github.com/bitly/nsq/nsqd.newDiskQueue
    /Users/cheney/Projects/GoProject/src/github.com/bitly/nsq/nsqd/diskqueue.go:92 +0x403

This panic above is cause by

readBuf := make([]byte, msgSize)

I assume that maybe msgSize is negative, and running multiple instances in the same directory is possible?

@mreiferson
Copy link
Member

The way diskqueue filenames are currently implemented, it's not going to work.

I don't have strong feelings on this either. It sounds like something we should probably just document? It isn't too hard of a requirement to ask that data paths be unique.

Thoughts @jehiah?

@jehiah
Copy link
Member

jehiah commented May 7, 2015

I'm not sure there is a portable way to solve this, but could we open nsq.%d.dat with an exclusive lock similar to setlock?

@cespare
Copy link
Contributor

cespare commented May 26, 2015

Easy enough on platforms with flock(2):

https://github.com/cespare/kvcache/blob/master/db.go#L376-L391

@jehiah
Copy link
Member

jehiah commented May 26, 2015

@cespare cool. Would you be interested in contributing that for the platforms where we can support it?

@mreiferson
Copy link
Member

I don't think it's as simple as locking on the metadata file - the nsqd would be competing trying to deliver + cleanup any persisted message backlog.

To me, the question is: do we actually want you to be able to point two nsqd at the same data path?

@jehiah
Copy link
Member

jehiah commented May 26, 2015

Oh, @mreiferson you mean because we write to a temp file and replace the metadata locking is somewhat a fools errand.

We don't want to allow two nsqd to point at the same data path, i was thinking this as a way to ensure that you don't accidentally end up with that happening. (fail fast, fail early.

@mreiferson
Copy link
Member

@jehiah got it - maybe we just explicitly add a lock file to the data path that we can rely on to detect and fail fast?

@chzyer
Copy link
Author

chzyer commented May 27, 2015

@mreiferson @jehiah
The idea that adding a lock file to the data path is awesome!
But if nsqd crash or kill by SIGTERM any ways that unable to delete the lock file, this nsqd couldn't run any more before delete the lock file manually?
I think the key point is named each nsqd when it starting, the name which nsqd belongs to should be unique with any other nsqd running in this server, and every time a nsqd started with the same configure should got the same name. (e.g. {TcpPort}:{HttpPort})
So if the configure not changed, nsqd can detect that the lockfile is belong to it.

@cespare
Copy link
Contributor

cespare commented May 27, 2015

You create the file if it doesn't exist, and then try to lock it. You don't need to delete it.

When nsqd exits (maybe because it crashes), the flock is released.

@cespare
Copy link
Contributor

cespare commented May 27, 2015

(Also if you have a dir to put the lock file, you might as well just lock the dir instead.)

@chzyer
Copy link
Author

chzyer commented May 27, 2015

👍 cool :)

@mreiferson
Copy link
Member

👍 to locking dir

@mreiferson
Copy link
Member

RFR @jehiah

Windows isn't implemented - if someone who's running on Windows wants to contribute those code paths that would be great!

}
n.swapOpts(opts)

err := n.dl.Lock()
if err != nil {
n.logf("FATAL: --data-path=%s in use", dataPath)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be helpful if this error gave a little bit of a hint that it's in use by another nsqd, or that the resolution is to use a different/unique dirpath.

Thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@mreiferson
Copy link
Member

updated

jehiah added a commit that referenced this pull request Oct 4, 2015
nsqd: panic when running two instances with same -data-path
@jehiah jehiah merged commit 5a36653 into nsqio:master Oct 4, 2015
@mreiferson mreiferson deleted the data_path_lock_583 branch October 4, 2015 20:59
@pacoguzman
Copy link

I'm getting:

nsqd --lookupd-tcp-address=localhost:4160 --data-path=/data
[nsqd] 2017/04/21 16:18:59.356983 FATAL: --data-path=/data in use (possibly by another instance of nsqd)

On a Mac OS machine, but that' the only process starting nsqd, could someone help me to setup a directory to store disk-backed messages

@ploxiln
Copy link
Member

ploxiln commented Apr 21, 2017

please ask this kind of question at https://groups.google.com/d/forum/nsq-users

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants