Skip to content
This repository has been archived by the owner on Sep 17, 2019. It is now read-only.

Batch inserts #110

Open
rgr78 opened this issue Jan 30, 2018 · 16 comments
Open

Batch inserts #110

rgr78 opened this issue Jan 30, 2018 · 16 comments

Comments

@rgr78
Copy link

rgr78 commented Jan 30, 2018

Hi.

I noticed the plugin issues one insert statement per event.

Would it be possible for the plugin to implement batch inserts? It would increase insert throughput and reduce server load considerably for large workloads.

Maybe something like this: https://www.tutorialspoint.com/jdbc/jdbc-batch-processing.htm

@theangryangel
Copy link
Owner

theangryangel commented Jan 30, 2018

The original version of the plugin did do exactly this (this was back in the logstash 1.4 era)

The problem was that if one insert in the batch fails, the entire batch would fail and there was no way to figure out which of the batch failed. It turned out that a lot of people aren't capable of writing statements that won't fail, or they had inputs which werent always providing the data they expected. This resulted in a loss of events.

Unless I'm mistaken there's no way to deal with this other than the way the plugin now functions?

@rgr78
Copy link
Author

rgr78 commented Jan 30, 2018

Yes, but that depends on the use case. On some use cases, it may be important to know which event caused the batch to fail. On other use cases, throughput may be more important. Or maybe errors are already ignored (with something like INSERT IGNORE).

Shouldn't that decision be made by the user? Maybe the batch insert feature can be optional and disabled by default.

Regarding dealing with batch failure and pinpointing the culprit, what about this approach: https://vladmihalcea.com/how-to-find-which-statement-failed-in-a-jdbc-batch-update/

@theangryangel
Copy link
Owner

theangryangel commented Jan 30, 2018

Ah it's starting to come back to me now.

I did start to implement something and I think I found some JDBC driver(s) I was testing with behaved differently when it came to how they worked with failures. Some stopped processing immediately with an error and some carried on. This made the retry logic a pain in the arse to deal with and I think I just decided to give up at that point as retrying was far more important at the time.

In theory there's nothing wrong with implementing batch support as a switch/option. For support purposes I'd prefer to leave it as off by default based on my previous experience. It's unlikely I'll get around to implementing this any time soon as it's not something I need at work (our rate of events in the JDBC output is really low) and my personal time is limited at the moment.

If you'd like to work on a patch I'll happily accept it, after review. Just let me know if you do you want to start on it. Otherwise this'll likely be at least a few weeks/months before I get around to looking at it.

@rgr78
Copy link
Author

rgr78 commented Jan 31, 2018

I see what you mean.

For now, our use case is much slower than it should/could be (about 10M~20M records). But since the total time is still tolerable, we also won't start working on it immediately. If we do, we'll let you know.

Thanks!

@theangryangel
Copy link
Owner

No problem 👍 I'll keep the issue open, if you dont mind? That way I dont forget about this as a feature to look into in the future?

@rgr78
Copy link
Author

rgr78 commented Jan 31, 2018

Sure. 👍

@phr0gz
Copy link

phr0gz commented Apr 17, 2018

Hello, just to let you know that I'm also inserting a lot of data in MS SQL. And I have the same issue. 👍

@rkhapre
Copy link

rkhapre commented Jan 6, 2019

Need Batch inserts for sure +1

@theangryangel
Copy link
Owner

Pull requests are welcome. It’s unlikely I’m going to implement this any time soon still. I have no need for this feature at work and I cannot justify the time at home as I’m flat out on other things.

@ankitgoel154
Copy link

is there any way we can increase pipeline worker thread size to achieve the same thing until the enhancement is done?

@theangryangel
Copy link
Owner

Increase the number of logstash workers (and ensure you have the same +1 connection pool size) and it should increase the output as long as your sql server can keep up. Or at least that’s what ligstash used to do. I have not needed to tweak this myself and I have not kept on top of this.

@ibspoof
Copy link

ibspoof commented Apr 12, 2019

I had a need for batching using both safe and unsafe statements with retry support. Changes have been added to my fork at: https://github.com/ibspoof/logstash-output-jdbc/tree/add_batching if my implementation looks good I can file a MR.

@theangryangel
Copy link
Owner

theangryangel commented Apr 12, 2019

Edit: thank you for your work so far @ibspoof

Looks good on paper, but I’ve not tried it out (I’m away from home and office today). Only thing that’s probably missing are some specs, but if you wish I can probably add some. If you want to have a go feel free.

Whenever you’re ready file the MR, regardless I’ll try and find some time Monday to have a quick play and test

@ibspoof
Copy link

ibspoof commented Apr 12, 2019

I can add specs this weekend and submit.

@vector4wang
Copy link

expect batch inserts ~~~

@theangryangel
Copy link
Owner

theangryangel commented Jun 28, 2019

The main problem I see with the batch implementation @ibspoof is that you're always putting all the batch_events back on the retry list. I was going to merge it straight in until I noticed this :/

Depending on what errors you are either going to end up duplicating a bunch of events, or potentially not putting any in at all, if it's the first statement that fails. I believe you need to account for this by dealing with getUpdateCounts and looping through getNextException. From what I recall in the depths of my memory trying to do this for 1.x I had real problems with JDBC drivers behaving differently, and I just gave up as I didn't need it myself.

If you don't need/want to worry about the above, are you happy if I take your work and do something with it?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants