Skip to content

Commit

Permalink
v0.0.1
Browse files Browse the repository at this point in the history
  • Loading branch information
tagomoris committed Dec 23, 2013
1 parent 5f527d1 commit a16561f
Show file tree
Hide file tree
Showing 9 changed files with 806 additions and 80 deletions.
29 changes: 10 additions & 19 deletions LICENSE.txt
Original file line number Diff line number Diff line change
@@ -1,22 +1,13 @@
Copyright (c) 2013 TAGOMORI Satoshi
Copyright (c) 2012- TAGOMORI Satoshi

MIT License
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:
http://www.apache.org/licenses/LICENSE-2.0

The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
145 changes: 128 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,29 +1,140 @@
# Fluent::Plugin::Bigquery
# fluent-plugin-webhdfs

TODO: Write a gem description
Fluentd output plugin to load/insert data into Google BigQuery.

## Installation
* insert data over streaming inserts
* for continuous real-time insertions, under many limitations
* https://developers.google.com/bigquery/streaming-data-into-bigquery#usecases
* (NOT IMPLEMENTED) load data
* for data loading as batch jobs, for big amount of data
* https://developers.google.com/bigquery/loading-data-into-bigquery

Current version of this plugin supports Google API with Service Account Authentication, and does not support OAuth.

Add this line to your application's Gemfile:
## Configuration

gem 'fluent-plugin-bigquery'
### Streming inserts

And then execute:
For service account authentication, generate service account private key file and email key, then upload private key file onto your server.

$ bundle
Configure insert specifications with target table schema, with your credentials. This is minimum configurations:

Or install it yourself as:
```apache
<match dummy>
type bigquery
method insert # default
email xxxxxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxx@developer.gserviceaccount.com
private_key_path /home/username/.keys/00000000000000000000000000000000-privatekey.p12
# private_key_passphrase notasecret # default
project yourproject_id
dataset yourdataset_id
table tablename
time_format %s
time_field time
field_integer time,status,bytes
field_string rhost,vhost,path,method,protocol,agent,referer
field_float requestime
field_boolean bot_access,loginsession
</match>
```

$ gem install fluent-plugin-bigquery
For high rate inserts over streaming inserts, you should specify flush intervals and buffer chunk options:

## Usage
```apache
<match dummy>
type bigquery
method insert # default
flush_interval 1 # flush as frequent as possible
buffer_chunk_records_limit 300 # default rate limit for users is 100
buffer_queue_limit 10240 # 1MB * 10240 -> 10GB!
num_threads 16
email xxxxxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxx@developer.gserviceaccount.com
private_key_path /home/username/.keys/00000000000000000000000000000000-privatekey.p12
# private_key_passphrase notasecret # default
project yourproject_id
dataset yourdataset_id
tables accesslog1,accesslog2,accesslog3
time_format %s
time_field time
field_integer time,status,bytes
field_string rhost,vhost,path,method,protocol,agent,referer
field_float requestime
field_boolean bot_access,loginsession
</match>
```

TODO: Write usage instructions here
Important options for high rate events are:

## Contributing
* `tables`
* 2 or more tables are available with ',' separator
* `out_bigquery` uses these tables for Table Sharding inserts
* these must have same schema
* `buffer_chunk_records_limit`
* number of records over streaming inserts API call is limited as 100, per second, per table
* default average rate limit is 100, and spike rate limit is 1000
* `out_bigquery` flushes buffer with 100 records for 1 inserts API call
* `buffer_queue_limit`
* BigQuery streaming inserts needs very small buffer chunks
* for high-rate events, `buffer_queue_limit` should be configured with big number
* Max 1GB memory may be used under network problem in default configuration
* `buffer_chunk_limit (default 1MB)` x `buffer_queue_limit (default 1024)`
* `num_threads`
* threads for insert api calls in parallel
* specify this option for 100 or more records per seconds
* 10 or more threads seems good for inserts over internet
* less threads may be good for Google Compute Engine instances (with low latency for BigQuery)
* `flush_interval`
* `1` is lowest value, without patches on Fluentd v0.10.41 or earlier
* see `patches` below

1. Fork it
2. Create your feature branch (`git checkout -b my-new-feature`)
3. Commit your changes (`git commit -am 'Add some feature'`)
4. Push to the branch (`git push origin my-new-feature`)
5. Create new Pull Request
### patches

This plugin depends on `fluent-plugin-buffer-lightening`, and it includes monkey patch module for BufferedOutput plugin, to realize high rate and low latency flushing. With this patch, sub 1 second flushing available.

To use this feature, execute fluentd with `-r fluent/plugin/output_try_flush_interval_patch` option.
And configure `flush_interval` and `try_flush_interval` with floating point value.

```apache
<match dummy>
type bigquery
method insert # default
flush_interval 0.2
try_flush_interval 0.05
buffer_chunk_records_limit 300 # default rate limit for users is 100
buffer_queue_limit 10240 # 1MB * 10240 -> 10GB!
num_threads 16
# credentials, project/dataset/table and schema specs.
</match>
```

With this configuration, flushing will be done in 0.25 seconds after record inputs in the worst case.

## TODO

* support Load API
* with automatically configured flush/buffer options
* support RECORD field
* and support optional data fields
* support NULLABLE/REQUIRED/REPEATED field options
* OAuth installed application credentials support
* Google API discovery expiration
* Error classes
* check row size limits
10 changes: 10 additions & 0 deletions Rakefile
Original file line number Diff line number Diff line change
@@ -1 +1,11 @@
#!/usr/bin/env rake
require "bundler/gem_tasks"

require 'rake/testtask'
Rake::TestTask.new(:test) do |test|
test.libs << 'lib' << 'test'
test.pattern = 'test/**/test_*.rb'
test.verbose = true
end

task :default => :test
23 changes: 15 additions & 8 deletions fluent-plugin-bigquery.gemspec
Original file line number Diff line number Diff line change
@@ -1,12 +1,16 @@
# coding: utf-8
lib = File.expand_path('../lib', __FILE__)
$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
require 'fluent/plugin/bigquery/version'

Gem::Specification.new do |spec|
spec.name = "fluent-plugin-bigquery"
spec.version = "0.0.1"
spec.version = Fluent::BigQueryPlugin::VERSION
spec.authors = ["TAGOMORI Satoshi"]
spec.email = ["tagomoris@gmail.com"]
spec.description = %q{TODO: Write a gem description}
spec.summary = %q{TODO: Write a gem summary}
spec.homepage = ""
spec.description = %q{Fluentd plugin to store data on Google BigQuery, by load, or by stream inserts}
spec.summary = %q{Fluentd plugin to store data on Google BigQuery}
spec.homepage = "https://github.com/tagomoris/fluent-plugin-bigquery"
spec.license = "APLv2"

spec.files = `git ls-files`.split($/)
Expand All @@ -15,8 +19,11 @@ Gem::Specification.new do |spec|
spec.require_paths = ["lib"]

spec.add_development_dependency "rake"
gem.add_runtime_dependency "fluentd"
gem.add_runtime_dependency "fluent-mixin-plaintextformatter", '>= 0.2.1'
gem.add_runtime_dependency "fluent-mixin-config-placeholders", ">= 0.2.0"
gem.add_runtime_dependency "google-api-client", "~> 0.6.4"
spec.add_runtime_dependency "google-api-client", "~> 0.6.4"
spec.add_runtime_dependency "fluentd"
spec.add_runtime_dependency "fluent-mixin-plaintextformatter", '>= 0.2.1'
spec.add_runtime_dependency "fluent-mixin-config-placeholders", ">= 0.2.0"
spec.add_runtime_dependency "fluent-plugin-buffer-lightening"

spec.add_development_dependency "fluent-plugin-dummydata-producer"
end
Loading

0 comments on commit a16561f

Please sign in to comment.