v0.0.1

fluent-plugins-nursery · Dec 23, 2013 · a16561f · a16561f
1 parent 5f527d1
commit a16561f
Show file tree

Hide file tree

Showing 9 changed files with 806 additions and 80 deletions.
diff --git a/LICENSE.txt b/LICENSE.txt
@@ -1,22 +1,13 @@
-Copyright (c) 2013 TAGOMORI Satoshi
+Copyright (c) 2012- TAGOMORI Satoshi
 
-MIT License
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
 
-Permission is hereby granted, free of charge, to any person obtaining
-a copy of this software and associated documentation files (the
-"Software"), to deal in the Software without restriction, including
-without limitation the rights to use, copy, modify, merge, publish,
-distribute, sublicense, and/or sell copies of the Software, and to
-permit persons to whom the Software is furnished to do so, subject to
-the following conditions:
+       http://www.apache.org/licenses/LICENSE-2.0
 
-The above copyright notice and this permission notice shall be
-included in all copies or substantial portions of the Software.
-
-THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
-EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
-MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
-NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
-LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
-OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
-WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
diff --git a/README.md b/README.md
@@ -1,29 +1,140 @@
-# Fluent::Plugin::Bigquery
+# fluent-plugin-webhdfs
 
-TODO: Write a gem description
+Fluentd output plugin to load/insert data into Google BigQuery.
 
-## Installation
+* insert data over streaming inserts
+  * for continuous real-time insertions, under many limitations
+  * https://developers.google.com/bigquery/streaming-data-into-bigquery#usecases
+* (NOT IMPLEMENTED) load data
+  * for data loading as batch jobs, for big amount of data
+  * https://developers.google.com/bigquery/loading-data-into-bigquery
+
+Current version of this plugin supports Google API with Service Account Authentication, and does not support OAuth.
 
-Add this line to your application's Gemfile:
+## Configuration
 
-    gem 'fluent-plugin-bigquery'
+### Streming inserts
 
-And then execute:
+For service account authentication, generate service account private key file and email key, then upload private key file onto your server.
 
-    $ bundle
+Configure insert specifications with target table schema, with your credentials. This is minimum configurations:
 
-Or install it yourself as:
+```apache
+<match dummy>
+  type bigquery
+  
+  method insert    # default
+  
+  email xxxxxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxx@developer.gserviceaccount.com
+  private_key_path /home/username/.keys/00000000000000000000000000000000-privatekey.p12
+  # private_key_passphrase notasecret # default
+  
+  project yourproject_id
+  dataset yourdataset_id
+  table   tablename
+  
+  time_format %s
+  time_field  time
+  
+  field_integer time,status,bytes
+  field_string  rhost,vhost,path,method,protocol,agent,referer
+  field_float   requestime
+  field_boolean bot_access,loginsession
+</match>
+```
 
-    $ gem install fluent-plugin-bigquery
+For high rate inserts over streaming inserts, you should specify flush intervals and buffer chunk options:
 
-## Usage
+```apache
+<match dummy>
+  type bigquery
+  
+  method insert    # default
+  
+  flush_interval 1  # flush as frequent as possible
+  
+  buffer_chunk_records_limit 300  # default rate limit for users is 100
+  buffer_queue_limit 10240        # 1MB * 10240 -> 10GB!
+  
+  num_threads 16
+  
+  email xxxxxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxx@developer.gserviceaccount.com
+  private_key_path /home/username/.keys/00000000000000000000000000000000-privatekey.p12
+  # private_key_passphrase notasecret # default
+  
+  project yourproject_id
+  dataset yourdataset_id
+  tables  accesslog1,accesslog2,accesslog3
+  
+  time_format %s
+  time_field  time
+  
+  field_integer time,status,bytes
+  field_string  rhost,vhost,path,method,protocol,agent,referer
+  field_float   requestime
+  field_boolean bot_access,loginsession
+</match>
+```
 
-TODO: Write usage instructions here
+Important options for high rate events are:
 
-## Contributing
+  * `tables`
+    * 2 or more tables are available with ',' separator
+    * `out_bigquery` uses these tables for Table Sharding inserts
+    * these must have same schema
+  * `buffer_chunk_records_limit`
+    * number of records over streaming inserts API call is limited as 100, per second, per table
+    * default average rate limit is 100, and spike rate limit is 1000
+    * `out_bigquery` flushes buffer with 100 records for 1 inserts API call
+  * `buffer_queue_limit`
+    * BigQuery streaming inserts needs very small buffer chunks
+    * for high-rate events, `buffer_queue_limit` should be configured with big number
+    * Max 1GB memory may be used under network problem in default configuration
+      * `buffer_chunk_limit (default 1MB)` x `buffer_queue_limit (default 1024)`
+  * `num_threads`
+    * threads for insert api calls in parallel
+    * specify this option for 100 or more records per seconds
+    * 10 or more threads seems good for inserts over internet
+    * less threads may be good for Google Compute Engine instances (with low latency for BigQuery)
+  * `flush_interval`
+    * `1` is lowest value, without patches on Fluentd v0.10.41 or earlier
+    * see `patches` below
 
-1. Fork it
-2. Create your feature branch (`git checkout -b my-new-feature`)
-3. Commit your changes (`git commit -am 'Add some feature'`)
-4. Push to the branch (`git push origin my-new-feature`)
-5. Create new Pull Request
+### patches
+
+This plugin depends on `fluent-plugin-buffer-lightening`, and it includes monkey patch module for BufferedOutput plugin, to realize high rate and low latency flushing. With this patch, sub 1 second flushing available.
+
+To use this feature, execute fluentd with `-r fluent/plugin/output_try_flush_interval_patch` option.
+And configure `flush_interval` and `try_flush_interval` with floating point value.
+
+```apache
+<match dummy>
+  type bigquery
+  
+  method insert    # default
+  
+  flush_interval     0.2
+  try_flush_interval 0.05
+  
+  buffer_chunk_records_limit 300  # default rate limit for users is 100
+  buffer_queue_limit 10240        # 1MB * 10240 -> 10GB!
+  
+  num_threads 16
+  
+  # credentials, project/dataset/table and schema specs.
+</match>
+```
+
+With this configuration, flushing will be done in 0.25 seconds after record inputs in the worst case.
+
+## TODO
+
+* support Load API
+  * with automatically configured flush/buffer options
+* support RECORD field
+  * and support optional data fields
+* support NULLABLE/REQUIRED/REPEATED field options
+* OAuth installed application credentials support
+* Google API discovery expiration
+* Error classes
+* check row size limits
diff --git a/Rakefile b/Rakefile
@@ -1 +1,11 @@
+#!/usr/bin/env rake
 require "bundler/gem_tasks"
+
+require 'rake/testtask'
+Rake::TestTask.new(:test) do |test|
+  test.libs << 'lib' << 'test'
+  test.pattern = 'test/**/test_*.rb'
+  test.verbose = true
+end
+
+task :default => :test
diff --git a/fluent-plugin-bigquery.gemspec b/fluent-plugin-bigquery.gemspec
@@ -1,12 +1,16 @@
 # coding: utf-8
+lib = File.expand_path('../lib', __FILE__)
+$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
+require 'fluent/plugin/bigquery/version'
+
 Gem::Specification.new do |spec|
   spec.name          = "fluent-plugin-bigquery"
-  spec.version       = "0.0.1"
+  spec.version       = Fluent::BigQueryPlugin::VERSION
   spec.authors       = ["TAGOMORI Satoshi"]
   spec.email         = ["tagomoris@gmail.com"]
-  spec.description   = %q{TODO: Write a gem description}
-  spec.summary       = %q{TODO: Write a gem summary}
-  spec.homepage      = ""
+  spec.description   = %q{Fluentd plugin to store data on Google BigQuery, by load, or by stream inserts}
+  spec.summary       = %q{Fluentd plugin to store data on Google BigQuery}
+  spec.homepage      = "https://github.com/tagomoris/fluent-plugin-bigquery"
   spec.license       = "APLv2"
 
   spec.files         = `git ls-files`.split($/)
@@ -15,8 +19,11 @@ Gem::Specification.new do |spec|
   spec.require_paths = ["lib"]
 
   spec.add_development_dependency "rake"
-  gem.add_runtime_dependency "fluentd"
-  gem.add_runtime_dependency "fluent-mixin-plaintextformatter", '>= 0.2.1'
-  gem.add_runtime_dependency "fluent-mixin-config-placeholders", ">= 0.2.0"
-  gem.add_runtime_dependency "google-api-client", "~> 0.6.4"
+  spec.add_runtime_dependency "google-api-client", "~> 0.6.4"
+  spec.add_runtime_dependency "fluentd"
+  spec.add_runtime_dependency "fluent-mixin-plaintextformatter", '>= 0.2.1'
+  spec.add_runtime_dependency "fluent-mixin-config-placeholders", ">= 0.2.0"
+  spec.add_runtime_dependency "fluent-plugin-buffer-lightening"
+
+  spec.add_development_dependency "fluent-plugin-dummydata-producer"
 end