Skip to content

Latest commit

 

History

History
369 lines (263 loc) · 14.4 KB

tidb-binlog-kafka.md

File metadata and controls

369 lines (263 loc) · 14.4 KB
title category
TiDB-Binlog user guide
tool

TiDB-Binlog User Guide

This document describes how to deploy the Kafka version of TiDB-Binlog. If you need to deploy the local version of TiDB-Binlog, see the TiDB-Binlog user guide for the local version.

About TiDB-Binlog

TiDB-Binlog is a tool for enterprise users to collect binlog files for TiDB and provide real-time backup and synchronization.

TiDB-Binlog supports the following scenarios:

  • Data synchronization: to synchronize TiDB cluster data to other databases
  • Real-time backup and recovery: to back up TiDB cluster data, and recover in case of cluster outages

TiDB-Binlog architecture

The TiDB-Binlog architecture is as follows:

TiDB-Binlog architecture

The TiDB-Binlog cluster mainly consists of three components:

Pump

Pump is a daemon that runs on the background of each TiDB host. Its main function is to record the binlog files generated by TiDB in real time and write to the file in the disk sequentially.

Drainer

Drainer collects binlog files from each Pump node, converts them into specified database-compatible SQL statements in the commit order of the transactions in TiDB, and synchronizes to the target database or writes to the file sequentially.

Kafka & Zookeeper

The Kafka cluster stores the binlog data written by Pump and provides the binlog data to Drainer for reading.

Note: In the local version of TiDB-Binlog, the binlog is stored in files, while in the latest version, the binlog is stored using Kafka.

Install TiDB-Binlog

Download Binary for the CentOS 7.3+ platform

# Download the tool package.
wget http://download.pingcap.org/tidb-binlog-latest-linux-amd64.tar.gz
wget http://download.pingcap.org/tidb-binlog-latest-linux-amd64.sha256

# Check the file integrity. If the result is OK, the file is correct. 
sha256sum -c tidb-binlog-latest-linux-amd64.sha256

# Extract the package.
tar -xzf tidb-binlog-latest-linux-amd64.tar.gz
cd tidb-binlog-latest-linux-amd64

Deploy TiDB-Binlog

Note

  • You need to deploy a Pump for each TiDB server in the TiDB cluster. Currently, the TiDB server only supports the binlog in UNIX socket.

  • When you deploy a Pump manually, to start the service, follow the order of Pump -> TiDB; to stop the service, follow the order of TiDB -> Pump.

    We set the startup parameter binlog-socket as the specified unix socket file path of the corresponding parameter socket in Pump. The final deployment architecture is as follows:

    TiDB Pump deployment architecture

  • Drainer does not support renaming DDL on the table of the ignored schemas (schemas in the filter list).

  • To start Drainer in the existing TiDB cluster, usually you need to do a full backup, get the savepoint, import the full backup, and start Drainer and synchronize from the savepoint.

    To guarantee the integrity of data, perform the following operations 10 minutes after Pump is started:

    • Use the generate_binlog_position tool to generate the Drainer savepoint file. The tool is involved in the tidb-tools project. See the README description for usage.

    • Do a full backup. For example, back up TiDB using mydumper.

    • Import the full backup to the target system.

    • Set the file path of the savepoint and start Drainer:

      bin/drainer --config=conf/drainer.toml --data-dir=${drainer_savepoint_dir}
      
  • The drainer outputs pb and you need to set the following parameters in the configuration file:

    [syncer]
    db-type = "pb"
    disable-dispatch = true
    
    [syncer.to]
    dir = "/path/pb-dir"
  • Before you deploy the TiDB-Binlog, install the Kafka and Zookeeper cluster and pay attention to the following items:

    • Make sure that Kafka is 0.9 version or later.
    • It is required to set the parameter auto.create.topics.enable=true.
    • It is recommended to deploy Kafka and Zookeeper on 3 to 5 servers.
    • The size of the disk space depends on the business data volume.

Deploy Pump using TiDB-Ansible

  • If you have not deployed the Kafka cluster, use the Kafka-Ansible to deploy.
  • When you deploy the TiDB cluster using TiDB-Ansible, edit the tidb-ansible/inventory.ini file, set enable_binlog = True, and configure the zookeeper_addrs variable as the Zookeeper address of the Kafka cluster. In this way, Pump is deployed while you deploy the TiDB cluster.

Configuration example:

# binlog trigger
enable_binlog = True
# zookeeper address of kafka cluster, example:
# zookeeper_addrs = "192.168.0.11:2181,192.168.0.12:2181,192.168.0.13:2181"
zookeeper_addrs = "192.168.0.11:2181,192.168.0.12:2181,192.168.0.13:2181"

Deploy Pump using Binary

  1. Description of Pump command line arguments

    Usage of Pump:
    -L string
        log level: debug, info, warn, error, fatal (default "info")
    -V
        to print Pump version info
    -addr string
        the RPC address that Pump provides service (default "127.0.0.1:8250")
    -advertise-addr string
        the RPC address that Pump provides external service
    -config string
        to configure the file path of Pump; if you specifies the configuration file, Pump reads the configuration first; if the corresponding configuration also exists in the command line argument, Pump uses the command line configuration to cover that in the configuration file
    -data-dir string
        the path of storing Pump data
    -kafka-addrs string
        the connected Kafka address (default "127.0.0.1:9092")
    -zookeeper-addrs string
        the Zookeeper address; if you set the option, the Kafka address is got from Zookeeper; if you do not set the option, the value of kafka-addrs is used
    -gc int
        the maximum days that the binlog is retained (default 7), and 0 means retaining the binlog permanently
    -heartbeat-interval int
        the interval between heartbeats that Pump sends to PD (unit: second)
    -log-file string
        the path of the log file
    -log-rotate string
        the log file rotating frequency (hour/day)
    -metrics-addr string
        the Prometheus pushgataway address; leaving it empty disables Prometheus push
    -metrics-interval int
        the frequency of reporting monitoring information (default 15, unit: second)
    -pd-urls string
        the node address of the PD cluster (default "http://127.0.0.1:2379")
    -socket string
        the monitoring address of the unix socket service (default "unix:///tmp/pump.sock")
    
  2. Pump configuration file

    # Pump configuration.
    # the RPC address that Pump provides service
    addr = "127.0.0.1:8250"
    
    # the RPC address that Pump provides external service
    advertise-addr = ""
    
    # an integer value to control expiry date of the binlog data, indicates how long (in days) the binlog data is stored.
    # (default value is 0, means binlog data would never be removed)
    gc = 7
    
    # the path of storing Pump data
    data-dir = "data.pump"
    
    # the connected Kafka address (default "127.0.0.1:9092")
    kafka-addrs = "127.0.0.1:9092"
    
    # the Zookeeper address; if you set the option, the Kafka address is got from Zookeeper; if you do not set the option, the value of kafka-addrs is used
    zookeeper-addrs = "127.0.0.1:2181"
    
    # the interval between heartbeats that Pump sends to PD (unit: second)
    heartbeat-interval = 3
    
    # the node address of the PD cluster (default "http://127.0.0.1:2379")
    pd-urls = "http://127.0.0.1:2379"
    
    # the monitoring address of the unix socket service (default "unix:///tmp/pump.sock")
    socket = "unix:///tmp/pump.sock"
  3. Startup example

    ./bin/pump -config pump.toml

Deploy Drainer using Binary

  1. Description of Drainer command line arguments

    Usage of Drainer:
    -L string
        log level: debug, info, warn, error, fatal (default "info")
    -V
        to print Pump version info
    -addr string
        the address that Drainer provides service (default "127.0.0.1:8249")
    -c int
        to synchronize the downstream concurrency number, and a bigger value means better throughput performance (default 1)
    -config string
        to configure the file path of Drainer; if you specifies the configuration file, Drainer reads the configuration first; if the corresponding configuration also exists in the command line argument, Pump uses the command line configuration to cover that in the configuration file
    -data-dir string
        the path of storing Drainer data (default "data.drainer")
    -kafka-addrs string
        the connected Kafka address (default "127.0.0.1:9092")
    -zookeeper-addrs string
        the Zookeeper address; if you set the option, the Kafka address is got from Zookeeper; if you do not set the option, the value of kafka-addrs is used
    -dest-db-type string
        the downstream service type of Drainer (default "mysql")
    -detect-interval int
        the interval of detecting Pump's status from PD (default 10, unit: second)
    -disable-dispatch
        whether to disable dispatching sqls in a single binlog; if you set the value to true, it is restored into a single transaction to synchronize in the order of each binlog (If the downstream service type is "mysql", set the value to false)
    -gen-savepoint
        If you set the value to true, only the savepoint meta file of Drainer is generated, and you can use it with mydumper
    -ignore-schemas string
        the DB filtering list (default "INFORMATION_SCHEMA,PERFORMANCE_SCHEMA,mysql,test"); does not support the rename DDL operation on the table of ignore schemas
    -log-file string
        the path of the log file
    -log-rotate string
        the log file rotating frequency (hour/day)
    -metrics-addr string
        the Prometheus pushgataway address; leaving it empty disables Prometheus push
    -metrics-interval int
        the frequency of reporting monitoring information (default 15, unit: second)
    -pd-urls string
        the node address of the PD cluster (default "http://127.0.0.1:2379")
    -txn-batch int
        the number of SQL statements in a single transaction that is output to the downstream database (default 1)
    
  2. Drainer configuration file

    # Drainer configuration
    
    # the address that Drainer provides service (default "127.0.0.1:8249")
    addr = "127.0.0.1:8249"
    
    # the interval of detecting Pump's status from PD (default 10, unit: second)
    detect-interval = 10
    
    # the path of storing Drainer data (default "data.drainer")
    data-dir = "data.drainer"
    
    # the connected Kafka address (default "127.0.0.1:9092")
    kafka-addrs = "127.0.0.1:9092"
    
    # the Zookeeper address; if you set the option, the Kafka address is got from Zookeeper; if you do not set the option, the value of kafka-addrs is used
    zookeeper-addrs = "127.0.0.1:2181"
    
    # the node address of the PD cluster (default "http://127.0.0.1:2379")
    pd-urls = "http://127.0.0.1:2379"
    
    # the path of the log file
    log-file = "drainer.log"
    
    # Syncer configuration.
    [syncer]
    
    # the DB filtering list (default "INFORMATION_SCHEMA,PERFORMANCE_SCHEMA,mysql,test")
    # does not support the rename DDL operation on the table of ignore schemas
    ignore-schemas = "INFORMATION_SCHEMA,PERFORMANCE_SCHEMA,mysql"
    
    # the number of SQL statements in a single transaction that is output to the downstream database (default 1)
    txn-batch = 1
    
    # to synchronize the downstream concurrency number, and a bigger value means better throughput performance (default 1)
    worker-count = 1
    
    # whether to disable dispatching sqls in a single binlog; 
    # if you set the value to true, it is restored into a single transaction to synchronize in the order of each binlog (If the downstream service type is "mysql", set the value to false)
    disable-dispatch = false
    
    # the downstream service type of Drainer (default "mysql")
    # valid values: "mysql", "pb"
    db-type = "mysql"
    
    # replicate-do-db priority over replicate-do-table if have same db name
    # and we support regex expression,
    # the regex expression starts with '~'
    
    # replicate-do-db = ["~^b.*","s1"]
    
    # [[syncer.replicate-do-table]]
    # db-name ="test"
    # tbl-name = "log"
    
    # [[syncer.replicate-do-table]]
    # db-name ="test"
    # tbl-name = "~^a.*"
    
    # server parameters of the downstream database when the db-type is set to "mysql"
    [syncer.to]
    host = "127.0.0.1"
    user = "root"
    password = ""
    port = 3306
    
    # the directory of the binlog file when the db-type is set to "pb"
    # [syncer.to]
    # dir = "data.drainer"
  3. Startup example

    ./bin/drainer -config drainer.toml

Monitor TiDB-Binlog

This section introduces how to monitor TiDB-Binlog's status and performance, and display the metrics using Prometheus and Grafana.

Configure Pump/Drainer

Use the Pump service deployed using Ansible. Set metrics in startup parameters.

When you start Drainer, set the two parameters of --metrics-addr and --metrics-interval. Set --metrics-addr as the address of Push Gateway. Set --metrics-interval as the frequency of push (default 15 seconds).

Configure Grafana

Create a Prometheus data source

  1. Login the Grafana Web interface.

    • The default address is: http://localhost:3000

    • The default account name: admin

    • The password for the default account: admin

  2. Click the Grafana logo to open the sidebar menu.

  3. Click "Data Sources" in the sidebar.

  4. Click "Add data source".

  5. Specify the data source information:

    • Specify the name for the data source.
    • For Type, select Prometheus.
    • For Url, specify the Prometheus address.
    • Specify other fields as needed.
  6. Click "Add" to save the new data source.

Create a Grafana dashboard

  1. Click the Grafana logo to open the sidebar menu.

  2. On the sidebar menu, click "Dashboards" -> "Import" to open the "Import Dashboard" window.

  3. Click "Upload .json File" to upload a JSON file (Download TiDB Grafana Config).

  4. Click "Save & Open".

  5. A Prometheus dashboard is created.