Skip to content

Getting Started

shiyuhang0 edited this page Jun 13, 2022 · 16 revisions

Use TiSpark >= 2.5

Take the use of spark-shell for example, make sure you have deployed Spark and getted the TiSpark

Start spark-shell

TO use Tispark in spark-shell

  1. Add the following configuration in spark-defaults.conf
spark.sql.extensions  org.apache.spark.sql.TiExtensions
spark.tispark.pd.addresses  ${your_pd_adress}
spark.sql.catalog.tidb_catalog  org.apache.spark.sql.catalyst.catalog.TiCatalog
spark.sql.catalog.tidb_catalog.pd.addresses  ${your_pd_adress}
  1. Start spark-shell with the --jars option
spark-shell --jars tispark-assembly-{version}.jar

Get TiSpark version

spark.sql("select ti_version()").collect

Read with TiSpark

You can use Spark SQL to read from TiKV

spark.sql("use tidb_catalog")
spark.sql("select count(*) from ${database}.${table}").show

Write with TiSpark

You can use Spark DataSource API to write to TiKV and guarantees ACID(INSERT statement is not supported yet)

val tidbOptions: Map[String, String] = Map(
  "tidb.addr" -> "127.0.0.1",
  "tidb.password" -> "",
  "tidb.port" -> "4000",
  "tidb.user" -> "root"
)

val customerDF = spark.sql("select * from customer limit 100000")

customerDF.write
.format("tidb")
.option("database", "tpch_test")
.option("table", "cust_test_select")
.options(tidbOptions)
.mode("append")
.save()

See here for more details.

Delete with TiSpark

You can use Spark SQL to delete from TiKV (Tispark master support)

spark.sql("use tidb_catalog")
spark.sql("delete from ${database}.${table} where xxx")

See here for more details.

Use with spark_catalog

you can use multiple catalogs to read from different sources.

// read from hive
spark.sql("select * from spark_catalog.default.t").show

// join hive and tidb
spark.sql("select t1.id,t2.id from spark_catalog.default.t t1 left join tidb_catalog.test.t t2").show

Use TiSpark 2.4.x

Take the use of spark-shell for example

Start spark-shell

TO use Tispark in Spark shell

  1. Add the following configuration in spark-defaults.conf
spark.sql.extensions  org.apache.spark.sql.TiExtensions
spark.tispark.pd.addresses  ${your_pd_adress}
  1. Start spark-shell with the --jars option
spark-shell --jars tispark-assembly-{version}.jar

Get TiSpark version

spark.sql("select ti_version()").collect

Read with TiSpark

You can use Spark SQL to read from TiKV

spark.sql("select count(*) from ${database}.${table}").show

Write with TiSpark

You can use Spark DataSource API to write to TiKV and guarantees ACID(INSERT statement is not supported yet)

val tidbOptions: Map[String, String] = Map(
  "tidb.addr" -> "127.0.0.1",
  "tidb.password" -> "",
  "tidb.port" -> "4000",
  "tidb.user" -> "root"
)

val customerDF = spark.sql("select * from customer limit 100000")

customerDF.write
.format("tidb")
.option("database", "tpch_test")
.option("table", "cust_test_select")
.options(tidbOptions)
.mode("append")
.save()

See here for more details.