Skip to content

Commit

Permalink
readme, tispark: update TiSpark and enable sparkR and pyspark (#27)
Browse files Browse the repository at this point in the history
* update tispark to 1.0

* add TiSparkR

* upgrade tispark version to 1.0.1
  • Loading branch information
tennix authored Aug 15, 2018
1 parent f30d7af commit 592d1de
Show file tree
Hide file tree
Showing 6 changed files with 125 additions and 4 deletions.
9 changes: 9 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -184,3 +184,12 @@ scala> spark.sql("select count(*) from lineitem").show
| 60175|
+--------+
```
You can also access Spark with Python or R using the following commands:
```
docker-compose exec tispark-master /opt/spark/bin/pyspark
docker-compose exec tispark-master /opt/spark/bin/sparkR
```
More documents about TiSpark can be found [here](https://github.com/pingcap/tispark).
23 changes: 19 additions & 4 deletions tispark/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -2,25 +2,40 @@ FROM anapsix/alpine-java:8

ENV SPARK_VERSION=2.1.1 \
HADOOP_VERSION=2.7 \
TISPARK_VERSION=0.1.0-SNAPSHOT \
TISPARK_VERSION=1.0.1 \
TISPARK_R_VERSION=1.1 \
TISPARK_PYTHON_VERSION=1.0.1 \
SPARK_HOME=/opt/spark \
SPARK_NO_DAEMONIZE=true \
SPARK_MASTER_PORT=7077 \
SPARK_MASTER_HOST=0.0.0.0 \
SPARK_MASTER_WEBUI_PORT=8080

ADD R /TiSparkR

# base image only contains busybox version nohup and ps
# spark scripts needs nohup in coreutils and ps in procps
# and we can use mysql-client to test tidb connection
RUN apk --no-cache add coreutils procps mysql-client python py-pip R \
&& pip install pytispark==1.0.1 pyspark==2.1.2
RUN apk --no-cache add \
coreutils \
mysql-client \
procps \
python \
py-pip \
R \
&& pip install --no-cache-dir pytispark==${TISPARK_PYTHON_VERSION} \
&& R CMD build TiSparkR \
&& R CMD INSTALL TiSparkR_${TISPARK_R_VERSION}.tar.gz \
&& rm -rf /TiSparkR_${TISPARK_R_VERSION}.tar.gz /TiSparkR

RUN wget -q https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz \
&& tar zxf spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz -C /opt/ \
&& ln -s /opt/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION} ${SPARK_HOME} \
&& wget -q http://download.pingcap.org/tispark-${TISPARK_VERSION}-jar-with-dependencies.jar -P ${SPARK_HOME}/jars \
&& wget -q https://github.com/pingcap/tispark/releases/download/${TISPARK_VERSION}/tispark-core-${TISPARK_VERSION}-jar-with-dependencies.jar -P ${SPARK_HOME}/jars \
&& wget -q http://download.pingcap.org/tispark-sample-data.tar.gz \
&& tar zxf tispark-sample-data.tar.gz -C ${SPARK_HOME}/data/ \
&& rm -rf spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz tispark-sample-data.tar.gz

ENV PYTHONPATH=${SPARK_HOME}/python/lib/py4j-0.10.4-src.zip:${SPARK_HOME}/python:$PYTHONPATH

WORKDIR ${SPARK_HOME}
11 changes: 11 additions & 0 deletions tispark/R/DESCRIPTION
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
Package: TiSparkR
Type: Package
Title: TiSpark for R
Version: 1.1
Author: PingCAP
Maintainer: Novemser <novemser@gmail.com>
Description: A shabby thin layer to support TiSpark in R language.
License: Apache 2.0
Copyright: 2017 PingCAP, Inc.
Encoding: UTF-8
LazyData: true
1 change: 1 addition & 0 deletions tispark/R/NAMESPACE
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
exportPattern("^[[:alpha:]]+")
41 changes: 41 additions & 0 deletions tispark/R/R/tisparkR.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
#
# Copyright 2017 PingCAP, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# See the License for the specific language governing permissions and
# limitations under the License.
#
#

# Title : TiSparkR
# Objective : TiSpark entry for R
# Created by: novemser
# Created on: 17-11-1

# Function:createTiContext
# Create a new TiContext via the spark session passed in
#
# @return A new TiContext created on session
# @param session A Spark Session for TiContext creation
createTiContext <- function(session) {
sparkR.newJObject("org.apache.spark.sql.TiContext", session)
}

# Function:tidbMapDatabase
# Mapping TiContext designated database to `dbName`.
#
# @param tiContext TiSpark context
# @param dbName Database name to map
# @param isPrefix Whether to use dbName As Prefix
# @param loadStatistics Whether to use statistics information from TiDB
tidbMapDatabase <- function(tiContext, dbName, isPrefix=FALSE, loadStatistics=TRUE) {
sparkR.callJMethod(tiContext, "tidbMapDatabase", dbName, isPrefix, loadStatistics)
paste("Mapping to database:", dbName)
}
44 changes: 44 additions & 0 deletions tispark/R/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
## TiSparkR
TiSparkR is a thin layer built to support the R language with TiSpark.

### Usage
1. Download the TiSparkR source code and build a binary package (run `R CMD build R` in TiSpark root directory). Install it to your local R library (e.g. via `R CMD INSTALL TiSparkR_1.0.0.tar.gz`)

2. Build or download TiSpark dependency jar `tispark-core-1.0-RC1-jar-with-dependencies.jar` [here](https://github.com/pingcap/tispark).

3. `cd` to your Spark home directory, and run:
```
./bin/sparkR --jars /where-ever-it-is/tispark-core-${version}-jar-with-dependencies.jar
```
Note that you should replace the `TiSpark` jar path with your own.

4. Use as below in your R console:
```R
# import tisparkR library
> library(TiSparkR)
# create a TiContext instance
> ti <- createTiContext(spark)
# Map TiContext to database:tpch_test
> tidbMapDatabase(ti, "tpch_test")

# Run a sql query
> customers <- sql("select * from customer")
# Print schema
> printSchema(customers)
root
|-- c_custkey: long (nullable = true)
|-- c_name: string (nullable = true)
|-- c_address: string (nullable = true)
|-- c_nationkey: long (nullable = true)
|-- c_phone: string (nullable = true)
|-- c_acctbal: decimal(15,2) (nullable = true)
|-- c_mktsegment: string (nullable = true)
|-- c_comment: string (nullable = true)

# Run a count query
> count <- sql("select count(*) from customer")
# Print count result
> head(count)
count(1)
1 150
```

0 comments on commit 592d1de

Please sign in to comment.