-
Notifications
You must be signed in to change notification settings - Fork 13.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark SQL backend (to support Elasticsearch, Cassandra, etc) #241
Comments
+1 |
+1 |
Totally worth doing, there's 2 paths for it, either by creating a SqlAlchemy dialect (might not be possible is Spark SQL is funky), or creating a new datasource and implementing the query interface. For now we have 2 datasources: sqlalchemy or druid. It's totally doable to add a third one, it just needs to implement something like: Basically you need to receive these parameters and return a pandas dataframe. |
We use Spark at Airbnb and have some SparkSql in places, we might have use cases for it internally, but I'm not sure where it fits in the priority list. |
Cool thanks for the pointers! This new connector would surely unlock a wealth of valuable contributions from other businesses which happen to not use Druid or a plain RDBMS. Sounds like a good investment to me :) |
I am really interested in adding Hive support, I may take a crack at it sometime in the next few weeks. Dropbox has a Python/Hive project that I was looking at: https://github.com/dropbox/PyHive |
Does it means Impala as well? Thanks |
+1 |
+1 for Hive |
Great work guys, but can I load data from Elasticsearch? |
+1 to addition of Elasticsearch support. |
+1 |
1 similar comment
+1 |
+1 for Hive |
+1 for Hive and Elasticsearch |
I am working on an Apache Drill Sql Alchemy Dialect. I have some basic things working, and have been working with others on the Drill mailing list. There has been talk of plugging Drill to Elastic Search, which seems a bit convoluted, however, since Elasticsearch doesn't have a SQL interface, Drill works really nice, if we get a Dialect working for Drill, then other storage plugins will (hopefully) just work. Some of the work can be found here: Docker container with pyodbc, unixodbc, Drill ODBC, and caravel all working: https://github.com/JohnOmernik/caraveldrill Drill Dialect (work in progress, feel free to play with it and try it, please report issues as you find them, this is iterative brute force programming at this point!) |
I've taken a different approach and started a native backend. WIP is at https://github.com/sathieu/caravel/tree/elasticsearch (beware: I'll squash commits and force push). Not much is working yet, and I don't have dedicated time on it. We'll see what comes. |
+1 to sparksql |
For what is worth: spark 2 will be sql compliant so then a sqlalchemy dialect is feasible |
+1 for spark SQL. That will get you connected to most data sources these days. |
|
You can connect it to Spark SQL. If it uses a hive back-end then you refer to this documentation page for instructions on how to connect sparkl sql via a jdbc+hive connector. https://docs.cloud.databricks.com/docs/latest/databricks_guide/index.html#01%20Databricks%20Overview/14%20Third%20Party%20Integrations/05%20Beeline.html. The one I prefer is dropbox/pyhive to connect to spark sql in my python projects. For scala or java the jdbc+hive will be preferable. |
+1 for spark sql |
Sweet! Can other confirm that SparkSQL works for them through SQL alchemy?! |
Giving hints about how to use SparkSQL in the docs: #803 |
@mistercrunch Right now it does. |
I used Spark Thrift Server with Pyhive and it almost works (I need to change one line in hive dialect) |
@shkr Hi, I am trying to achieve the same thing with pyHive and have not been able to make it work. What is the URI you are using for setting up Superset data source? I am trying something like jdbc+hive://localhost:10000/, and it gives an error: "Can't load plugin: sqlalchemy.dialects:jdbc.hive". I am sure I must be missing something here.. Thanks in advance for any instructions on this. -- update -- I have another question that is, what do you mean when you say use SparkSql as backend? I am fairly new to this, but AFAIK I can save dataframes in SparkSql to a Hive table, from which I can then create a Superset table/slice using the above connector. But is there more that I can do to make this process better? My overall goal is to be able to create tables/ slices from parquet files on HDFS. |
+1 for Elasticsearch support. |
@giaosudau, What is the SQLAlchemy URI, I should give in superset to connect to SparkSQL |
@santhavathi when you open spark ui dashboard, there is a ip printed on top, which is the hostname of the head of the cluster. you have to use that, as your hostname in the hive url. example : hive://<spark-cluster-master/ |
@shkr, thanks so much for the reply. The error message returned was: I used impala:// and it works now. |
Hello guys, I see in the documentation that SparkSQL is supported : http://airbnb.io/superset/installation.html#database-dependencies. What does this concretely mean ? Which DB can we query then ? Thanks a lot in advance. |
@shkr according to your latest comment, I tried the following URI: hive://172.17.0.2, where 172.17.0.2 is what I got from spark UI. It allows me to add it as a database, so far so good. However when I query against a table in this database, the job tracker shows a MapReduce job. I would expect the job to be a Spark job though, is it true in your case? |
@kaiosama, when you said you are connecting to hive://172.17.0.2, what is the port you used here, and are you directly connecting to spark master without hiveserver running? |
@santhavathi that is the full URI I used, without port #. I tried using some port #s from the spark UI page but none of them works. It was with a running hive server. Maybe I am missing something here, but it seems to me that Spark-sql is supposed to be used against Hive, i.e. you always need a running Hive server? Or can the Spark-sql connector be used against other sources? It's like @cduverne mentioned, it's not very clear to me. And I have not got any replies about how to get "jdbc+hive" work as said in the document. |
+1 for Hbase support :) |
At Airbnb we can do Hbase through Presto with the HBase Presto connector. |
would you please give me a link so i can follow install steps? |
Hi Can someone, please list down steps to do to connect ElasticSearch from Superset. |
@balchandra it would involve using this: |
@kaiosama The hostname directs the sql-alchemy to use SQL at the given port. Hard to say whether a map reduce is the normal behavior to expect, without knowing details about your setup of hive, map reduce and spark. |
@mistercrunch... |
Looks like |
+1 for elasticsearch |
1 similar comment
+1 for elasticsearch |
+1 for Hive and Elasticsearch |
Good news about ElasticSearch here! #8441 |
Closing since Superset now works with Elasticsearch! |
I can't resist saying Caravel looks much neater than Kibana, plus the user management doesn't cost money and it's not an afterthought.
It would be amazing to see Caravel replacing my Kibana dashboard, using the data I've got currently in Elasticsearch.
You use an SQL interface to query the data store, is there any chance Caravel can speak to Elasticsearch through Spark SQL?
Spark has a mature Elasticsearch connector, so it should be OK.
And wait.. If you support Spark SQL, you'll be immediately able to support HDFS, Cassandra, HBase, Hive, Tachyon, and any Hadoop data source!
Is this a path worth exploring for this project? I think it's quite exciting.
The text was updated successfully, but these errors were encountered: