Skip to content

Commit

Permalink
Merge pull request #76977 from dagiro/freshness79
Browse files Browse the repository at this point in the history
freshness79
  • Loading branch information
Ja-Dunn authored May 17, 2019
2 parents 2fa1713 + 2b85f2c commit a10c0d4
Show file tree
Hide file tree
Showing 2 changed files with 21 additions and 16 deletions.
37 changes: 21 additions & 16 deletions articles/hdinsight/spark/apache-spark-load-data-run-query.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,37 +8,46 @@ ms.service: hdinsight
ms.custom: hdinsightactive,mvc
ms.topic: tutorial
ms.author: hrasheed
ms.date: 04/03/2019
ms.date: 05/16/2019

#custom intent: As a developer new to Apache Spark and to Apache Spark in Azure HDInsight, I want to learn how to load data into a Spark cluster, so I can run interactive SQL queries against the data.
---

# Tutorial: Load data and run queries on an Apache Spark cluster in Azure HDInsight

In this tutorial, you learn how to create a dataframe from a csv file, and how to run interactive Spark SQL queries against an [Apache Spark](https://spark.apache.org/) cluster in Azure HDInsight. In Spark, a dataframe is a distributed collection of data organized into named columns. Dataframe is conceptually equivalent to a table in a relational database or a data frame in R/Python.

In this tutorial, you learn how to:
> [!div class="checklist"]
> * Create a dataframe from a csv file
> * Run queries on the dataframe
## Prerequisites

* Complete [Create an Apache Spark cluster in Azure HDInsight](apache-spark-jupyter-spark-sql.md).
An Apache Spark cluster on HDInsight. See [Create an Apache Spark cluster](./apache-spark-jupyter-spark-sql-use-portal.md).

## Create a dataframe from a csv file
## Create a Jupyter notebook

Applications can create dataframes directly from files or folders on the remote storage such as Azure Storage or Azure Data Lake Storage; from a Hive table; or from other data sources supported by Spark, such as Cosmos DB, Azure SQL DB, DW, etc. The following screenshot shows a snapshot of the HVAC.csv file used in this tutorial. The csv file comes with all HDInsight Spark clusters. The data captures the temperature variations of some buildings.

![Snapshot of data for interactive Spark SQL query](./media/apache-spark-load-data-run-query/hdinsight-spark-sample-data-interactive-spark-sql-query.png "Snapshot of data for interactive Spark SQL query")
Jupyter Notebook is an interactive notebook environment that supports various programming languages. The notebook allows you to interact with your data, combine code with markdown text and perform simple visualizations.

1. Edit the URL `https://SPARKCLUSTER.azurehdinsight.net/jupyter` by replacing `SPARKCLUSTER` with the name of your Spark cluster. Then enter the edited URL in a web browser. If prompted, enter the cluster login credentials for the cluster.

2. From the Jupyter web page, Select **New** > **PySpark** to create a notebook.

![Create a Jupyter Notebook to run interactive Spark SQL query](./media/apache-spark-load-data-run-query/hdinsight-spark-create-jupyter-interactive-spark-sql-query.png "Create a Jupyter Notebook to run interactive Spark SQL query")

1. Open the Jupyter notebook that you created in the prerequisites section and create a new notebook with PySpark.
A new notebook is created and opened with the name Untitled(`Untitled.ipynb`).

> [!NOTE]
> By using the PySpark kernel to create a notebook, the `spark` session is automatically created for you when you run the first code cell. You do not need to explicitly create the session.
2. Paste the following code in an empty cell of the notebook, and then press **SHIFT + ENTER** to run the code. The code imports the types required for this scenario:
## Create a dataframe from a csv file

Applications can create dataframes directly from files or folders on the remote storage such as Azure Storage or Azure Data Lake Storage; from a Hive table; or from other data sources supported by Spark, such as Cosmos DB, Azure SQL DB, DW, etc. The following screenshot shows a snapshot of the HVAC.csv file used in this tutorial. The csv file comes with all HDInsight Spark clusters. The data captures the temperature variations of some buildings.

![Snapshot of data for interactive Spark SQL query](./media/apache-spark-load-data-run-query/hdinsight-spark-sample-data-interactive-spark-sql-query.png "Snapshot of data for interactive Spark SQL query")

1. Paste the following code in an empty cell of the Jupyter notebook, and then press **SHIFT + ENTER** to run the code. The code imports the types required for this scenario:

```python
from pyspark.sql import *
Expand All @@ -49,7 +58,7 @@ Applications can create dataframes directly from files or folders on the remote

![Status of interactive Spark SQL query](./media/apache-spark-load-data-run-query/hdinsight-spark-interactive-spark-query-status.png "Status of interactive Spark SQL query")

3. Run the following code to create a dataframe and a temporary table (**hvac**) by running the following code.
2. Run the following code to create a dataframe and a temporary table (**hvac**) by running the following code.

```python
# Create a dataframe and table from sample data
Expand Down Expand Up @@ -92,11 +101,7 @@ You can also select the resource group name to open the resource group page, and

## Next steps

In this tutorial, you learned how to:
> [!div class="checklist"]
> * Create an Apache Spark dataframe.
> * Run Spark SQL against the dataframe.
In this tutorial, you learned how to create a dataframe from a csv file, and how to run interactive Spark SQL queries against an Apache Spark cluster in Azure HDInsight. Advance to the next article to see how the data you registered in Apache Spark can be pulled into a BI analytics tool such as Power BI.

Advance to the next article to see how the data you registered in Apache Spark can be pulled into a BI analytics tool such as Power BI.
> [!div class="nextstepaction"]
> [Analyze data using BI tools](apache-spark-use-bi-tools.md)
> [Analyze data using BI tools](apache-spark-use-bi-tools.md)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit a10c0d4

Please sign in to comment.