This repository gathers Spark code examples coming from various websites and books. It also includes several build scripts (Bash scripts, batch files, Make scripts) for experimenting with Spark on a Windows machine. |
Ada, Akka, C++, COBOL, Dafny, Dart, Deno, Docker, Erlang, Flix, Golang, GraalVM, Haskell, Kafka, Kotlin, LLVM, Modula-2, Node.js, Rust, Scala 3, Spring, TruffleSqueak, WiX Toolset and Zig are other topics we are continuously monitoring.
☛ Read the document "What is Apache Spark™?" from the Spark documentation to know more about the Spark ecosystem.
This project depends on two external software for the Microsoft Windows platform:
- Apache Maven 3.9 (requires Java 8+) (release notes)
- Git 2.47 (release notes)
- MSYS2 2024 (changelog)
- sbt 1.10 (requires Java 8) (release notes)
- Scala 2.13 (requires Java 8) (release notes)
- Spark 3.5 1 (release notes)
- Temurin OpenJDK 11 LTS 1 (release notes, bug fixes)
Optionally one may also install the following software:
- ConEmu 2023 (release notes)
- Gradle 8.10 1 (requires Java 8+) (release notes)
- Temurin OpenJDK 17 LTS 1 (release notes, bug fixes, Java 17 API)
- Temurin OpenJDK 21 LTS (release notes, Java 21 API)
- Visual Studio Code 1.95 (release notes)
☛ Installation policy
When possible we install software from a Zip archive rather than via a Windows installer. In our case we definedC:\opt\
as the installation directory for optional software tools (similar to the/opt/
directory on Unix).
For instance our development environment looks as follows (November 2024) 2:
C:\opt\apache-maven\ ( 10 MB) C:\opt\ConEmu\ ( 26 MB) C:\opt\Git\ (391 MB) C:\opt\gradle\ (140 MB) C:\opt\jdk-temurin-11.0.25_9\ (306 MB) C:\opt\jdk-temurin-17.0.13_11\ (304 MB) C:\opt\jdk-temurin-21.0.5_11\ (329 MB) C:\opt\msys64\ (2.8 GB) C:\opt\sbt\ (135 MB) C:\opt\scala-2.13.15\ ( 24 MB) C:\opt\spark-3.5.3-bin-hadoop3\ (423 MB) C:\opt\spark-3.5.3-bin-hadoop3-scala2.13\ (432 MB) C:\opt\VSCode\ (381 MB)
🔎 Git for Windows provides a BASH emulation used to run
git
from the command line (as well as over 250 Unix commands likeawk
,diff
,file
,grep
,more
,mv
,rmdir
,sed
andwc
).
Directory structure ▴
This project has the following directory structure :
bin\ docs\ examples\{README.md, HelloWorld, etc.} README.md QUICKREF.md RESOURCES.md setenv.bat
where
- directory
bin\
contains utility batch scripts. - directory
docs\
contains Apache Spark related papers/articles. - directory
examples
contain Apache Spark code examples. - file
README.md
is the Markdown document for this page. - file
QUICKREF.md
gathers Spark hints and tips. - file
RESOURCES.md
is the Markdown document presenting external resources. - file
setenv.bat
is the batch script for setting up our environment.
We also define a virtual drive – e.g. drive K:
– in our working environment in order to reduce/hide the real path of our project directory (see article "Windows command prompt limitation" from Microsoft Support).
🔎 We use the Windows external command
subst
to create virtual drives; for instance:> subst K: %USERPROFILE%\workspace\spark-examples
In the next section we give a brief description of the batch files present in this project.
Batch/Bash commands ▴
setenv.bat
3
We execute command setenv.bat
once to setup our development environment; it makes external tools such as mvn.cmd
, sbt.bat
or sh.exe
directly available from the command prompt.
> setenv Tool versions: java 11.0.25, sbt 1.10.3, scalac 2.13.15, spark-shell 3.5.3, gradle 8.10.2, mvn 3.9.9, make 4.4.1, git 2.47.0, diff 3.10, bash 5.2.37(1) > where mvn sbt sh C:\opt\apache-maven\bin\mvn C:\opt\apache-maven\bin\mvn.cmd C:\opt\Git\bin\sh.exe C:\opt\Git\usr\bin\sh.exe C:\opt\sbt\bin\sbt C:\opt\sbt\bin\sbt.bat
Footnotes ▴
[1] Scala 2.13 Support ↩
- Spark 3.2.0 and newer add support for Scala 2.13 (see PR#34218).
[2] Downloads ↩
-
In our case we downloaded the following installation files (see section 1):
apache-maven-3.9.9-bin.zip ( 10 MB) ConEmuPack.230724.7z ( 5 MB) gradle-8.10.2-bin.zip (118 MB) msys2-x86_64-20240727.exe ( 86 MB) OpenJDK11U-jdk_x64_windows_hotspot_11.0.25_9.zip (194 MB) OpenJDK17U-jdk_x64_windows_hotspot_17.0.13_11.zip (191 MB) OpenJDK21U-jdk_x64_windows_hotspot_21.0.5_11.zip (191 MB) PortableGit-2.47.0-64-bit.7z.exe ( 41 MB) sbt-1.10.3.zip ( 17 MB) scala-2.13.15.zip ( 21 MB) spark-3.5.3-bin-hadoop3.tgz (285 MB) spark-3.5.3-bin-hadoop3-scala2.13.tgz (292 MB) VSCode-win32-x64-1.95.0.zip (131 MB) winutils-master.zip ( 24 MB)
Note: If not yet done our batch filesetenv.bat
also install the winutils tools for Windows to avoid the "no native library
" and "access0
" error.> setenv -verbose Assign drive J: to path "%USERPROFILE%\workspace-perso\spark-examples" Download Zip file to directory "%TEMP%" Uncompress Zip file to directory "%TEMP%" Copy files from "%TEMP%\winutils-master\hadoop-3.3.6\bin" to directory "C:\opt\spark-3.5.3-bin-hadoop3-scala2.13\bin" Tool versions: java 11.0.25, sbt 1.10.3, scalac 2.13.8, spark-shell 3.5.3, gradle 8.10.2, mvn 3.9.9, make 4.4.1, git 2.47.0, diff 3.10, sh 5.2.37(1) Tool paths: [...]
[3] setenv.bat
usage ↩
-
Batch file
setenv.bat
has specific environment variables set that enable us to use command-line developer tools more easily. - It is similar to the setup scripts described on the page "Visual Studio Developer Command Prompt and Developer PowerShell" of the Visual Studio online documentation.
-
For instance we can quickly check that the two scripts
Launch-VsDevShell.ps1
andVsDevCmd.bat
are indeed available in our Visual Studio 2019 installation :> where /r "C:\Program Files (x86)\Microsoft Visual Studio" *vsdev* C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\Common7\Tools\Launch-VsDevShell.ps1 C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\Common7\Tools\VsDevCmd.bat C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\Common7\Tools\vsdevcmd\core\vsdevcmd_end.bat C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\Common7\Tools\vsdevcmd\core\vsdevcmd_start.bat
-
Concretely, in our GitHub projects which depend on Visual Studio (e.g.
michelou/cpp-examples
),setenv.bat
does invokeVsDevCmd.bat
(resp.vcvarall.bat
for older Visual Studio versions) to setup the Visual Studio tools on the command prompt.