-
This guide is intended for Arch Linux or its derivative (Manjaro, Endeavour, Garuda, etc). The package name, package manager, and directory structure may vary for other Linux distributions. See _misc / arch_package for details.
-
Always double check every path, especially when writing config files! Even if you're on Arch Linux, package maintainers may also change (which may decide to change something), so proceed with caution.
-
Update your system first before attempting to do anything in this guide (e.g.
sudo pacman -Syyu
). -
This guide use a single computer to run 2 Hadoop clusters (1 master, 1 worker) in a pseudo-distributed mode. If you need more clusters, you may need to modify something that's not covered in this guide. The current stack version used in this guide are Hadoop v3.3.5, Hive v3.1.3, Spark v3.5.0, OpenJDK v8.382.u05, and PostgreSQL v15.4.
-
Install and configure Apache Spark (see below).
If you're following that link until the end, you will also end up installing Apache Hadoop and Apache Hive as prerequisite for Spark.
However, if you don't care about cluster setup or any of those programs, you can skip this step since
pyspark
can also work on its own. -
Install and configure Metabase (see below).
Note that if you don't install Hive, you don't need to install Metabase since it (currently) can't connect to Spark SQL without Hive.
-
If you installed everything from the previous steps, make sure to start all the services:
# sudo systemctl start sshd postgresql sudo systemctl start hadoop-namenode hadoop-datanode hadoop-secondarynamenode hadoop-resourcemanager hadoop-historyserver sudo systemctl start metabase sudo systemctl start hive-metastore hive-hiveserver2
Wait about a minute, then check the services status (some services may not fail immediately):
# Single command becase the output is displayed through "less" sudo systemctl status hadoop-namenode hadoop-datanode hadoop-secondarynamenode hadoop-resourcemanager hadoop-historyserver metabase hive-metastore hive-hiveserver2
Don't check services based on the status (active/inactive) only, but check the actual logs. Hive services may still continue even if they actually encountered an error! To see the full log of a service, use
sudo journalctl -u service_name
. After fixing the problem, usesudo systemctl restart service_name
to restart it.Check used ports by running
sudo netstat -plnt
, make sure some important ports like9803
,10000
, and9000
are listed there. If you run Python before starting these services, there's chance that Python may claim those ports! -
Create Python virtual environment. You can use
Ctrl + Shift + P
on VS Code, or runpython -m venv .venv
on project folder directly. By using virtual environment, all dependencies will be installed in the project folder. -
Install all pip requirements (
pip install -r requirements.txt
), which also contains the neededpyspark
library.PySpark may also be bundled within Apache Spark itself, but you will need to set
PYTHONPATH
manually (so the Python library can be detected). However, if you're installing from AUR, by default it may not include thepython
folder sopyspark
will still be needed.Note that PySpark is actually a standalone package (but still require Java), it will work without installing Apache Hadoop, Apache Hive, or even Apache Spark itself. However, without installing all the prerequisites, PySpark can't be used to its full potential and most softwares (e.g. Metabase) won't be able to connect to it.
-
Start experimenting using the notebook file and Metabase (
localhost:3000
). -
If you're done, stop all the services:
# sudo systemctl stop sshd postgresql sudo systemctl stop hadoop-namenode hadoop-datanode hadoop-secondarynamenode hadoop-resourcemanager hadoop-historyserver sudo systemctl stop metabase sudo systemctl stop hive-metastore hive-hiveserver2
Using Apache Spark, we can execute queries and create tables without affecting the real database. This is ideal for production environment. It can also be used for machine learning and streaming data in real-time.
Expand
-
Configure Apache Hadoop first (see below). It's needed because currently Metabase only support Spark SQL connection using Hive.
However, if you are very lazy, you can just connect to MariaDB (MySQL) or PostgreSQL using Metabase directly, but then most of this guide will be pointless.
-
Install Apache Spark from AUR (
yay -S apache-spark
). Despite its package name, it will use the binary release instead of compiling from source code. If you prefer compiling, useapache-spark-git
instead. -
Check whether Spark is registered to
PATH
or not by runningspark-shell --version
orwhere spark-shell
. If not, create a new profile file (e.g.sudo nano /etc/profile.d/apache-spark.sh
) with content similar to this:export SPARK_HOME=/opt/apache-spark export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
Load the new change by running
source /etc/profile.d/apache-spark.sh
on your shell (e.g. Bash, Zsh). -
Done, continue to the next step on the main guide.
Metabase is a business intelligence, dashboard, and data visualization tool. It has many alternatives such as Power BI, Tableau, Superset, Redash, and Grafana. There are some interesting insights which made me choose Metabase, see here and here. However, please do your own research before deciding!
Expand
-
Install Metabase from AUR (
yay -S metabase
). -
By default, Metabase config is located at
/etc/metabase.conf
. Edit it as root using nano (sudo nano /etc/metabase.conf
) or similar tools. You can run Metabase without changing any config, but it will complain if we use the default H2 database (unsafe).Metabase example config:
# https://www.metabase.com/docs/latest/configuring-metabase/environment-variables # Metabase server URL MB_JETTY_HOST=127.0.0.1 MB_JETTY_PORT=3000 MB_ANON_TRACKING_ENABLED=false MB_CHECK_FOR_UPDATES=false # Possible values: postgres, mysql, h2 # Use mysql for MySQL compatible databases (e.g. MariaDB) MB_DB_TYPE=postgres MB_DB_HOST=127.0.0.1 # Example values: 5432 (PostgreSQL), 3306 (MariaDB) MB_DB_PORT=5432 # Change it based on your setup MB_DB_USER=postgres MB_DB_PASS=postgres MB_DB_DBNAME=metabase
Create the database (
MB_DB_DBNAME
) if not exist yet, and make sure the owner is correct (in this case,postgres
). -
After changing the config file, you should not run
metabase
directly. You need to run it as service (sudo systemctl start metabase
), otherwise it won't detect the config file and may create some stuff directly on your home directory.If you want to run it automatically on startup, use
sudo systemctl enable metabase
(will take about 800 MB of RAM on idle). You can also check the service status by usingsudo systemctl status metabase
. -
Set up Metabase account, data source (SparkSQL), etc by going to
localhost:3000
(you can do this later). Make sure you already set up Hive and Spark correctly, otherwise you won't be able to connect to data source.The default database for Hive is
default
, it's different because we are usinghive2
JDBC to connect, notpostgres
ormysql
JDBC. You can check this from the notebook file. As for the host and port, uselocalhost
and10000
. -
Done, continue to the next step on the main guide.
Apache Hadoop is a parallel data processing engine where big data can be distributed to several clusters. There are 2 types of cluster, the master (name) node and the worker (data) node.
Expand
-
Install and configure Java environment first (see below).
-
Install Apache Hadoop from AUR (
yay -S hadoop
). Apache Hadoop is required by Apache Hive, which is required by Metabase.By default, Hadoop will be compiled from source instead of using the ready-made binary. To use the binary version, don't continue immediately when asked about clean build (on yay), modify
PKGBUILD
file in~/.cache/yay/hadoop
(see below), then choose "[N]one" on yay to continue again.The modified
PKGBUILD
file should (more or less) look like this:# ... # Modify the source URL only, leave the rest source=("https://dlcdn.apache.org/hadoop/common/hadoop-$pkgver/hadoop-$pkgver.tar.gz" "XXXXXXXXXXXXXXXXXXXX" "XXXXXXXXXXXXXXXXXXXX" "XXXXXXXXXXXXXXXXXXXX") # Replace the first line with SKIP (no file corruption check!) sha256sums=('SKIP' 'XXXXXXXXXXXXXXXXXXXX' 'XXXXXXXXXXXXXXXXXXXX' 'XXXXXXXXXXXXXXXXXXXX') # Remove build() function and replace it with prepare() prepare() { # Use similar folder structure as if we built it from source # See the beginning of package() function for folder structure reference mkdir -p hadoop-rel-release-$pkgver/hadoop-dist/target mv hadoop-$pkgver hadoop-rel-release-$pkgver/hadoop-dist/target/hadoop-$pkgver } # ...
-
Check whether Hadoop env vars are already set or not by looking at
/etc/profile.d/hadoop.sh
(file name may vary). If not, create the file with content similar to this:export HADOOP_COMMON_LIB_NATIVE_DIR=/usr/lib export HADOOP_CONF_DIR=/etc/hadoop export HADOOP_LOG_DIR=/var/log/hadoop export HADOOP_USERNAME=hadoop # Hive still hasn't support Java 11 yet export JAVA_HOME=/usr/lib/jvm/java-8-openjdk
Load the new env vars by running
source /etc/profile.d/hadoop.sh
on your shell (e.g. Bash, Zsh).🔴 Important: When running program as root (e.g. by using
sudo
), it will not load user env vars from/etc/profile.d
by default, so you may need to usesudo -i
(to invoke the login shell) if Hadoop still complains about missing variables.Also, other than
/etc/profile.d/hadoop.sh
, Hadoop env file is also located at/etc/hadoop/hadoop-env.sh
. The Arch wiki recommend editingJAVA_HOME
in that file instead, but it will be easier for us to just modify one file. -
After that, you can also modify
/etc/conf.d/hadoop
file with content similar to/etc/profile.d/hadoop.sh
, but remove theexport
and any shell$
variable (because it's not a shell file).The difference between this file (
/etc/conf.d/hadoop
) and the previous file (/etc/profile.d/hadoop.sh
) is that this one will be loaded instead if you run Hadoop as service/daemon (which you may want to do at later step). Some of you may prefer this rather than running Hadoop manually on shell every time you need it.If you want to run Hadoop directly (as user/root, not as service), then there's no need to modify this file. By default, the service will be running under the
hadoop
user (it's not a normal user, it doesn't have password or home by default). -
To test Hadoop, try using a single cluster first. In this guide, I will be using the pseudo-distributed mode (my main reference).
Modify
/etc/hadoop/core-site.xml
:<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration>
Modify
/etc/hadoop/hdfs-site.xml
:<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:///mnt/hadoop/${user.name}/dfs/name</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:///mnt/hadoop/${user.name}/dfs/data</value> </property> </configuration>
If you want to know more, here are the default configuration for core-site.xml and hdfs-site.xml.
If you take a deeper look at those links, the default location for
dfs.namenode.name.dir
anddfs.datanode.data.dir
are insidehadoop.tmp.dir
, which will be cleaned on each startup. That's why we move them to other directories, so the data can be retained. Using/mnt
seems quite reasonable in this case.Also, we should make
/mnt/hadoop
accessible by the user we want to run Hadoop as (e.g.root
,hadoop
, or your username) by running these commands:# I think /mnt/hadoop should just be owned by root sudo mkdir /mnt/hadoop # Create /mnt/hadoop/username (e.g. hadoop), see hdfs-site.xml sudo mkdir /mnt/hadoop/hadoop # User and group to own the directory (e.g. hadoop:hadoop) sudo chown hadoop:hadoop /mnt/hadoop/hadoop
🔴 Important: Make sure you consistently use the same username and group (it will be used again at later step), or just use
hadoop
if you're not sure because the services/daemons also use that by default. The DFS directory config we created previously is also based on username, so changing users may cause path and permission issues. -
Since we are using the pseudo-distributed mode, we may as well set Hadoop to use YARN in case we want to use (real) multiple clusters in the future. However, if you only want to use single cluster, I think you can skip this step.
Modify
/etc/hadoop/mapred-site.xml
:<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
Modify
/etc/hadoop/yarn-site.xml
:<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration>
Default config for reference: mapred-default.xml, yarn-default.xml.
-
Format a new Distributed File System (DFS) for the master node by running
sudo -i hdfs namenode -format
. The DFS directory will be made as specified in thehdfs-site.xml
file.🔴 Important: Note that by running as
sudo
your user will beroot
, nothadoop
or your original user. This may cause inconsistencies (e.g. DFS directory not created yet) when you force Hadoop to run as other user at later step. However, that is actually the right way because running as root is dangerous andssh
(needed by Hadoop) disallow root password login by default.🔵 Information: In this guide, we will try running it as root first because
I want you to suffer the same way as I didthat's how I troubleshoot what's wrong and came up with solutions. I hope my findings may be useful to some of you. -
Start the master node and worker node(s) by running
sudo -i start-all.sh
(it's the same as running bothstart-dfs.sh
andstart-yarn.sh
). Also note that the user will beroot
if we do it this way.If Hadoop complains about
HDFS_NAMENODE_USER
,HDFS_DATANODE_USER
and some other variables (because you ran it as root), you can add all the complained variables to/etc/profile.d/hadoop.sh
like this:# Just add these, don't delete other existing variables export HDFS_NAMENODE_USER=hadoop export HDFS_DATANODE_USER=hadoop export HDFS_SECONDARYNAMENODE_USER=hadoop export YARN_NODEMANAGER_USER=hadoop export YARN_RESOURCEMANAGER_USER=hadoop
Also, if you take a look at
start-dfs.sh
file, there are actually other variable calledHDFS_DATANODE_SECURE_USER
. You can also add it to/etc/profile.d/hadoop.sh
(optional), but withhdfs
as the value.If Hadoop still complains (again) because it can't connect to localhost via
ssh
(port 22), you can either start or enable the SSH service (sshd
). Then, execute these commands on your shell to enable passwordless connection (otherwise Hadoop won't be able to connect to worker nodes):ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa # Equal to: ssh-copy-id -i ~/.ssh/id_rsa.pub username@localhost # Don't execute this multiple times, it will append duplicated keys cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys # To prevent permission denied error for other users chmod 0600 ~/.ssh/authorized_keys
Notice that when you try to run
start-all.sh
again, it may still complain aboutPermission denied (publickey,password)
. This is likely due to your original user home directory (~
or/home/username
) is not accessible by SSH when running ashadoop
orroot
, or SSH is looking for files in the wrong directory. To troubleshoot this issue, try running SSH with debug mode as different user(s):# Connect to hadoop as current user #ssh -v hadoop@localhost # Connect to hadoop as root #sudo ssh -v hadoop@localhost # Connect to hadoop as hadoop sudo -u hadoop ssh -v hadoop@localhost
From the debug log, it turns out that the user
hadoop
actually has a non-standard home directory, which is located at/var/lib/hadoop
, and SSH is trying to access the key files from that directory instead (/var/lib/hadoop/.ssh
). In conclusion, we need to make passwordless connection like in the previous step, but this time we will run the SSH tools ashadoop
:sudo -i -u hadoop ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys chmod 0600 ~/.ssh/authorized_keys exit
-
If after all that, Hadoop still complain about missing env vars (probably due to SELinux fuckery), you have 2 options (and you can do both):
- Add all relevant variables from
/etc/profile.d/hadoop.sh
to/etc/hadoop/hadoop-env.sh
(just append them at the end). Though you may still need to usesudo -i
initially to load some Hadoop env vars from/etc/profile.d/hadoop.sh
since our config files are stored in a non-standard directory, or - Use the provided Hadoop services/daemons
You can do option 1 by yourself, as for option 2, check the available service names by running
ls /usr/lib/systemd/system/hadoop*
. The advantage of using services is that you can run it ashadoop
user by default, and these services don't read env vars from/etc/profile.d
, but from/etc/conf.d/hadoop
instead.Before running the services, you may want to reformat DFS on master node and delete the old one since we ran it as
root
previously:sudo rm -r /mnt/hadoop/root # Will create /mnt/hadoop/hadoop this time sudo -i -u hadoop hdfs namenode -format
The
start-all.sh
should be equal as running:sudo systemctl start hadoop-namenode hadoop-datanode hadoop-secondarynamenode hadoop-resourcemanager hadoop-historyserver
In case there are some Hadoop processes that are already running, stop them all by running
sudo -i stop-all.sh
and:sudo systemctl stop hadoop-namenode hadoop-datanode hadoop-secondarynamenode hadoop-resourcemanager hadoop-historyserver
Check all services status by running:
sudo systemctl status hadoop-namenode hadoop-datanode hadoop-secondarynamenode hadoop-resourcemanager hadoop-historyserver
- Add all relevant variables from
-
Now, we can start testing Hadoop. I'm still using this reference.
Start
sshd
and all Hadoop services if they haven't been started already. Access the master node web UI athttp://localhost:9870
to see whether it can be accessed or not. You can also check how many worker nodes are running from here (if it's zero, I also provided the solution, just continue for now).Make the required user directory on Hadoop DFS by running
hdfs dfs -mkdir -p /user/hadoop
(it seems thatsudo
is not required to access the DFS). Then, make a new folder named "input" on that user by runninghdfs dfs -mkdir input
.Try copying files from the real system into the DFS
input
folder by runninghdfs dfs -put /etc/hadoop/*.xml input
. If an error occurred, probably no worker node is running, try checking what's wrong by running worker node manually (sudo -i hdfs datanode
) and see the output. I get anIncompatible clusterIDs
error, here's what I did:- Stop all Hadoop services (
sudo -i stop-all.sh
). - Delete master and worker node DFS (
sudo rm -r /mnt/hadoop/hadoop/dfs/*
). - Format a new master node DFS by running
sudo -i hdfs namenode -format
. You can also usesudo -i hdfs namenode -format -clusterId <cluster_id>
to force the cluster id. - Run
sudo -i start-all.sh
and repeat all previous steps (making user directory, etc).
Run the MapReduce example (not sure what's this) by running
sudo -i hadoop jar /usr/share/hadoop/mapreduce/hadoop-mapreduce-examples-X.X.X.jar grep input output 'dfs[a-z.]+'
(where X.X.X is your Hadoop version).Try getting
output
files from DFS to our real system (theDownloads
folder) by runninghdfs dfs -get output ~/Downloads/output
. Now check the output files by runningcat ~/Downloads/output/*
.Delete the output files by running
rm -r ~/Downloads/output
(please make sure there's nothing important here).🔵 Information: Other than using CLI, you can also upload and delete files using the web UI, access the "Utilities menu", then choose "Browse the file system" (or simply use this URL:
http://localhost:9870/explorer.html
). There's also the YARN web UI athttp://localhost:8088
, but we can't upload files in here. - Stop all Hadoop services (
-
If everything works correctly, then continue to configure Apache Hive (see below).
Apache Hive is a data warehouse system built on top of Hadoop for providing distributed data query and analysis. If used together with Spark, the warehouse data (containing tables and databases) will be saved to Hadoop DFS.
Expand
-
If you haven't installed any database yet, refer to the Arch wiki to set up either PostgreSQL or MariaDB (MySQL). You don't need to create your own database user, because we will create one in the next step.
-
Download the latest version of Apache Hive (it's not available on AUR yet).
Alternatively, you can also use my local
PKGBUILD
config. See how to install local package manually from the Arch wiki. You can skip a few steps ahead if you're using this method. -
Create a new user called
hive
(or just usehadoop
, it will also work).sudo -i getent group hive || groupadd hive getent passwd hive || useradd -g hive -r -d /var/lib/hive hive mkdir -p /var/{lib,log}/hive chown -R hive:hive /var/{lib,log}/hive exit
Also create the user on the database. On PostgreSQL, you can do this:
# Login as user "postgres" sudo -i -u postgres # Create a new user on PostgreSQL # You should limit the new user privileges createuser --interactive # Enter name of role to add: hive # Shall the new role be a superuser? (y/n) n # Shall the new role be allowed to create databases? (y/n) y # Shall the new role be allowed to create more new roles? (y/n) n # Create a password for the new user "hive" psql -c "ALTER USER hive PASSWORD 'hive';" exit
Also, please create a database named
hive_metastore
(or any other name) to be used by Hive later (I recommend using DBeaver if you prefer using GUI tool). If you don't want to use GUI tool, use this command (for PostgreSQL):# Login as user "hive" sudo -i -u hive # Create a new database createdb hive_metastore exit
-
Go to the download folder (
cd download-folder
). Extract the downloaded archive to/opt/apache-hive
by using:sudo tar -xvf apache-hive-X.X.X-bin.tar.gz -C /opt/ # Rename the root folder to "apache-hive" sudo mv /opt/apache-hive-X.X.X-bin /opt/apache-hive
-
Create a new
profile.d
env script (sudo nano /etc/profile.d/apache-hive.sh
) with content like this:export HIVE_HOME=/opt/apache-hive export HIVE_LOG_DIR=/var/log/hive export PATH=$PATH:$HIVE_HOME/bin
Load the new env vars by running
source /etc/profile.d/apache-hive.sh
on your shell (e.g. Bash, Zsh). -
Create
hive-site.xml
, either by copying (cp /opt/apache-hive/conf/hive-default.xml /opt/apache-hive/conf/hive-site.xml
) or directly creating the file (sudo nano /opt/apache-hive/conf/hive-site.xml
). -
Modify
hive-site.xml
to use either PostgreSQL or MariaDB (MySQL):<?xml version="1.0" encoding="UTF-8" standalone="no"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:postgresql://localhost:5432/hive_metastore</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>org.postgresql.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>hive</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>hive</value> </property> </configuration>
Note that
5432
is the port used by PostgreSQL, whilehive_metastore
is the database name to be used. If you use MariaDB, replacepostgresql
withmysql
and the port with3306
. -
You will need to initialize to database schema first using
schematool -dbType postgres -initSchema
(replacepostgres
withmysql
for MariaDB user). -
Start the needed Hive services (check available service by using
hive --service help
):hive --service metastore --hiveconf hive.root.logger=INFO,console hive --service hiveserver2 --hiveconf hive.root.logger=INFO,console
Run both commands in separate terminal if needed. Alternatively, you can run it as services, create the files below:
-
/usr/lib/systemd/system/hive-hiveserver2.service
[Unit] Description=Hive Server2 After=network-online.target [Service] Type=simple User=hive Group=hive EnvironmentFile=/etc/conf.d/apache-hive ExecStart=/opt/apache-hive/bin/hive --service hiveserver2 --hiveconf hive.root.logger=INFO,console [Install] WantedBy=multi-user.target
-
/usr/lib/systemd/system/hive-metastore.service
[Unit] Description=Hive Metastore After=network-online.target [Service] Type=simple User=hive Group=hive EnvironmentFile=/etc/conf.d/apache-hive ExecStart=/opt/apache-hive/bin/hive --service metastore --hiveconf hive.root.logger=INFO,console [Install] WantedBy=multi-user.target
-
/etc/conf.d/apache-hive
(see/etc/conf.d/hadoop
for reference)HADOOP_COMMON_LIB_NATIVE_DIR=/usr/lib HADOOP_CONF_DIR=/etc/hadoop HADOOP_LOG_DIR=/var/log/hadoop HADOOP_USERNAME=hadoop HIVE_HOME=/opt/apache-hive HIVE_LOG_DIR=/var/log/hive JAVA_HOME=/usr/lib/jvm/java-8-openjdk
Then, you can start them by using:
sudo systemctl start hive-metastore sudo systemctl start hive-hiveserver2
-
-
Install
net-tools
and usesudo netstat -plnt
to see all used ports. This can be used to detect whether Hive/Hadoop/Spark is working correctly or not.Hive
metastore
should use port9083
, whilehiveserver2
should use port10000
/10001
by default. You can also access Hive web UI by accessinglocalhost:10002
.In my case, port
10000
is not listening, it turns outhiveserver2
will not output error directly and will keep running even if an error occurred. Check the log from/tmp/username/hive.log
and/var/log/hive
, I got theAppClassLoader cannot be cast
error. After searching on Google, it seems that Hive is still not designed for Java 11 yet (per Hive v3.1.3). -
Try connecting to
hive2
database by usingbeeline -u jdbc:hive2://localhost:10000/default -n hive -p hive
.If you get
hadoop is not allowed to impersonate hive
error, modify/opt/apache-hive/conf/hive-site.xml
(don't delete the existing config):<property> <name>hive.server2.enable.doAs</name> <value>false</value> </property>
Note that it may be dangerous for production, but other method I tried doesn't work. If no more error occurred when re-running Hive and
beeline
, just!quit
since we have nothing more to test. -
Done, go back to configure Apache Spark (see above).
There are 2 types of Java environment, JDK and JRE. JRE is the lightweight version and only used to run Java application. JDK is the full version of Java which is usually used as JAVA_HOME
and contains the compiler (javac
).
Expand
-
See what Java environments are installed on your machine by using
archlinux-java status
. -
Check whether JDK exists by running
where javac
. If not, you may need to install one. Usesudo pacman -S XXX-openjdk
, whereXXX
can be eitherjdk
(Java latest),jdk8
(Java 8),jdk11
(Java 11), and so on.As of Hive v3.1.3, Java 11 is still not fully supported so it's safer to use Java 8.
-
Recheck the Java environments by running
archlinux-java status
(again). -
Reset the Java environment by running
sudo archlinux-java set XXX
(optional), whereXXX
can be known from the previous command.
- Apache Hadoop documentation.
- Apache Hive getting started (also here and there).
- Apache Spark documentation.
- PySpark getting started.
- Pier Taranti's Hadoop, Hive, Spark setup on Medium.