This is an overview of the AWS SDK for pandas (awswrangler) which is an open-source python library that makes it easier to work with data from AWS services.
AWS Data Wrangler is an open-source Python library built on top of Pandas, Apache Arrow, and Boto3, it offers abstracted functions to execute usual ETL tasks like loading/unloading data from Data Lakes, Data Warehouses, and Databases using python. AWS datawrangler is easily integrated with AWS services like AWS S3, AWS Glue, Amazon Athena, AWS DynamoDB, AWS CloudWatch, AWS Redshift, Amazon Timestream, AWS EMR, etc. Working with data datawrangler support reading and writing Excel, JSON, CSV, and Parquet from S3. Interact with data and metadata through AWS Glue and run SQL queries on Amazon Athena.
Note: Before working with AWS datawrangler you need to install and configure your AWS CLI account on your Linux machine.
Now, before installing datawrangler, we need to install the python3 on our Linux machine, which can be done with commands
$ apt update
$ apt install -y python3
After installation of python3, we need to install the python package pip, which can be done with commands
$ apt install python3-pip
$ pip3 install --upgrade pip
$ apt install -y python3-venv
Create a virtual environment, which can be done with commands.
$ python3 -m venv my_env_project
The above command creates a directory named my_env_project
in the current directory, which contains pip, interpreter, scripts, and libraries, view as
$ ls my_env_project/
You can now activate the virtual environment
, with the command
$ source my_env_project/bin/activate
Command prompt would change to your environment and will look as shown
(my_env_project) ubuntu@DESKTOP-I4BBP24:~$
Now, we install the awswrangler
package into our virtual environment
as
(my_env_project)$ pip install awswrangler
Now, if you didn’t configured AWS CLI, configure as
(my_env_project)$ aws configure
Run python command inside virtual environment
to open the interpreter
(my_env_project)$ python
Every time you install a new package inside your virtual environment
, you should be able to import it into your project.
Now let’s test awswrangler
with S3 bucket
.
(my_env_project) ubuntu@DESKTOP-I4BBP24:~/my_env_project$ python
>>> import awswrangler as wr
>>> s3_bucket_name='you_bucket_name'
>>> s3_bucket_file_path='directory_name/'
>>> s3_bucket_path=f"s3://{s3_bucket_name}/{s3_bucket_file_path}"
>>> df=wr.s3.read_csv (path=s3_bucket_path, path_suffix=['.csv'])
>>> print (df)
To exit from the interpreter, type
>>> quit()
We can also create a python script and run from inside python 3 virtual environment
as
(my_env_project) ubuntu@ubuntu:~$ vim script.py
Copy and paste the given code inside the script file
import awswrangler as wr
s3_bucket_name='you_bucket_name'
s3_bucket_file_path='directory_name/'
s3_bucket_path=f"s3://{s3_bucket_name}/{s3_bucket_file_path}"
df=wr.s3.read_csv (path=s3_bucket_path, path_suffix=['.csv'])
print (df)
To execute the script, run command
(my_env_project) ubuntu@ubuntu:~$ python script.py
To exit from virtual environment
use exit
or Ctrl+d
command. To delete a virtual environment
run the following command
(my_env_project) ubuntu@ubuntu:~$ deactivate
The above command won't remove my_env_project
directory, simply use rm
command to delete it.
Create a directory and go into it to create virtual environment
as
$ mkdir jupyter_notebook
$ ls jupyter_notebook
$ cd jupyter_notebook
Now, create a python virtual environment
named jupypter_notebook
$ virtualenv jupyter_notebook
To activate
and get inside that virtual environment
$ source jupyter_notebook/bin/activate
Install Jupyter
inside the virtual environment
(jupyter_notebook) ubuntu@ubuntu:~/jupyter_notebook$ pip3 install jupyter
Create a kernel
that can be used to run python
commands inside the virtual environment
of jupyter notebook
.
(jupyter_notebook) ubuntu@ubuntu:~/jupyter_notebook$ ipython kernel install --user --name=python-env
You can launch its web interface from the terminal as
(jupyter_notebook) ubuntu@ubuntu:~/jupyter_notebook$ jupyter notebook --allow-root
You get the link to open it in your browser, click on right side, New
drop down menu and select your python_env
.
Install awswrangler
with command given in your python_env
virtual environment.
pip install awswrangler
Run the following code to test the awswrangler
with your S3 Bucket
to get the data from .csv
file.
import awswrangler as wr
s3_bucket_name='you_bucket_name'
s3_bucket_file_path='directory_name/'
s3_bucket_path=f"s3://{s3_bucket_name}/{s3_bucket_file_path}"
df=wr.s3.read_csv (path=s3_bucket_path, path_suffix=['.csv'])
print (df)
After you are done with the project, exit from Jupyter
from the browser and no longer need the kernel
you can uninstall it with the command.
(jupyter_notebook) ubuntu@ubuntu:~/jupyter_notebook$ jupyter-kernelspec uninstall python-env
To exit from virtual environment
(jupyter_notebook) ubuntu@ubuntu:~/jupyter_notebook$ deactivate
To delete the virtual environment
virtualenv --clear /home/ubuntu/jupyter-notebook/