This ETL (Extract, Transform, Load) project aims to extract metadata from Windows PE files stored in an S3 bucket, transform that data, and load it into a PostgreSQL database.
project_description.mp4
- File path and size
- File type (dll or exe)
- Architecture (x32 or x64)
- Number of imports (integer)
- Number of exports (integer)
- PySpark: Utilized for large-scale data processing. Learn more
- Spark Cluster: Comprises master and worker containers for distributed data processing. Cluster Overview | Setup Guide
- Python pefile Library: For extracting metadata from Windows PE files Pefile Library
- PostgreSQL: Database for storing extracted metadata
- Docker: For creating and managing the application and database environments
- Elasticsearch, Logstash, Kibana (ELK stack): For logging and log visualization
- Ensure Docker and Docker Compose are installed on your machine to build and run the necessary containers for the application, database, and ELK stack.
- You have to have at least 10GB free ram to run the application.
- Environment Variables: all environment variables are set in the docker-compose.dev.yaml file
- Build and Run Docker Containers:
- Navigate to the project directory and run the following command:
docker compose -f docker-compose.dev.yaml up --build
- Navigate to the project directory and run the following command:
-
Environment Variables:
- Create a
.env
file at the root of the project. - Fill the
.env
file with your specific configuration:DB_USER=db_user DB_PASSWORD=db_password DB_NAME=db_name BUCKET_NAME=your_s3_bucket_name
- Create a
-
Build and Run Docker Containers:
- Navigate to the project directory and run the following command:
docker compose -f docker-compose.prod.yaml up --build
- Navigate to the project directory and run the following command:
You can modify the number of files to be downloaded by adjusting the CMD
value in the Docker Compose file. For instance:
python-app:
build:
context: .
dockerfile: Dockerfile.spark.pythonS
command: ["1000000"]
In your Dockerfile, you might have an entry similar to:
ENTRYPOINT ["python3", "main.py"]
CMD ["10000"]
This means by default 10000 files will be downloaded unless the command in the docker-compose.prod.yaml overrides it (as in the example where it's set to 1000000). Attention: The number of files to be downloaded should be less than the number of files in the S3 bucket.
-
Spark Master Dashboard: http://localhost:8080/
-
Kibana Dashboard: http://localhost:5601/app/home#/
-
Adminer (DB Viewer): http://localhost:8089/
- Extraction: Retrieve Windows PE files from the specified S3 bucket.
- Transformation: Extract the predefined metadata using PySpark and Python's pefile library.
- Load: Store the transformed data into the PostgreSQL database.
The application uses the ELK stack for logging:
- Elasticsearch: Stores logs
- Logstash: Processes logs
- Kibana: Visualizes logs on a dashboard
- The Kibana dashboard is unprotected and doesn't require any credentials for access. Ensure proper network configurations to secure your data.
sudo apt install awscli
To interact with LocalStack, you need the AWS CLI tool installed on your machine. If you haven't installed it yet, you can download it from the AWS website.
LocalStack doesn't require real AWS credentials but uses dummy credentials. You can configure AWS CLI with any random credentials:
aws configure
When prompted, you can enter the following:
AWS Access Key ID: test
AWS Secret Access Key: test
Default region name: your preferred region (e.g., us-east-1)
Default output format: json
To create a bucket in LocalStack, use the AWS CLI command with the endpoint URL pointing to your LocalStack instance:
aws --endpoint-url=http://localhost:4566 s3 mb s3://my-bucket
Replace my-bucket with your desired bucket name.
To list all the buckets:
aws --endpoint-url=http://localhost:4566 s3 ls
To list the contents of a bucket:
aws --endpoint-url=http://localhost:4566 s3 ls s3://my-bucket
To list all the files in a bucket:
aws --endpoint-url=http://localhost:4566 s3 ls s3://my-bucket --recursive
To upload a file to the bucket:
aws --endpoint-url=http://localhost:4566 s3 cp /path/to/local/file s3://my-bucket