cp .env.example .env
- It's recommended to use pyenv and to install Python 3.11 locally inside the app's directory so it doesn't clash with other Python version on your machine
pyenv local 3.11.7
- Now that Python is available (python -v), Virtual environment should be set in order to install requirements
python -m venv .venv && source .venv/bin/activate
- Install Python requirements
pip install -r requirements.txt
Airflow pipelines are part of Knowledge mining service, which are used for creation of automated data processing pipelines. Main purpose of pipelines is to create content for Knowledge assets based on the input file.
Generate default airflow config
airflow config list --defaults
Change the following lines in the config:
load_examples = False
dags_folder = YOUR_PATH_TO/edge-node-knowledge-mining/dags
parallelism = 32
max_active_tasks_per_dag = 16
max_active_runs_per_dag = 16
enable_xcom_pickling = True
airflow db init
airflow users create --role Admin --username admin --email admin --firstname admin --lastname admin --password admin
In order to have Airflow running, first Scheduler should be started:
airflow scheduler (to pick up new DAGs/jobs)
airflow dags unpause exampleDAG
airflow dags unpause pdf_to_jsonld
airflow dags unpause simple_json_to_jsonld
To keep track how your pipelines perform, webserver should be installed. It will be available on http://localhost:8080. After starting everything pipelines should be available on page http://localhost:8080/home and un-paused
Start airflow server
airflow webserver --port 8080 (port where you can open the dashboard)
python app.py
CREATE DATABASE ka-mining-api-logging CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci
curl -X POST http://localhost:5005/trigger_pipeline \
-F "file=@test_pdfs/22pages_eng.pdf" \
-F "pipelineId=pdf_to_jsonld" \
-F "fileFormat=pdf" \
-b "connect.sid=s%3A9XCAe7sos-iY4Z_jIjyVcQYjLaYHVi0H.UeghM8ZRS97nVkZPukbL8Zu%2F%2BbRZSAuOLpq3BMepiD0; Path=/; HttpOnly;"
curl -X POST http://localhost:5005/trigger_pipeline \
-F "file=@test_jsons/entertainment_test.json" \
-F "pipelineId=simple_json_to_jsonld" \
-F "fileFormat=json" \
-b "connect.sid=s%3Aw_26GwYGj1rLvXpGPBQW0M_mQxrfbVMW.jZazIh0iv01R7TiOxmF0WKFjlKTi7rWhZJe1M24E21E; Path=/; HttpOnly"
Trigger the vectorization DAG via POST request
curl -X POST http://localhost:5005/trigger_pipeline \
-F "file=@test_jsonlds/vectorize_test.json" \
-F "pipelineId=vectorize_ka" \
-b "connect.sid=s%3AjLYArFLH7IadiB4dkEDrppgEEQJEqNss.35WzNEW3PySPRIxrDpL5tsRZ%2F%2B%2FNo%2BnZgRPDoRz0y7g; Path=/; HttpOnly;"