Database Management Systems | Project

Abstract

Comma Separated Value (CSV) is one of the most important Open Data formats extensively used in a plethora of domains because of his simplicity and flexibility in storing data. However, data published in this format often does not comply to strict specifications, rising a variety of ambiguities and making the automated data extraction a challenging task. Although the relational data (table) extraction from CSV files has not been studied extensively, there is a novel framework named Pytheas which can effectively manage this challenge. Nevertheless, Pytheas focuses on table discovery which is just an intermediate key step in the CSV processing pipeline. In this work, I deal with the remaining steps, mainly with data type detection one, in order to convert the CSV file in a proper format for data analysis purposes and storing in the relational database management system PostgreSQL. Moreover, I build a dashboard which gives to users a clear data insight and enables them to store and

Installation instructions

The following instructions have been tested on a newly created Windows 10 with Python3.7.12 on a conda environment. Create your conda environment by running conda create --name pytheas-venv python==3.7.12 , and activate it by running conda activate pytheas-venv. Clone the repo to your machine using git git clone https://github.com/JKaraman93/Pytheas.git .

First, you have to setup and run Pytheas, and then the Extra Component.

Pytheas

(Original repo: https://github.com/cchristodoulaki/Pytheas)

Setup

Package the Pytheas project into a wheel, and install it using pip:

cd src
python setup.py sdist bdist_wheel
pip install  --upgrade --force-reinstall dist/pytheas-0.0.1-py3-none-any.whl

Load trained weights and inference

There are pretrained Pytheas rules using a set of 2000 Open Data CSV files from Canadian CKAN portals.

E.g.:

cd pytheas
python ppytheas.py infer -w trained_rules.json -f ../../data/examples/demo6.csv -o inferred_annotation.json

If you want, you can also train Pytheas using your own files and annotations. For more details : https://github.com/JKaraman93/Pytheas/edit/master/src/README.md

Extra Component

Install requirements

cd Extra_Component
pip install -r requirements.txt

Inference

The Component uses the Pytheas output file i.e., inferred_annotation.json. You only need to specify the CSV file as follows:

python run.py -c ../pytheas/data/examples/demo6.csv

Dashboard

To display the Dashboard, follow the link in terminal e.g Dash is running on http://127.0.0.1:8050/ .

Name		Name	Last commit message	Last commit date
Latest commit History 133 Commits
Extra_component		Extra_component
data/examples		data/examples
docs		docs
pytheas/rules		pytheas/rules
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
requirements.txt.sphinx		requirements.txt.sphinx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Database Management Systems | Project

Abstract

Installation instructions

Pytheas

(Original repo: https://github.com/cchristodoulaki/Pytheas)

Setup

Load trained weights and inference

Extra Component

Install requirements

Inference

Dashboard

About

Releases

Packages

Languages

JKaraman93/Pytheas

Folders and files

Latest commit

History

Repository files navigation

Database Management Systems | Project

Abstract

Installation instructions

Pytheas

(Original repo: https://github.com/cchristodoulaki/Pytheas)

Setup

Load trained weights and inference

Extra Component

Install requirements

Inference

Dashboard

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages