|
1 |
| -# DataEngineering-Workshop1 |
2 |
| -### Workshop 1 Agenda |
3 |
| -**Prerequisites** |
| 1 | +# Data Engineering Workshop |
4 | 2 |
|
5 |
| - Linux Machine |
6 |
| - Docker |
7 |
| - Python 3.10 |
8 |
| - PostgreSQL 13 |
9 |
| - Beautifulsoup |
10 |
| - urllib2 |
11 |
| - requests |
12 |
| - git |
| 3 | +One Day workshop on understanding Docker, Web Scrapping, Regular Expressions, PostgreSQL and Git. |
13 | 4 |
|
14 |
| -1. **Introduction to Docker.** |
| 5 | +## Prerequisite |
| 6 | + |
| 7 | +##### Any Linux machine/VM with following packages installed |
| 8 | +- Python 3.6 or above |
| 9 | +- [docker-ce](https://docs.docker.com/engine/install/ubuntu/) |
| 10 | +- [docker-compose](https://docs.docker.com/compose/install/) |
| 11 | +- pip3 |
| 12 | +- git (any recent version) |
| 13 | +- PostgreSQL 13 |
| 14 | +- psycopg2 |
| 15 | +- bs4 |
| 16 | +- urllib2 |
| 17 | +- requests |
| 18 | + |
| 19 | +##### GitHub account |
| 20 | +- Create an account on [GitHub](https://github.com/join) (if you don't already have one) |
| 21 | +- Fork [this](https://github.com/UniCourt/DataEngineering-Workshop1) repository and then clone it to your machine |
| 22 | +- You can refer [this](https://docs.github.com/en/get-started/quickstart/fork-a-repo) guide to understand how to fork and clone |
| 23 | + |
| 24 | +## What will you learn by the end of this workshop? |
| 25 | +- By the end of this workshop you will learn how to build docker image and it's usage. |
| 26 | +- You will learn how to scrape a website using urllib/requests and Beautifulsoup. |
| 27 | +- You will learn Regular Expressions and how to work with it. |
| 28 | +- You will learn key features of PostgreSQL. |
| 29 | +- You will learn how to dockerize your project. |
| 30 | + |
| 31 | +## Schedule |
| 32 | +| Time | Topics |
| 33 | +| ----------------------- |------- |
| 34 | +| 09:00 - 11:00 | [`Introduction to Docker`](#Introduction-to-Docker) |
| 35 | +| 11:00 - 01:00 | [`Introduction to Webscrapping.`](#Introduction-to-Webscrapping) |
| 36 | +| 1:00 - 2:00 | `Break` |
| 37 | +| 02:00 - 03:00 | [`Introduction to PostgreSQL`](#Introduction-to-PostgreSQL) |
| 38 | +| 03:30 - 04:00 | [`Dockerizing a project`] |
| 39 | +| 04:00 - 04:30 | [`Introduction to Github`](#Introduction-to-Github) |
| 40 | +| 04:30 - 05:00 | `Q & A and Wrapping Up` |
| 41 | + |
| 42 | + |
| 43 | +## Workshop 1 Agenda |
| 44 | + |
| 45 | +1. ###Introduction to Docker. |
15 | 46 |
|
16 | 47 | - Building the Docker image for Worker using python:3.10.2-alpine3.15
|
17 | 48 |
|
|
33 | 64 |
|
34 | 65 | 2)Goto the directory where you created Dockerfile
|
35 | 66 |
|
36 |
| - Docker build -t Simple_python |
| 67 | + docker build ./ -t Simple_python |
37 | 68 |
|
38 |
| -2. **Introduction to Webscrapping.** |
| 69 | +2. ###Introduction to Webscrapping. |
39 | 70 | - **Beautifulsoup**
|
40 | 71 | - *Introduction*
|
41 | 72 | Beautiful Soup is a python package which allows us to pull data out of HTML and XML documents.
|
|
165 | 196 |
|
166 | 197 |
|
167 | 198 |
|
168 |
| - 3. **Introduction to PostgreSQL.** |
| 199 | + 3. ###Introduction to PostgreSQL. |
169 | 200 | - **Key Features of PostgreSQL**.
|
170 | 201 | - Free to download
|
171 | 202 | - Compatible with Data Integrity
|
|
245 | 276 | Goto the directory where you created Dockerfile
|
246 | 277 | Docker build -t simple_python
|
247 | 278 |
|
248 |
| - 4. **Introduction to Github.** |
| 279 | + 4. ###Introduction to Github. |
249 | 280 | - **Setting up github**.
|
250 | 281 |
|
251 | 282 | Make a repository in GitHub
|
|
290 | 321 |
|
291 | 322 | The git config command is used initially to configure the user.name and user.email. This specifies what email id and username will be used from a local repository.
|
292 | 323 |
|
293 |
| - 5. **Workshop 1 Home Work.** |
| 324 | + 5. ###Webscrapping with docker. |
| 325 | + - Create a new docker file. |
| 326 | + |
| 327 | + FROM python:3.10.2-alpine3.15 |
| 328 | + # Create directories |
| 329 | + RUN mkdir -p /root/workspace/src |
| 330 | + COPY ./web_scraping_sample.py /root/workspace/src |
| 331 | + # Switch to project directory |
| 332 | + WORKDIR /root/workspace/src |
| 333 | + |
| 334 | + - Create a docker-compose file. |
| 335 | + |
| 336 | + version: "3" |
| 337 | + services: |
| 338 | + pyhton_service: |
| 339 | + build: |
| 340 | + context: ./ |
| 341 | + dockerfile: Dockerfile |
| 342 | + image: workshop1 |
| 343 | + container_name: workshop_python_container |
| 344 | + stdin_open: true # docker attach container_id |
| 345 | + tty: true |
| 346 | + ports: |
| 347 | + - "8000:8000" |
| 348 | + volumes: |
| 349 | + - .:/app |
| 350 | + - Get the containers up. |
| 351 | + |
| 352 | + docker-compose up -d |
| 353 | + |
| 354 | + - Login to the container. |
| 355 | + |
| 356 | + docker exec -it python_service sh |
| 357 | + - Run the script for web scrapping inside the container. |
| 358 | + |
| 359 | + python web_scraping_sample.py |
294 | 360 |
|
| 361 | + 6. ###Workshop 1 Home Work. |
295 | 362 | A PR should be given where the data is scrapped from Lorem Ipsum - All the facts - Lipsum generator[ Lorem Ipsum - All the facts - Lipsum generator](https://www.lipsum.com/) website and save each section from that page in the database.
|
0 commit comments