Skip to content

Commit

Permalink
Merge pull request #127 from pycontw/remove-python-fb-page-insights-c…
Browse files Browse the repository at this point in the history
…lient

Remove python fb page insights client
  • Loading branch information
henry410213028 authored May 12, 2024
2 parents 523dd7b + d66bae8 commit ee0e333
Show file tree
Hide file tree
Showing 8 changed files with 1,956 additions and 1,016 deletions.
36 changes: 17 additions & 19 deletions .github/workflows/python.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,38 +2,36 @@ name: Python CI

on:
push:
branches: [ master ]
branches: [master]
pull_request:
branches: [ master ]
branches: [master]
env:
POETRY_VIRTUALENVS_CREATE: false
AIRFLOW_TEST_MODE: true
jobs:
build:
runs-on: ubuntu-latest
timeout-minutes: 10
steps:
- uses: actions/checkout@v2
- name: Set up Python 3.8
uses: actions/setup-python@v1
with:
python-version: 3.8
- uses: actions/checkout@v2

- name: Install dependencies
run: |
pip install poetry==1.1.14
poetry config experimental.new-installer false
poetry install
- name: Set up Python 3.8
uses: actions/setup-python@v1
with:
python-version: 3.8

- name: Run linters
run: make lint
- name: Install dependencies
run: |
pip install -U poetry==1.6.1
poetry install
- name: Run test
run: make test
- name: Run linters
run: make lint

- name: Coverage
run: make coverage
- name: Run test
run: make test

- name: Coverage
run: make coverage

# CD part
# - name: Push dags to GCS
Expand Down
16 changes: 10 additions & 6 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,24 +1,28 @@
FROM puckel/docker-airflow:1.10.9
FROM apache/airflow:1.10.13-python3.8
USER root
ENV POETRY_VIRTUALENVS_CREATE=false \
POETRY_CACHE_DIR='/var/cache/pypoetry' \
ENV POETRY_CACHE_DIR='/var/cache/pypoetry' \
GOOGLE_APPLICATION_CREDENTIALS='/usr/local/airflow/service-account.json'

RUN apt-get update \
RUN apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 467B942D3A79BD29 \
&& apt-key adv --keyserver keyserver.ubuntu.com --recv-keys B7B3B788A8D3785C \
&& apt-get update \
&& apt-get install -y --no-install-recommends git \
# 1. if you don't need postgres, remember to remove postgresql-dev and sqlalchemy
# 2. libglib2.0-0 libsm6 libxext6 libxrender-dev libgl1-mesa-dev are required by opencv
# 3. git is required by pip install git+https
&& pip install --no-cache-dir poetry==1.1.7 \
&& pip install --no-cache-dir -U poetry==1.6.1 \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*

COPY pyproject.toml pyproject.toml
COPY poetry.toml poetry.toml
COPY poetry.lock poetry.lock

RUN python -m poetry install --no-interaction --no-ansi --no-dev \
# Cleaning poetry installation's cache for production:
&& rm -rf "$POETRY_CACHE_DIR" \
&& pip uninstall -yq poetry

USER airflow
COPY dags /usr/local/airflow/dags
COPY airflow.cfg airflow.cfg
COPY airflow.cfg airflow.cfg
43 changes: 19 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,9 @@

Using Airflow to implement our ETL pipelines


[TOC]

## Year to Year Jobs

這三個 job 什麼時候打開需要人工確認(麻煩當年的組長大大了),理論上是售票開始前我們要測試一下然後打開
Expand All @@ -19,38 +22,34 @@ Using Airflow to implement our ETL pipelines

* Dag 的命名規則請看這篇 [阿里巴巴大數據實戰](https://medium.com/@davidtnfsh/%E5%A4%A7%E6%95%B0%E6%8D%AE%E4%B9%8B%E8%B7%AF-%E9%98%BF%E9%87%8C%E5%B7%B4%E5%B7%B4%E5%A4%A7%E6%95%B0%E6%8D%AE%E5%AE%9E%E8%B7%B5-%E8%AE%80%E6%9B%B8%E5%BF%83%E5%BE%97-54e795c2b8c)
* Please refer to [this article](https://medium.com/@davidtnfsh/%E5%A4%A7%E6%95%B0%E6%8D%AE%E4%B9%8B%E8%B7%AF-%E9%98%BF%E9%87%8C%E5%B7%B4%E5%B7%B4%E5%A4%A7%E6%95%B0%E6%8D%AE%E5%AE%9E%E8%B7%B5-%E8%AE%80%E6%9B%B8%E5%BF%83%E5%BE%97-54e795c2b8c) for naming guidline

1. ods/opening_crawler: Crawlers written by @Rain. Those openings can be used for recuitment board, which was implemented by @tai271828 and @stacy.
2. ods/survey_cake: A manually triggered uploader which would upload questionnaire to bigquery. The uploader should be invoked after we recieved the surveycake questionnaire.
* examples
1. `ods/opening_crawler`: Crawlers written by @Rain. Those openings can be used for the recruitment board, which was implemented by @tai271828 and @stacy.
2. `ods/survey_cake`: A manually triggered uploader that would upload questionnaires to bigquery. The uploader should be invoked after we receive the surveycake questionnaire.

## Prerequisites

1. [Install Python 3.8+](https://www.python.org/downloads/release/python-3811/)
2. [Get Docker](https://docs.docker.com/get-docker/)
3. [Install Git](https://git-scm.com/book/zh-tw/v2/%E9%96%8B%E5%A7%8B-Git-%E5%AE%89%E8%A3%9D%E6%95%99%E5%AD%B8)
4. [Get npm](https://www.npmjs.com/get-npm)

## Install

1. `docker pull puckel/docker-airflow:1.10.9`
1. `docker pull docker.io/apache/airflow:1.10.13-python3.8`
2. Python dependencies:
1. `virtualenv venv`
* `. venv/bin/activate`
2. `pip install poetry`
3. `poetry install`
3. Npm dependencies, for linter, formatter and commit linter (optional):
3. Npm dependencies for linter, formatter, and commit linter (optional):
1. `brew install npm`
2. `npm ci`

## Commit

1. `git add <files>`
2. `npm run check`: Apply all the linter and formatter
3. `npm run commit`

## PR

Please use Gitlab Flow, otherwise you cannot pass dockerhub CI
Please use Gitlab Flow, otherwise, you cannot pass docker hub CI

## Run

Expand All @@ -62,7 +61,6 @@ Please use Gitlab Flow, otherwise you cannot pass dockerhub CI
4. Check its command in [contrib/README.md](contrib/README.md)
5. `python xxx.py`


### Local environment Docker

> Find @davidtnfsh if you don't have those secrets.
Expand All @@ -76,12 +74,12 @@ Please use Gitlab Flow, otherwise you cannot pass dockerhub CI
* Build dev/test image (for dev/test): `docker build -t davidtnfsh/pycon_etl:test --cache-from davidtnfsh/pycon_etl:prod -f Dockerfile.test .`
2. Fill in some secrets:
1. `cp .env.template .env.staging` for dev/test. `cp .env.template .env.production` instead if you are going to start a production instance.
2. Follow the instruction in `.env.<staging|production>` and fill in your secrets.
If you are just running the staging instance for development as a sandbox, and not going to access any specific thrid-party service, leave the `.env.staging` as-is should be fine.
2. Follow the instructions in `.env.<staging|production>` and fill in your secrets.
If you are running the staging instance for development as a sandbox and not going to access any specific third-party service, leave the `.env.staging` as-is should be fine.
3. Start the Airflow server:
* production: `docker run --log-opt max-size=1m -p 8080:8080 --name airflow -v $(pwd)/dags:/usr/local/airflow/dags -v $(pwd)/service-account.json:/usr/local/airflow/service-account.json --env-file=./.env.production davidtnfsh/pycon_etl:prod webserver`
* dev/test: `docker run -p 8080:8080 --name airflow -v $(pwd)/dags:/usr/local/airflow/dags -v $(pwd)/service-account.json:/usr/local/airflow/service-account.json --env-file=./.env.staging davidtnfsh/pycon_etl:test webserver`
* Note the difference are just the env file name and the image cache.
* Note the difference is just the env file name and the image cache.
4. Portforward compute instance to your local and then navigate to <http://localhost:8080/admin/>:
1. `gcloud beta compute ssh --zone "asia-east1-b" "data-team" --project "pycontw-225217" -- -NL 8080:localhost:8080`
2. If Port 8080 is already in use. You need to stop the service occupied 8080 port on your local first.
Expand All @@ -90,9 +88,8 @@ Please use Gitlab Flow, otherwise you cannot pass dockerhub CI
5. Setup Airflow's Variable and Connections:
* Youtube: ![img](docs/youtube-connection.png)


### Local environment Docker(windows)
> Do not use Windows Powershell, please use Comman Prompt instead.
### Local environment Docker (Windows)
> Do not use Windows Powershell; please use Command Prompt instead.
> Find @davidtnfsh if you don't have those secrets.
Expand All @@ -105,8 +102,8 @@ Please use Gitlab Flow, otherwise you cannot pass dockerhub CI
* Build dev/test image (for dev/test): `docker build -t davidtnfsh/pycon_etl:test --cache-from davidtnfsh/pycon_etl:prod -f Dockerfile.test .`
2. Fill in some secrets:
1. `copy .env.template .env.staging` for dev/test. `copy .env.template .env.production` instead if you are going to start a production instance.
2. Follow the instruction in `.env.<staging|production>` and fill in your secrets.
If you are just running the staging instance for development as a sandbox, and not going to access any specific thrid-party service, leave the `.env.staging` as-is should be fine.
2. Follow the instructions in `.env.<staging|production>` and fill in your secrets.
If you are running the staging instance for development as a sandbox, and not going to access any specific thrid-party service, leave the `.env.staging` as-is should be fine.
3. Start the Airflow server:
* production: `docker run -p 8080:8080 --name airflow -v "/$(pwd)"/dags:/usr/local/airflow/dags -v "/$(pwd)"/service-account.json:/usr/local/airflow/service-account.json --env-file=./.env.production davidtnfsh/pycon_etl:prod webserver`
* dev/test: `docker run -p 8080:8080 --name airflow -v "/$(pwd)"/dags:/usr/local/airflow/dags -v "/$(pwd)"/service-account.json:/usr/local/airflow/service-account.json --env-file=./.env.staging davidtnfsh/pycon_etl:test webserver`
Expand All @@ -115,17 +112,16 @@ Please use Gitlab Flow, otherwise you cannot pass dockerhub CI
1. `gcloud beta compute ssh --zone "asia-east1-b" "data-team" --project "pycontw-225217" -- -N -L 8080:localhost:8080`
2. If Port 8080 is already in use. You need to stop the service occupied 8080 port on your local first.


![image](./docs/airflow.png)

#### BigQuery (Optional)
1. Setup the Authentication of GCP: <https://googleapis.dev/python/google-api-core/latest/auth.html>
* After invoking `gcloud auth application-default login`, you'll get a credentials.json resides in `$HOME/.config/gcloud/application_default_credentials.json`. Invoke `export GOOGLE_APPLICATION_CREDENTIALS="/path/to/keyfile.json"` if you have it.
* service-account.json: Please contact @david30907d using email, telegram or discord. No worry about this json if you are just running the sandbox staging instance for development.
* service-account.json: Please contact @david30907d using email, telegram, or discord. No worry about this json if you are running the sandbox staging instance for development.
2. Give [Toy-Examples](#Toy-Examples) a try

## Deployment & Setting Up Credentials/Env

1. Login to data team's server:
1. Login to the data team's server:
1. `gcloud compute ssh --zone "asia-east1-b" "data-team" --project "pycontw-225217"`
2. service:
* ETL: `/home/zhangtaiwei/pycon-etl`
Expand All @@ -144,7 +140,6 @@ Please use Gitlab Flow, otherwise you cannot pass dockerhub CI
* kktix_events_endpoint: url path of kktix's `hosting_events`, ask @gtb for details!

### CI/CD

Please check [.github/workflows](.github/workflows) for details

## Tutorials
Expand Down
50 changes: 0 additions & 50 deletions dags/ods/fb_page_insights/dags/fb_page_insights_2_bigquery.py

This file was deleted.

Loading

0 comments on commit ee0e333

Please sign in to comment.