-
Notifications
You must be signed in to change notification settings - Fork 0
Content Blocker Bot #653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
kelvinkipruto
wants to merge
25
commits into
main
Choose a base branch
from
ft/midiadata-init
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Content Blocker Bot #653
Changes from all commits
Commits
Show all changes
25 commits
Select commit
Hold shift + click to select a range
75f73f8
Minimal working setup
kelvinkipruto b92ef3e
Working version with DB
kelvinkipruto a4541de
Cleanup
kelvinkipruto aec1820
Run time improvements
kelvinkipruto 8c0b06f
Remove unused imports
kelvinkipruto 95dae7f
Merge branch 'main' of https://github.com/CodeForAfrica/api into ft/m…
kelvinkipruto 9e17c89
Docker files
kelvinkipruto 1469485
validate robots.txt
kelvinkipruto 1e1c00d
Improve script to capture extra required fields
kelvinkipruto 3140ecb
Rename to content_access_bot
kelvinkipruto 906ba75
use case insensitivity when matching crawlers
kelvinkipruto e1dd2e4
Improve url redirects check
kelvinkipruto f74769b
Update list of crawlers
kelvinkipruto 73a0031
use environs instead of dotenv
kelvinkipruto d8981e1
Misc improvements
kelvinkipruto 883a8ab
Code changes
kelvinkipruto b551b3e
Working Update
kelvinkipruto 09bc272
Refactor database imports to use sqliteDB module
kelvinkipruto f13a25c
Improve script reliability
kelvinkipruto 782b921
Fix SQL table definition to allow NULL values for archived robots fields
kelvinkipruto a2761a5
Simplified working scrapper
kelvinkipruto a1d7374
Update interpreter constraints to include Python 3.10
kelvinkipruto df6e7a3
Enhance database connection timeout and improve robots fetching logic
kelvinkipruto b3352ff
refactor(db): implement site checks tracking system
kelvinkipruto 7ab4278
Merge branch 'main' into ft/midiadata-init
kelvinkipruto File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -168,3 +168,6 @@ cython_debug/ | |
# Custom gitignore | ||
*.db | ||
# End of custom ignore | ||
|
||
*.csv | ||
*.xlsx |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
AIRTABLE_BASE_ID= | ||
AIRTABLE_API_KEY= | ||
AIRTABLE_ORGANISATION_TABLE= | ||
AIRTABLE_CONTENT_TABLE= | ||
DB_FILE=content_access_bot.db |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
python_sources() | ||
docker_image( | ||
name="content_access_bot-deps", | ||
image_tags=["deps"], | ||
build_platform=["linux/amd64", "linux/arm64"], | ||
registries=["content_access_bot"], | ||
repository="app", | ||
skip_push=True, | ||
source="Dockerfile.deps", | ||
) | ||
|
||
file(name="app.json", source="app.json") | ||
|
||
docker_image( | ||
name="content_access_bot-srcs", | ||
image_tags=["srcs"], | ||
build_platform=["linux/amd64", "linux/arm64"], | ||
registries=["content_access_bot"], | ||
repository="app", | ||
skip_push=True, | ||
source="Dockerfile.srcs", | ||
) | ||
|
||
docker_image( | ||
name="content_access_bot", | ||
build_platform=["linux/amd64", "linux/arm64"], | ||
dependencies=[":content_access_bot-srcs", ":content_access_bot-deps", ":app.json"], | ||
image_tags=[ | ||
"{build_args.VERSION}", | ||
"latest", | ||
], | ||
source="Dockerfile", | ||
) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
FROM python:3.11-slim-bookworm AS python-base | ||
FROM content_access_bot/app:deps AS app-deps | ||
FROM content_access_bot/app:srcs AS app-srcs | ||
FROM python-base AS python-app | ||
|
||
WORKDIR /app | ||
COPY content_access_bot/docker/app.json ./ | ||
COPY --from=app-deps /app ./ | ||
COPY --from=app-srcs /app ./ | ||
|
||
CMD ["tail", "-f", "/dev/null"] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
FROM python:3.11-slim-bookworm | ||
|
||
COPY content_access_bot.py/content_access_bot-deps@environment=linux.pex /content_access_bot-deps.pex | ||
RUN PEX_TOOLS=1 python /content_access_bot-deps.pex venv --scope=deps --compile /app |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
FROM python:3.11-slim-bookworm | ||
|
||
COPY content_access_bot.py/content_access_bot-srcs@environment=linux.pex /content_access_bot-srcs.pex | ||
RUN PEX_TOOLS=1 python /content_access_bot-srcs.pex venv --scope=srcs --compile /app |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
{ | ||
"name": "content_access_bot", | ||
"cron": [ | ||
{ | ||
"command": "./pex", | ||
"schedule": "@daily" | ||
} | ||
] | ||
} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
python_sources( | ||
name="lib", | ||
dependencies=[ | ||
"3rdparty/py:requirements-all#aiohttp", | ||
"3rdparty/py:requirements-all#backoff", | ||
"3rdparty/py:requirements-all#environs", | ||
"3rdparty/py:requirements-all#pyairtable", | ||
"3rdparty/py:requirements-all#scrapy", | ||
"3rdparty/py:requirements-all#openpyxl", | ||
"3rdparty/py:requirements-all#pandas", | ||
"content_access_bot/py/pipeline.py:lib" | ||
], | ||
) | ||
|
||
pex_binary( | ||
name="content_access_bot-deps", | ||
environment=parametrize("__local__", "linux"), | ||
dependencies=[ | ||
":lib", | ||
], | ||
entry_point="main.py", | ||
include_sources=False, | ||
include_tools=True, | ||
layout="packed", | ||
) | ||
|
||
pex_binary( | ||
name="content_access_bot-srcs", | ||
environment=parametrize("__local__", "linux"), | ||
dependencies=[ | ||
":lib", | ||
], | ||
entry_point="main.py", | ||
include_requirements=False, | ||
include_tools=True, | ||
layout="packed", | ||
) | ||
|
||
|
||
pex_binary( | ||
name="content_access_bot", | ||
dependencies=[ | ||
":lib", | ||
], | ||
entry_point="main.py", | ||
) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
0.0.1 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,85 @@ | ||
from pyairtable import Api | ||
from utils import validate_url, clean_url | ||
import os | ||
import logging | ||
import re | ||
from environs import Env | ||
env = Env() | ||
dotenv_path = os.path.join(os.path.dirname(__file__), '..', '.env') | ||
|
||
env.read_env(dotenv_path) | ||
|
||
|
||
logging.basicConfig(level=logging.INFO, | ||
format='%(asctime)s - %(levelname)s - %(message)s') | ||
|
||
api_key = os.getenv('AIRTABLE_API_KEY') | ||
base_id = os.getenv('AIRTABLE_BASE_ID') | ||
organisations_table = os.getenv('AIRTABLE_ORGANISATION_TABLE') | ||
content_table = os.getenv('AIRTABLE_CONTENT_TABLE') | ||
|
||
if not api_key or not base_id or not organisations_table or not content_table: | ||
raise ValueError('API key, base ID and Organisation table are required') | ||
|
||
at = Api(api_key) | ||
|
||
|
||
def get_table_data(table_name, formula=None, fields=None): | ||
if not base_id: | ||
logging.error(f"AIRTABLE_BASE_ID Not Provided") | ||
return | ||
table = at.table(base_id, table_name) | ||
return table.all(formula=formula, fields=fields) | ||
|
||
|
||
def get_formula(allowed_countries=None): | ||
base_formula = 'AND(NOT({Organisation Name} = ""), NOT({Website} = ""), NOT({HQ Country} = ""))' | ||
if allowed_countries: | ||
countries_formula = ', '.join( | ||
[f'({{HQ Country}} = "{country}")' for country in allowed_countries]) | ||
formula = f'AND({base_formula}, OR({countries_formula}))' | ||
else: | ||
formula = base_formula | ||
return formula | ||
|
||
|
||
def process_records(data): | ||
organizations = [] | ||
for record in data: | ||
website = validate_url(record['fields'].get('Website', None)) | ||
name = record['fields'].get('Organisation Name', None) | ||
country = record['fields'].get('HQ Country', None) | ||
id: str = record['id'] | ||
if website: | ||
org = {} | ||
org['id'] = id | ||
org['name'] = re.sub( | ||
r'[\\/*?:"<>|]', '-', name) if name else None | ||
org['url'] = clean_url(website) | ||
org['country'] = country | ||
|
||
organizations.append(org) | ||
return organizations | ||
|
||
|
||
def get_organizations(allowed_countries=None): | ||
logging.info('Fetching organizations from Airtable') | ||
formula = get_formula(allowed_countries) | ||
fields = ['Organisation Name', 'Website', 'HQ Country'] | ||
data = get_table_data(organisations_table, formula, fields) | ||
organizations = process_records(data) | ||
logging.info(f'Fetched {len(organizations)} organizations') | ||
return organizations | ||
|
||
|
||
async def batch_upsert_organizations(data): | ||
logging.info('Upserting organizations in Airtable') | ||
try: | ||
if not base_id or not content_table: | ||
logging.error(f"AIRTABLE_BASE_ID or AIRTABLE_CONTENT_TABLE Not Provided") | ||
return | ||
table = at.table(base_id, content_table) | ||
table.batch_upsert(records=data, key_fields=['id',]) | ||
logging.info('Organizations upserted successfully') | ||
except Exception as e: | ||
logging.error(f'Error upserting organization: {e}') |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[P0] Import pyairtable Api before use
The module instantiates
Api(api_key)
without importing the class, so importingairtable.py
raisesNameError: Api is not defined
before any functionality can run. Addfrom pyairtable import Api
(or the appropriate module) near the other imports.Useful? React with 👍 / 👎.