-
Notifications
You must be signed in to change notification settings - Fork 0
Content Blocker Bot #653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Content Blocker Bot #653
Conversation
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
…idiadata-init Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial thoughts LGTM,
- consistency in logging, in one/more files you've setup logging & you're using
print()
for logging/debugging? in multiple instances - Interesting choice for using
asyncio
, from my earlier understanding thought a synchronous approach would have been straightforward. So could be helpful to add docs to help understand the process flow & motivations
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
- Replace direct status updates with new site_checks table for historical tracking - Add SiteCheck dataclass and related database methods - Update all status and robots operations to use new system - Fix typo in get_combined_data method name - Improve database schema and error handling
@codex Review this PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codex Review: Here are some suggestions.
About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you open a pull request for review, mark a draft as ready, or comment "@codex review". If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex fix this CI failure" or "@codex address that feedback".
process = CrawlerProcess(settings={ | ||
'ITEM_PIPELINES': { | ||
'pipeline.ArchivedURLsDatabasePipeline': 2 | ||
}, | ||
}, install_root_handler=False) | ||
process.crawl(ArchivedURLsSpider, urls=urls, target_date=target_date) | ||
process.start() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[P0] Avoid restarting Scrapy reactor for each crawl
fetch_internet_archive_snapshots
creates a fresh CrawlerProcess
and calls start()
even though fetch_robots
already started and stopped a reactor earlier in the run. Twisted’s reactor is not restartable, so this second start()
(and the third in fetch_archived_robots
) will raise ReactorNotRestartable
and stop the workflow after the first spider. Reuse one process/runner and run the reactor once to avoid the crash.
Useful? React with 👍 / 👎.
from utils import validate_url, clean_url | ||
import os | ||
import logging | ||
import re | ||
from environs import Env | ||
env = Env() | ||
dotenv_path = os.path.join(os.path.dirname(__file__), '..', '.env') | ||
|
||
env.read_env(dotenv_path) | ||
|
||
|
||
logging.basicConfig(level=logging.INFO, | ||
format='%(asctime)s - %(levelname)s - %(message)s') | ||
|
||
api_key = os.getenv('AIRTABLE_API_KEY') | ||
base_id = os.getenv('AIRTABLE_BASE_ID') | ||
organisations_table = os.getenv('AIRTABLE_ORGANISATION_TABLE') | ||
content_table = os.getenv('AIRTABLE_CONTENT_TABLE') | ||
|
||
if not api_key or not base_id or not organisations_table or not content_table: | ||
raise ValueError('API key, base ID and Organisation table are required') | ||
|
||
at = Api(api_key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[P0] Import pyairtable Api before use
The module instantiates Api(api_key)
without importing the class, so importing airtable.py
raises NameError: Api is not defined
before any functionality can run. Add from pyairtable import Api
(or the appropriate module) near the other imports.
Useful? React with 👍 / 👎.
"cohere-ai" | ||
"DataForSeoBot", | ||
"Diffbot", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[P1] Fix missing commas in AI crawler list
The entries for cohere-ai
and PerplexityBot
are missing trailing commas, so Python concatenates them with the following literals, producing strings like "cohere-aiDataForSeoBot"
. As a result, those crawler names can never be detected in diff_robot_content
, skewing the report about blocked AI crawlers.
Useful? React with 👍 / 👎.
CREATE TABLE IF NOT EXISTS internet_archive_snapshots( | ||
id INTEGER PRIMARY KEY AUTOINCREMENT, | ||
airtable_id TEXT NOT NULL, | ||
url TEXT NOT NULL, | ||
archive_date TEXT NOT NULL UNIQUE, | ||
archive_robots_url TEXT, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[P2] Allow identical archive timestamps for different sites
The internet_archive_snapshots
table declares archive_date
as globally UNIQUE
. If two media sites happen to have snapshots at the same timestamp (only second-level precision), inserting the second record will fail with a unique-constraint error and the snapshot is lost. Making the uniqueness per (airtable_id, archive_date)
or removing the uniqueness constraint avoids rejected inserts.
Useful? React with 👍 / 👎.
row.update({ | ||
"Archive URL": closest_snapshot.get("url"), | ||
"Archive Date": format_db_date(closest_snapshot.get("archive_date")), | ||
"Archive Robots URL": closest_snapshot.get("archive_robots_url"), | ||
"Archive Robot Content": ( | ||
"''" if closest_snapshot.get("archive_robots_url") == "" else closest_snapshot.get("archive_robots_url") | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[P2] Report shows robots URL where content is expected
When building the report row, the "Archive Robot Content" field is populated from archive_robots_url
instead of archived_content
. The generated spreadsheet therefore shows the URL twice and never includes the archived robots.txt content. Use archived_content
for this column.
Useful? React with 👍 / 👎.
Description
This PR introduces a new app to check which media houses in the MediaData database block AI crawlers.
Type of change
Checklist: