Skip to content

Conversation

kelvinkipruto
Copy link

Description

This PR introduces a new app to check which media houses in the MediaData database block AI crawlers.

Type of change

  • New feature (non-breaking change which adds functionality)

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation

Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
…idiadata-init

Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
@kelvinkipruto kelvinkipruto marked this pull request as ready for review May 7, 2024 06:44
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
@kilemensi
Copy link
Member

mediadata_ai_blocklist is a terrible name Mr. @kelvinkipruto ... we're 100% going to use this app for more than media data for one. My vote is naming it something related to checking whether website content data is accessible to bots e.g. accessbot, etc.

Copy link
Contributor

@koechkevin koechkevin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@thepsalmist thepsalmist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial thoughts LGTM,

  1. consistency in logging, in one/more files you've setup logging & you're using print() for logging/debugging? in multiple instances
  2. Interesting choice for using asyncio, from my earlier understanding thought a synchronous approach would have been straightforward. So could be helpful to add docs to help understand the process flow & motivations

Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
@kelvinkipruto kelvinkipruto changed the title MediaData AI Crawler Blocker Checker Content Blocker Bot May 15, 2024
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
Signed-off-by: Kipruto <43873157+kelvinkipruto@users.noreply.github.com>
- Replace direct status updates with new site_checks table for historical tracking
- Add SiteCheck dataclass and related database methods
- Update all status and robots operations to use new system
- Fix typo in get_combined_data method name
- Improve database schema and error handling
@kelvinkipruto
Copy link
Author

@codex Review this PR

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codex Review: Here are some suggestions.

About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you open a pull request for review, mark a draft as ready, or comment "@codex review". If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex fix this CI failure" or "@codex address that feedback".

Comment on lines +116 to +122
process = CrawlerProcess(settings={
'ITEM_PIPELINES': {
'pipeline.ArchivedURLsDatabasePipeline': 2
},
}, install_root_handler=False)
process.crawl(ArchivedURLsSpider, urls=urls, target_date=target_date)
process.start()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P0] Avoid restarting Scrapy reactor for each crawl

fetch_internet_archive_snapshots creates a fresh CrawlerProcess and calls start() even though fetch_robots already started and stopped a reactor earlier in the run. Twisted’s reactor is not restartable, so this second start() (and the third in fetch_archived_robots) will raise ReactorNotRestartable and stop the workflow after the first spider. Reuse one process/runner and run the reactor once to avoid the crash.

Useful? React with 👍 / 👎.

Comment on lines +2 to +24
from utils import validate_url, clean_url
import os
import logging
import re
from environs import Env
env = Env()
dotenv_path = os.path.join(os.path.dirname(__file__), '..', '.env')

env.read_env(dotenv_path)


logging.basicConfig(level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s')

api_key = os.getenv('AIRTABLE_API_KEY')
base_id = os.getenv('AIRTABLE_BASE_ID')
organisations_table = os.getenv('AIRTABLE_ORGANISATION_TABLE')
content_table = os.getenv('AIRTABLE_CONTENT_TABLE')

if not api_key or not base_id or not organisations_table or not content_table:
raise ValueError('API key, base ID and Organisation table are required')

at = Api(api_key)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P0] Import pyairtable Api before use

The module instantiates Api(api_key) without importing the class, so importing airtable.py raises NameError: Api is not defined before any functionality can run. Add from pyairtable import Api (or the appropriate module) near the other imports.

Useful? React with 👍 / 👎.

Comment on lines +14 to +16
"cohere-ai"
"DataForSeoBot",
"Diffbot",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] Fix missing commas in AI crawler list

The entries for cohere-ai and PerplexityBot are missing trailing commas, so Python concatenates them with the following literals, producing strings like "cohere-aiDataForSeoBot". As a result, those crawler names can never be detected in diff_robot_content, skewing the report about blocked AI crawlers.

Useful? React with 👍 / 👎.

Comment on lines +78 to +83
CREATE TABLE IF NOT EXISTS internet_archive_snapshots(
id INTEGER PRIMARY KEY AUTOINCREMENT,
airtable_id TEXT NOT NULL,
url TEXT NOT NULL,
archive_date TEXT NOT NULL UNIQUE,
archive_robots_url TEXT,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] Allow identical archive timestamps for different sites

The internet_archive_snapshots table declares archive_date as globally UNIQUE. If two media sites happen to have snapshots at the same timestamp (only second-level precision), inserting the second record will fail with a unique-constraint error and the snapshot is lost. Making the uniqueness per (airtable_id, archive_date) or removing the uniqueness constraint avoids rejected inserts.

Useful? React with 👍 / 👎.

Comment on lines +195 to +201
row.update({
"Archive URL": closest_snapshot.get("url"),
"Archive Date": format_db_date(closest_snapshot.get("archive_date")),
"Archive Robots URL": closest_snapshot.get("archive_robots_url"),
"Archive Robot Content": (
"''" if closest_snapshot.get("archive_robots_url") == "" else closest_snapshot.get("archive_robots_url")
),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] Report shows robots URL where content is expected

When building the report row, the "Archive Robot Content" field is populated from archive_robots_url instead of archived_content. The generated spreadsheet therefore shows the URL twice and never includes the archived robots.txt content. Use archived_content for this column.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants