Skip to content
/ hn-gems Public

A tool that discovers high-quality Hacker News posts from low-karma accounts that would otherwise be overlooked.

License

Notifications You must be signed in to change notification settings

DG1001/hn-gems

Repository files navigation

HN Hidden Gems Finder

GitHub License

A tool that discovers high-quality Hacker News posts from low-karma accounts that would otherwise be overlooked.

Overview

The HN Hidden Gems Finder helps surface excellent content from new or low-karma Hacker News users that often gets buried despite being valuable. This addresses the problem where great "Show HN" posts and discussions get no traction simply because the author doesn't have established karma.

⚠️ Experimental Status

Please note: This project is experimental in nature. The AI-generated content analysis, scoring, and podcast features are in alpha status and should not be taken too seriously. All AI-generated text, evaluations, and audio content represent algorithmic interpretations and may contain inaccuracies, biases, or subjective opinions. This is primarily a research project exploring automated content discovery and AI-powered analysis techniques.

Features

  • Real-time Hidden Gems Feed: Continuously discovers overlooked quality posts (every 5 minutes)
  • Automated Background Services:
    • Post Collection: Automatic discovery and analysis with configurable intervals (no Redis required)
    • Hall of Fame Monitoring: Tracks gems that achieve success and automatically promotes them (every 6 hours)
    • Super Gems Analysis: AI-powered deep analysis of top gems using Google Gemini with user-friendly visual scoring (every 6 hours)
    • Podcast Generation: Automatic conversion of Super Gems analysis to professional podcast audio using AI script generation and Google Cloud Text-to-Speech (alpha - experimental feature)
  • Hall of Fame: Automated tracking of discovered gems that later became popular (≥100 points)
  • Success Metrics: Real-time monitoring of discovery accuracy and timing
  • Quality Analysis: AI-powered content analysis to identify technical depth and originality
  • Anti-spam Protection: Advanced filtering to maintain high quality
  • Duplicate Detection: Intelligent duplicate post detection and filtering to prevent spam and content recycling
  • Visual Scoring System: User-friendly star ratings and professional dot indicators instead of intimidating numerical scores
  • Knowledge-Aware AI: Smart evaluation system that avoids penalizing posts for recent technology releases
  • Time-based Collection: Intelligent collection that only processes posts from specified time windows
  • Podcast Player: Built-in HTML5 audio player with streaming and download capabilities for Super Gems podcasts

Quick Start

  1. Install Dependencies

    pip install -r requirements.txt
  2. Configure Environment (optional):

    cp .env.sample .env
    # Edit .env to customize settings
  3. Initialize Database

    python -m hn_hidden_gems.models.init_db
  4. Configure Collection Service (Optional)

    # Set collection interval (default: 5 minutes, 0 to disable)
    export POST_COLLECTION_INTERVAL_MINUTES=5
  5. Run Application

    python app.py

    The application automatically starts all background services:

    • Post Collection: Discovers new gems every 5 minutes
    • Hall of Fame Monitoring: Checks for gem success every 6 hours
    • Super Gems Analysis: Deep AI analysis of top gems every 6 hours
    • Podcast Generation: Automatically creates audio podcasts after Super Gems analysis completes (when enabled)

Architecture

The system uses the official Hacker News API for all data collection:

  • HN Firebase API: Real-time updates with no rate limits

Key components:

  • Data Collection: Automatic background collection of HN new posts
  • Quality Analysis: AI-powered content evaluation with spam detection
  • Duplicate Detection: Advanced duplicate post filtering using URL normalization, content similarity analysis, and same-author detection
  • Super Gems Analysis: Advanced AI evaluation with visual scoring and knowledge-aware assessments
  • Storage: SQLite for development, PostgreSQL for production
  • Web Interface: Flask application with real-time updates
  • Background Service: APScheduler-based in-process collection (no Redis required)
  • Time-based Processing: Collects only posts from specified time windows
  • Podcast System: Automated script generation with Gemini 2.5 Flash-Lite and high-quality audio synthesis with Google Cloud TTS

Documentation

For detailed technical documentation on the algorithms:

📖 Hidden Gems Detection Algorithm - Complete documentation of the initial quality analysis system that identifies hidden gems from low-karma authors, including scoring methodology, spam detection, and Live Feed generation.

🔬 Super Gems Analysis Algorithm - In-depth explanation of the advanced LLM-powered analysis system, including prompt engineering, scoring criteria, and quality assurance measures.

Super Gems Analysis System

The Super Gems feature provides comprehensive AI-powered analysis of top hidden gems:

Visual Scoring System:

  • ⭐⭐⭐⭐⭐ Star ratings for overall quality (instead of intimidating numerical scores)
  • Professional dot indicators for detailed metrics:
    • ●●●● Exceptional (91-100%)
    • ●●● Excellent (76-90%)
    • ●● Good (51-75%)
    • ● Basic (0-50%)

Smart AI Analysis:

  • Knowledge-aware evaluation that doesn't penalize posts for recent technology releases
  • Technical merit focus over factual verification for emerging technologies
  • Automatic bias correction for outdated knowledge assumptions
  • Comprehensive GitHub integration for code quality assessment

Analysis Dimensions:

  • Technical Innovation
  • Problem Significance
  • Implementation Quality
  • Community Value
  • Uniqueness Score

Output Formats:

  • super-gems.html: Clean, public-friendly version without ratings (maintains ranking order)
  • super-gems-ratings.html: Internal version with full visual scoring system
  • super-gems.json: JSON API data with all analysis details

Podcast Feature 🎧

The HN Hidden Gems Podcast automatically converts Super Gems analysis into professional-quality audio content for on-the-go listening.

Features

  • AI Script Generation: Uses Gemini 2.5 Flash-Lite to create natural, engaging podcast scripts from Super Gems analysis
  • Professional Audio: High-quality MP3 generation with Google Cloud Neural2 voices
  • Automatic Integration: Seamlessly integrates with existing Super Gems analysis workflow
  • Web Player: Built-in HTML5 audio player with streaming, progress control, and download options
  • Text Optimization: Intelligently converts technical content for speech synthesis (URLs, acronyms, ratings)
  • Cost-Efficient: Uses Gemini 2.5 Flash-Lite model ($0.10/1M input, $0.40/1M output tokens)

Setup

  1. Enable Podcast Generation:

    # In your .env file
    AUDIO_GENERATION_ENABLED=true
    GEMINI_API_KEY=your-gemini-api-key-here
  2. Configure Google Cloud TTS (optional - for audio generation):

    # Option 1: Service Account (recommended)
    GOOGLE_TTS_CREDENTIALS_PATH=path/to/service-account.json
    
    # Option 2: API Key
    GOOGLE_CLOUD_API_KEY=your-google-cloud-api-key-here
    
    # TTS Configuration
    TTS_LANGUAGE_CODE=en-US
    TTS_VOICE_NAME=en-US-Neural2-J
    TTS_AUDIO_ENCODING=MP3
  3. Audio Storage Configuration:

    AUDIO_STORAGE_PATH=static/audio
    AUDIO_CLEANUP_DAYS=30

How It Works

  1. Automatic Trigger: Podcast generation automatically starts after Super Gems analysis completes (ensuring fresh data)
  2. Script Generation: The system generates natural, personalized podcast scripts focusing on:
    • Factual GitHub metrics (stars, contributors, repository health)
    • Qualitative AI analysis (specific to each project)
    • Real community data (open source status, working demos)
    • No algorithmic scores - avoids confusing numerical ratings
  3. Text Optimization: Converts technical content for speech (e.g., "github.com/user/repo" → "github repository by user")
  4. Audio Synthesis: Converts optimized script to high-quality MP3 audio using Google Cloud TTS
  5. Web Integration: Audio player automatically appears on Super Gems pages with streaming and download options
  6. File Management: Automatic cleanup of old files based on retention policy

Audio Player

  • Auto-Detection: Automatically appears on Super Gems pages
  • Professional Controls: Play/pause, progress bar, volume control, download button
  • Mobile Responsive: Works seamlessly on all devices
  • Keyboard Shortcuts: Spacebar (play/pause), arrow keys (seek ±10 seconds)
  • Metadata Display: Shows gem count, duration, generation date, file size

API Endpoints

  • GET /api/audio/super-gems/latest: Get latest podcast metadata
  • GET /api/audio/super-gems/<date>: Get podcast for specific date
  • POST /api/audio/generate: Manually trigger audio generation
  • GET /api/audio/list: List available podcast files
  • GET /audio/<filename>: Stream or download audio files

Duplicate Detection System

The application includes a comprehensive duplicate detection system to maintain content quality:

Detection Methods:

  • URL-based: Detects identical URLs after normalization (removes tracking parameters, trailing slashes, etc.)
  • Content similarity: Uses sequence matching to identify posts with similar titles and content (configurable thresholds)
  • Same-author detection: Identifies users posting similar content multiple times (spam behavior)
  • Title normalization: Removes common HN prefixes ("Ask HN:", "Show HN:") and punctuation for better matching

Integration Points:

  • Post Collection: Automatically checks for duplicates before saving new posts
  • Super Gems Analysis: Filters out duplicates before expensive LLM analysis
  • Database Management: Provides tools to find, mark, and clean up duplicate posts

Quality Preservation:

  • Always keeps the earlier post (lower HN ID) or higher quality post (better gem score)
  • Marks duplicates as spam rather than deleting them
  • Maintains traceability of duplicate detection decisions

Performance Impact:

  • Cleaned up 278 existing duplicate posts across 112 URLs in initial deployment
  • Significantly reduces noise and improves content quality in hidden gems feed
  • Prevents spam behavior where users post the same content multiple times

Configuration

Configure the application using environment variables:

Core Settings

  • FLASK_ENV: development/production
  • DATABASE_URL: Database connection string
  • SECRET_KEY: Flask secret key for security
  • HOST: Server host (default: 127.0.0.1)
  • PORT: Server port (default: 5000)

Background Services

  • POST_COLLECTION_INTERVAL_MINUTES=5: Minutes between post collections (0 to disable)
  • POST_COLLECTION_BATCH_SIZE=25: Posts to commit per batch
  • POST_COLLECTION_MAX_STORIES=500: Max story IDs to fetch per run
  • HALL_OF_FAME_INTERVAL_HOURS=6: Hours between Hall of Fame monitoring (0 to disable)
  • SUPER_GEMS_INTERVAL_HOURS=6: Hours between super gems analysis (0 to disable)
  • SUPER_GEMS_ANALYSIS_HOURS=48: Hours back to analyze for super gems
  • SUPER_GEMS_TOP_N=5: Number of top gems to analyze per run

Podcast Generation Settings

  • AUDIO_GENERATION_ENABLED=false: Enable automatic podcast audio generation
  • AUDIO_STORAGE_PATH=static/audio: Directory to store generated audio files
  • AUDIO_CLEANUP_DAYS=30: Delete audio files older than N days
  • GOOGLE_TTS_CREDENTIALS_PATH=path/to/service-account.json: Path to Google Cloud service account JSON
  • TTS_LANGUAGE_CODE=en-US: Language code for TTS (en-US, de-DE, etc.)
  • TTS_VOICE_NAME=en-US-Neural2-J: Voice name for audio generation
  • TTS_AUDIO_ENCODING=MP3: Audio format (MP3, OGG_OPUS, LINEAR16)

Quality Thresholds

  • KARMA_THRESHOLD=100: Max author karma for gems
  • MIN_INTEREST_SCORE=0.3: Min quality score for gems

Duplicate Detection Settings

  • URL_SIMILARITY_THRESHOLD=0.95: Minimum similarity score for URL matching (0.0-1.0)
  • TITLE_SIMILARITY_THRESHOLD=0.85: Minimum similarity score for title matching (0.0-1.0)
  • CONTENT_SIMILARITY_THRESHOLD=0.8: Minimum similarity score for content matching (0.0-1.0)
  • SAME_AUTHOR_THRESHOLD=0.7: Lower threshold when posts are by the same author (spam detection)

Super Gems Analysis

  • GEMINI_API_KEY: Google Gemini API key for super gems analysis (required for super gems feature)
  • Enhanced GitHub Analysis: Uses 6 GitHub API calls per repository for detailed metrics
  • Factual Implementation Quality: Based on measurable GitHub metrics (stars, commits, structure)
  • Factual Community Value: Based on measurable community engagement (stars, forks, contributors)
  • No AI Speculation: LLM only assesses technical innovation, problem significance, and uniqueness
  • Graceful Failure: Returns no analysis rather than creating dummy/fake data when LLM parsing fails
  • Podcast Generation: Avoids algorithmic scores, focuses on factual data and qualitative analysis
  • The system automatically applies knowledge-aware evaluation to avoid penalizing recent technology releases
  • Uses temperature=0.1 for consistent, focused AI responses
  • Generates both HTML and JSON output for comprehensive analysis results

Logging Settings

  • LOG_LEVEL: Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
  • LOG_FILE: Log file path (default: logs/app.log)

Development

# Run tests (when implemented)
pytest

# Check service status
flask collection-status

# View configuration
flask config-collection

Background Collection Service

The application includes an automatic background service for collecting new HN posts:

# Check service status
python scripts/manage_collector_simple.py status

# Manually trigger collection
python scripts/manage_collector_simple.py collect --minutes 60

# Flask CLI commands
flask config-collection          # Show configuration for both services
flask start-collector           # Start both services manually
flask stop-collector            # Stop both services manually
flask collect-now               # Manually trigger post collection
flask monitor-gems              # Manually trigger Hall of Fame monitoring
flask analyze-super-gems        # Manually trigger super gems analysis
flask collection-status         # Check status of all services

# Podcast Generation Management
flask generate-podcast           # Manually trigger podcast generation
flask podcast-status            # Check podcast generation status

# Duplicate Detection Management
flask find-duplicates           # Find and report duplicate posts
flask clean-duplicates          # Automatically clean up duplicate posts
flask check-post-duplicates     # Interactive duplicate checking for specific posts
flask cleanup-existing-duplicates # Bulk cleanup of existing duplicates in database

Service Features

  • Quad Background Services: Post collection + Hall of Fame monitoring + Super gems analysis + Podcast generation
  • No External Dependencies: No Redis or Celery required
  • Auto-start/stop: All services start with Flask app, stop when app stops
  • Configurable Intervals:
    • Post collection: Default 5 minutes (set to 0 to disable)
    • Hall of Fame monitoring: Default 6 hours (set to 0 to disable)
    • Super gems analysis: Default 6 hours (set to 0 to disable)
    • Podcast generation: Triggered after Super Gems analysis (when enabled)
  • Time-based Collection: Only processes posts from specified time windows
  • Automated Success Tracking: Promotes gems to Hall of Fame when they reach ≥100 points
  • Thread-safe: Prevents overlapping collection runs
  • Progress Tracking: Built-in statistics and status reporting for all services

API Endpoints

Core Endpoints

  • GET /api/gems: Latest hidden gems with filtering
  • GET /api/gems/hall-of-fame: Hall of fame entries
  • GET /super-gems: AI-curated super gems analysis page with visual scoring
  • GET /super-gems.html: Clean super gems analysis (no ratings, public-friendly)
  • GET /super-gems-ratings.html: Super gems analysis with star ratings and indicators
  • GET /super-gems.json: JSON API for super gems analysis data
  • GET /api/stats: Success metrics and statistics
  • GET /api/posts/<hn_id>: Get specific post by HN ID
  • GET /api/users/<username>: Get user information
  • GET /api/search?q=<query>: Search posts by title/content
  • GET /feed.xml: RSS feed of hidden gems

Collection Service Endpoints

  • GET /api/collection/status: Service status and statistics
  • POST /api/collection/trigger: Manually trigger collection
  • GET /api/collection/config: Current configuration

Podcast Service Endpoints

  • GET /api/audio/super-gems/latest: Get latest podcast metadata and URLs
  • GET /api/audio/super-gems/<date>: Get podcast for specific date (YYYY-MM-DD)
  • POST /api/audio/generate: Manually trigger podcast generation
  • GET /api/audio/list: List available podcast files
  • GET /api/podcast/scripts/latest: Get latest generated podcast script
  • GET /audio/<filename>: Stream or download audio files

Utility Endpoints

  • GET /api/health: Health check endpoint

Development Tools

This project was developed using:

  • XaresAICoder - Open-source browser IDE with integrated AI coding assistants
  • Claude Code - AI-powered development assistant for code analysis and implementation

License

MIT License - see LICENSE file for details.

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests
  5. Submit a pull request

Support

For issues and feature requests, please use the GitHub issue tracker.

About

A tool that discovers high-quality Hacker News posts from low-karma accounts that would otherwise be overlooked.

Topics

Resources

License

Stars

Watchers

Forks