HN Hidden Gems Finder

A tool that discovers high-quality Hacker News posts from low-karma accounts that would otherwise be overlooked.

Overview

The HN Hidden Gems Finder helps surface excellent content from new or low-karma Hacker News users that often gets buried despite being valuable. This addresses the problem where great "Show HN" posts and discussions get no traction simply because the author doesn't have established karma.

⚠️ Experimental Status

Please note: This project is experimental in nature. The AI-generated content analysis, scoring, and podcast features are in alpha status and should not be taken too seriously. All AI-generated text, evaluations, and audio content represent algorithmic interpretations and may contain inaccuracies, biases, or subjective opinions. This is primarily a research project exploring automated content discovery and AI-powered analysis techniques.

Features

Real-time Hidden Gems Feed: Continuously discovers overlooked quality posts (every 5 minutes)
Automated Background Services:
- Post Collection: Automatic discovery and analysis with configurable intervals (no Redis required)
- Hall of Fame Monitoring: Tracks gems that achieve success and automatically promotes them (every 6 hours)
- Super Gems Analysis: AI-powered deep analysis of top gems using Google Gemini with user-friendly visual scoring (every 6 hours)
- Podcast Generation: Automatic conversion of Super Gems analysis to professional podcast audio using AI script generation and Google Cloud Text-to-Speech (alpha - experimental feature)
Hall of Fame: Automated tracking of discovered gems that later became popular (≥100 points)
Success Metrics: Real-time monitoring of discovery accuracy and timing
Quality Analysis: AI-powered content analysis to identify technical depth and originality
Anti-spam Protection: Advanced filtering to maintain high quality
Duplicate Detection: Intelligent duplicate post detection and filtering to prevent spam and content recycling
Visual Scoring System: User-friendly star ratings and professional dot indicators instead of intimidating numerical scores
Knowledge-Aware AI: Smart evaluation system that avoids penalizing posts for recent technology releases
Time-based Collection: Intelligent collection that only processes posts from specified time windows
Podcast Player: Built-in HTML5 audio player with streaming and download capabilities for Super Gems podcasts

Quick Start

Install Dependencies
```
pip install -r requirements.txt
```

Configure Environment (optional):

cp .env.sample .env
# Edit .env to customize settings

Initialize Database
```
python -m hn_hidden_gems.models.init_db
```

Configure Collection Service (Optional)

# Set collection interval (default: 5 minutes, 0 to disable)
export POST_COLLECTION_INTERVAL_MINUTES=5

Run Application
```
python app.py
```
The application automatically starts all background services:
- Post Collection: Discovers new gems every 5 minutes
- Hall of Fame Monitoring: Checks for gem success every 6 hours
- Super Gems Analysis: Deep AI analysis of top gems every 6 hours
- Podcast Generation: Automatically creates audio podcasts after Super Gems analysis completes (when enabled)

Architecture

The system uses the official Hacker News API for all data collection:

HN Firebase API: Real-time updates with no rate limits

Key components:

Data Collection: Automatic background collection of HN new posts
Quality Analysis: AI-powered content evaluation with spam detection
Duplicate Detection: Advanced duplicate post filtering using URL normalization, content similarity analysis, and same-author detection
Super Gems Analysis: Advanced AI evaluation with visual scoring and knowledge-aware assessments
Storage: SQLite for development, PostgreSQL for production
Web Interface: Flask application with real-time updates
Background Service: APScheduler-based in-process collection (no Redis required)
Time-based Processing: Collects only posts from specified time windows
Podcast System: Automated script generation with Gemini 2.5 Flash-Lite and high-quality audio synthesis with Google Cloud TTS

Documentation

For detailed technical documentation on the algorithms:

📖 Hidden Gems Detection Algorithm - Complete documentation of the initial quality analysis system that identifies hidden gems from low-karma authors, including scoring methodology, spam detection, and Live Feed generation.

🔬 Super Gems Analysis Algorithm - In-depth explanation of the advanced LLM-powered analysis system, including prompt engineering, scoring criteria, and quality assurance measures.

Super Gems Analysis System

The Super Gems feature provides comprehensive AI-powered analysis of top hidden gems:

Visual Scoring System:

⭐⭐⭐⭐⭐ Star ratings for overall quality (instead of intimidating numerical scores)
Professional dot indicators for detailed metrics:
- ●●●● Exceptional (91-100%)
- ●●● Excellent (76-90%)
- ●● Good (51-75%)
- ● Basic (0-50%)

Smart AI Analysis:

Knowledge-aware evaluation that doesn't penalize posts for recent technology releases
Technical merit focus over factual verification for emerging technologies
Automatic bias correction for outdated knowledge assumptions
Comprehensive GitHub integration for code quality assessment

Analysis Dimensions:

Technical Innovation
Problem Significance
Implementation Quality
Community Value
Uniqueness Score

Output Formats:

super-gems.html: Clean, public-friendly version without ratings (maintains ranking order)
super-gems-ratings.html: Internal version with full visual scoring system
super-gems.json: JSON API data with all analysis details

Podcast Feature 🎧

The HN Hidden Gems Podcast automatically converts Super Gems analysis into professional-quality audio content for on-the-go listening.

Features

AI Script Generation: Uses Gemini 2.5 Flash-Lite to create natural, engaging podcast scripts from Super Gems analysis
Professional Audio: High-quality MP3 generation with Google Cloud Neural2 voices
Automatic Integration: Seamlessly integrates with existing Super Gems analysis workflow
Web Player: Built-in HTML5 audio player with streaming, progress control, and download options
Text Optimization: Intelligently converts technical content for speech synthesis (URLs, acronyms, ratings)
Cost-Efficient: Uses Gemini 2.5 Flash-Lite model ($0.10/1M input, $0.40/1M output tokens)

Setup

Enable Podcast Generation:

# In your .env file
AUDIO_GENERATION_ENABLED=true
GEMINI_API_KEY=your-gemini-api-key-here

Configure Google Cloud TTS (optional - for audio generation):

# Option 1: Service Account (recommended)
GOOGLE_TTS_CREDENTIALS_PATH=path/to/service-account.json

# Option 2: API Key
GOOGLE_CLOUD_API_KEY=your-google-cloud-api-key-here

# TTS Configuration
TTS_LANGUAGE_CODE=en-US
TTS_VOICE_NAME=en-US-Neural2-J
TTS_AUDIO_ENCODING=MP3

Audio Storage Configuration:

AUDIO_STORAGE_PATH=static/audio
AUDIO_CLEANUP_DAYS=30

How It Works

Automatic Trigger: Podcast generation automatically starts after Super Gems analysis completes (ensuring fresh data)
Script Generation: The system generates natural, personalized podcast scripts focusing on:
- Factual GitHub metrics (stars, contributors, repository health)
- Qualitative AI analysis (specific to each project)
- Real community data (open source status, working demos)
- No algorithmic scores - avoids confusing numerical ratings
Text Optimization: Converts technical content for speech (e.g., "github.com/user/repo" → "github repository by user")
Audio Synthesis: Converts optimized script to high-quality MP3 audio using Google Cloud TTS
Web Integration: Audio player automatically appears on Super Gems pages with streaming and download options
File Management: Automatic cleanup of old files based on retention policy

Audio Player

Auto-Detection: Automatically appears on Super Gems pages
Professional Controls: Play/pause, progress bar, volume control, download button
Mobile Responsive: Works seamlessly on all devices
Keyboard Shortcuts: Spacebar (play/pause), arrow keys (seek ±10 seconds)
Metadata Display: Shows gem count, duration, generation date, file size

API Endpoints

GET /api/audio/super-gems/latest: Get latest podcast metadata
GET /api/audio/super-gems/<date>: Get podcast for specific date
POST /api/audio/generate: Manually trigger audio generation
GET /api/audio/list: List available podcast files
GET /audio/<filename>: Stream or download audio files

Duplicate Detection System

The application includes a comprehensive duplicate detection system to maintain content quality:

Detection Methods:

URL-based: Detects identical URLs after normalization (removes tracking parameters, trailing slashes, etc.)
Content similarity: Uses sequence matching to identify posts with similar titles and content (configurable thresholds)
Same-author detection: Identifies users posting similar content multiple times (spam behavior)
Title normalization: Removes common HN prefixes ("Ask HN:", "Show HN:") and punctuation for better matching

Integration Points:

Post Collection: Automatically checks for duplicates before saving new posts
Super Gems Analysis: Filters out duplicates before expensive LLM analysis
Database Management: Provides tools to find, mark, and clean up duplicate posts

Quality Preservation:

Always keeps the earlier post (lower HN ID) or higher quality post (better gem score)
Marks duplicates as spam rather than deleting them
Maintains traceability of duplicate detection decisions

Performance Impact:

Cleaned up 278 existing duplicate posts across 112 URLs in initial deployment
Significantly reduces noise and improves content quality in hidden gems feed
Prevents spam behavior where users post the same content multiple times

Configuration

Configure the application using environment variables:

Core Settings

FLASK_ENV: development/production
DATABASE_URL: Database connection string
SECRET_KEY: Flask secret key for security
HOST: Server host (default: 127.0.0.1)
PORT: Server port (default: 5000)

Background Services

POST_COLLECTION_INTERVAL_MINUTES=5: Minutes between post collections (0 to disable)
POST_COLLECTION_BATCH_SIZE=25: Posts to commit per batch
POST_COLLECTION_MAX_STORIES=500: Max story IDs to fetch per run
HALL_OF_FAME_INTERVAL_HOURS=6: Hours between Hall of Fame monitoring (0 to disable)
SUPER_GEMS_INTERVAL_HOURS=6: Hours between super gems analysis (0 to disable)
SUPER_GEMS_ANALYSIS_HOURS=48: Hours back to analyze for super gems
SUPER_GEMS_TOP_N=5: Number of top gems to analyze per run

Podcast Generation Settings

AUDIO_GENERATION_ENABLED=false: Enable automatic podcast audio generation
AUDIO_STORAGE_PATH=static/audio: Directory to store generated audio files
AUDIO_CLEANUP_DAYS=30: Delete audio files older than N days
GOOGLE_TTS_CREDENTIALS_PATH=path/to/service-account.json: Path to Google Cloud service account JSON
TTS_LANGUAGE_CODE=en-US: Language code for TTS (en-US, de-DE, etc.)
TTS_VOICE_NAME=en-US-Neural2-J: Voice name for audio generation
TTS_AUDIO_ENCODING=MP3: Audio format (MP3, OGG_OPUS, LINEAR16)

Quality Thresholds

KARMA_THRESHOLD=100: Max author karma for gems
MIN_INTEREST_SCORE=0.3: Min quality score for gems

Duplicate Detection Settings

URL_SIMILARITY_THRESHOLD=0.95: Minimum similarity score for URL matching (0.0-1.0)
TITLE_SIMILARITY_THRESHOLD=0.85: Minimum similarity score for title matching (0.0-1.0)
CONTENT_SIMILARITY_THRESHOLD=0.8: Minimum similarity score for content matching (0.0-1.0)
SAME_AUTHOR_THRESHOLD=0.7: Lower threshold when posts are by the same author (spam detection)

Super Gems Analysis

GEMINI_API_KEY: Google Gemini API key for super gems analysis (required for super gems feature)
Enhanced GitHub Analysis: Uses 6 GitHub API calls per repository for detailed metrics
Factual Implementation Quality: Based on measurable GitHub metrics (stars, commits, structure)
Factual Community Value: Based on measurable community engagement (stars, forks, contributors)
No AI Speculation: LLM only assesses technical innovation, problem significance, and uniqueness
Graceful Failure: Returns no analysis rather than creating dummy/fake data when LLM parsing fails
Podcast Generation: Avoids algorithmic scores, focuses on factual data and qualitative analysis
The system automatically applies knowledge-aware evaluation to avoid penalizing recent technology releases
Uses temperature=0.1 for consistent, focused AI responses
Generates both HTML and JSON output for comprehensive analysis results

Logging Settings

LOG_LEVEL: Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
LOG_FILE: Log file path (default: logs/app.log)

Development

# Run tests (when implemented)
pytest

# Check service status
flask collection-status

# View configuration
flask config-collection

Background Collection Service

The application includes an automatic background service for collecting new HN posts:

# Check service status
python scripts/manage_collector_simple.py status

# Manually trigger collection
python scripts/manage_collector_simple.py collect --minutes 60

# Flask CLI commands
flask config-collection          # Show configuration for both services
flask start-collector           # Start both services manually
flask stop-collector            # Stop both services manually
flask collect-now               # Manually trigger post collection
flask monitor-gems              # Manually trigger Hall of Fame monitoring
flask analyze-super-gems        # Manually trigger super gems analysis
flask collection-status         # Check status of all services

# Podcast Generation Management
flask generate-podcast           # Manually trigger podcast generation
flask podcast-status            # Check podcast generation status

# Duplicate Detection Management
flask find-duplicates           # Find and report duplicate posts
flask clean-duplicates          # Automatically clean up duplicate posts
flask check-post-duplicates     # Interactive duplicate checking for specific posts
flask cleanup-existing-duplicates # Bulk cleanup of existing duplicates in database

Service Features

Quad Background Services: Post collection + Hall of Fame monitoring + Super gems analysis + Podcast generation
No External Dependencies: No Redis or Celery required
Auto-start/stop: All services start with Flask app, stop when app stops
Configurable Intervals:
- Post collection: Default 5 minutes (set to 0 to disable)
- Hall of Fame monitoring: Default 6 hours (set to 0 to disable)
- Super gems analysis: Default 6 hours (set to 0 to disable)
- Podcast generation: Triggered after Super Gems analysis (when enabled)
Time-based Collection: Only processes posts from specified time windows
Automated Success Tracking: Promotes gems to Hall of Fame when they reach ≥100 points
Thread-safe: Prevents overlapping collection runs
Progress Tracking: Built-in statistics and status reporting for all services

API Endpoints

Core Endpoints

GET /api/gems: Latest hidden gems with filtering
GET /api/gems/hall-of-fame: Hall of fame entries
GET /super-gems: AI-curated super gems analysis page with visual scoring
GET /super-gems.html: Clean super gems analysis (no ratings, public-friendly)
GET /super-gems-ratings.html: Super gems analysis with star ratings and indicators
GET /super-gems.json: JSON API for super gems analysis data
GET /api/stats: Success metrics and statistics
GET /api/posts/<hn_id>: Get specific post by HN ID
GET /api/users/<username>: Get user information
GET /api/search?q=<query>: Search posts by title/content
GET /feed.xml: RSS feed of hidden gems

Collection Service Endpoints

GET /api/collection/status: Service status and statistics
POST /api/collection/trigger: Manually trigger collection
GET /api/collection/config: Current configuration

Podcast Service Endpoints

GET /api/audio/super-gems/latest: Get latest podcast metadata and URLs
GET /api/audio/super-gems/<date>: Get podcast for specific date (YYYY-MM-DD)
POST /api/audio/generate: Manually trigger podcast generation
GET /api/audio/list: List available podcast files
GET /api/podcast/scripts/latest: Get latest generated podcast script
GET /audio/<filename>: Stream or download audio files

Utility Endpoints

GET /api/health: Health check endpoint

Development Tools

This project was developed using:

XaresAICoder - Open-source browser IDE with integrated AI coding assistants
Claude Code - AI-powered development assistant for code analysis and implementation

License

MIT License - see LICENSE file for details.

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests
Submit a pull request

Support

For issues and feature requests, please use the GitHub issue tracker.

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
.vscode		.vscode
docs		docs
hn_hidden_gems		hn_hidden_gems
scripts		scripts
static		static
templates		templates
.env.sample		.env.sample
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
app.py		app.py
prompt_podcast.md		prompt_podcast.md
requirements.txt		requirements.txt
setup.py		setup.py
super_gem_analyzer.py		super_gem_analyzer.py

License

DG1001/hn-gems

Folders and files

Latest commit

History

Repository files navigation

HN Hidden Gems Finder

Overview

⚠️ Experimental Status

Features

Quick Start

Architecture

Documentation

Super Gems Analysis System

Podcast Feature 🎧

Features

Setup

How It Works

Audio Player

API Endpoints

Duplicate Detection System

Configuration

Core Settings

Background Services

Podcast Generation Settings

Quality Thresholds

Duplicate Detection Settings

Super Gems Analysis

Logging Settings

Development

Background Collection Service

Service Features

API Endpoints

Core Endpoints

Collection Service Endpoints

Podcast Service Endpoints

Utility Endpoints

Development Tools

License

Contributing

Support

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages