HN Hidden Gems Finder
A tool that discovers high-quality Hacker News posts from low-karma accounts that would otherwise be overlooked.
The HN Hidden Gems Finder helps surface excellent content from new or low-karma Hacker News users that often gets buried despite being valuable. This addresses the problem where great "Show HN" posts and discussions get no traction simply because the author doesn't have established karma.
Please note: This project is experimental in nature. The AI-generated content analysis, scoring, and podcast features are in alpha status and should not be taken too seriously. All AI-generated text, evaluations, and audio content represent algorithmic interpretations and may contain inaccuracies, biases, or subjective opinions. This is primarily a research project exploring automated content discovery and AI-powered analysis techniques.
- Real-time Hidden Gems Feed: Continuously discovers overlooked quality posts (every 5 minutes)
- Automated Background Services:
- Post Collection: Automatic discovery and analysis with configurable intervals (no Redis required)
- Hall of Fame Monitoring: Tracks gems that achieve success and automatically promotes them (every 6 hours)
- Super Gems Analysis: AI-powered deep analysis of top gems using Google Gemini with user-friendly visual scoring (every 6 hours)
- Podcast Generation: Automatic conversion of Super Gems analysis to professional podcast audio using AI script generation and Google Cloud Text-to-Speech (alpha - experimental feature)
- Hall of Fame: Automated tracking of discovered gems that later became popular (≥100 points)
- Success Metrics: Real-time monitoring of discovery accuracy and timing
- Quality Analysis: AI-powered content analysis to identify technical depth and originality
- Anti-spam Protection: Advanced filtering to maintain high quality
- Duplicate Detection: Intelligent duplicate post detection and filtering to prevent spam and content recycling
- Visual Scoring System: User-friendly star ratings and professional dot indicators instead of intimidating numerical scores
- Knowledge-Aware AI: Smart evaluation system that avoids penalizing posts for recent technology releases
- Time-based Collection: Intelligent collection that only processes posts from specified time windows
- Podcast Player: Built-in HTML5 audio player with streaming and download capabilities for Super Gems podcasts
-
Install Dependencies
pip install -r requirements.txt
-
Configure Environment (optional):
cp .env.sample .env # Edit .env to customize settings -
Initialize Database
python -m hn_hidden_gems.models.init_db
-
Configure Collection Service (Optional)
# Set collection interval (default: 5 minutes, 0 to disable) export POST_COLLECTION_INTERVAL_MINUTES=5
-
Run Application
python app.py
The application automatically starts all background services:
- Post Collection: Discovers new gems every 5 minutes
- Hall of Fame Monitoring: Checks for gem success every 6 hours
- Super Gems Analysis: Deep AI analysis of top gems every 6 hours
- Podcast Generation: Automatically creates audio podcasts after Super Gems analysis completes (when enabled)
The system uses the official Hacker News API for all data collection:
- HN Firebase API: Real-time updates with no rate limits
Key components:
- Data Collection: Automatic background collection of HN new posts
- Quality Analysis: AI-powered content evaluation with spam detection
- Duplicate Detection: Advanced duplicate post filtering using URL normalization, content similarity analysis, and same-author detection
- Super Gems Analysis: Advanced AI evaluation with visual scoring and knowledge-aware assessments
- Storage: SQLite for development, PostgreSQL for production
- Web Interface: Flask application with real-time updates
- Background Service: APScheduler-based in-process collection (no Redis required)
- Time-based Processing: Collects only posts from specified time windows
- Podcast System: Automated script generation with Gemini 2.5 Flash-Lite and high-quality audio synthesis with Google Cloud TTS
For detailed technical documentation on the algorithms:
📖 Hidden Gems Detection Algorithm - Complete documentation of the initial quality analysis system that identifies hidden gems from low-karma authors, including scoring methodology, spam detection, and Live Feed generation.
🔬 Super Gems Analysis Algorithm - In-depth explanation of the advanced LLM-powered analysis system, including prompt engineering, scoring criteria, and quality assurance measures.
The Super Gems feature provides comprehensive AI-powered analysis of top hidden gems:
Visual Scoring System:
- ⭐⭐⭐⭐⭐ Star ratings for overall quality (instead of intimidating numerical scores)
- Professional dot indicators for detailed metrics:
- ●●●● Exceptional (91-100%)
- ●●● Excellent (76-90%)
- ●● Good (51-75%)
- ● Basic (0-50%)
Smart AI Analysis:
- Knowledge-aware evaluation that doesn't penalize posts for recent technology releases
- Technical merit focus over factual verification for emerging technologies
- Automatic bias correction for outdated knowledge assumptions
- Comprehensive GitHub integration for code quality assessment
Analysis Dimensions:
- Technical Innovation
- Problem Significance
- Implementation Quality
- Community Value
- Uniqueness Score
Output Formats:
- super-gems.html: Clean, public-friendly version without ratings (maintains ranking order)
- super-gems-ratings.html: Internal version with full visual scoring system
- super-gems.json: JSON API data with all analysis details
The HN Hidden Gems Podcast automatically converts Super Gems analysis into professional-quality audio content for on-the-go listening.
- AI Script Generation: Uses Gemini 2.5 Flash-Lite to create natural, engaging podcast scripts from Super Gems analysis
- Professional Audio: High-quality MP3 generation with Google Cloud Neural2 voices
- Automatic Integration: Seamlessly integrates with existing Super Gems analysis workflow
- Web Player: Built-in HTML5 audio player with streaming, progress control, and download options
- Text Optimization: Intelligently converts technical content for speech synthesis (URLs, acronyms, ratings)
- Cost-Efficient: Uses Gemini 2.5 Flash-Lite model ($0.10/1M input, $0.40/1M output tokens)
-
Enable Podcast Generation:
# In your .env file AUDIO_GENERATION_ENABLED=true GEMINI_API_KEY=your-gemini-api-key-here -
Configure Google Cloud TTS (optional - for audio generation):
# Option 1: Service Account (recommended) GOOGLE_TTS_CREDENTIALS_PATH=path/to/service-account.json # Option 2: API Key GOOGLE_CLOUD_API_KEY=your-google-cloud-api-key-here # TTS Configuration TTS_LANGUAGE_CODE=en-US TTS_VOICE_NAME=en-US-Neural2-J TTS_AUDIO_ENCODING=MP3
-
Audio Storage Configuration:
AUDIO_STORAGE_PATH=static/audio AUDIO_CLEANUP_DAYS=30
- Automatic Trigger: Podcast generation automatically starts after Super Gems analysis completes (ensuring fresh data)
- Script Generation: The system generates natural, personalized podcast scripts focusing on:
- Factual GitHub metrics (stars, contributors, repository health)
- Qualitative AI analysis (specific to each project)
- Real community data (open source status, working demos)
- No algorithmic scores - avoids confusing numerical ratings
- Text Optimization: Converts technical content for speech (e.g., "github.com/user/repo" → "github repository by user")
- Audio Synthesis: Converts optimized script to high-quality MP3 audio using Google Cloud TTS
- Web Integration: Audio player automatically appears on Super Gems pages with streaming and download options
- File Management: Automatic cleanup of old files based on retention policy
- Auto-Detection: Automatically appears on Super Gems pages
- Professional Controls: Play/pause, progress bar, volume control, download button
- Mobile Responsive: Works seamlessly on all devices
- Keyboard Shortcuts: Spacebar (play/pause), arrow keys (seek ±10 seconds)
- Metadata Display: Shows gem count, duration, generation date, file size
GET /api/audio/super-gems/latest: Get latest podcast metadataGET /api/audio/super-gems/<date>: Get podcast for specific datePOST /api/audio/generate: Manually trigger audio generationGET /api/audio/list: List available podcast filesGET /audio/<filename>: Stream or download audio files
The application includes a comprehensive duplicate detection system to maintain content quality:
Detection Methods:
- URL-based: Detects identical URLs after normalization (removes tracking parameters, trailing slashes, etc.)
- Content similarity: Uses sequence matching to identify posts with similar titles and content (configurable thresholds)
- Same-author detection: Identifies users posting similar content multiple times (spam behavior)
- Title normalization: Removes common HN prefixes ("Ask HN:", "Show HN:") and punctuation for better matching
Integration Points:
- Post Collection: Automatically checks for duplicates before saving new posts
- Super Gems Analysis: Filters out duplicates before expensive LLM analysis
- Database Management: Provides tools to find, mark, and clean up duplicate posts
Quality Preservation:
- Always keeps the earlier post (lower HN ID) or higher quality post (better gem score)
- Marks duplicates as spam rather than deleting them
- Maintains traceability of duplicate detection decisions
Performance Impact:
- Cleaned up 278 existing duplicate posts across 112 URLs in initial deployment
- Significantly reduces noise and improves content quality in hidden gems feed
- Prevents spam behavior where users post the same content multiple times
Configure the application using environment variables:
FLASK_ENV: development/productionDATABASE_URL: Database connection stringSECRET_KEY: Flask secret key for securityHOST: Server host (default: 127.0.0.1)PORT: Server port (default: 5000)
POST_COLLECTION_INTERVAL_MINUTES=5: Minutes between post collections (0 to disable)POST_COLLECTION_BATCH_SIZE=25: Posts to commit per batchPOST_COLLECTION_MAX_STORIES=500: Max story IDs to fetch per runHALL_OF_FAME_INTERVAL_HOURS=6: Hours between Hall of Fame monitoring (0 to disable)SUPER_GEMS_INTERVAL_HOURS=6: Hours between super gems analysis (0 to disable)SUPER_GEMS_ANALYSIS_HOURS=48: Hours back to analyze for super gemsSUPER_GEMS_TOP_N=5: Number of top gems to analyze per run
AUDIO_GENERATION_ENABLED=false: Enable automatic podcast audio generationAUDIO_STORAGE_PATH=static/audio: Directory to store generated audio filesAUDIO_CLEANUP_DAYS=30: Delete audio files older than N daysGOOGLE_TTS_CREDENTIALS_PATH=path/to/service-account.json: Path to Google Cloud service account JSONTTS_LANGUAGE_CODE=en-US: Language code for TTS (en-US, de-DE, etc.)TTS_VOICE_NAME=en-US-Neural2-J: Voice name for audio generationTTS_AUDIO_ENCODING=MP3: Audio format (MP3, OGG_OPUS, LINEAR16)
KARMA_THRESHOLD=100: Max author karma for gemsMIN_INTEREST_SCORE=0.3: Min quality score for gems
URL_SIMILARITY_THRESHOLD=0.95: Minimum similarity score for URL matching (0.0-1.0)TITLE_SIMILARITY_THRESHOLD=0.85: Minimum similarity score for title matching (0.0-1.0)CONTENT_SIMILARITY_THRESHOLD=0.8: Minimum similarity score for content matching (0.0-1.0)SAME_AUTHOR_THRESHOLD=0.7: Lower threshold when posts are by the same author (spam detection)
GEMINI_API_KEY: Google Gemini API key for super gems analysis (required for super gems feature)- Enhanced GitHub Analysis: Uses 6 GitHub API calls per repository for detailed metrics
- Factual Implementation Quality: Based on measurable GitHub metrics (stars, commits, structure)
- Factual Community Value: Based on measurable community engagement (stars, forks, contributors)
- No AI Speculation: LLM only assesses technical innovation, problem significance, and uniqueness
- Graceful Failure: Returns no analysis rather than creating dummy/fake data when LLM parsing fails
- Podcast Generation: Avoids algorithmic scores, focuses on factual data and qualitative analysis
- The system automatically applies knowledge-aware evaluation to avoid penalizing recent technology releases
- Uses temperature=0.1 for consistent, focused AI responses
- Generates both HTML and JSON output for comprehensive analysis results
LOG_LEVEL: Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)LOG_FILE: Log file path (default: logs/app.log)
# Run tests (when implemented)
pytest
# Check service status
flask collection-status
# View configuration
flask config-collectionThe application includes an automatic background service for collecting new HN posts:
# Check service status
python scripts/manage_collector_simple.py status
# Manually trigger collection
python scripts/manage_collector_simple.py collect --minutes 60
# Flask CLI commands
flask config-collection # Show configuration for both services
flask start-collector # Start both services manually
flask stop-collector # Stop both services manually
flask collect-now # Manually trigger post collection
flask monitor-gems # Manually trigger Hall of Fame monitoring
flask analyze-super-gems # Manually trigger super gems analysis
flask collection-status # Check status of all services
# Podcast Generation Management
flask generate-podcast # Manually trigger podcast generation
flask podcast-status # Check podcast generation status
# Duplicate Detection Management
flask find-duplicates # Find and report duplicate posts
flask clean-duplicates # Automatically clean up duplicate posts
flask check-post-duplicates # Interactive duplicate checking for specific posts
flask cleanup-existing-duplicates # Bulk cleanup of existing duplicates in database- Quad Background Services: Post collection + Hall of Fame monitoring + Super gems analysis + Podcast generation
- No External Dependencies: No Redis or Celery required
- Auto-start/stop: All services start with Flask app, stop when app stops
- Configurable Intervals:
- Post collection: Default 5 minutes (set to 0 to disable)
- Hall of Fame monitoring: Default 6 hours (set to 0 to disable)
- Super gems analysis: Default 6 hours (set to 0 to disable)
- Podcast generation: Triggered after Super Gems analysis (when enabled)
- Time-based Collection: Only processes posts from specified time windows
- Automated Success Tracking: Promotes gems to Hall of Fame when they reach ≥100 points
- Thread-safe: Prevents overlapping collection runs
- Progress Tracking: Built-in statistics and status reporting for all services
GET /api/gems: Latest hidden gems with filteringGET /api/gems/hall-of-fame: Hall of fame entriesGET /super-gems: AI-curated super gems analysis page with visual scoringGET /super-gems.html: Clean super gems analysis (no ratings, public-friendly)GET /super-gems-ratings.html: Super gems analysis with star ratings and indicatorsGET /super-gems.json: JSON API for super gems analysis dataGET /api/stats: Success metrics and statisticsGET /api/posts/<hn_id>: Get specific post by HN IDGET /api/users/<username>: Get user informationGET /api/search?q=<query>: Search posts by title/contentGET /feed.xml: RSS feed of hidden gems
GET /api/collection/status: Service status and statisticsPOST /api/collection/trigger: Manually trigger collectionGET /api/collection/config: Current configuration
GET /api/audio/super-gems/latest: Get latest podcast metadata and URLsGET /api/audio/super-gems/<date>: Get podcast for specific date (YYYY-MM-DD)POST /api/audio/generate: Manually trigger podcast generationGET /api/audio/list: List available podcast filesGET /api/podcast/scripts/latest: Get latest generated podcast scriptGET /audio/<filename>: Stream or download audio files
GET /api/health: Health check endpoint
This project was developed using:
- XaresAICoder - Open-source browser IDE with integrated AI coding assistants
- Claude Code - AI-powered development assistant for code analysis and implementation
MIT License - see LICENSE file for details.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
For issues and feature requests, please use the GitHub issue tracker.