Data Engineering - AltSchool (Tinyuka) 2024 Second Semester Project Exam
A comprehensive MongoDB database system for an online e-learning platform, demonstrating advanced NoSQL database concepts, operations, and optimization techniques.
EduHub is a complete database backend for an online learning platform that supports:
- User management (students and instructors)
- Course creation and management
- Student enrollment and progress tracking
- Assignment submission and grading system
- Analytics and reporting capabilities
- Advanced search and filtering functionality
- MongoDB: v8.0 or higher (MongoDB Atlas Cloud or Local)
- Python: 3.8 or higher
- Environment: MongoDB Compass (GUI) and MongoDB Shell (CLI)
- Libraries:
pymongo- MongoDB Python driverpandas- Data manipulation and analysisdatetime- Date and time operationsjson- Data serialization
This project uses MongoDB Atlas (cloud database) for production deployment with fallback to local MongoDB for development.
# Cloud MongoDB Atlas (Primary)
CONNECTION_STRING = "mongodb+srv://username:password@cluster.mongodb.net/"
# Local MongoDB (Development)
CONNECTION_STRING = "mongodb://localhost:27017/"Stores information about students and instructors:
{
"userId": "ST_001",
"email": "student@example.com",
"firstName": "John",
"lastName": "Doe",
"role": "student", // "student" or "instructor"
"dateJoined": "2024-01-15T10:00:00Z",
"profile": {
"bio": "Passionate learner",
"avatar": "avatar_url",
"skills": ["Python", "JavaScript"]
},
"isActive": true
}Contains course information and metadata:
{
"courseId": "CO_001",
"title": "Introduction to Python",
"description": "Learn Python programming fundamentals",
"instructorId": "IN_001",
"category": "Programming",
"level": "beginner", // "beginner", "intermediate", "advanced"
"duration": 40,
"price": 99.99,
"tags": ["python", "programming", "basics"],
"createdAt": "2024-01-10T09:00:00Z",
"updatedAt": "2024-01-15T14:30:00Z",
"isPublished": true,
"rating": 4.5
}Tracks student course enrollments:
{
"enrollmentId": "EN_001",
"studentId": "ST_001",
"courseId": "CO_001",
"enrollmentDate": "2024-02-01T10:00:00Z",
"status": "active", // "active", "completed", "dropped"
"progress": 65,
"completionDate": null
}Stores individual lessons within courses:
{
"lessonId": "LE_001",
"courseId": "CO_001",
"title": "Variables and Data Types",
"content": "In this lesson, we'll explore...",
"duration": 30,
"order": 1,
"videoUrl": "https://example.com/video1",
"materials": ["slides.pdf", "code_examples.py"],
"createdAt": "2024-01-12T11:00:00Z"
}Contains course assignments:
{
"assignmentId": "AS_001",
"courseId": "CO_001",
"title": "Python Basics Quiz",
"description": "Test your understanding of Python basics",
"dueDate": "2024-02-15T23:59:59Z",
"maxPoints": 100,
"createdAt": "2024-01-20T09:00:00Z",
"instructions": "Complete all questions and submit your code"
}Tracks student assignment submissions:
{
"submissionId": "SU_001",
"assignmentId": "AS_001",
"studentId": "ST_001",
"submissionDate": "2024-02-14T18:30:00Z",
"content": "My solution to the assignment...",
"attachments": ["solution.py"],
"grade": 85,
"feedback": "Great work! Consider optimizing your loops.",
"gradedDate": "2024-02-16T10:00:00Z"
}-
MongoDB Atlas Account (Recommended):
- Sign up at MongoDB Atlas
- Create a free cluster
- Get your connection string
-
Alternative - Local MongoDB Installation:
# On macOS using Homebrew brew tap mongodb/brew brew install mongodb-community@8.0 # On Ubuntu/Debian sudo apt-get install mongodb-org # On Windows # Download from https://www.mongodb.com/try/download/community
-
Install Python Dependencies:
pip install pymongo pandas jupyter notebook
-
Configure Connection:
- Create a
config.jsonfile in thenotebooks/directory:
{ "cloud_db_url": "mongodb+srv://username:password@cluster.mongodb.net/" } - Create a
-
Clone and Navigate to Project:
git clone <repository-url> cd mongodb-eduhub-project
-
Initialize Database:
python src/eduhub_queries.py
-
Run Jupyter Notebook:
jupyter notebook notebooks/eduhub_mongodb_project.ipynb
- Realistic Educational Data: Removed faker dependency in favor of curated educational content
- Professional Profiles: Real instructor backgrounds and student interests
- Industry-Standard Courses: Actual course titles and descriptions from tech industry
- Academic Authenticity: Proper assignment structures and grading patterns
- Cloud-First Architecture: Designed for MongoDB Atlas deployment
- Scalable Infrastructure: Ready for production workloads
- Connection Flexibility: Supports both cloud and local development
- Performance Optimized: Cloud-optimized indexing strategies
- Create: Add users, courses, enrollments, lessons, assignments
- Read: Advanced queries with filtering, sorting, and aggregation
- Update: Profile updates, grade assignments, course modifications
- Delete: Soft delete users, remove enrollments and lessons
- Course Enrollment Statistics: Enrollment counts and ratings by category
- Student Performance Analysis: Average grades and completion rates
- Instructor Analytics: Revenue and student metrics per instructor
- Trend Analysis: Monthly enrollment trends and engagement metrics
- Indexing Strategy: Optimized indexes for common query patterns
- Query Optimization: Efficient text search and date range queries
- Performance Monitoring: Query analysis and execution time tracking
- Schema Validation: JSON schema validation for all collections
- Unique Constraints: Prevent duplicate enrollments and users
- Error Handling: Comprehensive error handling for edge cases
# Find all active students
active_students = db.find_all_active_students()
# Get courses in a specific category
programming_courses = db.get_courses_by_category("Programming")
# Search courses by title
search_results = db.search_courses_by_title("Python")# Get course enrollment statistics
stats = db.get_course_enrollment_statistics()
# Analyze student performance
performance = db.get_student_performance_analysis()
# Generate instructor analytics
analytics = db.get_instructor_analytics()# Find courses in price range
courses = db.find_courses_by_price_range(50, 200)
# Get recent users (last 6 months)
recent_users = db.get_recent_users(6)
# Find assignments due next week
upcoming = db.get_assignments_due_next_week()- Primary Indexes: Unique constraints on all identifier fields
- Secondary Indexes: Optimized for common query patterns
- Compound Indexes: Multi-field indexes for complex queries
- Text Indexes: Full-text search on course titles and descriptions
- Basic lookups: ~2ms with proper indexing
- Complex aggregations: ~50ms for 1000+ documents
- Text searches: ~8ms with text indexes
- Date range queries: ~12ms with date indexes
mongodb-eduhub-project/
βββ README.md # Project documentation
βββ .gitignore # Git ignore file
βββ notebooks/
β βββ eduhub_mongodb_project.ipynb # Interactive Jupyter notebook
βββ src/
β βββ eduhub_queries.py # Main Python implementation
βββ data/
β βββ sample_data.json # Exported sample data
β βββ schema_validation.json # Schema validation rules
βββ docs/
βββ performance_analysis.md # Performance optimization docs
βββ presentation.pptx # Project presentation
β Database Design: Proper schema design with relationships and constraints
β CRUD Operations: Complete implementation of Create, Read, Update, Delete operations
β Advanced Queries: Complex filtering, sorting, and aggregation pipelines
β Performance Optimization: Strategic indexing and query optimization
β Data Validation: Schema validation and error handling
β Analytics: Business intelligence queries and reporting
from src.eduhub_queries import EduHubDatabase
# Initialize with MongoDB Atlas
CONNECTION_STRING = "mongodb+srv://username:password@cluster.mongodb.net/"
db = EduHubDatabase(CONNECTION_STRING)
# Populate with sample data
db.populate_sample_data()# Add a new student
student_id = db.add_new_student(
email="john.doe@email.com",
first_name="John",
last_name="Doe",
bio="Aspiring data scientist"
)
# Create a new course
course_id = db.create_new_course(
title="Advanced Python Programming",
description="Deep dive into Python advanced concepts",
instructor_id="IN_001",
category="Programming",
level="advanced",
duration=60,
price=199.99
)# Get comprehensive analytics
analytics = db.get_advanced_analytics()
print(f"Monthly trends: {analytics['monthly_trends']}")
print(f"Popular categories: {analytics['popular_categories']}")# Test email validation
valid_email = db.validate_email_format("test@example.com") # Returns True
invalid_email = db.validate_email_format("invalid-email") # Returns False
# Test user creation with validation
user_data = {
"userId": "ST_100",
"email": "test@example.com",
"firstName": "Test",
"lastName": "User",
"role": "student"
}
result = db.validate_and_insert_user(user_data)# Analyze query performance
performance = db.analyze_query_performance(
"courses",
{"category": "Programming"}
)
print(f"Execution time: {performance['executionStats']['executionTimeMillis']}ms")
# Optimize slow queries
db.optimize_slow_queries()The database is populated with realistic educational data:
- 20 Users: 15 students and 5 instructors with professional profiles
- 8 Courses: Comprehensive courses across different tech categories
- 25 Lessons: Educational content with proper learning progression
- 10 Assignments: Real-world project assignments
- 15 Enrollments: Student-course relationships with progress tracking
- 10+ Submissions: Assignment submissions with realistic grades and feedback
- Professional Instructors: Former Google, Facebook, Stanford backgrounds
- Comprehensive Courses: Python, Data Science, AI/ML, Web Development, Cloud Computing
- Educational Content: Proper lesson structures with video URLs and materials
- Industry Projects: Real-world assignment descriptions and requirements
- Academic Grading: Realistic grade distribution and professional feedback
- Revenue Analysis: Calculate total revenue per instructor
- Engagement Metrics: Track student completion rates
- Trend Analysis: Monthly enrollment and completion trends
- Performance Metrics: Average grades and top performers
- User Management: Registration, profile updates, soft deletion
- Course Management: Creation, publishing, content organization
- Enrollment Tracking: Student progress and completion status
- Assignment Grading: Submission handling and feedback
The system implements comprehensive error handling:
- Duplicate Key Errors: Prevent duplicate users and enrollments
- Validation Errors: Ensure data integrity and format compliance
- Missing Data: Handle incomplete or invalid data gracefully
- Connection Errors: Robust database connection management
The Jupyter notebook includes data visualization examples:
- Enrollment trends over time
- Course popularity by category
- Student performance distributions
- Instructor analytics dashboards
# Get collection statistics
stats = db.get_collection_statistics()
# Export data for backup
db.export_sample_data("backup_data.json")
# Analyze database performance
info = db.get_database_info()This project is part of the AltSchool Data Engineering curriculum. For academic purposes:
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
Goodrich Okoro
- Data Engineering - AlltSchool (Tinyuka) 2024
- Email: okogoodrich@gmail.com
- GitHub: https://github.com/TelRich
β Database Design and Setup - Complete
β Data Population - Complete
β CRUD Operations - Complete
β Advanced Queries - Complete
β Performance Optimization - Complete
β Documentation - Complete
This project demonstrates comprehensive understanding of MongoDB database concepts, operations, and best practices for modern web applications.





