A powerful and friendly tool for downloading and organizing MIT OpenCourseWare (OCW) course materials. MITCrawlerX helps you access high-quality educational content from MIT's extensive course catalog, making it easy to download and organize course materials for offline usage, training LLMs or building your Educational RAG Applications.
- Single Course Download: Download all materials from a specific MIT OCW course
- Multi-Course Download: Download materials from multiple courses across different subjects
- Smart Organization: Organizes downloaded content by course and subject
- Content Extraction: Extracts text content from various file formats (PDF, DOCX, PY)
- Progress Tracking: Shows progress and provides detailed download summaries
- Resume Capability: Can resume interrupted downloads and skip already downloaded content
- Python 3.6 or higher
- Internet connection
- Required Python packages (automatically installed):
- requests
- beautifulsoup4
- PyMuPDF (fitz)
- python-docx
- Clone the repository:
git clone https://github.com/Ashad001/MITCrawlerX.git
cd MITCrawlerX
- Install required packages:
pip install -r requirements.txt
Here are some common and helpful commands to run the script:
- Download a single course with default settings:
python main.py --single --course-url "https://ocw.mit.edu/courses/6-0001-introduction-to-computer-science-and-programming-in-python-fall-2016/"
- Download a single course to a custom directory:
python main.py --single --course-url "https://ocw.mit.edu/courses/your-course-url/" --download-dir "my_courses"
- Download multiple courses from specific subjects:
python main.py --multi --subject-urls "https://ocw.mit.edu/search/?d=Computer%20Science" "https://ocw.mit.edu/search/?d=Mathematics"
- Limit the number of courses per subject:
python main.py --multi --subject-urls "https://ocw.mit.edu/search/?d=Computer%20Science" --max-courses-per-subject 3
- Set a maximum total number of courses:
python main.py --multi --subject-urls "https://ocw.mit.edu/search/?d=Computer%20Science" "https://ocw.mit.edu/search/?d=Mathematics" --max-total-courses 5
- Use a custom search query:
python main.py --multi --query-url "https://ocw.mit.edu/search/?q=python"
--download-dir
: Specify download directory (default: "downloads")
--single
: Enable single course mode--course-url
: URL of the course to download
--multi
: Enable multiple courses mode--subject-urls
: List of subject URLs to scrape--query-url
: Search query URL--max-courses-per-subject
: Maximum courses per subject--max-total-courses
: Maximum total courses to download
The downloaded content is organized as follows:
downloads/
├── Computer Science/
│ ├── Course1.json
│ └── Course2.json
├── Mathematics/
│ ├── Course1.json
│ └── Course2.json
├── scraped_content.json
└── scraping_summary.json
Each course is saved as a JSON file containing:
- Course metadata (name, description, topics)
- Course materials (lectures, assignments, exams)
- Extracted text content from various file formats
The scraper generates two main JSON files:
scraping_summary.json
: Contains overall statistics and course list
{
"timestamp": "2025-06-01 12:43:58",
"total_courses_found": 15,
"total_courses_processed": 10,
"total_courses_failed": 0,
"courses_processed": [...]
}
scraped_content.json
: Contains detailed course content
{
"metadata": {...},
"courses": [
{
"course_name": "Introduction to Computer Science",
"course_description": "...",
"topics": ["Computer Science", "Programming"],
"files": [...]
}
]
}
Contributions are welcome! Please feel free to submit a Pull Request.
This tool is for educational purposes only. Please respect MIT's terms of use and copyright policies when using downloaded materials.