The subtitle grabber for TsinghuaX MOOC platform.
Chinese-version README follows the English-version.
Run main.py
to launch the command-line interface.
For Windows users, you can also get the .exe
file in the release page.
In current version, simulated-login is not available, so an alternative method of using cookie to login is adopted.
To get the cookie to login:
- Manually login at http://tsinghua.xuetangx.com in your browser.
- Open the developer's tool, and switch to
console
page. - Enter command
document.cookie
.
And the cookie shall be returned by the console
of your browser.
Commands currently available are listed as below:
gt
- Get the term list.
gc ${term_id}
- Get the course list of the specified term.
gl ${course_id}
- Get the lesson list of the specified course.
get ${param_fragment...}
orget .
- Get the subtitles as are described in the
param_fragment
list, or get all the subtitles of the specified course.
- Get the subtitles as are described in the
cookie ${data}
- Reset the cookie data.
s ${target}
- Search for the target string in the subtitles.
search
- Enter or exit the "search" mode.
cd ${directory}
- [EXPERIMENTAL] Change the searching directory.
Detailed descriptions can be obtained at runtime by entering help ${command_name}
.
An easy way of getting the core library is simply importing the helper
package in your project.
I might probably upload the package to PyPI later on.
Currently, it only has a version data.
The module related to grabbing data from TsinghuaX.
class User:
# Context
terms # Current term list
term # Current term
courses # Current course list
course # Current course
lessons # Current lesson list
# Login-related
cookie # Defined in __init__
# Public methods
get_terms(self) -> list
get_courses(self, term_id: int) -> list
get_lessons(self, course_id: int) -> list
get_subtitle(self, r: list, on_beg=None, on_end=None, on_err=None)
# on_beg, on_end and on_err serve as callback functions
# Private method
__connect(self, url: str) -> str
Documents are available in the source file.
The module related to the interactive user interface.
Each instantiated interface wraps in it a logged-in user and a local searcher.
Of course, it is not a necessity that you import this module in your own project.
class Interface:
user # Defined in __init__
exec(self, command: str)
gt(self)
gc(self, term_id: int)
gl(self, course_id: int)
get(self, r: list = range(0, 1000))
cookie(self, cookie: str)
s(self, s: str)
cd(self, directory: str)
@staticmethod
h(command: str)
instruct() -> Interface
The module related to I/O, serving as a inner util module.
There are currently two methods:
save(path: str, walk_id: int, name: str, data: str)
- Creates a file at
path
with the filename of"%03d. %s.txt" % (walk_id, name)
and writesdata
into it.
- Creates a file at
search(path: str, s: str, on_success, on_error)
- Search for
s
in the givenpath
(sub-directories included).
- Search for
The module providing methods related to local search.
class Searcher:
path # Current directory
search(self, s: str, on_success, on_error)
cd(self, directory: str) -> bool
- beautifulsoup4
- v1.0.1
- Add support for multi-platform.
- v1.0.0
- Add local search.
- v0.1.0
- First release.
- Support the core function of grabbing the subtitles, along with term-list and course-list query.
这是Get-TsinghuaX MOOC字幕抓取助手。
在这份说明中,我将介绍命令行交互界面的使用方法。至于如何在自己的项目中使用核心库,请参阅英文说明。
目前仅支持Windows系统。
运行main.py
来启动交互式用户界面。
你可能需要先安装beautifulsoup4
第三方库。
由于我还没搞定模拟登录,目前只能通过手动设置cookie凑合一下。
获取cookie的步骤如下:
- 进入http://tsinghua.xuetangx.com登录MOOC网站;
- 打开开发者工具,进入
Console
(控制台)页面; - 输入命令
document.cookie
,即可得到cookie值(不含首末引号)。
以下是当前可用的全部命令:
gt
:查看学期列表gc
:查看指定学期的课程列表gl
:查看指定课程的视频目录get
:批量下载指定视频的字幕cookie
:修改cookies
:在字幕文件中搜索指定的字符串search
:进入/退出搜索模式
你可以在运行时输入help 命令名称
来查看相应命令的具体用法。
>>> gt
1641 2020春
1156 2019秋
796 2019春
371 2018秋
153 2018春
24 2017秋
>>> gc 1156
8716 思想道德修养与法律基础 (2019秋)
>>> gl 8716
绪论
0 开篇的话
1 0.1 认识大学生活特点,提高独立生活能力
2 0.2 树立新的学习理念,养成优良的学风
3 0.3 确立成才目标,塑造新的形象
第一章 人生的青春之问
4 第一节 树立正确的人生观
...
>>> get 0, 2-4, 6
# 下载0、2、3、4、6号视频的字幕
>>> get .
# 下载全部视频的字幕
>>> search
已进入搜索模式。再次输入命令search即可退出搜索模式。
>>> 思想道德
正在当前目录下进行搜索...
...
>>> search
已退出搜索模式。