-
-
Notifications
You must be signed in to change notification settings - Fork 1
feat(wistia): add subtitle download support #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @kovyrin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant new feature: the ability to download subtitles and captions for Wistia-hosted videos. It integrates subtitle discovery and queuing directly into the download process, ensures proper configuration through a new environment variable, and intelligently manages the download resume cache to accommodate this new content type. The changes enhance the utility of the downloader by providing a more complete content acquisition experience. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a great new feature for downloading Wistia video subtitles. The implementation is thorough, covering configuration, subtitle discovery, and download queuing. I appreciate the addition of cache invalidation to handle courses that were analyzed before this feature was added. My review focuses on improving exception handling to be more specific and suggesting a refactor for a large new function to enhance maintainability by reducing code duplication.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
/gemini review |
|
@gemini-code-assist review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces subtitle download support for Wistia videos, a valuable addition. The implementation includes a new configuration flag, comprehensive logic for discovering subtitle tracks from the Wistia API, and intelligent cache invalidation to ensure subtitles are queued for existing partial downloads. The documentation has also been updated accordingly. My review focuses on improving code clarity, maintainability, and error handling for even greater robustness.
thinkific_downloader/downloader.py
Outdated
| filtered_tasks: List[Dict[str, Any]] = [] | ||
| skipped_count = 0 | ||
| for task in saved_tasks: | ||
| content_type = (task.get('content_type') or 'video').lower() | ||
| if content_type == 'subtitle': | ||
| skipped_count += 1 | ||
| continue | ||
| filtered_tasks.append(task) | ||
| restored_tasks = filtered_tasks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block of code for filtering out subtitle tasks can be made more concise and Pythonic by using a list comprehension. This improves readability and reduces the lines of code.
restored_tasks = [
task for task in saved_tasks
if (task.get('content_type') or 'video').lower() != 'subtitle'
]
skipped_count = len(saved_tasks) - len(restored_tasks)| language_part = track.get('language') or track.get('label') or '' | ||
| if isinstance(language_part, (list, dict)): | ||
| language_part = '' | ||
| language_part = str(language_part or '') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block of code to safely get and convert language_part to a string can be simplified. You can achieve the same goal more concisely by checking the type and using a conditional expression, which also handles edge cases like None or non-string types more cleanly.
language_part_raw = track.get('language') or track.get('label')
language_part = str(language_part_raw) if isinstance(language_part_raw, str) else ''There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds a great new feature: downloading subtitles for Wistia videos. The implementation is thorough, covering discovery of subtitles from various fields in the Wistia API response, adding a feature flag to control it, and even invalidating the resume cache to ensure subtitles are picked up on existing partial downloads. The documentation and configuration changes are also well-executed.
My review includes a few suggestions to improve maintainability and code structure. I've pointed out a circular dependency between modules and suggested a refactoring to decouple them. I've also recommended simplifying a complex function by breaking it down and making a piece of logic more concise. These changes should make the new code even more robust and easier to work with in the future.
| def queue_wistia_subtitle_downloads(media: Dict[str, Any], dest_dir: Path, video_base_name: str): | ||
| """Queue subtitle download tasks for a Wistia media object.""" | ||
| from .downloader import SETTINGS, add_download_task, init_settings | ||
|
|
||
| if not isinstance(dest_dir, Path): | ||
| dest_dir = Path(dest_dir) | ||
|
|
||
| init_settings() | ||
| settings = SETTINGS | ||
| if settings and hasattr(settings, 'subtitle_download_enabled') and not settings.subtitle_download_enabled: | ||
| return | ||
|
|
||
| tracks = extract_wistia_subtitle_tracks(media) | ||
| if not tracks: | ||
| return | ||
|
|
||
| base_name = Path(video_base_name).stem | ||
| if not base_name: | ||
| fallback_name = media.get('name') or media.get('hashedId') or 'captions' | ||
| base_name = filter_filename(str(fallback_name)) | ||
| else: | ||
| base_name = filter_filename(base_name) | ||
|
|
||
| if not base_name: | ||
| base_name = "captions" | ||
|
|
||
| counter = 1 | ||
| for track in tracks: | ||
| url = track.get('url') | ||
| if not url: | ||
| continue | ||
|
|
||
| ext = (track.get('ext') or _infer_track_extension(url)).lstrip('.').lower() or DEFAULT_SUBTITLE_EXTENSION | ||
| language_part = track.get('language') or track.get('label') or '' | ||
| if isinstance(language_part, (list, dict)): | ||
| language_part = '' | ||
| language_part = str(language_part or '') | ||
| language_part = _LANGUAGE_SANITIZE_PATTERN.sub('-', language_part).strip('-') | ||
|
|
||
| if not language_part: | ||
| language_part = 'captions' if counter == 1 else f"captions-{counter}" | ||
|
|
||
| subtitle_filename = filter_filename(f"{base_name}.{language_part}.{ext}") | ||
| if not subtitle_filename: | ||
| subtitle_filename = filter_filename(f"{base_name}.captions-{counter}.{ext}") | ||
|
|
||
| print(f" [Subs] Queued subtitles: {subtitle_filename}") | ||
| add_download_task(url, dest_dir / subtitle_filename, "subtitle") | ||
| counter += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function imports from downloader.py, which in turn imports from wistia_downloader.py, creating a circular dependency. While using local imports makes this work at runtime, it's a code smell that makes the modules tightly coupled and harder to maintain and test.
A better approach would be to refactor this function to not have side effects in another module. Instead of calling add_download_task directly, it could return a list of subtitle download tasks. The caller in downloader.py would then be responsible for adding them to the download queue. This is an example of Inversion of Control.
This would involve:
- Renaming
queue_wistia_subtitle_downloadsto something likeget_wistia_subtitle_tasksand making it return a list of task dictionaries. - Passing the
settingsobject to it to checksubtitle_download_enabled. - Updating the call sites in
downloader.pyandwistia_downloader.pyto handle the returned list of tasks.
This change would break the dependency from wistia_downloader.py to downloader.py, improving the overall module architecture.
| def extract_wistia_subtitle_tracks(media: Dict[str, Any]) -> List[Dict[str, Optional[str]]]: | ||
| """Extract subtitle/caption track metadata from Wistia media JSON.""" | ||
| if not isinstance(media, dict): | ||
| return [] | ||
|
|
||
| hashed_id = media.get('hashedId') or media.get('hashed_id') | ||
| tracks: List[Dict[str, Optional[str]]] = [] | ||
|
|
||
| def add_track(url: Optional[str], language: Optional[str], label: Optional[str], ext: Optional[str]): | ||
| normalized = _normalize_wistia_track_url(url) | ||
| if not normalized and hashed_id and language: | ||
| normalized = _build_caption_url(hashed_id, language, ext) | ||
| if not normalized: | ||
| return | ||
| tracks.append({ | ||
| 'url': normalized, | ||
| 'language': language, | ||
| 'label': label, | ||
| 'ext': (ext or '').lstrip('.') or None | ||
| }) | ||
|
|
||
| for track in media.get('captions') or []: | ||
| if isinstance(track, dict): | ||
| add_track( | ||
| track.get('url') or track.get('src'), | ||
| track.get('language') or track.get('lang'), | ||
| track.get('languageName') or track.get('label') or track.get('name'), | ||
| track.get('ext') | ||
| ) | ||
|
|
||
| def iter_track_dicts(collection: Optional[Iterable[Dict[str, Any]]]): | ||
| for item in collection or []: | ||
| if isinstance(item, dict): | ||
| yield item | ||
|
|
||
| def extract_label(track_dict: Dict[str, Any], label_keys: Iterable[str]) -> Optional[str]: | ||
| for key in label_keys: | ||
| value = track_dict.get(key) | ||
| if value: | ||
| return value | ||
| return None | ||
|
|
||
| def iter_track_sources(track_dict: Dict[str, Any]): | ||
| sources = track_dict.get('sources') or [] | ||
| if not sources: | ||
| yield track_dict.get('url') or track_dict.get('src'), track_dict.get('ext') | ||
| return | ||
| for source in sources: | ||
| if isinstance(source, dict): | ||
| yield source.get('url') or source.get('src'), source.get('ext') or track_dict.get('ext') | ||
|
|
||
| def process_track_collection(collection: Optional[Iterable[Dict[str, Any]]], label_keys: Iterable[str]): | ||
| for track in iter_track_dicts(collection): | ||
| language = track.get('language') or track.get('lang') | ||
| label = extract_label(track, label_keys) | ||
| for source_url, source_ext in iter_track_sources(track): | ||
| add_track(source_url, language, label, source_ext) | ||
|
|
||
| process_track_collection(media.get('text_tracks'), ('name', 'label')) | ||
| process_track_collection(media.get('textTracks'), ('name', 'label', 'title')) | ||
|
|
||
| for asset in media.get('assets') or []: | ||
| if isinstance(asset, dict): | ||
| asset_type = (asset.get('type') or '').lower() | ||
| asset_kind = (asset.get('kind') or '').lower() | ||
| if asset_type in ('caption', 'captions', 'subtitle', 'subtitles') or asset_kind in ('caption', 'captions', 'subtitle', 'subtitles'): | ||
| add_track( | ||
| asset.get('url') or asset.get('src'), | ||
| asset.get('language') or asset.get('lang'), | ||
| asset.get('display_name') or asset.get('name'), | ||
| asset.get('ext') | ||
| ) | ||
|
|
||
| available_transcripts = media.get('availableTranscripts') or [] | ||
| if hashed_id and available_transcripts: | ||
| for transcript in available_transcripts: | ||
| if not isinstance(transcript, dict) or not transcript.get('hasCaptions'): | ||
| continue | ||
| language = transcript.get('language') or transcript.get('wistiaLanguageCode') or transcript.get('bcp47LanguageTag') | ||
| if not language: | ||
| continue | ||
| add_track( | ||
| _build_caption_url(hashed_id, language, DEFAULT_SUBTITLE_EXTENSION), | ||
| language, | ||
| transcript.get('name') or transcript.get('familyName') or language, | ||
| DEFAULT_SUBTITLE_EXTENSION | ||
| ) | ||
|
|
||
| unique_tracks: Dict[str, Dict[str, Optional[str]]] = {} | ||
| for track in tracks: | ||
| url = track['url'] | ||
| if not url: | ||
| continue | ||
| if url not in unique_tracks: | ||
| unique_tracks[url] = track | ||
| else: | ||
| existing = unique_tracks[url] | ||
| # Prefer track data that includes language/label/ext | ||
| if not existing.get('language') and track.get('language'): | ||
| existing['language'] = track['language'] | ||
| if not existing.get('label') and track.get('label'): | ||
| existing['label'] = track['label'] | ||
| if not existing.get('ext') and track.get('ext'): | ||
| existing['ext'] = track['ext'] | ||
|
|
||
| return list(unique_tracks.values()) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is quite long and complex, making it difficult to read and maintain. It handles extracting subtitle tracks from several different structures within the Wistia media JSON.
To improve readability and maintainability, consider breaking this function down into several smaller helper functions, each responsible for extracting tracks from a specific part of the media object (e.g., captions, text_tracks, assets). The main function would then orchestrate calls to these helpers.
Example structure:
def _extract_from_captions(media_captions: list, add_track: callable):
# ... logic for 'captions' key
def _extract_from_text_tracks(text_tracks: list, add_track: callable):
# ... logic for 'text_tracks' key
# ... other helpers for 'assets', 'availableTranscripts'
def extract_wistia_subtitle_tracks(media: Dict[str, Any]) -> List[Dict[str, Optional[str]]]:
if not isinstance(media, dict):
return []
hashed_id = media.get('hashedId') or media.get('hashed_id')
tracks: List[Dict[str, Optional[str]]] = []
def add_track(url: Optional[str], language: Optional[str], label: Optional[str], ext: Optional[str]):
# ... (current add_track logic)
tracks.append(...)
_extract_from_captions(media.get('captions') or [], add_track)
_extract_from_text_tracks(media.get('text_tracks') or [], add_track)
# ... call other helpers
# De-duplication logic remains here
unique_tracks: Dict[str, Dict[str, Optional[str]]] = {}
# ...
return list(unique_tracks.values())This refactoring would make the code more modular and easier to test and debug.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
@kovyrin what is missing for this to be merged? |
Verification