Fix gdrive_sync video processing issues and improve multipart upload reliability #2569

ibrahimjaved12 · 2025-05-15T13:58:32Z

What are the relevant tickets?

https://github.com/mitodl/hq/issues/7086

Description (What does it do?)

There were two issues reported for different courses where gdrive_sync process for videos failed/stalled/worked partially. Some of the videos in both cases were also very large in size (~6gb+).

These are the tickets:
https://github.com/mitodl/hq/issues/7114
https://github.com/mitodl/hq/issues/6818

The error in them as returned by Celery was drive_file.resource returned None in update_youtube_statuses.
In one course, WebsiteContent objects were not made and in another, WebsiteContent objects were not updated to add youtube_id.
Another thing reported by Celery was there were 4 unfinished process_drive_file tasks, started by 1 unfinished import_website_files task.

This PR attempts to fix these problems by implementing different things:

Refactoring code to not update VideoFileStatus to COMPLETE if the WebsiteContent is not yet ready; thus allowing it to be retried until it is created
Adding a retry logic with backoff in process_drive_file
Using TransferConfig with custom configuration (64 mb chunks) for better multiparting in stream_to_s3 function for large video uploads.
Making the overall video process robust: e.g., hitting YouTubeAPI only when needed

How can this be tested?

Make sure the video process works as expected. Try uploading large videos (1gb+) to gdrive and sync. Make sure all the relevant objects are made (WebsiteContent, Video, DriveFile).

gemini-code-assist

Hello @ibrahimjaved12, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello team, gemini-code-assist here to provide a summary of this pull request. This PR aims to address issues related to processing large video uploads, specifically those originating from Google Drive and being uploaded to YouTube. The core changes involve making the background task responsible for processing Drive files more resilient by adding retry logic and refining the task that updates YouTube video statuses to be more efficient and correctly handle video completion.

Highlights

Retry Logic for Drive File Processing: The process_drive_file Celery task in gdrive_sync/tasks.py has been updated to include retry logic. This means if an error occurs during the processing of a Google Drive file (which includes initiating video transcoding), the task will automatically retry up to 3 times with a 60-second backoff, making the process more robust against transient issues.
Improved YouTube Status Updates: The update_youtube_statuses Celery task in videos/tasks.py has been refined. It now avoids fetching the YouTube status for videos that are already marked as PROCESSED. Additionally, the logic for marking a video as COMPLETE, triggering the transcript job, and sending the success email is now more tightly coupled to the confirmation that the video's YouTube status is indeed PROCESSED.

Changelog

Click here to see the changelog

gdrive_sync/tasks.py
- Added bind=True, acks_late=True, retry_backoff=60, and max_retries=3 to the @app.task decorator for process_drive_file (L37).
- Updated process_drive_file signature to accept self (L38).
- Changed bare except to except Exception as exc (L52).
- Implemented retry logic within the exception handler for process_drive_file (L54-L57).
videos/constants.py
- Minor docstring update for VideoStatus class (L19).
videos/tasks.py
- Removed # noqa: C901 from update_youtube_statuses function signature (L152).
- Added a check if video_file.destination_status != YouTubeStatus.PROCESSED before fetching the YouTube status to avoid redundant API calls (L168).
- Adjusted the logic flow to ensure video_file.status is set to COMPLETE, the transcript job is added, and the success email is sent only when the YouTube status is confirmed as PROCESSED (L174-L178, L191-L193).

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Task runs in the queue,
Sometimes fails, what to do?
Retry, wait, then try,
Until the work is done, oh my!

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The pull request addresses issues with large video uploads, which is great. The changes include adding retry logic to the process_drive_file task and updating YouTube video statuses. Overall, the changes seem reasonable, but there are a few areas that could be improved.

Summary of Findings

Exception Handling: The pull request catches broad exceptions in gdrive_sync/tasks.py, which is generally discouraged. It's better to catch specific exceptions or use celery.exceptions.Retry.
Task Acknowledgement: The pull request uses acks_late=True for the process_drive_file task, which could lead to duplicate executions in certain failure scenarios. Consider the implications of this choice.
Video Status Update: The pull request updates YouTube video statuses, but it might be missing an update to video_file.status after updating destination_status.

Merge Readiness

The pull request addresses an important issue and includes some good improvements. However, the exception handling and task acknowledgement aspects should be carefully reviewed before merging. Additionally, the video status update logic could be improved. I am unable to directly approve this pull request, and recommend that these comments be addressed before merging. It is also recommended that others review and approve this code before merging.

gemini-code-assist · 2025-05-15T14:01:09Z

gdrive_sync/tasks.py

+    except Exception as exc:  # pylint: disable=broad-exception-caught
        log.exception("Error processing DriveFile %s", drive_file_id)
-
-    return None
+        try:
+            raise self.retry(exc=exc) from exc
+        except self.MaxRetriesExceededError:
+            return None


Catching Exception is generally discouraged because it can mask unexpected errors. It's better to catch specific exceptions that you expect and can handle. In this case, what exceptions are you anticipating? If you want to retry on all exceptions, consider using celery.exceptions.Retry directly. Also, consider adding a delay before retrying, in case the underlying issue is transient.

gemini-code-assist · 2025-05-15T14:01:10Z

gdrive_sync/tasks.py

@@ -34,8 +34,8 @@
 log = logging.getLogger(__name__)


-@app.task()
-def process_drive_file(drive_file_id: str):
+@app.task(bind=True, acks_late=True, retry_backoff=60, max_retries=3)


Adding acks_late=True is good for ensuring the task is executed even if the worker crashes, but it's important to understand the implications. Since the task is acknowledged after it's completed, there's a possibility of duplicate executions if the worker crashes after processing but before sending the acknowledgement. Is this acceptable for this task?

gemini-code-assist · 2025-05-15T14:01:10Z

videos/tasks.py

+                if video_file.destination_status != YouTubeStatus.PROCESSED:
+                    video_file.destination_status = youtube.video_status(
+                        video_file.destination_id
+                    )
+                    video_file.save()


It's good to check video_file.destination_status before calling youtube.video_status to avoid unnecessary API calls. However, after updating destination_status, you should also update the video_file.status to VideoFileStatus.COMPLETE if destination_status is YouTubeStatus.PROCESSED.

pdpinch · 2025-05-16T12:40:57Z

@ibrahimjaved12 please write a more descriptive PR title and commit message. What are you fixing?

ibrahimjaved12 · 2025-05-16T13:21:24Z

@pdpinch Updated the description and PR title

ibrahimjaved12 · 2025-05-20T12:47:54Z

gdrive_sync/api.py

+logging.basicConfig(level=logging.INFO)
+logging.getLogger("boto3").setLevel(logging.DEBUG)
+logging.getLogger("botocore").setLevel(logging.DEBUG)


This is just for debugging - Will be removed soon

mbertrand · 2025-05-20T14:59:14Z

LGTM so far (did not run locally though). One potential way to replicate part of the problem on the main branch might be to add a long delay in the gdrive_sync.api.stream_to_s3 function:

@retry_on_failure
def stream_to_s3(drive_file: DriveFile):
    """Stream a Google Drive file to S3"""
    time.sleep(300)  # 5 minute delay
    try:
        // etc

umar8hassan

I have tested the upload with two ~1.7 GB files simultaneously. Both files uploaded with timeout errors in a few parts followed by a retry.
However, there is a nitpick that the video with a failed parts, although uploaded completely on the next try, still has a task running in the celery that is trying to upload it again. This status is a bit misleading and may result in a duplicate transcoding job or unexpected behavior.

Fix video upload issues

b20195a

gemini-code-assist bot reviewed May 15, 2025

View reviewed changes

gemini-code-assist bot suggested changes May 15, 2025

View reviewed changes

Use TransferConfig for multiparting configuration for large files

798a3d0

Add debugging

86a57b9

ibrahimjaved12 changed the title ~~Fix video upload issues~~ Fix gdrive_sync video processing issues and improve multipart upload reliability May 16, 2025

arslanashraf7 added the Work in Progress label May 20, 2025

ibrahimjaved12 added 2 commits May 20, 2025 15:42

Persist changes in DB

97c595b

Fix some tests

acca3c3

ibrahimjaved12 commented May 20, 2025

View reviewed changes

Remove debugging statements

b24d80f

ibrahimjaved12 added Needs Review and removed Work in Progress labels May 22, 2025

umar8hassan self-assigned this May 27, 2025

umar8hassan reviewed May 29, 2025

View reviewed changes

ibrahimjaved12 added 3 commits May 30, 2025 18:26

Add guards in stream_to_s3 and remove retry_on_failure

89dc1ca

Fix tests

2c69322

Fix tests

003fc73

ibrahimjaved12 added Needs Review and removed Needs Review labels May 30, 2025

ibrahimjaved12 requested a review from umar8hassan May 30, 2025 14:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix gdrive_sync video processing issues and improve multipart upload reliability #2569

Fix gdrive_sync video processing issues and improve multipart upload reliability #2569

Uh oh!

ibrahimjaved12 commented May 15, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot May 15, 2025

Uh oh!

gemini-code-assist bot May 15, 2025

Uh oh!

gemini-code-assist bot May 15, 2025

Uh oh!

pdpinch commented May 16, 2025

Uh oh!

ibrahimjaved12 commented May 16, 2025

Uh oh!

ibrahimjaved12 May 20, 2025

Uh oh!

mbertrand commented May 20, 2025

Uh oh!

umar8hassan left a comment

Uh oh!

Uh oh!

Fix gdrive_sync video processing issues and improve multipart upload reliability #2569

Are you sure you want to change the base?

Fix gdrive_sync video processing issues and improve multipart upload reliability #2569

Uh oh!

Conversation

ibrahimjaved12 commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What are the relevant tickets?

Description (What does it do?)

How can this be tested?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Changelog

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

Uh oh!

gemini-code-assist bot May 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot May 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot May 15, 2025

Choose a reason for hiding this comment

Uh oh!

pdpinch commented May 16, 2025

Uh oh!

ibrahimjaved12 commented May 16, 2025

Uh oh!

ibrahimjaved12 May 20, 2025

Choose a reason for hiding this comment

Uh oh!

mbertrand commented May 20, 2025

Uh oh!

umar8hassan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ibrahimjaved12 commented May 15, 2025 •

edited

Loading