Skip to content

Fix gdrive_sync video processing issues and improve multipart upload reliability #2569

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

ibrahimjaved12
Copy link
Contributor

@ibrahimjaved12 ibrahimjaved12 commented May 15, 2025

What are the relevant tickets?

https://github.com/mitodl/hq/issues/7086

Description (What does it do?)

There were two issues reported for different courses where gdrive_sync process for videos failed/stalled/worked partially. Some of the videos in both cases were also very large in size (~6gb+).

These are the tickets:
https://github.com/mitodl/hq/issues/7114
https://github.com/mitodl/hq/issues/6818

The error in them as returned by Celery was drive_file.resource returned None in update_youtube_statuses.
In one course, WebsiteContent objects were not made and in another, WebsiteContent objects were not updated to add youtube_id.
Another thing reported by Celery was there were 4 unfinished process_drive_file tasks, started by 1 unfinished import_website_files task.

This PR attempts to fix these problems by implementing different things:

  • Refactoring code to not update VideoFileStatus to COMPLETE if the WebsiteContent is not yet ready; thus allowing it to be retried until it is created
  • Adding a retry logic with backoff in process_drive_file
  • Using TransferConfig with custom configuration (64 mb chunks) for better multiparting in stream_to_s3 function for large video uploads.
  • Making the overall video process robust: e.g., hitting YouTubeAPI only when needed

How can this be tested?

Make sure the video process works as expected. Try uploading large videos (1gb+) to gdrive and sync. Make sure all the relevant objects are made (WebsiteContent, Video, DriveFile).

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @ibrahimjaved12, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello team, gemini-code-assist here to provide a summary of this pull request. This PR aims to address issues related to processing large video uploads, specifically those originating from Google Drive and being uploaded to YouTube. The core changes involve making the background task responsible for processing Drive files more resilient by adding retry logic and refining the task that updates YouTube video statuses to be more efficient and correctly handle video completion.

Highlights

  • Retry Logic for Drive File Processing: The process_drive_file Celery task in gdrive_sync/tasks.py has been updated to include retry logic. This means if an error occurs during the processing of a Google Drive file (which includes initiating video transcoding), the task will automatically retry up to 3 times with a 60-second backoff, making the process more robust against transient issues.
  • Improved YouTube Status Updates: The update_youtube_statuses Celery task in videos/tasks.py has been refined. It now avoids fetching the YouTube status for videos that are already marked as PROCESSED. Additionally, the logic for marking a video as COMPLETE, triggering the transcript job, and sending the success email is now more tightly coupled to the confirmation that the video's YouTube status is indeed PROCESSED.

Changelog

Click here to see the changelog
  • gdrive_sync/tasks.py
    • Added bind=True, acks_late=True, retry_backoff=60, and max_retries=3 to the @app.task decorator for process_drive_file (L37).
    • Updated process_drive_file signature to accept self (L38).
    • Changed bare except to except Exception as exc (L52).
    • Implemented retry logic within the exception handler for process_drive_file (L54-L57).
  • videos/constants.py
    • Minor docstring update for VideoStatus class (L19).
  • videos/tasks.py
    • Removed # noqa: C901 from update_youtube_statuses function signature (L152).
    • Added a check if video_file.destination_status != YouTubeStatus.PROCESSED before fetching the YouTube status to avoid redundant API calls (L168).
    • Adjusted the logic flow to ensure video_file.status is set to COMPLETE, the transcript job is added, and the success email is sent only when the YouTube status is confirmed as PROCESSED (L174-L178, L191-L193).
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.


Task runs in the queue,
Sometimes fails, what to do?
Retry, wait, then try,
Until the work is done, oh my!

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request addresses issues with large video uploads, which is great. The changes include adding retry logic to the process_drive_file task and updating YouTube video statuses. Overall, the changes seem reasonable, but there are a few areas that could be improved.

Summary of Findings

  • Exception Handling: The pull request catches broad exceptions in gdrive_sync/tasks.py, which is generally discouraged. It's better to catch specific exceptions or use celery.exceptions.Retry.
  • Task Acknowledgement: The pull request uses acks_late=True for the process_drive_file task, which could lead to duplicate executions in certain failure scenarios. Consider the implications of this choice.
  • Video Status Update: The pull request updates YouTube video statuses, but it might be missing an update to video_file.status after updating destination_status.

Merge Readiness

The pull request addresses an important issue and includes some good improvements. However, the exception handling and task acknowledgement aspects should be carefully reviewed before merging. Additionally, the video status update logic could be improved. I am unable to directly approve this pull request, and recommend that these comments be addressed before merging. It is also recommended that others review and approve this code before merging.

Comment on lines +52 to +57
except Exception as exc: # pylint: disable=broad-exception-caught
log.exception("Error processing DriveFile %s", drive_file_id)

return None
try:
raise self.retry(exc=exc) from exc
except self.MaxRetriesExceededError:
return None

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Catching Exception is generally discouraged because it can mask unexpected errors. It's better to catch specific exceptions that you expect and can handle. In this case, what exceptions are you anticipating? If you want to retry on all exceptions, consider using celery.exceptions.Retry directly. Also, consider adding a delay before retrying, in case the underlying issue is transient.

@@ -34,8 +34,8 @@
log = logging.getLogger(__name__)


@app.task()
def process_drive_file(drive_file_id: str):
@app.task(bind=True, acks_late=True, retry_backoff=60, max_retries=3)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Adding acks_late=True is good for ensuring the task is executed even if the worker crashes, but it's important to understand the implications. Since the task is acknowledged after it's completed, there's a possibility of duplicate executions if the worker crashes after processing but before sending the acknowledgement. Is this acceptable for this task?

Comment on lines +168 to +172
if video_file.destination_status != YouTubeStatus.PROCESSED:
video_file.destination_status = youtube.video_status(
video_file.destination_id
)
video_file.save()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

It's good to check video_file.destination_status before calling youtube.video_status to avoid unnecessary API calls. However, after updating destination_status, you should also update the video_file.status to VideoFileStatus.COMPLETE if destination_status is YouTubeStatus.PROCESSED.

@pdpinch
Copy link
Member

pdpinch commented May 16, 2025

@ibrahimjaved12 please write a more descriptive PR title and commit message. What are you fixing?

@ibrahimjaved12 ibrahimjaved12 changed the title Fix video upload issues Fix gdrive_sync video processing issues and improve multipart upload reliability May 16, 2025
@ibrahimjaved12
Copy link
Contributor Author

@pdpinch Updated the description and PR title

Comment on lines 64 to 66
logging.basicConfig(level=logging.INFO)
logging.getLogger("boto3").setLevel(logging.DEBUG)
logging.getLogger("botocore").setLevel(logging.DEBUG)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just for debugging - Will be removed soon

@mbertrand
Copy link
Member

LGTM so far (did not run locally though). One potential way to replicate part of the problem on the main branch might be to add a long delay in the gdrive_sync.api.stream_to_s3 function:

@retry_on_failure
def stream_to_s3(drive_file: DriveFile):
    """Stream a Google Drive file to S3"""
    time.sleep(300)  # 5 minute delay
    try:
        // etc

Copy link
Contributor

@umar8hassan umar8hassan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested the upload with two ~1.7 GB files simultaneously. Both files uploaded with timeout errors in a few parts followed by a retry.
However, there is a nitpick that the video with a failed parts, although uploaded completely on the next try, still has a task running in the celery that is trying to upload it again. This status is a bit misleading and may result in a duplicate transcoding job or unexpected behavior.
Screenshot 2025-05-29 at 5 28 20 PM
Screenshot 2025-05-29 at 5 36 43 PM
Screenshot 2025-05-29 at 5 41 37 PM
Screenshot 2025-05-29 at 5 28 33 PM
Screenshot 2025-05-29 at 5 29 57 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants