Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

supprt openai batch api #423

Merged
merged 11 commits into from
Aug 20, 2024
Merged

Conversation

mkXultra
Copy link
Collaborator

@mkXultra mkXultra commented Aug 16, 2024

Improving EPUB Translation Efficiency with ChatGPT Batch API Implementation

This PR implements functionality that significantly improves the efficiency of the EPUB translation process by utilizing ChatGPT's batch API. The main changes are as follows:

Major Feature Additions

  1. Implementation of Batch Translation Feature

    • --batch option: Batch process translations using ChatGPT's batch API
    • --batch-use option: Create files using pre-generated batch translation results
  2. Batch Processing Workflow

    • Creation and management of batch requests
    • Execution of batch jobs and status checking
    • Retrieval and processing of batch results

Batch Processing Mechanism

Pattern for Creating Batches (--batch option)

  1. Initialization

    • Initial setup for batch processing with the batch_init method
    • Sanitization of book names
  2. Creation of Translation Queue

    • Add texts to be translated to the queue using the add_to_batch_translate_queue method
  3. Generation of Batch Files

    • Creation of batch requests
      • Create translation requests
      • (OPTIONAL) Create context
        • Control update frequency using BATCH_CONTEXT_UPDATE_INTERVAL
        • Efficient context updates with the create_batch_context_messages method
    • Convert batch requests to JSONL files using the create_batch_files method
    • Limit of 40,000 lines per file
  4. Batch Execution

    • Control the entire batch process with the batch method
    • Upload files, execute batch jobs, save metadata
  5. Saving Batch Information

    • Save batch processing results and metadata as JSON files

Pattern for Using Batches (--batch-use option)

  1. Checking Batch Results

    • Confirm completion of batch processing with the is_completed_batch method
  2. Retrieval and Processing of Results

    • Retrieve translation results with the batch_translate method
    • Efficiently manage results using cache

Context Generation for Batch Processing

Batch processing implements a different context management method compared to normal sequential processing:

  1. Setting Context Update Interval

    • Update context every BATCH_CONTEXT_UPDATE_INTERVAL (default 50)
    • Improve processing efficiency by avoiding frequent updates
  2. Specificity of Context Generation

    • Normal processing: Use and cache the immediately preceding translation result as context
    • Batch processing: Generate context using API as translation results are not immediately available
  3. Context Generation Logic

    • Collect up to context_paragraph_limit paragraphs of 100+ words, tracing back from the index of the text to be translated
    • Generate context by translating the collected paragraphs using API
  4. Efficient Updates and Caching

    • Manage updates with the create_batch_context_messages method
    • Cache generated contexts and reuse until the next update
  5. Differences from Existing Behavior

    • Normal processing: Directly use translation results as context
    • Batch processing: Use API-translated original text as context

This method optimizes processing efficiency while maintaining appropriate context information during batch processing. However, due to the nature of batch processing, there may be slight differences in translation quality compared to normal sequential processing as the context generation method differs.

Files and Directories Created by Batch Processing

Batch processing creates the following directory structure and files:

  1. Root Directory

    • Location: Current working directory (os.getcwd())
    • Structure:
      current_working_directory/
      └── batch_files/
          ├── {book_name}_info.json
          └── {book_name}/
              ├── 1.jsonl
              ├── 2.jsonl
              ├── 3.jsonl
              └── ...
      
  2. batch_files Directory

    • Purpose: Store all files related to batch processing
    • Location: {current_working_directory}/batch_files/
  3. {book_name}_info.json

    • Purpose: Save metadata for batch processing
    • Location: {current_working_directory}/batch_files/{book_name}_info.json
    • Contents:
      • Book ID
      • Batch processing date and time
      • Information for each batch file (input file ID, batch ID, start index, end index)
  4. {book_name} Directory

    • Purpose: Store batch files related to a specific book
    • Location: {current_working_directory}/batch_files/{book_name}/
  5. Batch Request Files ({number}.jsonl)

    • Purpose: Contain batch requests for ChatGPT API
    • Location: {current_working_directory}/batch_files/{book_name}/{number}.jsonl
    • Format: JSONL (JSON Lines)
    • Contents: One request object per line (custom ID, method, URL, body)

Notes:

  • {book_name} is a safe directory name generated from the original file name (special characters replaced with '_')
  • Each JSONL file contains a maximum of 40,000 lines (requests) and is limited to 100MB or less
  • Batch processing results are temporarily stored on OpenAI's servers and retrieved later

Technical Improvements

  • Extension of ChatGPTAPI class: Addition of batch processing-related methods
  • Enhanced file processing: Generation and management of JSONL files for batches
  • Addition of error handling and timeout processing
  • Introduction of caching mechanism for improved efficiency

Expected Effects

  • Reduced processing time for large-scale translation tasks
  • Cost reduction through optimization of API requests
  • Improved flexibility through reuse of batch processing results

Important Notes

  • When using batch processing, you must first execute batch translation with the --batch option, then use the results with the --batch-use option.
  • Batch processing results are temporarily saved to the file system, so adequate disk space is required.
  • As the frequency of context updates is adjusted during batch processing, there may be slight differences in translation quality compared to normal sequential processing.

Usage Examples

1.a Executing batch translation (pattern without using context):

python3 make_book.py --book_name test_books/animal_farm.epub \
--model gpt4omini \
--language ja \
--batch 

1.b Executing batch translation (pattern using context):

python3 make_book.py --book_name test_books/animal_farm.epub \
--model gpt4omini \
--language ja \
--use_context \
--batch 
  1. Generating bilingual EPUB using batch translation results:
    python3 make_book.py --book_name test_books/animal_farm.epub \
    --model gpt4omini \
    --language ja \
    --batch-use
    

This implementation significantly improves processing efficiency, especially for EPUB files containing large amounts of text. Users can manage the translation process more flexibly, and overall performance is improved. By using batch processing, large-scale translation tasks can be executed efficiently and API requests can be optimized.

@mkXultra mkXultra self-assigned this Aug 16, 2024
@mkXultra mkXultra marked this pull request as ready for review August 18, 2024 11:41
@mkXultra mkXultra changed the title supprt batch api supprt openai batch api Aug 18, 2024
Comment on lines 42 to 43
CONTEXT_PARAGRAPH_LIMIT = 3
BATCH_CONTEXT_UPDATE_INTERVAL = 50
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make a config.py file for these?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yihong0618
I created config.py and modified the code to use it. Since this is my first time creating config.py, I'm not sure if I'm using it correctly. Could you please review it?

fd92f7a

Comment on lines +418 to +421
# Replace any characters that are not alphanumeric, underscore, hyphen, or dot with an underscore
sanitized_book_name = re.sub(r"[^\w\-_\.]", "_", book_name)
# Remove leading and trailing underscores and dots
sanitized_book_name = sanitized_book_name.strip("._")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

@@ -388,3 +407,224 @@ def set_model_list(self, model_list):
model_list = list(set(model_list))
print(f"Using model list {model_list}")
self.model_list = cycle(model_list)

def batch_init(self, book_name):
self.book_name = self.sanitize_book_name(book_name)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this name is not support windows
can we use pathlib or os.path

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've fixed it
#423 (comment)

Comment on lines 425 to 428
return f"{os.getcwd()}/batch_files/{self.book_name}_info.json"

def batch_dir(self):
return f"{os.getcwd()}/batch_files/{self.book_name}"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto seems not support windows file name

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed it
f1e78eb

@yihong0618 yihong0618 merged commit 9e4e7b5 into yihong0618:main Aug 20, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants