Refactor and Optimize DeepSeek Coder Training Script by Imadnajam · Pull Request #610 · deepseek-ai/DeepSeek-Coder

Imadnajam · 2025-01-29T10:17:21Z

Key Changes

1. Improved Tokenization and Data Preprocessing

We created a clear and modular tokenization function _tokenize_fn, which processes the input strings by padding and truncating them appropriately.

def _tokenize_fn(strings: Sequence[str], tokenizer: transformers.PreTrainedTokenizer) -> Dict:
    tokenized_list = [
        tokenizer(
            text,
            return_tensors="pt",
            padding="longest",
            max_length=tokenizer.model_max_length,
            truncation=True,
        )
        for text in strings
    ]

…_Training.py

Imadnajam added 4 commits January 29, 2025 11:07

DeepSeek Coder - AI Programming Assistant Training Script

9d2b0e2

Delete finetune_Programming_Assistant_Training_Script.py

012d197

Create Programming Assistant Training Script

1b6f010

Rename Programming Assistant Training Script to Programming_Assistant…

46a00fd

…_Training.py

vaibhavcybermeru approved these changes Jan 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Refactor and Optimize DeepSeek Coder Training Script#610

Refactor and Optimize DeepSeek Coder Training Script#610
Imadnajam wants to merge 4 commits intodeepseek-ai:mainfrom
Imadnajam:main

Imadnajam commented Jan 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

Imadnajam commented Jan 29, 2025

Key Changes

1. Improved Tokenization and Data Preprocessing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants