Skip to content

SmartX-Team/skrueue

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

1 Commit
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

SKRueue - Kubernetes RL-based Scheduler

SKRueue๋Š” Kubernetes์˜ Kueue ์Šค์ผ€์ค„๋Ÿฌ์— ๊ฐ•ํ™”ํ•™์Šต(Reinforcement Learning)์„ ์ ์šฉํ•˜์—ฌ ์ž‘์—… ์Šค์ผ€์ค„๋ง์„ ์ตœ์ ํ™”ํ•˜๋Š” ์‹œ์Šคํ…œ์ž…๋‹ˆ๋‹ค.

RL ํ›ˆ๋ จ์„ ์œ„ํ•œ ๊ณต๊ฐœ ๋ฐ์ดํ„ฐ ์…‹์€ ์•„๋ž˜ ๋งํฌ์—์„œ ์ œ๊ณต๋ฐ›์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. (Ensurance due: 2025.06.30) https://drive.google.com/file/d/1Hiv4E8SJtf5m0xzbIgdDhXY-Tt0QfqJq/view?usp=sharing

๐Ÿš€ ์ฃผ์š” ํŠน์ง•

  • 67์ฐจ์› ์ƒํƒœ ๊ณต๊ฐ„: ํด๋Ÿฌ์Šคํ„ฐ ์ƒํƒœ์™€ ์ž‘์—… ํ ์ •๋ณด๋ฅผ ํฌํ•จํ•œ ํ’๋ถ€ํ•œ ์ƒํƒœ ํ‘œํ˜„
  • ๋‹ค์–‘ํ•œ RL ์•Œ๊ณ ๋ฆฌ์ฆ˜: DQN, PPO, A2C ์ง€์›
  • ์‹ค์‹œ๊ฐ„ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘: ํด๋Ÿฌ์Šคํ„ฐ ๋ฉ”ํŠธ๋ฆญ๊ณผ ์ž‘์—… ์‹คํ–‰ ๋ฐ์ดํ„ฐ ์ž๋™ ์ˆ˜์ง‘
  • ํ˜„์‹ค์ ์ธ ์›Œํฌ๋กœ๋“œ ์‹œ๋ฎฌ๋ ˆ์ด์…˜: ์‹œ๊ฐ„๋Œ€๋ณ„ ํŒจํ„ด์„ ๋ฐ˜์˜ํ•œ ์ž‘์—… ์ƒ์„ฑ
  • ์„ฑ๋Šฅ ๋ฒค์น˜๋งˆํฌ: ๊ธฐ์กด ์Šค์ผ€์ค„๋Ÿฌ์™€์˜ ์„ฑ๋Šฅ ๋น„๊ต ๋„๊ตฌ ๋‚ด์žฅ

๐Ÿ“‹ ์š”๊ตฌ์‚ฌํ•ญ

  • Python 3.8+
  • Kubernetes ํด๋Ÿฌ์Šคํ„ฐ (1.20+)
  • Kueue ์„ค์น˜ (์„ ํƒ์‚ฌํ•ญ)
  • kubectl ์„ค์ •

๐Ÿ›  ์„ค์น˜

1. ํ”„๋กœ์ ํŠธ ํด๋ก 

git clone https://github.com/yourusername/skrueue.git
cd skrueue

2. Python ํ™˜๊ฒฝ ์„ค์ •

python -m venv venv
source venv/bin/activate  # Linux/Mac
# or
venv\Scripts\activate  # Windows

pip install -r requirements.txt

3. ํ™˜๊ฒฝ ์„ค์ •

python main.py setup

์ด ๋ช…๋ น์€ ๋‹ค์Œ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค:

  • Kubernetes ์—ฐ๊ฒฐ ํ™•์ธ
  • ํ•„์š”ํ•œ ๋„ค์ž„์ŠคํŽ˜์ด์Šค ์ƒ์„ฑ
  • RBAC ๊ถŒํ•œ ์„ค์ •
  • ๋””๋ ‰ํ† ๋ฆฌ ๊ตฌ์กฐ ์ƒ์„ฑ
  • ๊ธฐ๋ณธ ์„ค์ • ํŒŒ์ผ ์ƒ์„ฑ

๐Ÿ“ ํ”„๋กœ์ ํŠธ ๊ตฌ์กฐ

skrueue/
โ”œโ”€โ”€ core/                  # ํ•ต์‹ฌ ๋ชจ๋“ˆ
โ”‚   โ”œโ”€โ”€ environment.py     # RL ํ™˜๊ฒฝ (67์ฐจ์› ์ƒํƒœ ๊ณต๊ฐ„)
โ”‚   โ”œโ”€โ”€ agent.py          # RL ์—์ด์ „ํŠธ (DQN/PPO/A2C)
โ”‚   โ””โ”€โ”€ interface.py      # Kueue ์ธํ„ฐํŽ˜์ด์Šค
โ”œโ”€โ”€ utils/                # ๊ณตํ†ต ์œ ํ‹ธ๋ฆฌํ‹ฐ
โ”‚   โ”œโ”€โ”€ k8s_utils.py     # Kubernetes ํ—ฌํผ
โ”‚   โ”œโ”€โ”€ resource_parser.py # ๋ฆฌ์†Œ์Šค ํŒŒ์‹ฑ
โ”‚   โ””โ”€โ”€ logger.py        # ๋กœ๊น…
โ”œโ”€โ”€ data/                 # ๋ฐ์ดํ„ฐ ๊ด€๋ฆฌ
โ”‚   โ”œโ”€โ”€ collector.py     # ์‹ค์‹œ๊ฐ„ ์ˆ˜์ง‘
โ”‚   โ”œโ”€โ”€ database.py      # SQLite ๊ด€๋ฆฌ
โ”‚   โ””โ”€โ”€ exporter.py      # CSV ๋‚ด๋ณด๋‚ด๊ธฐ
โ”œโ”€โ”€ workload/            # ์›Œํฌ๋กœ๋“œ ์ƒ์„ฑ
โ”‚   โ”œโ”€โ”€ generator.py     # ํ†ตํ•ฉ ์ƒ์„ฑ๊ธฐ
โ”‚   โ”œโ”€โ”€ templates.py     # ์ž‘์—… ํ…œํ”Œ๋ฆฟ
โ”‚   โ””โ”€โ”€ strategies.py    # ์ƒ์„ฑ ์ „๋žต
โ”œโ”€โ”€ test/                # ํ…Œ์ŠคํŠธ ๋ฐ ๋ฒค์น˜๋งˆํฌ
โ”‚   โ”œโ”€โ”€ integration.py   # ํ†ตํ•ฉ ํ…Œ์ŠคํŠธ
โ”‚   โ”œโ”€โ”€ benchmark.py     # ์„ฑ๋Šฅ ๋ฒค์น˜๋งˆํฌ
โ”‚   โ””โ”€โ”€ monitor.py       # ๋ชจ๋‹ˆํ„ฐ๋ง
โ”œโ”€โ”€ scripts/             # ์‹คํ–‰ ์Šคํฌ๋ฆฝํŠธ
โ”‚   โ”œโ”€โ”€ setup.py        # ํ™˜๊ฒฝ ์„ค์ •
โ”‚   โ”œโ”€โ”€ run_experiment.py # ์‹คํ—˜ ์‹คํ–‰
โ”‚   โ””โ”€โ”€ collect_data.py  # ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘
โ”œโ”€โ”€ config/              # ์„ค์ • ํŒŒ์ผ
โ”‚   โ””โ”€โ”€ settings.py      # ์ค‘์•™ ์„ค์ • ๊ด€๋ฆฌ
โ””โ”€โ”€ main.py             # ๋ฉ”์ธ ์ง„์ž…์ 

๐ŸŽฏ ์‚ฌ์šฉ๋ฒ•

1. ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘

# 24์‹œ๊ฐ„ ๋™์•ˆ ํ˜„์‹ค์ ์ธ ํŒจํ„ด์œผ๋กœ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘
python main.py collect --duration 24 --strategy realistic

# ์›Œํฌ๋กœ๋“œ ์ƒ์„ฑ ์—†์ด ๊ธฐ์กด ํด๋Ÿฌ์Šคํ„ฐ ๋ฐ์ดํ„ฐ๋งŒ ์ˆ˜์ง‘
python main.py collect --no-workload

2. RL ๋ชจ๋ธ ํ›ˆ๋ จ

# DQN ๋ชจ๋ธ ํ›ˆ๋ จ (๊ธฐ๋ณธ๊ฐ’: 20,000 ์Šคํ…)
python main.py train --algorithm DQN --timesteps 50000

# PPO ๋ชจ๋ธ ํ›ˆ๋ จ
python main.py train --algorithm PPO --timesteps 100000 --name my_experiment

3. ์ถ”๋ก  ์‹คํ–‰

# ํ›ˆ๋ จ๋œ ๋ชจ๋ธ๋กœ ์‹ค์‹œ๊ฐ„ ์Šค์ผ€์ค„๋ง
python main.py inference --model models/skrueue_dqn_model --namespace skrueue-test

4. ์„ฑ๋Šฅ ๋ฒค์น˜๋งˆํฌ

# 30๋ถ„ ๋™์•ˆ ๋ชจ๋“  ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋น„๊ต
python main.py benchmark --duration 30 --algorithms DQN PPO A2C

# ํŠน์ • ์•Œ๊ณ ๋ฆฌ์ฆ˜๋งŒ ํ…Œ์ŠคํŠธ
python main.py benchmark --duration 15 --algorithms DQN --name quick_test

5. ํ…Œ์ŠคํŠธ ์‹คํ–‰

# ํ†ตํ•ฉ ํ…Œ์ŠคํŠธ ์‹คํ–‰
python main.py test

โš™๏ธ ์„ค์ •

config/skrueue.yaml ํŒŒ์ผ์„ ์ˆ˜์ •ํ•˜์—ฌ ์„ค์ •์„ ๋ณ€๊ฒฝํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

# Kueue ์„ค์ •
kueue:
    namespaces: [skrueue-test, default]
    max_queue_size: 10 # ์ƒํƒœ ๊ณต๊ฐ„์˜ ์ž‘์—… ํ ํฌ๊ธฐ

# RL ์„ค์ •
rl:
    algorithm: DQN
    learning_rate: 0.0001
    training_steps: 20000
    reward_weights:
        throughput: 0.4 # ์ฒ˜๋ฆฌ๋Ÿ‰ ๋ณด์ƒ ๊ฐ€์ค‘์น˜
        utilization: 0.3 # ์ž์› ํ™œ์šฉ๋„ ๋ณด์ƒ
        wait_penalty: 0.2 # ๋Œ€๊ธฐ์‹œ๊ฐ„ ํŽ˜๋„ํ‹ฐ
        failure_penalty: 0.1 # ์‹คํŒจ ํŽ˜๋„ํ‹ฐ

# ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ์„ค์ •
data:
    collection_interval: 10 # ์ดˆ ๋‹จ์œ„
    db_path: data/skrueue_training_data.db

๐Ÿ“Š ์ƒํƒœ ๊ณต๊ฐ„ ๊ตฌ์กฐ

SKRueue๋Š” 67์ฐจ์› ์ƒํƒœ ๊ณต๊ฐ„์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

  • ์ฐจ์› 0-3: ํด๋Ÿฌ์Šคํ„ฐ ๋ฆฌ์†Œ์Šค ์ •๋ณด
    • CPU ๊ฐ€์šฉ๋ฅ , ๋ฉ”๋ชจ๋ฆฌ ๊ฐ€์šฉ๋ฅ , CPU ์‚ฌ์šฉ๋ฅ , ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋ฅ 
  • ์ฐจ์› 4-6: ํด๋Ÿฌ์Šคํ„ฐ ํžˆ์Šคํ† ๋ฆฌ
    • ์‹คํ–‰ ์ค‘์ธ ์ž‘์—… ์ˆ˜, CPU ์‚ฌ์šฉ๋ฅ (ํžˆ์Šคํ† ๋ฆฌ), ์ตœ๊ทผ OOM ๋ฐœ์ƒ๋ฅ 
  • ์ฐจ์› 7-66: ์ž‘์—… ํ ์ •๋ณด (์ตœ๋Œ€ 10๊ฐœ ์ž‘์—… ร— 6์ฐจ์›)
    • ๊ฐ ์ž‘์—…: CPU ์š”์ฒญ, ๋ฉ”๋ชจ๋ฆฌ ์š”์ฒญ, ์šฐ์„ ์ˆœ์œ„, ๋Œ€๊ธฐ์‹œ๊ฐ„, ์˜ˆ์ƒ ์‹คํ–‰์‹œ๊ฐ„, ์ž‘์—… ํƒ€์ž…

๐Ÿ“ˆ ์„ฑ๋Šฅ ๋ฉ”ํŠธ๋ฆญ

๋ฒค์น˜๋งˆํฌ์—์„œ ์ธก์ •ํ•˜๋Š” ์ฃผ์š” ์ง€ํ‘œ:

  • ์ฒ˜๋ฆฌ๋Ÿ‰ (Throughput): ์‹œ๊ฐ„๋‹น ์™„๋ฃŒ๋œ ์ž‘์—… ์ˆ˜
  • ํ‰๊ท  ๋Œ€๊ธฐ์‹œ๊ฐ„: ์ž‘์—… ์ œ์ถœ๋ถ€ํ„ฐ ์‹คํ–‰๊นŒ์ง€์˜ ์‹œ๊ฐ„
  • ์„ฑ๊ณต๋ฅ : ์„ฑ๊ณต์ ์œผ๋กœ ์™„๋ฃŒ๋œ ์ž‘์—…์˜ ๋น„์œจ
  • ์ž์› ํ™œ์šฉ๋„: CPU/๋ฉ”๋ชจ๋ฆฌ ํ‰๊ท  ์‚ฌ์šฉ๋ฅ 
  • OOM ๋ฐœ์ƒ๋ฅ : Out-of-Memory๋กœ ์‹คํŒจํ•œ ์ž‘์—… ์ˆ˜

๐Ÿ”ง ๊ณ ๊ธ‰ ์‚ฌ์šฉ๋ฒ•

์‚ฌ์šฉ์ž ์ •์˜ ์›Œํฌ๋กœ๋“œ ํ…œํ”Œ๋ฆฟ

workload/templates.py์—์„œ ์ƒˆ๋กœ์šด ์ž‘์—… ํ…œํ”Œ๋ฆฟ์„ ์ถ”๊ฐ€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

JobTemplate(
    name="custom-task",
    category="custom",
    cpu_request="2000m",
    memory_request="8Gi",
    estimated_duration=30,
    priority=7,
    command=["python", "my_script.py"]
)

์‚ฌ์šฉ์ž ์ •์˜ ๋ณด์ƒ ํ•จ์ˆ˜

core/environment.py์˜ _calculate_reward() ๋ฉ”์„œ๋“œ๋ฅผ ์ˆ˜์ •ํ•˜์—ฌ ๋ณด์ƒ ํ•จ์ˆ˜๋ฅผ ์ปค์Šคํ„ฐ๋งˆ์ด์ฆˆํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋„คํŠธ์›Œํฌ ์•„ํ‚คํ…์ฒ˜ ๋ณ€๊ฒฝ

core/agent.py์˜ _get_network_architecture() ๋ฉ”์„œ๋“œ๋ฅผ ์ˆ˜์ •ํ•˜์—ฌ ์‹ ๊ฒฝ๋ง ๊ตฌ์กฐ๋ฅผ ๋ณ€๊ฒฝํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ› ๋ฌธ์ œ ํ•ด๊ฒฐ

kubectl ์—ฐ๊ฒฐ ์˜ค๋ฅ˜

# kubeconfig ํ™•์ธ
kubectl cluster-info

# ์ปจํ…์ŠคํŠธ ํ™•์ธ
kubectl config current-context

Kueue ์„ค์น˜ ํ™•์ธ

# CRD ํ™•์ธ
kubectl get crd | grep kueue

# Kueue ์„ค์น˜ (ํ•„์š”์‹œ)
kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/v0.6.2/manifests.yaml

๋ฉ”๋ชจ๋ฆฌ ๋ถ€์กฑ

  • config.rl.buffer_size ๊ฐ’์„ ์ค„์ด์„ธ์š”
  • ๋ฐฐ์น˜ ํฌ๊ธฐ๋ฅผ ์กฐ์ •ํ•˜์„ธ์š”

๐Ÿค ๊ธฐ์—ฌํ•˜๊ธฐ

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

๐Ÿ“„ ๋ผ์ด์„ ์Šค

์ด ํ”„๋กœ์ ํŠธ๋Š” MIT ๋ผ์ด์„ ์Šค ํ•˜์— ๋ฐฐํฌ๋ฉ๋‹ˆ๋‹ค.

๐Ÿ“š ์ฐธ๊ณ  ๋ฌธํ—Œ

โœจ ๋กœ๋“œ๋งต

  • ๋ถ„์‚ฐ ํ›ˆ๋ จ ์ง€์›
  • ๋” ๋งŽ์€ RL ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ถ”๊ฐ€ (SAC, TD3)
  • ์›น ๊ธฐ๋ฐ˜ ๋Œ€์‹œ๋ณด๋“œ
  • Prometheus ๋ฉ”ํŠธ๋ฆญ ํ†ตํ•ฉ
  • ๋‹ค์ค‘ ํด๋Ÿฌ์Šคํ„ฐ ์ง€์›

์ฃผ์˜: ์ด ํ”„๋กœ์ ํŠธ๋Š” ์—ฐ๊ตฌ ๋ชฉ์ ์œผ๋กœ ๊ฐœ๋ฐœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ํ”„๋กœ๋•์…˜ ํ™˜๊ฒฝ์—์„œ ์‚ฌ์šฉํ•˜๊ธฐ ์ „์— ์ถฉ๋ถ„ํ•œ ํ…Œ์ŠคํŠธ๋ฅผ ์ˆ˜ํ–‰ํ•˜์„ธ์š”.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages