Skip to content

๐ŸŽต ๋™์˜์ƒ ์ฝ˜ํ…์ธ ์—์„œ ๋งž์ถคํ˜• ๋ฐฐ๊ฒฝ์Œ์•…์„ ์ƒ์„ฑํ•˜๋Š” ์„œ๋น„์Šค

Notifications You must be signed in to change notification settings

seokhee516/BGM-Generate-by-Riffusion

ย 
ย 

Repository files navigation


โœจTeam.ETโœจ

boostcamp 4th NLP Final Project :
์˜์ƒ ์ฝ˜ํ…์ธ  ๋งž์ถคํ˜• BGM ์ƒ์„ฑ

1. Team


๊น€๊ฑด์šฐ

๋ฐฑ๋‹จ์ต

์†์šฉ์ฐฌ

์ด์žฌ๋•

์ •์„ํฌ

Contribution

๊น€๊ฑด์šฐ : ๋ชจ๋ธ ํ•™์Šต, ํŒŒ์ดํ”„๋ผ์ธ ์„ค๊ณ„, Riffusion
๋ฐฑ๋‹จ์ต : ๋ชจ๋ธ ์„ค๊ณ„ ๋ฐ ๋ถ„์„, whisper
์†์šฉ์ฐฌ : ๋ชจ๋ธ ์„ค๊ณ„ ๋ฐ ๋ถ„์„, ๊ฐ์„ฑ๋ถ„๋ฅ˜
์ด์žฌ๋• : Frontend, Backend, ์•„ํ‚คํ…์ฒ˜, Riffusion
์ •์„ํฌ : Backend, ์•„ํ‚คํ…์ฒ˜

2. About

๊ธฐํš์˜๋„

1์ธ ๋ฏธ๋””์–ด ์‹œ์žฅ ๊ทœ๋ชจ๊ฐ€ ์„ฑ์žฅํ•˜๊ณ  ์žˆ์œผ๋ฉฐ, ๋™์˜์ƒ ์ฝ˜ํ…์ธ  ์ œ์ž‘์˜ ๋น„์ค‘์ด ๋Œ€๋ถ€๋ถ„์ด๊ณ  ์ด์— ๋”ฐ๋ผ BGM ์ˆ˜์š” ๋˜ํ•œ ์ฆ๊ฐ€ํ•˜๊ณ  ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜, ๋Š˜์–ด๋‚˜๋Š” ๋™์˜์ƒ ์ˆ˜์š”์™€๋Š” ๋‹ฌ๋ฆฌ, ์˜์ƒ์ œ์ž‘์— ํ™œ์šฉ ๊ฐ€๋Šฅํ•œ BGM ์˜ ๊ฒฝ์šฐ ์ œํ•œ์‚ฌํ•ญ(์ €์ž‘๊ถŒ ๋ถ„์Ÿ๊ณผ ๋กœ์—ดํ‹ฐ ๋น„์šฉ ๋“ฑ)์ด ๋งŽ์ด ์กด์žฌํ•˜๋ฉฐ ์ด ๋ถ€๋ถ„์„ ํ•ด๊ฒฐํ•˜๊ณ ์ž AI ๊ธฐ๋ฐ˜ ์Œ์•… ์ƒ์„ฑ ๋ชจ๋ธ์„ ํ™œ์šฉํ•˜์—ฌ ์ €์ž‘๊ถŒ ์—†๋Š” BGM์„ ์ œ๊ณตํ•˜๊ณ ์ž ํ•œ๋‹ค.

๊ฐœ๋ฐœ๋ชฉํ‘œ

๋™์˜์ƒ์„ ์ž…๋ ฅํ•˜๋ฉด, ํ•ด๋‹น ๋™์˜์ƒ์œผ๋กœ๋ถ€ํ„ฐ ๋‚ด์šฉ์„ ์ถ”์ถœํ•˜์—ฌ ๊ฐ์„ฑ ๋ถ„์„ ํ›„, ์ฝ˜ํ…์ธ  ๋‚ด์šฉ์— ๋งž๋Š” ๊ฐ์„ฑ์„ ๋ถ„๋ฅ˜ํ•˜์—ฌ ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ riffusion ๋ชจ๋ธ์„ ์ ์šฉํ•˜์—ฌ BGM์„ ์ƒ์„ฑํ•˜๊ณ ์ž ํ•œ๋‹ค.

3. Model

FlowChart

Step1 : ๋™์˜์ƒ ๋‚ด์šฉ ํŒŒ์•…

Speech-to-Text

  • Openai์˜ Whisper model์„ ์‚ฌ์šฉํ•˜์—ฌ ์ „์ฒด ๋ฐœํ™” ๋‚ด์šฉ์„ ํ…์ŠคํŠธ๋กœ ์ถ”์ถœ.

WHISPER ๋ชจ๋ธ ์‚ฌ์šฉ ์ด์œ  :

  • SPEECH RECOGNITION์—์„œ SOTA๋กœ ์‚ฌ์šฉ๋˜๋Š” wav2vec 2.0 ๋Œ€๋น„ ํ‰๊ท ์ ์œผ๋กœ 55.2% ๋‚ฎ์€ ์˜ค๋ฅ˜์œจ์ด๋ผ๋Š” ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๊ฐ€์กŒ์Œ.
  • Any-to-English speech translation multitask ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•˜๊ธฐ์—, STT์™€ ๋ฒˆ์—ญ๊ธฐ๋Šฅ์„ ํ•จ๊ป˜ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์–ด ์ถ”ํ›„ ์˜๋ฌธ ๋ฐ์ดํ„ฐ์…‹ ํ™œ์šฉ ๊ฐ€๋Šฅํ•œ ์žฅ์ ์„ ๊ฐ€์ ธ ์„ ํƒํ•˜๊ฒŒ ๋จ.

Step2 : ๋™์˜์ƒ ๊ฐ์„ฑ ๋ถ„๋ฅ˜

Sentiment Classifier

  • ์ „์ฒด ํ…์ŠคํŠธ ๋‚ด์šฉ์„ ์•Œ ์ˆ˜ ์žˆ์œผ๋ฉด์„œ ๋‚ด์šฉ์˜ ํŠน์ง•์„ ์‚ด๋ฆด ์ˆ˜ ์žˆ๋„๋ก, ํ…์ŠคํŠธ ๊ตฌ๋ฌธ๋ณ„๋กœ ๊ฐ์„ฑ ๋ถ„๋ฅ˜๋ฅผ ์‹œ๋„ํ•จ.
  • ์ „์ฒด ํ…์ŠคํŠธ์— ๋Œ€ํ•ด ๊ตฌ๋ฌธ๋ณ„๋กœ ๊ฐ์„ฑ ๋ถ„์„ํ•˜์—ฌ ํ–‰๋ณต,์Šฌํ””,์—ญ๊ฒจ์›€,๋ถ„๋…ธ,๋†€๋žŒ,๋‘๋ ค์›€, ์ค‘๋ฆฝ 7๊ฐ€์ง€ ๊ฐ์ •์œผ๋กœ ๋ถ„๋ฅ˜ํ•จ.

    https://huggingface.co/j-hartmann/emotion-english-distilroberta-base

  • ๊ตฌ๋ฌธ๋ณ„ ๊ฐ์„ฑ๋ถ„๋ฅ˜ ํ›„, ๊ฐ์ • ์œ ์ง€๊ธฐ๊ฐ„์ด ์ž„๊ณ„๊ฐ’ ๋ณด๋‹ค ๋‚ฎ์€ ๊ฒฝ์šฐ ํ•ด๋‹น ๊ฐ์ •์„ ๋ฌด์‹œํ–ˆ์œผ๋ฉฐ, ๋ฌด์‹œ๋œ ๊ฐ์ •์˜ ์•ž๋’ค๋กœ ๊ฐ™์€ ๊ฐ์ •์ผ ๊ฒฝ์šฐ ๊ทธ ๊ฐ์ •๋“ค๊ณผ ์ด์–ด์ง„๋‹ค๊ณ  ํŒ๋‹จํ•˜์—ฌ ๋Œ€์ฒดํ•˜๋Š” ํ›„์ฒ˜๋ฆฌ ๊ณผ์ •์„ ์ง„ํ–‰.
  • ๊ทธ ๊ฒฐ๊ณผ ํƒ€์ž„๋ผ์ธ์— ๋”ฐ๋ผ ์•ˆ์ •๋œ ๊ฐ์ •์„ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ์—ˆ๊ณ , ๋”ฐ๋ผ์„œ Sentiment Classifier ๋ฐฉ์‹์„ ์ฑ„ํƒํ•จ.

Step3 : ๊ฐ์„ฑ์— ๋งž๋Š” BGM ์ƒ์„ฑ

Riffusion Model ํ™œ์šฉ ๋ฐ ํ•™์Šต

  • ๋ฆฌํ“จ์ „์€ ๋””ํ“จ์ „ ๋ชจ๋ธ์— ์†Œ๋ฆฌ๋‚˜ ํŒŒ๋™์„ ์‹œ๊ฐํ™”ํ•˜์—ฌ ํŒŒ์•…ํ•˜๊ธฐ ์œ„ํ•œ ๋„๊ตฌ์ธ ์ŠคํŽ™ํŠธ๋กœ๊ทธ๋žจ์„ ํ•™์Šตํ•œ ๋ชจ๋ธ.
  • Step2์—์„œ ์–ป์–ด์ง„ ๊ฐ์„ฑ ๋ถ„๋ฅ˜ ๊ฒฐ๊ณผ๋ฅผ prompt๋กœ ํ™œ์šฉํ•˜์—ฌ ๊ทธ ๊ฐ์„ฑ๊ณผ ๊ฐ™์€ ๊ฐ์„ฑ์˜ ์ŠคํŽ™ํŠธ๋กœ๊ทธ๋žจ์„ seed image๋กœ ์‚ฌ์šฉํ•จ.
  • ์‚ฌ์šฉ์ž ํŽธ์˜๋ฅผ ์œ„ํ•ด ๊ธฐ์กด ์˜์ƒ์—์„œ ๋ง์†Œ๋ฆฌ๋ฅผ ์ œ์™ธํ•œ ์Œ์•…์ด๋‚˜ ๋…ธ์ด์ฆˆ๋ฅผ ์‚ญ์ œํ•˜๊ณ  ์ƒ์„ฑ๋œ BGM์„ ํ•ฉ์ณ์„œ ์ตœ์ข… ๊ฒฐ๊ณผ๋ฌผ์„ ์ƒ์„ฑํ•จ.
  • Model: JD97/Riffusion_sentiment_LoRA(huggingface)

    https://huggingface.co/JD97/Riffusion_sentiment_LoRA

4. Dataset

Riffusion ์ถ”๊ฐ€ ํ•™์Šต์„ ์œ„ํ•œ Train Dataset ๊ตฌ์ถ•๊ณผ์ •

Input(Source data) โ†’ ๋ฐ์ดํ„ฐ ์ถ”์ถœ โ†’ ๋‹ค์šด์ƒ˜ํ”Œ๋ง โ†’ ๊ตฌ๊ฐ„๋ถ„ํ•  โ†’ ์ „์ฒ˜๋ฆฌ โ†’ Output(Spectrogram with caption)

Dataset

(1) Source data๋กœ ๋ถ€ํ„ฐ Sentiment classifier์™€ ์œ ์‚ฌํ•œ label ์„ ์ • ๋ฐ ์ถ”์ถœ(6680๊ฐœ)

  • Source_data: Chr0my/Epidemic_music(huggingface)

    https://huggingface.co/datasets/Chr0my/Epidemic_music

  • ์œ ์‚ฌํ•œ 7๊ฐ€์ง€ label : angry, fear, funny, happy, quirky, sad, weird

(2) ์ถ”์ถœํ•œ Music file ๋‹ค์šด์ƒ˜ํ”Œ๋ง(22.05khz โ†’ 8khz)

(3) ๋‹ค์šด์ƒ˜ํ”Œ๋ง๋œ Music file 10์ดˆ ๊ตฌ๊ฐ„๋ถ„ํ• (with Random sampling)

  • Riffusion ๋ชจ๋ธ ํ•™์Šต ๋ฐ์ดํ„ฐ์™€ ์œ ์‚ฌํ•œ ์ƒ˜ํ”Œ ์ƒ์„ฑ์œ„ํ•ด 10์ดˆ ๊ตฌ๊ฐ„ ์„ค์ •
  • ๊ตฌ๊ฐ„ ๋ณ€ํ™”์— ๊ฐ•๊ฑดํ•œ ๋ชจ๋ธ ํ•™์Šต ์œ„ํ•ด Random sampling ์ˆ˜ํ–‰

(4) ์ „์ฒ˜๋ฆฌ ์ˆ˜ํ–‰

  • STFT(Short time fourier transform) โ†’ Griffin-Lim โ†’ Mel scale
  • Source data์˜ metadataTags, moods data ํ™œ์šฉํ•˜์—ฌ caption ์ž‘์„ฑ

(5) ์ตœ์ข… dataset

  • gwkim22/spectro_caption_dataset(huggingface)

    https://huggingface.co/datasets/gwkim22/spectro_caption_dataset

5. Architecture

FlowChart

6. How to Use

File Directory

.
|-- LoRA
|   |-- README.md
|   |-- text_to_image_lora.py
|   `-- train.sh
|-- MLOPS
|   |-- README.md
|   |-- front
|   |-- kubernetes
|   `-- serving
|-- dataset
|-- model
|   |-- README.md
|   |-- _interpolation.py
|   |-- _sum_by_sent.py
|   |-- oneway_pipeline.py
|   |-- pre_to_stt.py
|   |-- pretrained_models
|   |-- stt_to_rif.py
|   `-- utils.py
|-- project_requirements.txt
|-- riffusion
|-- whisper
`-- README.md

Environment

Ubuntu 18.04.5 LTS
CPU : Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz x 8
GPU : Tesla V100-PCIE-32GB
Python Version 3.9

Prerequisite

# Install project_requirements.txt
$pip install -r project_requirements.txt

# Install the following additional files:
$apt-get update
$sudo apt-get install ffmpeg 
$conda install pyworld -c conda-forge
$apt-get install -y libsndfile1-dev
$pip install git+https://github.com/openai/whisper.git
$pip install git+https://github.com/huggingface/diffusers

Reference

Paper

  • whisper: Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356.
  • LoRA: Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., ... & Chen, W. (2021). Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.

Open Source

About

๐ŸŽต ๋™์˜์ƒ ์ฝ˜ํ…์ธ ์—์„œ ๋งž์ถคํ˜• ๋ฐฐ๊ฒฝ์Œ์•…์„ ์ƒ์„ฑํ•˜๋Š” ์„œ๋น„์Šค

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 94.8%
  • Python 4.9%
  • JavaScript 0.2%
  • CSS 0.1%
  • HTML 0.0%
  • Shell 0.0%