Will this book talk about RLHF? #226
-
Great book! I read all the notebooks in this repo and here is a question. I heard RLHF(Reinforcement Learning from Human Feedback) is the core technique of ChatGPT. Is this book talking about it? I see there will be an extra material about dpo on preference fine-tuning. Is it equivalent to RLHF? What's the popular practice in the current industry after instruct-finetuning? Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 4 replies
-
Thanks! Regarding DPO, I've actually implemented it for Chapter 7, but then I removed it for 2 reasons
DPO is a nice, and relatively simple technique for preference finetuning, but it didn't quite satisfy the bar regarding fundamental and established techniques that work well. I am currently going to be busy with finishing up the work on the book itself in the next few weeks, but then I plan to polish up the DPO part and perhaps either share it here or on my blog. Then, I plan to do the same with RLHF with dedicated reward models. In the meantime, you might like my two articles here: |
Beta Was this translation helpful? Give feedback.
Thanks! Regarding DPO, I've actually implemented it for Chapter 7, but then I removed it for 2 reasons
DPO is a nice, and relatively simple technique for preference finetuning, but it didn't quite satisfy the bar regarding fundamental and established techniques that work well. I am currently going to be busy with finishing up the work on the book itself in the next few weeks, but then I plan to polish up the DPO part and perhaps either share it here or on my blog. Then, I plan to do the same with RLHF with dedicated reward models.
In the meantime, you might like my two articles here:
…