Question regarding the reward of sales promotion training dataset #10
Open
Description
Hi,
In the sales promotion environment the reward is denoted by rew = (d_total_gmv - d_total_cost)/self.num_users which means the operator observes one single reward signal over all users. However, in the offline training dataset the reward is different for each user across 50 days. For example refer to the below user orders and reward graph
as per my understanding the reward should be same each day for the three users and gradually increase over 50 days with increase in sales. Could you kindly let me know how the reward in the training dataset was calculated.
Metadata
Assignees
Labels
No labels