Skip to content

Delayed reward computation

slivkins edited this page Sep 13, 2016 · 10 revisions

The reward does not need to be logged explicitly by the Client Library. Instead, it can be computed later by the Decision Service.

Default reward

When no information is logged about the reward or "outcome" of a given experimental unit, a configurable “default reward” is substituted by the Join Service. For example, the absence of a click does not typically get logged explicitly as a specific "no-click" outcome with a specific reward, and the "default reward" is used instead.

The default value for the "default reward" is 0. To change it, open the ASA join query in your Decision Service deployment (see this Wiki page). Look for the line

WHEN observation IS NULL THEN 0 -- default cost

and change 0 to the appropriate number.

Complex outcomes

The "outcome" of an experimental unit can consist of one or more "fragments", and the reward can be defined as a function of these fragments. The exact reward definition may be not fully known in advance, or may change in the future. Then it is possible to log all pertinent outcome fragments (see this Wiki page for details and limitations), and compute the reward later in the Join Service.

To compute the reward, open the ASA join query in your Decision Service deployment, look for the line

ELSE -observation.v -- insert custom reward function using ASA SQL, JavaScript or AzureML

and change -observation.v to a valid expression in the ASA query language. For a concrete example, suppose the logged outcome includes numerical properties P and B, and the reward is defined as A+B. Then -observation.v should be changed to observation.v.A + observation.v.B.

Not all reward definitions are expressable in the ASA query language. Please contact us if you run into this limitation, there are ways to sidestep it and we may be able to help.

Reward can depend on context and/or action

The reward can also depend on the context and/or the chosen action. Often neither is available when the reward is logged. Instead, the pertinent outcome can be logged, and reward can be computed later by the Join Service much like in the previous example. For concreteness, suppose the context includes a numerical feature F, the chosen action is just a number N, the logged outcome is a tuple which includes a numerical property P, and the reward is defined as F+N+P. Then -observation.v should be changed to observation.v.P + interaction.a + interaction.c.F.