|
| 1 | +--- |
| 2 | +title: A Regularization Proof |
| 3 | +author: Brian Zhang |
| 4 | +date: '2022-09-19' |
| 5 | +slug: a-regularization-proof |
| 6 | +categories: [] |
| 7 | +tags: [] |
| 8 | +description: 'Investigating behavior of a function minimum as we add regularization.' |
| 9 | +--- |
| 10 | + |
| 11 | + |
| 12 | + |
| 13 | +<p>Say we have a loss function <span class="math inline">\(l(w)\)</span>. With no regularization, we might obtain the minimum at <span class="math inline">\(w = w_0\)</span>. Now consider the setting with regularization: |
| 14 | +<span class="math display">\[ |
| 15 | +f_\lambda(w) = l(w) + \lambda R(w), |
| 16 | +\]</span> |
| 17 | +where <span class="math inline">\(R(w) \geq 0\)</span> is some regularization function and <span class="math inline">\(\lambda \geq 0\)</span>. What can we say if we consider the minimizing inputs <span class="math inline">\(w_1\)</span> for <span class="math inline">\(f_{\lambda_1}(w)\)</span> and <span class="math inline">\(w_2\)</span> for <span class="math inline">\(f_{\lambda_2}(w)\)</span>, with <span class="math inline">\(0 \leq \lambda_1 < \lambda_2\)</span>? |
| 18 | +<span class="math display">\[ |
| 19 | +w_1 = argmin_w \left[ l(w) + \lambda_1 R(w) \right],\\ |
| 20 | +w_2 = argmin_w \left[ l(w) + \lambda_2 R(w) \right]. |
| 21 | +\]</span></p> |
| 22 | +<p>Intuitively, as we increase <span class="math inline">\(\lambda\)</span> from <span class="math inline">\(\lambda_1\)</span> to <span class="math inline">\(\lambda_2\)</span>, the function <span class="math inline">\(f_\lambda(w)\)</span> places more importance on the regularization term <span class="math inline">\(R(w)\)</span>. We should expect <span class="math inline">\(l(w)\)</span> evaluated at the optimum <span class="math inline">\(w\)</span> to increase, and the regularization term <span class="math inline">\(R(w)\)</span> evaluated at the optimum <span class="math inline">\(w\)</span> to decrease.</p> |
| 23 | +<p>By the properties of the optimum, we have |
| 24 | +<span class="math display">\[\begin{gather} |
| 25 | +l(w_1) + \lambda_1 R(w_1) \leq l(w_2) + \lambda_1 R(w_2), \quad (1)\\ |
| 26 | +l(w_2) + \lambda_2 R(w_2) \leq l(w_1) + \lambda_2 R(w_1). \quad (2) |
| 27 | +\end{gather}\]</span> |
| 28 | +The only other information we have relating these terms is that <span class="math inline">\(R(w) \geq 0\)</span> (for all <span class="math inline">\(w\)</span>) and <span class="math inline">\(0 \leq \lambda_1 < \lambda_2\)</span>. So we work with what we have. First, leveraging <span class="math inline">\((1)\)</span>, |
| 29 | +<span class="math display">\[\begin{align*} |
| 30 | +f_{\lambda_1}(w_1) &= l(w_1) + \lambda_1 R(w_1)\\ |
| 31 | +&\leq l(w_2) + \lambda_1 R(w_2)\\ |
| 32 | +&\leq l(w_2) + \lambda_2 R(w_2)\\ |
| 33 | +&= f_{\lambda_2}(w_2), |
| 34 | +\end{align*}\]</span> |
| 35 | +so the minimum of the optimized function increases (or stays the same) as we increase <span class="math inline">\(\lambda\)</span>. This can also be proved as <span class="math inline">\(f_{\lambda_2}(w) \geq f_{\lambda_1}(w)\)</span> for all <span class="math inline">\(w\)</span>.</p> |
| 36 | +<p>The other inequalities are trickier. Observe (starting with <span class="math inline">\((2)\)</span>): |
| 37 | +<span class="math display">\[\begin{align*} |
| 38 | +l(w_1) + \lambda_2 R(w_1) &\geq l(w_2) + \lambda_2 R(w_2)\\ |
| 39 | +&= l(w_2) + (\lambda_1 + \lambda_2 - \lambda_1) R(w_2)\\ |
| 40 | +&= \left[l(w_2) + \lambda_1 R(w_2)\right] + (\lambda_2 - \lambda_1) R(w_2)\\ |
| 41 | +&\geq \left[l(w_1) + \lambda_1 R(w_1)\right] + (\lambda_2 - \lambda_1) R(w_2). |
| 42 | +\end{align*}\]</span> |
| 43 | +Subtracting <span class="math inline">\((l(w_1) + \lambda_1 R(w_1))\)</span> from both sides, we have |
| 44 | +<span class="math display">\[ |
| 45 | +(\lambda_2 - \lambda_1) R(w_1) \geq (\lambda_2 - \lambda_1) R(w_2). |
| 46 | +\]</span> |
| 47 | +<span class="math inline">\(\lambda_2 - \lambda_1 > 0\)</span>, so dividing on both sides, |
| 48 | +<span class="math display">\[ |
| 49 | +R(w_1) \geq R(w_2). |
| 50 | +\]</span> |
| 51 | +In words, the minimum of the regularization component (not including the factor of <span class="math inline">\(\lambda\)</span>) decreases (or stays the same) as we increase <span class="math inline">\(\lambda\)</span>.<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a></p> |
| 52 | +<p>Starting with <span class="math inline">\((1)\)</span> and leveraging this fact, we additionally have |
| 53 | +<span class="math display">\[\begin{align*} |
| 54 | +l(w_1) + \lambda_1 R(w_1) &\leq l(w_2) + \lambda_1 R(w_2)\\ |
| 55 | +&\leq l(w_2) + \lambda_1 R(w_1) |
| 56 | +\end{align*}\]</span> |
| 57 | +Subtracting <span class="math inline">\(\lambda_1 R(w_1)\)</span> from both sides, we obtain |
| 58 | +<span class="math display">\[ |
| 59 | +l(w_1) \leq l(w_2). |
| 60 | +\]</span> |
| 61 | +In words, the minimum of the loss function component increases (or stays the same) as we increase <span class="math inline">\(\lambda\)</span>.<a href="#fn2" class="footnote-ref" id="fnref2"><sup>2</sup></a></p> |
| 62 | +<div class="footnotes"> |
| 63 | +<hr /> |
| 64 | +<ol> |
| 65 | +<li id="fn1"><p>An alternate proof, by adding <span class="math inline">\((1)\)</span> with <span class="math inline">\((2)\)</span>: |
| 66 | +<span class="math display">\[ |
| 67 | +l(w_1) + l(w_2) + \lambda_1 R(w_1) + \lambda_2 R(w_2) \leq l(w_1) + l(w_2) + \lambda_1 R(w_2) + \lambda_2 R(w_1),\\ |
| 68 | +\lambda_1 R(w_1) + \lambda_2 R(w_2) \leq \lambda_1 R(w_2) + \lambda_2 R(w_1),\\ |
| 69 | +(\lambda_2 - \lambda_1) R(w_2) \leq (\lambda_2 - \lambda_1) R(w_1),\\ |
| 70 | +R(w_2) \leq R(w_1). |
| 71 | +\]</span><a href="#fnref1" class="footnote-back">↩︎</a></p></li> |
| 72 | +<li id="fn2"><p>An alternate proof, by adding <span class="math inline">\(1/\lambda_1\)</span> times <span class="math inline">\((1)\)</span> with <span class="math inline">\(1/\lambda_2\)</span> times <span class="math inline">\((2)\)</span>: |
| 73 | +<span class="math display">\[ |
| 74 | +\frac{l(w_1)}{\lambda_1} + \frac{l(w_2)}{\lambda_2} + R(w_1) + R(w_2) \leq \frac{l(w_2)}{\lambda_1} + \frac{l(w_1)}{\lambda_2} + R(w_2) + R(w_1),\\ |
| 75 | +\frac{l(w_1)}{\lambda_1} + \frac{l(w_2)}{\lambda_2} \leq \frac{l(w_2)}{\lambda_1} + \frac{l(w_1)}{\lambda_2},\\ |
| 76 | +\left(\frac{1}{\lambda_1} - \frac{1}{\lambda_2}\right) l(w_1) \leq \left(\frac{1}{\lambda_1} - \frac{1}{\lambda_2}\right) l(w_2) ,\\ |
| 77 | +l(w_1) \leq l(w_2). |
| 78 | +\]</span><a href="#fnref2" class="footnote-back">↩︎</a></p></li> |
| 79 | +</ol> |
| 80 | +</div> |
0 commit comments