Description
@justinormont figured out that setting default # of iterations to 10 in the Averaged Perceptron learner would lead to better results
From: Justin Ormont
Sent: Monday, April 3, 2017 2:52:13 PM
Subject: Re: Move AveragedPerceptron defaults to iter=10
Greetings folks,
I had a chance to run larger datasets, and I think my conclusion holds.
I did a sweep of the 15GB dataset, and the 2.7TB dataset.
Sweep: 1 to 20 iterations; while it's still running; it's finished most of the experiments and the pattern is pretty clear.
15GB text (note x-axis is number of iterations, not time; y-axis AUC)
Also run (not shown) was FastTreeBinary, its AUC is below this graph at 89.1%, and much, much slower.
2.7TB numeric (note x-axis is number iterations, not time; y-axis AUC)
It doesn't appear that I've hit overfitting thus far in either dataset. AUC continues to increase from a low at iter=1 (far left), to a high on the right (iter=15)
How does AP iterations affect time?
Time was a bit odd (not a smooth graph) but generally increasing as the number of iterations increases.
15GB text (note x-axis is iteration count, y-axis is time)
Time was almost constant with added iterations (noise is due zooming). There's ~5% runtime difference between fastest and slowest on this graph, with 15 iterations being fastest (likely noise).
For 1 iterations: 14,478 (4.0 hours)
For 10 iterations: 14,623 sec (4.1 hours)
That's a very sub-linear 1.01x growth from 1 to 10 iterations
2.7TB numeric (note x-axis is iteration count, y-axis is time)
Sorry, the GUI cuts off the time labels on the left. Time given on next line.
For 1 iteration: 111,367 sec (1.3 days);
For 10 iterations: 317,203 sec (3.7 days).
That's a sub-linear 2.8x growth from 1 to 10 iterations.
I think the 15GB text dataset fitting fully in memory causes it to have a near constant runtime vs iterations and it's dominated by another factor, like Text featurization[wild guess].
The dataset being 2.7TB had to have caching turned off, and each iteration had to fetch the data from CT01; data fetch time may have dominated[wild guess].
Presented is AUC as the datasets are binary. Accuracy graphs look similar though more noisy indicating perhaps we could look at how we're setting the binary threshold.
Memory usage
In both datasets, memory usage appears flat (plus noise) as iteration count increases.
Methodology
Both datasets are binary classification of larger size than previous experiments w/ AveragedPerceptron's iteration count. All experiments were run on HPC with each experiment taking a full node until finished. Data was stored on CT01.
For the 2.7 TB numeric dataset, caching, normalization and shuffling were turned off. Caching was disabled due to size (2.7TB)
Conclusion
For AveragedPerceptron, iterations=10 seems to be an OK default for these two larger datasets; it appears the "best" (in terms of AUC/Acc) hasn't been hit and is above 15 for these.
For 10 iterations, the added duration in the 15GB dataset was negligible and the added runtime for the 2.7TB was an additional 1.8x.
The 2.7TB dataset gains ~0.2% AUC w/ 10 iterations (~7% decrease in relative AUC-loss [aka, 1-AUC]). The 15G dataset gains ~0.4% AUC w/ 10 iterations (~4% decease in relative AUC-loss).