-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathnotes.txt
212 lines (143 loc) · 7.54 KB
/
notes.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
--------------- Installation: --------------
brew update
brew install python ruby1.9 phantomjs gnuplot
# Phantomjs must be version >= 1.7
(cd analysis && git clone https://github.com/colin-scott/web-page-replay.git wpr)
-------------- Workflow: ----------------
./data/csv/extract_urls.rb httparchive*pages.csv > ./data/target_urls
cd data
# Error log for web-page-replay process (most recent run) will show up in wpr_log.tx
# Error log for phantomjs process (each run) will show up in wpr/<b64>.err,
# and also unfortunately as random strings in har/<b64>.har.
sudo ./fetch_urls_sequentially.py
cd ..
./analysis/wpr/modify_wpr_delays.py ./data/wpr
# Sanity check, over just original fetches and wprs:
# You should see 0 urls in data/filtered_stats/valids.txt at this point, but most of the URLs
# should *not* show up in one of the other data/filtered_stats/ files
(cd analysis; ./cleanse_hars.rb ../data/har; ./har_processing/filter_bad_pages.rb ../data/har)
# Optionally: refetch any urls that didn't fetch properly the first time.
(cd data; sudo ./refetch_urls_sequentially.py)
(cd analysis; ./cleanse_hars.rb ../data/har; ./har_processing/filter_bad_pages.rb ../data/har)
# For a parallel version of replay, see ./replay_vms/README.md
cd data
sudo ./replay_urls_sequentially.py ./wpr
cd ../analysis
./cleanse_hars.rb ../data/replay
# Run the sanity check again, this time over replays as well
# TODO(cs): split filter_bad_pages into two scripts so we don't repeat the
# sanity checks for wprs.
./har_processing/filter_bad_pages.rb ../data/har
# Graphs:
# First filter
cut -d ' ' -f 2 ../data/filtered_stats/valids.txt > valid_originals.desktop.txt
./har_processing/get_replays_from_originals.rb \
valid_originals.desktop.txt > valid_replays.desktop.txt
# Cacheable bytes
./har_processing/extract_cacheable_bytes.rb \
valid_originals.desktop.txt | sort | uniq > cacheable_bytes.desktop.dat
(cd ./graphs/cacheable_bytes/; ./generate_cdf_from_raw_data.sh ../../cacheable_bytes.desktop.dat)
# Total bytes
(cd ./graphs/total_bytes/; ./generate_cdf_from_raw_data.sh ../../cacheable_bytes.desktop.dat)
# Sanity check PLT difference between original and replays
# (sanity check replays that took too long)
./har_processing/extract_plt.rb \
valid_originals.desktop.txt > original_plts.desktop.dat
./har_processing/extract_plt.rb \
valid_replays.desktop.txt > replay_plts.desktop.dat
./graphs/original_vs_replay/compare.rb \
original_plts.desktop.dat replay_plts.desktop.dat > plt_comparison.desktop.dat
(cd ./graphs/plt_comparison/; ./generate_cdf_from_raw_data.sh ../../plt_comparison.desktop.dat)
less plt_comparison.dat
# Raw PLTs
# (Not currently included in the paper, AFAICT?)
egrep '.pc.1.har ' replay_plts.desktop.dat > perfect_cache_replay_plts.desktop.dat
egrep -v '.pc.1.har ' replay_plts.desktop.dat > unmodified_replay_plts.desktop.dat
(cd ./graphs/plts/; ./generate_cdf_from_raw_data.sh ../../unmodified_replay_plts.desktop.dat ../../perfect_cache_replay_plts.desktop.dat)
# PLT Reduction
./graphs/percent_plt_reduction/compute_median_reduction.rb \
replay_plts.desktop.dat > median_plt_reduction.desktop.dat 2>plt_got_worse.desktop.dat
(cd ./graphs/percent_plt_reduction/; ./generate_cdf_from_raw_data.sh ../../median_plt_reduction.desktop.dat)
# Ratio Cacheable bytes: PLT Reduction
./graphs/ratio_bytes_to_reduction/compute_ratio.rb \
median_plt_reduction.desktop.dat \
cacheable_bytes.desktop.dat > ratio.desktop.dat
(cd ./graphs/ratio_bytes_to_reduction/; ./generate_cdf_from_raw_data.sh ../../ratio.desktop.dat)
---- To compare desktop vs mobile results ----
# ./intersect_keys.rb median_plt_reduction.desktop.dat median_plt_reduction.dat
(cd ./graphs/percent_plt_reduction/; ./generate_cdf_from_raw_data.sh ../../median_plt_reduction.desktop.dat ../../median_plt_reduction.dat)
---------------- TODO: ------------------
- Even though Jamshed and I had the same target list, the URLs in
valid_urls.txt and all of the .dat files differ. I suspect this is because
of redirects? i.e., one of us is logging not the original fetch URLs, but
the redirect URL? Should investigate.
- Figure out why fetches failed, even on c5.
-> 131 caused phantomjs to crash.
-> 160 replays caused phantomjs to crash
-> A fair number of CannotAllocateMemory errors, due to runaway processes?
- Don't have telemetry write to a single log file for all fetches.
- When fetching, if wpr is empty (4KB) or HAR fails, raise.
- It doesn't make sense that wpr acrhive would be non-empty, yet the original
HAR is empty...
- Balance data
- Increase parallelization
- Understand why replay (broken_wpr_log.txt) took 30 sec, whereas original
took ~1 sec.
- Graph difference between original PLT and replay PLT?
---------------- Questions: ------------------
Would it be advantageous to have PhantomJS emulate a particular browser? (Is
that possible?)
Do we have an estimate on what cache hit ratio prefetch/preconnect achieve in
practice? Obviously not 100%, as we're assuming for the analysis.
*********** OLD (httparchive.org) NOTES **************
It's the `requests` table that includes responses headers, not the `pages`
table.
The `stats` (and `pages`?) table has the original onLoad time.
Small problem: many of the HAR files aren't fetchable from httparchive.org.
The web server responds with a 200, but no body.
May not be able to get historical data for before 2012:
"Jul 1 2012 - The HTTP
Archive Mobile switched from running on the Blaze.io framework to WebPagetest.
This involved several changes including new iPhone hardware (although both
used iOS5) and a new geo location (from Toronto to Redwood City)."
-------------- Questions: --------------
Does WPR delay DNS requests?
Does WPR insert the original delays by default? -> Yes, but the original
execution isn't that close to the replayed. Although, when repeatedly
replaying, the results are fairly consistent. Plan: get the original PLT by
replaying, not by mining it directly from the HAR. This also solves the IE
prolem.
Does HAR have enough of the response body content to be able to replay? -> Not
sure about Flash/Video.
Are the HTTP Archives complete enough to be able to replay with WPR? Email
response said something about javascript injection (deterministic.js).
-------------- Implementation notes: --------------
- Might need to get webkit-headless running. (Although, it was IE that did the
original web page loads?)
http://dsalzr.wordpress.com/2012/08/10/theres-still-no-perfect-headless-browser/
http://stackoverflow.com/questions/9210765/any-way-to-start-google-chrome-in-headless-mode
There's also WPT.
- It is possible to get individual HAR files, by issuing queries to:
http://httparchive.webpagetest.org/export.php?test=$wptid&run=$wptrun&bodies=1&pretty=0
using the wptid and wptrun values in the Pages table
- Need to join the `time` field of the original request with the `time` field of
the responses to get the relative timings.
--------------- Workflow ---------------------
./data/fetch_har_files.rb /path/to/httparchive_pages.csv
./analysis/modify_all_hars.sh data/har/*har
./analysis/convert_all_hars.sh data/har/*har
for F in data/wpr/*
sudo ./analysis/wpr/replay.py --use_server_delays $F
# Load webpage
done
-------------- Tools: --------------
WPR:
https://github.com/chromium/web-page-replay
HTTP Archive Downloads:
http://httparchive.org/downloads.php
HAR format:
http://www.softwareishard.com/blog/har-12-spec/
Existing BigQuery queries:
http://bigqueri.es/t/getting-started-with-bigquery-http-archive/2
Query on cache-control policy:
http://bigqueri.es/t/cache-control-response-policy-of-html-documents/310