notes.txt

--------------- Installation: --------------

brew update
brew install python ruby1.9 phantomjs gnuplot
# Phantomjs must be version >= 1.7
(cd analysis && git clone https://github.com/colin-scott/web-page-replay.git wpr)

-------------- Workflow: ----------------

./data/csv/extract_urls.rb httparchive*pages.csv > ./data/target_urls
cd data
# Error log for web-page-replay process (most recent run) will show up in wpr_log.tx
# Error log for phantomjs process (each run) will show up in wpr/<b64>.err,
# and also unfortunately as random strings in har/<b64>.har.
sudo ./fetch_urls_sequentially.py
cd ..
./analysis/wpr/modify_wpr_delays.py ./data/wpr

# Sanity check, over just original fetches and wprs:
# You should see 0 urls in data/filtered_stats/valids.txt at this point, but most of the URLs
# should *not* show up in one of the other data/filtered_stats/ files
(cd analysis; ./cleanse_hars.rb ../data/har; ./har_processing/filter_bad_pages.rb ../data/har)

# Optionally: refetch any urls that didn't fetch properly the first time.
(cd data; sudo ./refetch_urls_sequentially.py)
(cd analysis; ./cleanse_hars.rb ../data/har; ./har_processing/filter_bad_pages.rb ../data/har)

# For a parallel version of replay, see ./replay_vms/README.md
cd data
sudo ./replay_urls_sequentially.py ./wpr
cd ../analysis
./cleanse_hars.rb ../data/replay

# Run the sanity check again, this time over replays as well
# TODO(cs): split filter_bad_pages into two scripts so we don't repeat the
# sanity checks for wprs.
./har_processing/filter_bad_pages.rb ../data/har

# Graphs:
# First filter
cut -d ' ' -f 2 ../data/filtered_stats/valids.txt > valid_originals.desktop.txt
./har_processing/get_replays_from_originals.rb \
  valid_originals.desktop.txt > valid_replays.desktop.txt

#   Cacheable bytes
./har_processing/extract_cacheable_bytes.rb \
  valid_originals.desktop.txt | sort | uniq > cacheable_bytes.desktop.dat
(cd ./graphs/cacheable_bytes/; ./generate_cdf_from_raw_data.sh ../../cacheable_bytes.desktop.dat)

#   Total bytes
(cd ./graphs/total_bytes/; ./generate_cdf_from_raw_data.sh ../../cacheable_bytes.desktop.dat)

# Sanity check PLT difference between original and replays
#      (sanity check replays that took too long)
./har_processing/extract_plt.rb \
  valid_originals.desktop.txt > original_plts.desktop.dat
./har_processing/extract_plt.rb \
  valid_replays.desktop.txt > replay_plts.desktop.dat
./graphs/original_vs_replay/compare.rb \
  original_plts.desktop.dat replay_plts.desktop.dat > plt_comparison.desktop.dat
(cd ./graphs/plt_comparison/; ./generate_cdf_from_raw_data.sh ../../plt_comparison.desktop.dat)
less plt_comparison.dat

# Raw PLTs
# (Not currently included in the paper, AFAICT?)
egrep '.pc.1.har ' replay_plts.desktop.dat > perfect_cache_replay_plts.desktop.dat
egrep -v '.pc.1.har ' replay_plts.desktop.dat > unmodified_replay_plts.desktop.dat
(cd ./graphs/plts/; ./generate_cdf_from_raw_data.sh ../../unmodified_replay_plts.desktop.dat ../../perfect_cache_replay_plts.desktop.dat)

#   PLT Reduction
./graphs/percent_plt_reduction/compute_median_reduction.rb \
    replay_plts.desktop.dat > median_plt_reduction.desktop.dat 2>plt_got_worse.desktop.dat
(cd ./graphs/percent_plt_reduction/; ./generate_cdf_from_raw_data.sh ../../median_plt_reduction.desktop.dat)

#   Ratio Cacheable bytes: PLT Reduction
./graphs/ratio_bytes_to_reduction/compute_ratio.rb \
    median_plt_reduction.desktop.dat \
    cacheable_bytes.desktop.dat > ratio.desktop.dat
(cd ./graphs/ratio_bytes_to_reduction/; ./generate_cdf_from_raw_data.sh ../../ratio.desktop.dat)

---- To compare desktop vs mobile results ----

# ./intersect_keys.rb median_plt_reduction.desktop.dat median_plt_reduction.dat
(cd ./graphs/percent_plt_reduction/; ./generate_cdf_from_raw_data.sh ../../median_plt_reduction.desktop.dat ../../median_plt_reduction.dat)

---------------- TODO: ------------------

- Even though Jamshed and I had the same target list, the URLs in
  valid_urls.txt and all of the .dat files differ. I suspect this is because
  of redirects? i.e., one of us is logging not the original fetch URLs, but
  the redirect URL? Should investigate.

- Figure out why fetches failed, even on c5.
   -> 131 caused phantomjs to crash.
   -> 160 replays caused phantomjs to crash
   -> A fair number of CannotAllocateMemory errors, due to runaway processes?

- Don't have telemetry write to a single log file for all fetches.

- When fetching, if wpr is empty (4KB) or HAR fails, raise.

- It doesn't make sense that wpr acrhive would be non-empty, yet the original
  HAR is empty...

- Balance data

- Increase parallelization

- Understand why replay (broken_wpr_log.txt) took 30 sec, whereas original
  took ~1 sec.

- Graph difference between original PLT and replay PLT?

---------------- Questions: ------------------

Would it be advantageous to have PhantomJS emulate a particular browser? (Is
that possible?)

Do we have an estimate on what cache hit ratio prefetch/preconnect achieve in
practice? Obviously not 100%, as we're assuming for the analysis.


*********** OLD (httparchive.org) NOTES **************

It's the `requests` table that includes responses headers, not the `pages`
table.

The `stats` (and `pages`?) table has the original onLoad time.

Small problem: many of the HAR files aren't fetchable from httparchive.org.
The web server responds with a 200, but no body.

May not be able to get historical data for before 2012:
"Jul 1 2012 - The HTTP
Archive Mobile switched from running on the Blaze.io framework to WebPagetest.
This involved several changes including new iPhone hardware (although both
used iOS5) and a new geo location (from Toronto to Redwood City)."

-------------- Questions: --------------

Does WPR delay DNS requests?

Does WPR insert the original delays by default? -> Yes, but the original
execution isn't that close to the replayed. Although, when repeatedly
replaying, the results are fairly consistent. Plan: get the original PLT by
replaying, not by mining it directly from the HAR. This also solves the IE
prolem.

Does HAR have enough of the response body content to be able to replay? -> Not
sure about Flash/Video.

Are the HTTP Archives complete enough to be able to replay with WPR? Email
response said something about javascript injection (deterministic.js).

-------------- Implementation notes: --------------

- Might need to get webkit-headless running. (Although, it was IE that did the
  original web page loads?)

http://dsalzr.wordpress.com/2012/08/10/theres-still-no-perfect-headless-browser/
http://stackoverflow.com/questions/9210765/any-way-to-start-google-chrome-in-headless-mode
There's also WPT.

- It is possible to get individual HAR files, by issuing queries to:

http://httparchive.webpagetest.org/export.php?test=$wptid&run=$wptrun&bodies=1&pretty=0

using the wptid and wptrun values in the Pages table

- Need to join the `time` field of the original request with the `time` field of
the responses to get the relative timings.

--------------- Workflow ---------------------

./data/fetch_har_files.rb /path/to/httparchive_pages.csv
./analysis/modify_all_hars.sh data/har/*har
./analysis/convert_all_hars.sh data/har/*har
for F in data/wpr/*
  sudo ./analysis/wpr/replay.py --use_server_delays $F
  # Load webpage
done

-------------- Tools: --------------

WPR:

https://github.com/chromium/web-page-replay

HTTP Archive Downloads:

http://httparchive.org/downloads.php

HAR format:

http://www.softwareishard.com/blog/har-12-spec/

Existing BigQuery queries:

http://bigqueri.es/t/getting-started-with-bigquery-http-archive/2

Query on cache-control policy:

http://bigqueri.es/t/cache-control-response-policy-of-html-documents/310