Skip to content
This repository was archived by the owner on Dec 22, 2021. It is now read-only.

Verification / evolution of "Internet Jones" paper #26 #93

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 29 additions & 0 deletions analyses/2009_03_noahwalugembe__Internet-Jones-evolution.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
Verification / evolution of "Internet Jones" paper #26


Introduction

Third-party web tracking is the practice by which entities (“trackers”) embedded in webpages re-identify users as they browse the web, collecting information about the websites that they visit. A cording to According to Lerner, Simpson, Kohno and Roesner, (2016) web Tracking is typically done for the purposes of website analytics, targeted advertising, and other forms of personalization (e.g., social media content). In this work I am evaluating the contribution of "Internet Jones" paper starting with its insight on TrackingExcavator and a longitudinal measurement study of third-party cookie-based web tracking on Wayback Machine1. I will also show how has the third-party web tracking ecosystem evolved since its beginnings according to "Internet Jones" paper.

TrackingExcavator

The Wayback Machine1 contains archives of full webpages, including JavaScript, stylesheets, and embedded resources, dating back to 1996. To leverage this archive, According to Lerner, Simpson, Kohno and Roesner, (2016) designed and implemented a retrospective tracking detection and analysis platform called TrackingExcavator which allowed them to conduct a longitudinal study of third-party tracking from 1996 to present (2016). TrackingExcavator logs in-browser behaviors related to web tracking, including: third-party requests, cookies attached to requests, cookies programmatically set by JavaScript, and the use of other relevantJavaScript APIs (e.g., HTML5 LocalStorage and APIsused in browser fingerprinting, such as enumerating installed plugins). TrackingExcavator also run on both live as well as archived versions of websites.

Wayback Machine

According to Lerner, Simpson, Kohno and Roesner, (2016) Reported that "The Wayback Machine provides a unique and comprehensive source of historical web data. However, it was not created for the purpose of studying third-party web tracking and is thus imperfect for that use but they stated that Nevertheless, the only way to study web tracking prior to explicit measurements targeting it is to leverage materials previously archived for other purposes which is true because it is a good approach to start from some thing than reinventing from scratch." Also Lerner, Simpson, Kohno and Roesner, (2016) mention that "Wayback Machine may fail to archive resources for any number of reasons. For example, the domain serving a certain resource may have been unavailable at the time of the archive, or changes in the Wayback Machine’s crawler may result in different archiving behaviors over time."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, when you are quoting work it is very important to use quotation marks and citations to make it clear that you have used the original authors words. Alternatively you can re-write conclusions in your own words. Here are passages from the original paper that are too close, in my eyes, for you to claim as your own words:

Page8 Sec 4.1

The Wayback Machine may fail to archive resources for any number of reasons. For example, the domain serving a certain resource may have been unavailable at the time of the archive, or changes in the Wayback Machine’s crawler may result in different archiving behaviors over time. As shown in Table 2, missing archives are rare.

Though the Wayback Machine’s archived pages execute the corresponding archived JavaScript within the browser when TrackingExcavator visits them, the Wayback Machine does not execute JavaScript during its archival crawls of the web. Instead, it attempts to statically extract URLs from HTML and JavaScript to find additional sites
to archive. It then modifies the archived JavaScript, rewriting the URLs in the included script to point to the archived copy of the resource. This process may fail, particularly for dynamically generated URLs. As a result, when TrackingExcavator visits archived pages, dynamically generated URLs not properly redirected to their archived versions will cause the page to attempt to make a request to the live web, i.e., “escape” the archive. TrackingExcavator blocks such escapes (see Section 3). As a result, the script never runs on the archived site, never sets a cookie or leaks it, and thus TrackingExcavator does not witness the associated tracking behavior.

As others have documented [10], embedded resources in a webpage archived by the Wayback Machine may occasionally have a timestamp far from the timestamp of the top-level page.

Any of the above failures can lead to cascading failures, in that non-archived responses or blocked requests will result in the omission of any subsequent requests or cookie setting events that would have resulted from the success of the original request. The “wake” of a single failure cannot be measured within an archival dataset, because events following that failure are simply missing. To study the effect of these cascading
failures, we must compare an archival run to a live run from the same time; we do so in the next subsection.

longitudinal measurement study.

After evaluating the Wayback Machine’s According to Lerner, Simpson, Kohno and Roesner, (2016) explored how web traclikng ecosystem changed over time, including the prevalence of different web tracking behaviors, the identities and scope of popular trackers, and the complexity of relationships within the ecosystem. In the "Internet Jones" paper it was observed the rise and fall of important players like Google Analytics in the ecosystem occurred. It was noted that websites contacted an increasing number of third parties over time and the top trackers could track users across an increasing percentage of most popular web sites.


Conclusion

All "Internet Jones" paper findings show that third-party web tracking is a rapidly growing practice in an increasingly complex ecosystem— suggesting that users’ and policymakers’ concerns about privacy require sustained, and perhaps increasing, attention. The Internet Jones" paper #26 research results also provide hitherto unavailable historical context for today’s technical and policy discussions. It is also stated The Internet Jones paper notes that "the Wayback Machine provides a unique and comprehensive source of historical web data. However, it was not created for the purpose of studying third-party web-tracking and is thus imperfect for that use."


Reference

Lerner A., Simpson A. K., Kohno T., and Roesner F.,(2016). Internet Jones and the Raiders of the lost trackers: An archaeological study of web tracking from 1996 to 2016. University of Washington. Retrieved from https://www.usenix.org/conference/usenixsecurity16/technical-sessions/presentation/lerner