Proceedings of the Web Conference (WWW)
Proceedings of the Web Conference (WWW) Year 2021 Peer-reviewed
Web Security · Privacy

Towards Realistic and Reproducible Web Crawl Measurements

Jordan Jueckstock Shaown Sarker Peter Snyder Aidan Beggs Panagiotis Papadopoulos Matteo Varvello
2021
Publication year
WWW
Venue
Peer-reviewed
Type

Problem

ABSTRACT Accurate web measurement is critical for understanding and improv- ing security and privacy online. Implicit in these measurements is the assumption that automated crawls generalize to the experiences of typical web users, despite significant anecdotal evidence to the contrary. Anecdotal evidence suggests that the web behaves differ- ently when approached from well-known measurement endpoints, or with well-known measurement and automation frameworks, for reasons ranging from DDOS detection, hiding malicious behavior, or bot detection.

Approach

This work improves the state of web privacy and se- curity by investigating how, and in what ways, privacy and security measurements change when using typical web measurement tools, compared to measurement configurations intentionally designed to match “real” web users. We build a web measurement framework encompassing network endpoints and browser configurations ranging from off-the-shelf defaults commonly used in research studies to configurations more representative of typical web users, and we note the effect of realism factors on security and privacy relevant measurements when applied to the Tranco top 25k web domains. We find that web privacy and security measurements are significantly affected by measurement vantage point and browser configuration, and conclude that unless researchers carefully consider if and how their web measurement tools match real world users, the research community is likely systematically missing important signals.

Results

For example, we find that browser configuration alone can cause shifts in 19% of known ad and tracking domains encountered, and similarly affects the loading frequency of up to 10% of distinct families of JavaScript code units executed. We also find that choice of measurement network points have similar, though less dramatic, effects on privacy and security measurements. To aid the measurement replicability, and to aid future web research, we share our dataset and precise measurement configurations. ACM Reference Format: Jordan Jueckstock†, Shaown Sarker†, Peter SnyderΔ, Aidan Beggs†, Panagiotis Papadopoulos⋄, Matteo Varvello★, Ben LivshitsΔ, Alexan- dros Kapravelos†. 202

Cite this paper — BibTeX
@InProceedings{jueckstock21crawl,
  title = "Towards Realistic and Reproducible Web Crawl Measurements",
  author = "Jordan Jueckstock and Shaown Sarker and Peter Snyder and Aidan Beggs and Panagiotis Papadopoulos and Matteo Varvello and Benjamin Livshits and Alexandros Kapravelos",
  year = "2021",
  month = apr,
  booktitle = "Proceedings of the Web Conference (WWW)",
}
Copied