Abstract
Online evaluation is one of the most common approaches to measure the effectiveness of an information retrieval system. It involves fielding the information retrieval system to real users, and observing these users’ interactions in-situ while they engage with the system. This allows actual users with real world information needs to play an important part in assessing retrieval quality. As such, online evaluation complements the common alternative offline evaluation approaches which may provide more easily interpretable outcomes, yet are often less realistic when measuring of quality and actual user experience.
In this survey, we provide an overview of online evaluation techniques for information retrieval. We show how online evaluation is used for controlled experiments, segmenting them into experiment designs that allow absolute or relative quality assessments. Our presentation of different metrics further partitions online evaluation based on different sized experimental units commonly of interest: documents, lists and sessions. Additionally, we include an extensive discussion of recent work on data re-use, and experiment estimation based on historical data.
A substantial part of this work focuses on practical issues: How to run evaluations in practice, how to select experimental parameters, how to take into account ethical considerations inherent in online evaluations, and limitations. While most published work on online experimentation today is at large scale in systems with millions of users, we also emphasize that the same techniques can be applied at small scale. To this end, we emphasize recent work that makes it easier to use at smaller scales and encourage studying real-world information seeking in a wide range of scenarios. Finally, we present a summary of the most recent work in the area, and describe open problems, as well as postulating future directions.
Topics

No keywords indexed for this article. Browse by subject →

References
210
[1]
Agarwal (2016)
[2]
Agarwal "Click shaping to optimize multiple objectives" (2011)
[3]
Agarwal "Personalized click shaping through Lagrangian duality for online recommendation" (2012)
[4]
Ageev "Find it if you can: A game for modeling different types of web search success using interaction data" (2011) 10.1145/2009916.2009965
[5]
Agrawal "Generating labels from clicks" (2009) 10.1145/1498759.1498824
[6]
Alonso "Implementing crowdsourcing-based relevance experimentation: An industrial perspective" Information Retrieval (2013) 10.1007/s10791-012-9204-1
[7]
Alonso "Can we get rid of TREC assessors? using Mechanical Turk for relevance assessment" Proceedings of the SIGIR 2009 Workshop on the Future of IR Evaluation (2009)
[8]
Andrienko (2006)
[9]
Arkhipova "Search engine evaluation based on search engine switching prediction" (2015)
[10]
Arkhipova "Search engine evaluation based on search engine switching prediction" (2015)
[11]
Arthur Asuncion and David J.Newman. UCI machine learning repository, 2007. http://www.ics.uci.edu/∼mlearn/MLRepository.html.
[12]
Auer "The nonstochastic multiarmed bandit problem" SIAM Journal on Computing (2002) 10.1137/s0097539701398375
[13]
Azzopardi "Modelling interaction with economic models of search" (2014)
[14]
Azzopardi "Building simulated queries for known-item topics: An analysis using six european languages" (2007)
[15]
Mixed-effects modeling with crossed random effects for subjects and items

R.H. Baayen, D.J. Davidson, D.M. Bates

Journal of Memory and Language 2008 10.1016/j.jml.2007.12.005
[16]
Bachrach "Optimising trade-offs among stakeholders in ad auctions" (2014)
[17]
Bakshy "Design and analysis of benchmarking experiments for distributed internet services" (2015) 10.1145/2736277.2741082
[18]
Bakshy "Designing and deploying online field experiments" (2014)
[19]
Balog "Head first: Living labs for ad-hoc search evaluation" (2014) 10.1145/2661829.2661962
[20]
Bates "Fitting linear mixed-effects models using lme4" (2014)
[21]
Bendersky "Up next: Retrieval methods for large scale related video suggestion" (2014) 10.1145/2623330.2623344
[22]
Beygelzimer "The offset tree for learning with partial labels" (2009) 10.1145/1557019.1557040
[23]
Boll "My app is an experiment: Experience from user studies in mobile app stores" International Journal of Mobile Human Computer Interaction (IJMHCI) (2011) 10.4018/jmhci.2011100105
[24]
Bottou "Counterfactual reasoning and learning systems: The example of computational advertising" Journal of Machine Learning Research (JMLR) (2013)
[25]
Boyan "A machine learning architecture for optimizing Web search engines" (1996)
[26]
Bubeck "Regret analysis of stochastic and nonstochastic multi-armed bandit problems" Foundations and Trends in Machine Learning (2012) 10.1561/2200000024
[27]
Burtini "A survey of online experiment design with the stochastic multi-armed bandit" (2015)
[28]
Buscher "The good, the bad, and the random: An eye-tracking study of ad quality in web search" (2010)
[29]
Campbell (1966)
[30]
Carterette "Statistical significance testing in information retrieval: Theory and practice" (2013) 10.1145/2499178.2499204
[31]
Carterette "Evaluating Search Engines by Modeling the Relationship Between Relevance and Clicks" (2007)
[32]
Carterette "Measuring the reusability of test collections" (2010) 10.1145/1718487.1718516
[33]
Casella (2001)
[34]
Chakraborty "On correlation of absence time and search effectiveness" (2014)
[35]
Chapelle "Yahoo! learning to rank challenge overview" Journal of Machine Learning Research - Proceedings Track (2011)
[36]
Chapelle "A dynamic bayesian network click model for web search ranking" (2009)
[37]
Chapelle "Large-scale validation and analysis of interleaved search evaluation" Transactions on Information System (TOIS) (2012)
[38]
Chen "Does vertical bring more satisfaction?: Predicting search satisfaction in a heterogeneous environment" (2015)
[39]
Chowdhury "Automatic evaluation of world wide web search services" (2002)
[40]
Chuklin "Evaluating Aggregated Search Using Interleaving" (2013) 10.1145/2505515.2505698
[41]
Chuklin "Click modelbased information retrieval metrics" (2013)
[42]
Chuklin "A comparative analysis of interleaving methods for aggregated search" (2014)
[43]
Chuklin (2015)
[44]
Clarke "The influence of caption features on clickthrough patterns in web search" (2007)
[45]
Cleverdon "The Cranfield tests on index language devices" Aslib Proceedings (1967) 10.1108/eb050097
[46]
Craswell "An experimental comparison of click position-bias models" (2008)
[47]
Deng "Objective Bayesian two sample hypothesis testing for online controlled experiments" (2015)
[48]
Deng "Improving the sensitivity of online controlled experiments by utilizing pre-experiment data" (2013) 10.1145/2433396.2433413
[49]
Deng "ImageNet: A large-scale hierarchical image database" (2009)
[50]
Diaz "Robust models of mouse movement on dynamic web search results pages" (2013)

Showing 50 of 210 references

Metrics
80
Citations
210
References
Details
Published
Jun 22, 2016
Vol/Issue
10(1)
Pages
1-117
Cite This Article
Katja Hofmann, Lihong Li, Filip Radlinski (2016). Online Evaluation for Information Retrieval. Foundations and Trends® in Information Retrieval, 10(1), 1-117. https://doi.org/10.1561/1500000051
Related

You May Also Like

The Probabilistic Relevance Framework: BM25 and Beyond

Stephen Robertson, Hugo Zaragoza · 2009

2,108 citations

Learning to Rank for Information Retrieval

Tie-Yan Liu · 2009

1,409 citations

Authorship Attribution

Patrick Juola · 2008

400 citations

LifeLogging: Personal Big Data

Cathal Gurrin, Alan F. Smeaton · 2014

328 citations