Lectures

Lecture I: Course Introduction

We will highlight the basic structure and major topics of this course, and go over some logistic issues and course requirements.

  • Day 1: Course Policy (slides, PDF)

  • Day 2: Introduction (slides, PDF)

    • Bush, Vannevar. "As we may think." The atlantic monthly 176, no.1 (1945): 101-108. (PDF)

Lecture II: Search Engine Architecture

We will briefly discuss the basic building blocks of a modern search engine system, including web crawler, inverted index, and query processing.

  • Day 1: Basic search engine architecture (slides, PDF)

    • Brin, Sergey, and Lawrence Page. "The anatomy of a large-scale hypertextual Web search engine." Computer networks and ISDN systems 30, no. 1 (1998): 107-117. (HTML)
    • Singhal, Amit. "Modern information retrieval: A brief overview." IEEE Data Eng. Bull. 24, no. 4 (2001): 35-43. (PDF)
    • Broder, Andrei. "A taxonomy of web search." In ACM Sigir forum, vol. 36, no. 2, pp. 3-10. ACM, 2002. (PDF)
  • Day 2: Web crawling and basic text processing techniques (slides, PDF)

    • Olston, Christopher, and Marc Najork. "Web crawling." Foundations and Trends in Information Retrieval 4, no. 3 (2010): 175-246. (PDF)
    • Abiteboul, Serge, Mihai Preda, and Gregory Cobena. "Adaptive on-line page importance computation." In Proceedings of the 12th international conference on World Wide Web, pp. 280-290. ACM, 2003. (PDF)
    • Rendle, Steffen, Christoph Freudenthaler, and Lars Schmidt-Thieme. "Factorizing personalized markov chains for next-basket recommendation." In Proceedings of the 19th international conference on World wide web, pp. 811-820. ACM, 2010. (PDF)
    • Cho, Junghoo, Hector Garcia-Molina, and Lawrence Page. "Efficient crawling through URL ordering." Computer Networks and ISDN Systems 30, no. 1 (1998): 161-172. (HTML)
    • Shkapenyuk, Vladislav, and Torsten Suel. "Design and implementation of a high-performance distributed web crawler." In Data Engineering, 2002. Proceedings. 18th International Conference on, pp. 357-368. IEEE, 2002. (PDF)
    • Chakrabarti, Soumen, Byron Dom, Prabhakar Raghavan, Sridhar Rajagopalan, David Gibson, and Jon Kleinberg. "Automatic resource compilation by analyzing hyperlink structure and associated text." Computer Networks and ISDN Systems 30, no. 1 (1998): 65-74. (HTML)
    • Hull, David A. "Stemming algorithms: A case study for detailed evaluation." JASIS 47, no. 1 (1996): 70-84. (PDF)
    • Xu, Jinxi, and W. Bruce Croft. "Corpus-based stemming using cooccurrence of word variants." ACM Transactions on Information Systems (TOIS) 16, no. 1 (1998): 61-81. (PDF)
  • Day 3: Inverted Index and Query processing (slides, PDF)

    • Cutting, Doug, and Jan Pedersen. "Optimization for dynamic inverted index maintenance." In Proceedings of the 13th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 405-411. ACM, 1989. (PDF)
    • Zobel, Justin, and Alistair Moffat. "Inverted files for text search engines." ACM computing surveys (CSUR) 38, no. 2 (2006): 6. (PDF)
    • Scholer, Falk, Hugh E. Williams, John Yiannis, and Justin Zobel. "Compression of inverted indexes for fast query evaluation." In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 222-229. ACM, 2002. (PDF)
    • Yan, Hao, Shuai Ding, and Torsten Suel. "Inverted index compression and query processing with optimized document ordering." In Proceedings of the 18th international conference on World wide web, pp. 401-410. ACM, 2009. (PDF)

Lecture III: Retrieval Evaluation

Assessing the quality of deployed system is essential for retrieval system development. Many different measures for evaluating the performance of information retrieval systems have been proposed. We will discuss both the classical evaluation metrics, e.g., Mean Average Precision, and modern advance, e.g., interleaving.

  • Day 1: Classic IR evaluations (slides, PDF)

    • Järvelin, Kalervo, and Jaana Kekäläinen. "IR evaluation methods for retrieving highly relevant documents." In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 41-48. ACM, 2000. (PDF)
    • Järvelin, Kalervo, and Jaana Kekäläinen. "Cumulated gain-based evaluation of IR techniques." ACM Transactions on Information Systems (TOIS) 20, no. 4 (2002): 422-446. (PDF)
    • Borlund, Pia. "The IIR evaluation model: a framework for evaluation of interactive information retrieval systems." Information research 8, no. 3 (2003). (PDF)
    • Clarke, Charles LA, Maheedhar Kolla, Gordon V. Cormack, Olga Vechtomova, Azin Ashkan, Stefan Büttcher, and Ian MacKinnon. "Novelty and diversity in information retrieval evaluation." In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 659-666. ACM, 2008. (PDF)
    • Smucker, Mark D., James Allan, and Ben Carterette. "A comparison of statistical significance tests for information retrieval evaluation." In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pp. 623-632. ACM, 2007. (PDF)
    • Buckley, Chris, and Ellen M. Voorhees. "Retrieval evaluation with incomplete information." In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 25-32. ACM, 2004. (PDF)
    • Carterette, Ben, James Allan, and Ramesh Sitaraman. "Minimal test collections for retrieval evaluation." In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 268-275. ACM, 2006. (PDF)
  • Day 2: Modern IR evaluations (slides, PDF)

    • Radlinski, Filip, and Nick Craswell. "Comparing the sensitivity of information retrieval metrics." In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, pp. 667-674. ACM, 2010. (PDF)
    • Ageev, Mikhail, Qi Guo, Dmitry Lagun, and Eugene Agichtein. "Find it if you can: a game for modeling different types of web search success using interaction data." In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pp. 345-354. ACM, 2011. (PDF)
    • Hassan, Ahmed, Yang Song, and Li-wei He. "A task level metric for measuring web search satisfaction and its application on improving relevance estimation." In Proceedings of the 20th ACM international conference on Information and knowledge management, pp. 125-134. ACM, 2011. (PDF)
    • White, Ryen. "Beliefs and biases in web search." In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, pp. 3-12. ACM, 2013. (PDF)
    • Smucker, Mark D., and Charles LA Clarke. "Time-based calibration of effectiveness measures." In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, pp. 95-104. ACM, 2012. (PDF)
    • Sanderson, Mark, Monica Lestari Paramita, Paul Clough, and Evangelos Kanoulas. "Do user preferences and evaluation measures line up?." In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, pp. 555-562. ACM, 2010. (PDF)

Lecture IV: Retrieval Models

Retrieval model, a.k.a., ranking algorithm, is arguably the most important component of a retrieval system and directly determines search effectiveness. We will discuss classical retrieval models, including Boolean, vector space, probabilistic and language models. We will also introduce the most recent development of learning-based ranking algorithms, i.e., learning-to-rank.

  • Day 1: Boolean and vector space model (slides, PDF)

    • Salton, Gerard, Anita Wong, and Chung-Shu Yang. "A vector space model for automatic indexing." Communications of the ACM 18, no. 11 (1975): 613-620. (PDF)
    • Salton, Gerard, and Christopher Buckley. "Term-weighting approaches in automatic text retrieval." Information processing & management 24, no. 5 (1988): 513-523. (PDF)
    • Raghavan, Vijay V., and SK Michael Wong. "A critical analysis of vector space model for information retrieval." Journal of the American Society for information Science 37, no. 5 (1986): 279-287. (PDF)
    • Singhal, Amit, Chris Buckley, and Mandar Mitra. "Pivoted document length normalization." In Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 21-29. ACM, 1996. (PDF)
    • Turney, Peter D., and Patrick Pantel. "From frequency to meaning: Vector space models of semantics." Journal of artificial intelligence research 37, no. 1 (2010): 141-188. (PDF)
    • Sahlgren, Magnus. "The Word-Space Model: Using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces." (2006). (PDF)
  • Day 2: Probabilistic ranking principle (slides, PDF)

    • Robertson, Stephen E., Steve Walker, Susan Jones, Micheline M. Hancock-Beaulieu, and Mike Gatford. "Okapi at TREC-3." Nist Special Publication Sp 109 (1995): 109. (PDF)
    • Metzler, Donald, and W. Bruce Croft. "A Markov random field model for term dependencies." In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 472-479. ACM, 2005. (PDF)
    • Robertson, Stephen, and Hugo Zaragoza. "The probabilistic relevance framework: BM25 and beyond." Foundations and Trends® in Information Retrieval 3, no. 4 (2009): 333-389. (PDF)
    • Büttcher, Stefan, Charles LA Clarke, and Brad Lushman. "Term proximity scoring for ad-hoc retrieval on very large text collections." In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 621-622. ACM, 2006. (PDF)
    • Lv, Yuanhua, and ChengXiang Zhai. "When documents are very long, BM25 fails!." In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pp. 1103-1104. ACM, 2011. (PDF)
  • Day 3: Language models (slides, PDF)

    • Ponte, Jay M., and W. Bruce Croft. "A language modeling approach to information retrieval." In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 275-281. ACM, 1998. (PDF)
    • Lavrenko, Victor, and W. Bruce Croft. "Relevance based language models." In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 120-127. ACM, 2001. (PDF)
    • Berger, Adam, and John Lafferty. "Information retrieval as statistical translation." In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 222-229. ACM, 1999. (PDF)
    • Zhai, Chengxiang, and John Lafferty. "A study of smoothing methods for language models applied to ad hoc information retrieval." In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 334-342. ACM, 2001. (PDF)
    • Gao, Jianfeng, Jian-Yun Nie, Guangyuan Wu, and Guihong Cao. "Dependence language model for information retrieval." In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 170-177. ACM, 2004. (PDF)
    • Song, Fei, and W. Bruce Croft. "A general language model for information retrieval." Proceedings of the eighth international conference on Information and knowledge management. ACM, 1999. (PDF)
  • Day 4: Learning to rank (slides, PDF)

    • Burges, Chris, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. "Learning to rank using gradient descent." In Proceedings of the 22nd international conference on Machine learning, pp. 89-96. ACM, 2005. (PDF)
    • Yue, Yisong, Thomas Finley, Filip Radlinski, and Thorsten Joachims. "A support vector method for optimizing average precision." In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 271-278. ACM, 2007. (PDF)
    • Cao, Zhe, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. "Learning to rank: from pairwise approach to listwise approach." In Proceedings of the 24th international conference on Machine learning, pp. 129-136. ACM, 2007. (PDF)
    • Xu, Jun, and Hang Li. "Adarank: a boosting algorithm for information retrieval." In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 391-398. ACM, 2007. (PDF)
    • Taylor, Michael, John Guiver, Stephen Robertson, and Tom Minka. "Softrank: optimizing non-smooth rank metrics." In Proceedings of the 2008 International Conference on Web Search and Data Mining, pp. 77-86. ACM, 2008. (PDF)
    • Geng, Xiubo, Tie-Yan Liu, Tao Qin, Andrew Arnold, Hang Li, and Heung-Yeung Shum. "Query dependent ranking using k-nearest neighbor." In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 115-122. ACM, 2008. (PDF)

Lecture V: Relevance Feedback

User feedback is important for retrieval systems to evaluate the performance and improve the effectiveness of their service strategies. However, in most practical system, only implicit feedback can be collected from users, e.g., clicks, which are known to be noisy and biased. We will discuss how to properly model implicit user feedback, and enhance retrieval performance via such feedback.

  • Day 1: Modeling feedback (slides, PDF)

    • Zhai, Chengxiang, and John Lafferty. "Model-based feedback in the language modeling approach to information retrieval." In Proceedings of the tenth international conference on Information and knowledge management, pp. 403-410. ACM, 2001. (PDF)
    • Lv, Yuanhua, and ChengXiang Zhai. "A comparative study of methods for estimating query language models with pseudo feedback." In Proceedings of the 18th ACM conference on Information and knowledge management, pp. 1895-1898. ACM, 2009. (PDF)
    • Lv, Yuanhua, and ChengXiang Zhai. "Positional relevance model for pseudo-relevance feedback." In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, pp. 579-586. ACM, 2010. (PDF)
    • Lee, Kyung Soon, W. Bruce Croft, and James Allan. "A cluster-based resampling method for pseudo-relevance feedback." In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 235-242. ACM, 2008. (PDF)
    • Cao, Guihong, Jian-Yun Nie, Jianfeng Gao, and Stephen Robertson. "Selecting good expansion terms for pseudo-relevance feedback." In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 243-250. ACM, 2008. (PDF)
    • Wang, Xuanhui, Hui Fang, and ChengXiang Zhai. "A study of methods for negative relevance feedback." In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 219-226. ACM, 2008. (PDF)
    • Shen, Xuehua, Bin Tan, and ChengXiang Zhai. "Context-sensitive information retrieval using implicit feedback." In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 43-50. ACM, 2005. (PDF)
  • Day 2: Modeling implicit feedback & Click modeling (slides, PDF)

    • Joachims, Thorsten, Laura Granka, Bing Pan, Helene Hembrooke, and Geri Gay. "Accurately interpreting clickthrough data as implicit feedback." In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 154-161. ACM, 2005. (PDF)
    • Joachims, Thorsten, et al. "Evaluating the accuracy of implicit feedback from clicks and query reformulations in web search." ACM Transactions on Information Systems (TOIS) 25.2 (2007): 7. (PDF)
    • Agichtein, Eugene, Eric Brill, and Susan Dumais. "Improving web search ranking by incorporating user behavior information." Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2006. (PDF)
    • Agichtein, Eugene, et al. "Learning user interaction models for predicting web search result preferences." Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2006. (PDF)
    • Guan, Zhiwei, and Edward Cutrell. "An eye tracking study of the effect of target rank on web search." Proceedings of the SIGCHI conference on Human factors in computing systems. ACM, 2007. (PDF)
    • White, Ryen W., and Steven M. Drucker. "Investigating behavioral variability in web search." In Proceedings of the 16th international conference on World Wide Web, pp. 21-30. ACM, 2007. (PDF)
    • Chapelle, Olivier, and Ya Zhang. "A dynamic bayesian network click model for web search ranking." In Proceedings of the 18th international conference on World wide web, pp. 1-10. ACM, 2009. (PDF)
    • Dupret, Georges E., and Benjamin Piwowarski. "A user browsing model to predict search engine click data from past observations." In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 331-338. ACM, 2008. (PDF)
    • Craswell, Nick, et al. "An experimental comparison of click position-bias models." Proceedings of the 2008 International Conference on Web Search and Data Mining. ACM, 2008. (PDF)
    • Zhu, Zeyuan Allen, Weizhu Chen, Tom Minka, Chenguang Zhu, and Zheng Chen. "A novel click model and its applications to online advertising." In Proceedings of the third ACM international conference on Web search and data mining, pp. 321-330. ACM, 2010. (PDF)

Lecture VI: Link analysis

We will discuss the unique characteristic of web: inter-connection, and introduce Google's winning algorithm PageRank. We will also introduce the application of link analysis techniques in a similar domain: social network analysis.

  • Day 1: Pagerank (slides, PDF)

    • Page, Lawrence, Sergey Brin, Rajeev Motwani, and Terry Winograd. "The PageRank citation ranking: Bringing order to the web." (1999). (PDF)
    • Haveliwala, Taher H. "Topic-sensitive pagerank." In Proceedings of the 11th international conference on World Wide Web, pp. 517-526. ACM, 2002. (PDF)
    • Jeh, Glen, and Jennifer Widom. "Scaling personalized web search." In Proceedings of the 12th international conference on World Wide Web, pp. 271-279. ACM, 2003. (PDF)
    • Jeh, Glen, and Jennifer Widom. "SimRank: a measure of structural-context similarity." In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 538-543. ACM, 2002. (PDF)
    • Erkan, Günes, and Dragomir R. Radev. "LexRank: Graph-based lexical centrality as salience in text summarization." J. Artif. Intell. Res.(JAIR) 22, no. 1 (2004): 457-479. (PDF)
    • Wan, Xiaojun, and Jianwu Yang. "Multi-document summarization using cluster-based link analysis." In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 299-306. ACM, 2008. (PDF)
    • Craswell, Nick, and Martin Szummer. "Random walks on the click graph." Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2007. (PDF)
  • Day 2: HITS (slides, PDF)

    • Kleinberg, Jon M. "Authoritative sources in a hyperlinked environment." Journal of the ACM (JACM) 46, no. 5 (1999): 604-632. (PDF)
    • Richardson, Matthew, Amit Prakash, and Eric Brill. "Beyond PageRank: machine learning for static ranking." Proceedings of the 15th international conference on World Wide Web. ACM, 2006. (PDF)