Tutorial Information

WSDM tutorials will be held February 8th. This year we will have 3 half day tutorials

Click for descriptions:

  • AM Tutorial: Mining, Search and Exploiting Collaboratively Generated Content on the Web (Eugene Agichtein and Evgeniy Gabrilovich)
  • PM Tutorial 1: Machine Learning for Query-Document Matching in Search (Hang Li and Jun Xu)
  • PM Tutorial 2: Collaborative Information Seeking: Understanding Users, Systems and Content (Chirag Shah)

Morning Tutorial: Mining, Search and Exploiting Collaboratively Generated Content on the Web

Eugene Agichtein (Emory University), Evgeniy Gabrilovich (Yahoo! Research)

Proliferation of ubiquitous access to the Internet enables millions of Web users to collaborate online on a variety of activities. Many of these activities result in the construction of large repositories of knowledge, either as their primary aim (e.g., Wikipedia) or as a by-product (e.g., Yahoo! Answers). In this tutorial, we will discuss organizing and exploiting Collaboratively Generated Content (CGC) for information organization and retrieval. Specifically, we intend to cover two complementary areas of the problem: (1) using such content as a powerful enabling resource for knowledge-enriched, intelligent representations and new information retrieval algorithms, and (2) development of supporting technologies for extracting, filtering, and organizing collaboratively created content.

The unprecedented amounts of information in CGC enable new, knowledge-rich approaches to information access, which are significantly more powerful than the conventional word-based methods. Considerable progress has been made in this direction over the last few years. Examples include explicit manipulation of human-defined concepts and their use to augment the bag of words (cf. Explicit Semantic Analysis), using large-scale taxonomies of topics from Wikipedia or the Open Directory Project to construct additional class-based features, or using Wikipedia for better word sense disambiguation.

However, the quality and comprehensiveness of collaboratively created content vary significantly, and in order for this resource to be useful, a significant amount of preprocessing, filtering, and organization is necessary. Consequently, new methods for analyzing CGC and corresponding user interactions are required to effectively harness the resulting knowledge. Thus, not only the content repositories can be used to improve IR methods, but the reverse pollination is also possible, as better information extraction methods can be used for automatically collecting more knowledge, or verifying the contributed content. This natural connection between modeling the generation process of CGC and effectively using the accumulated knowledge suggests covering both areas together in a single tutorial.

The intended audience of the tutorial includes IR researchers and graduate students, who would like to learn about the recent advances and research opportunities in working with collaboratively generated content. The emphasis of the tutorial is on comparing the existing approaches and presenting practical techniques that IR practitioners can use in their research. We also cover open research challenges, as well as survey available resources (software tools and data) for getting started in this research field.

Afternoon Tutorial 1: Machine Learning for Query-Document Matching in Search

Hang Li (Microsoft Research Asia), Jun Xu (Microsoft Research Asia)

In web search, relevance is one of the most important factors to meet users' satisfaction, and the success of a web search engine heavily depends on its performance on relevance. It has been observed that many hard cases in search relevance are due to term mismatch between query and document (e.g., query `ny times' does not match well with a document only containing `new york times'), and thus it is not an exaggeration to say that dealing with the mismatch between query and document is one of the most critical research problems in web search. Recently researchers have spent significant effort to address this grand challenge. The major approach is to conduct more query and document understanding and then perform better matching between enriched query and document representations. With the availability of a large amount of log data and advanced machine learning techniques, this becomes more feasible and significant progress has been made recently.

In this tutorial, we will give a systematic and detailed presentation on newly developed machine learning technologies for query document matching in search. We will focus on the fundamental problems, as well as the novel solutions for query document matching at the word form level, word sense level, topic level, and structure level. We will talk about novel technologies about query spelling error correction, query rewriting, query classification, topic modeling of documents, query document matching, and query document-title translation. The ideas and solutions introduced in this tutorial may motivate industrial practitioners to turn the research fruits into product reality. The summary of the state-of-the-art methods and the discussions on the technical issues in this tutorial may stimulate academic researchers to find new research directions and solutions.

Matching between query and document is not limited to search, and similar problems can be observed in online advertisements, recommendation systems, and other applications, as matching between objects from two spaces. The technologies we introduce can be generalized into more general machine learning techniques, which we call learning to match.

Afternoon Tutorial 2: Collaborative Information Seeking: Understanding Users, Systems and Content

Chirag Shah (Rutgers University)

The course will introduce theories, methodologies, and tools that focus on information retrieval/seeking in collaboration. The attendee will have an opportunity to learn about the social aspect of IR with a focus on collaborative information seeking (CIS) situations, systems, and evaluation techniques.

Traditionally, IR is considered an individual pursuit, and not surprisingly, the majority of tools, techniques, and models developed for addressing information need, retrieval, and usage have focused on single users. The assumption of information seekers being independent and IR problem being individual has been challenged often in the recent past. This course will introduce such works to the attendees, with an emphasis on understanding models and systems that support collaborative search or browsing. In addition, the course will provide samples of data collected through several experiments to demonstrate various mining and analysis techniques.

Specifically, the course will (1) outline the research and latest developments in the field of collaborative IR, (2) list the challenges for designing and evaluating collaborative IR systems, and (3) show how traditional single user IR models and systems could be mapped to those for CIS. This will be achieved through introduction to appropriate literature, algorithms and interfaces that facilitate CIS, and methodologies for studying and evaluating them. Thus, the course will offer a balance between theoretical and practical elements of CIS.