As more and more text information (e.g., Web
pages, email messages, and scientific literature) is available
online, developing software tools such as Web search engines to
help people manage such information becomes increasingly
important. A major challenge in text information management is to
model the uncertainties associated with making inferences on text
data. For example, given that a user is interested in finding
articles about heart diseases, how likely will the user use the
word "heart" in the query?
Statistical language models (i.e., probabilistic models of text)
have been successfully used to address such questions.
In this talk, I will present some of my research work on applying
statistical language models to text retrieval and mining. I will
first present a decision-theoretic framework for text retrieval
and show how this framework naturally allows us to use statistical
language models to solve the text retrieval problem. I will then
present several specific language models involving multinomial
distributions (over words) and mixture of multinomial
distributions and show how they can improve retrieval accuracy.
Finally, I will present a contextualized multinomial mixture model
that can be used to discover and analyze
spatial and temporal patterns of topic themes from text
collections. I will show some results of using this model to
analyze news articles, scientific literature, and Weblog data.