An improved document distance metric based on topic models

Introduction

In accordance with Allan [1], Topic Detection and Tracking (TDT) research has long been appeared in the relevant pilot study and open evaluations. The primary target of this kind of early survey has been extending techniques from information retrieval and speech recognition into the core TDT domain. TDT has received enormous support from both the US Government’s Defense Advanced Research Projects Agency (DARPA) and the US National Institute of Standards and Technology (NIST). First Story Detection is a cross-field research covering information retrieval, text mining, and text categorization.

In the recent decades, novelty detection or first story detection in social media analytics has been one of the trending research subjects due to the prevalence of large informational networks such as Twitter and Wikipedia. The literature has emphasised the importance of FSD as a variant of TDT. This is certainly true in the case of the recognised TDT-1 to TDT-5 competition and the TREC because a significant quantity of papers dealing with FSD have been published through the aforementioned two research institutions. FSD has emerged as a powerful technique for improving the accuracy and timing of news alert systems. In addition, FSD can play an important role in addressing the issue of Smart City monitoring, amongst which is the identification of the virality timing of an epidemic or hazards. FSD was first applied with a pilot study for monitoring streaming news stories [2]. In this circumstance, the task was to assign each story with either novel or redundant.

The majority of past research in novelty detection has been focused on the time-complexity or the execution time of the system. This is exemplified in the work undertaken by [3]. According to [3], a constant time approach was taken to eliminate the time consuming operation caused by document to document comparisons. In traditional novelty detection frameworks, there has been a lack in the investigation of the changing dynamics of the streaming context. Hence the consequent novelty detection step often does not reflect the updated term weighting information. A possible consequence of using the traditional TF-IDF weighting is that incorrect labelling of novel or redundant terms on a document may occur. This is a major issue in this kind of Big Data research.

Here, we present a new FSD algorithm based on the TF-IDF scheme which takes newly introduced vocabularies into account. Based on such an idea, we project that the new algorithm will outperform the existing baseline novelty detection system. In other words, the incremental TF-IDF weighting approach yields a more accurate identification on the text streams in a novelty detection system when compared to the baseline. The evidence is shown by a standard evaluation applied on a benchmark dataset. The student has published a peer-reviewed paper on ICSSC 2015 titled as ‘an improved system for sentence-level novelty detection in textual streams’.

Literature Review

In this section, we present the background of various techniques and approaches supporting the field of novelty detection in TDT.

We briefly go through the relevant background of TDT evaluation in the field of information retrieval in this part.

TDT programme is split into five crucial themes: namely topic tracking, link detection, topic detection, first story detection, and story segmentation. Topic tracking focuses on detecting stories that describe a target topic. The goal of link detection is to detect whether a pair of stories discuss the same topic. Detecting clusters of stories that discuss the same topic is the goal of topic detection. Whereas for first story detection, which is our primary interest, detect the first story that discuss a topic is the priority. For story segmentation, the task is to detect the boundaries of story.

Latent Dirichlet allocation (LDA) is a probabilistic model or technique which can detect topics within documents by itself. Each document d is modelled as a distribution over K topics, which are themselves characterized by distributions over words. The individual words in a document are generated by repeatedly sampling a topic according to the topic distribution and then sampling a single word from the chosen topic.

It is well known that dependencies happen between terms in a corpora of text, although with the fact that the binary independence model (BIM) has been in prevalence for a many years. For instance, in a SIGIR proceedings, appearances of certain pairs of terms are correlated, such as information and retrieval. Markov random fields (MRF), also noted as undirected graphical models, are commonly used to model the joint distribution over queries and document, in accordance with Metzler and Croft’s demonstration[4]. Empirical results shown that modelling dependencies can significantly improve retrieval effectiveness across a range of corpus. Using TREC dataset as an illustrative example, the average precision rate when using the full dependence variant of MRF is 10% higher than that of term independent frameworks [4].

A statistical language model, or more simply a language model, is a probabilistic mechanism for categorising text. Language modelling approach is generally generative, showing that it can be adopted to generate text at some level. Whereas the discriminative approaches model just the decision boundary. For example, does this document belong to class X or Y is usually an issue addressed by the discriminative programme. The goal of a language model is to estimate a model M from a sample text S.

The ongoing literature suggests that there are two major categories of language models that we could use: unigram or higher-order models and multinomial or multiple-Bernoulli [5]. For concreteness, unigram LM does not consider order of the words or assumes word independence. Therefore, ‘brown dog’ is codified exactly as ‘dog brown’, which cannot capture surface form. Higher-order models such as n-gram model with condition on preceding words, cache-like model with condition on a window, and grammar model with condition on parse tree resolve the problem of word independence, and are very likely to signal prohibitively expensive parameter estimation in practice.

Ranking documents usually deals with issues like ‘what is the probability to generate the given query, given a language model?’ or ‘what is the probability to generate the given document, given a language model?’. As proposed by Manning, Raghavan, and Schütze’s, scoring in language models could generally be split into four forms, namely query-likelihood scoring, document-likelihood, likelihood ratio, and divergence of query and document models or model comparison scoring approach [6].

The standard query-likelihood approach evaluates a language model MD for every document D in the collection and ranks documents by the probability of ‘generating’ the query. The likely drawbacks of the standard query-likelihood approach are fourfold. First, there is no relevance in the model, i.e., everything is random sampling. Second, user feedback or query expansion is not part of the model. Third, examples of relevant documents cannot facilitate us improving the language model itself. Fourth, query-likelihood modelling approach does not directly allow weighted or structured queries [7].

Document-likelihood approach is the flip side of the query-likelihood method. More specifically, it constructs a language model MQ for the query Q and ranks documents D by the likelihood of being a random sample from MQ. In the situation of document- likelihood, MQ is expected to ‘predict’ a typical relevant document. The problems of document- likelihood approach is that the probabilities of different documents are not comparable because of distinct document lengths [8].

Methodology

In this section, the proposed method for solving document dissimilarity is discussed shallowly. Recently, word embedding or distributional word vectors has attracted enormous research attention due to its efficient capturing semantic or syntactic relatedness among language contexts without compromising speed. In relation to word embedding line, word2vec [9] by now is the recongised optimal neural network based corpus learning framework. Once the high quality word vector is available, an instance of word mover's distance introduced by Kusner et al. [9] can be applied to compute the minimal distance has to move from one mass to another as the analogy of document distance or dissimilarity.

Formally, when calculating the distance of two target documents, an exhasitive comparison of each pairwised tuple has to be scanned and sorted in either incremental or decremental fashion, which uninevitably leads to huge computational complexity. Therefore, here we propose a novel count based metric based on topic model aka LDA to facilitate fast word mover's distance computation.

References

J. Allan, Topic detection and tracking: event-based information organization, vol. 12. 2002.
J. Allan, S. Harding, D. Fisher, A. Bolivar, S. Guzman-Lara, and P. Amstutz, “Taking topic detection from evaluation to practice,” Syst. Sci. 2005. HICSS’05. Proc. 38th Annu. Hawaii Int. Conf., p. 101a–101a, 2005.
S. Petrovic, M. Osborne, and V. Lavrenko, “Streaming First Story Detection with application to Twitter,” Comput. Linguist., no. June, pp. 181–189, 2010.
D. Metzler and W. B. Croft, “A Markov random field model for term dependencies,” SIGIR ’05 Proc. 28th Annu. Int. ACM SIGIR Conf. Res. Dev. Inf. Retr., pp. 472–479, 2005.
S. F. Chen and J. Goodman, “An Empirical Study of Smoothing Techniques for Language Modeling,” Comput. Speech Lang., vol. 13, pp. 359–394, 1999.
C. D. Manning, P. Can, H. Schütze, P. Raghavan, and H. Schütze, Introduction to Information Retrieval, vol. 1, no. c. 2008.
R. Nallapati and J. Allan, “Capturing term dependencies using a language model based on sentence trees,” Proc. Elev. Int. Conf. Inf. Knowl. Manag. - CIKM ’02, p. 383, 2002.
X. Li, “Improving the robustness of relevance-based language models,” Work, 2005.
Kusner, Matt, et al. "From Word Embeddings To Document Distances." Proceedings of the 32nd International Conference on Machine Learning (ICML-15). 2015.

Research Highlights

An improved document distance metric based on topic models

Introduction

Literature Review

Methodology

References

Publications