TL; DR. In this article, I would like to draw a conclusion of what we have learned from the first tutorial, “Statistical Language Model”, and give some further points to papers which can motivate discussions and dots connecting; practically, I would like to point to some standard benchmark dataset for us to further work on and compare perplexities with some other state-of-the-art methods.
A probabilistic model is a probability distribution over certain phenomenon you want to describe. Suppose you have abstract the sample space of your interest \(\mathcal{U}\) and element in it as \(x \in \mathcal{U}\), you can model them by a probability distribution \(P_\theta(x)\) with parameter \(\theta\) that you can learn from samples (training set) \(\mathcal{D} \subset \mathcal{U}\). This is a parametric model which models generative phenomena and sometimes we call it generative model instead of a discriminative model both of which can be parametric and non-parametric. This is a kind of unsupervised learning since we do not ask to predict new label for new \(x\), we just evaluate that given an arbitrary \(x\) how likely it can be generated from \(P_\theta(x)\), the usage of the generative model is to give a descriptive summary of the underlying data distribution, we sometimes call it density estimation.
One way to do unsupervised learning of probabilistic models is through maximum likelihood estimation (MLE) which is a statistical principle. And in SLM we can use the constrained optimization problem raised by maximum likelihood estimation to prove the count-based estimator for an n-gram language model.
Comment. Why the name smoothing?
Currently, my understanding of using this word to describe such a behavior to alleviate the problem of data sparsity is because smoothing is to make the estimated probability distribution more smooth, which means less bumpy and have probability in every point of the sample space (here the sample space is a combinatorial space) so less holes where no probability mass is put upon.
Comment.
Vocabulary management for a specific field of interest is a good starting point of becoming an expert of that field. What I mean by vocabulary management is that you should get to know and be familiar about a field by knowing the key concepts (e.g. definitions, name of methods, algorithm nickname etc.). Then, after having a seed vocabulary, you can start to connecting dots and make understanding clear to yourself among many concepts and fields, and that is how things start to resonate.
Two decades of statistical language modeling: where do we go from here?. citation: 714
(This paper will be scheduled to discuss during our party)
citation: 814
citation: 611
citation: 518
Setting goals for reading or learning new stuff is sometimes a necessity for practical research or study, since seldom can one absorb all the information a paper tries to convey, but to concentrate on the most interesting part of a paper and try to understand it: a). the burden of reading would be relieved and b). the content can be remembered in great detail.
So I would like to list a few goals for you, if you would like to read the above papers.
In this part, I will points to some LM benchmarks for you to test your own language models and compare it with other people around the world.
Benchmark.
Benchmark is an experimental setting a). dataset, i.e. training/dev/test set b). standard state-of-the-art algorithms c). etc. to help you with test a new proposed method or model with others.
This part aims to give some guide and particular solutions for reading a research paper. But I think the method is much more general, so that you can apply it for a). reading blog posts like this, this and this; b). do literature review to help you with your own project!
BTW. I highly recommend you to read the original paper which is very handful with 2-page-short.
This article outlines a practical 3-pass method for reading research papers.
First pass: 5~10 minutes reading.
The following routines should be taken:
Questions needed to be answered through this pass (“5Cs”):
Category: measurement paper, analysis of existing systems, or new research prototype etc.
Context: try to cluster within the paper the a). basic math background (Bayes formula, marginalization, SVD etc.); b). background papers (some papers which is set to be the first-reads before this paper) c). theories (some theorems provided and proved in this paper).
Correctness: are the assumptions made in this paper make sense to you, e.g. is the i.i.d. assumption made appropriate? Is the conditional independence assumption proper?
Contributions: what is the main contributions of the paper? Does that make sense to you? (Some papers will list their contributions explicitly like in the following paper. Most papers will claim their contributions in the Introduction
section.)
Clarity: is this paper well-written? Some words or sentences you can use to amplify your language ability.
As the author of “How to read a paper” says, “The first pass is adequate for papers that aren’t in your research area.”
Second pass: 60 minutes reading.
Look carefully at the figures, diagrams and other illustrations in the paper, like the following architecture for recursive neural networks for sentiment analysis and its demonstrative case example, this will make you quickly understand the overall structure of the model.
Look at the experiment result is also a VERY IMPORTANT part of this reading pass. Because by doing this, you can know actual experimental result and compare it with some papers you have previously read. This will ensure your belief of this paper and decide whether you should try its proposed method or not!
Mark relevance unread references for further reading.
Third pass: 5~6 hours reading.