| |||||||||
The minimum description length principle provides a practical formalization of Occam's razor, which states, roughly speaking, that the simplest hypothesis that can explain the data at hand should be preferred over more complicated ones.
To make this notion precise, first note that any set of data can be represented by a string of symbols from a finite (let's say, binary) alphabet. Now, "The fundamental idea behind the MDL Principle is that any regularity in a given set of data can be used to compress the data, i.e. to describe it using fewer symbols than needed to describe the data literally." (Grünwald, 1998. See the link below.)
In order to do this, one has to start by fixing a code for the compressed data. The most general code is to pick some (Turing-complete) computer language and to describe the data at hand in the form of a program in that language that outputs the data; that way, if one has come up with a good hypothesis about the structure of the data, then this can be used to write a shorter program. The length of the shortest program that outputs the data is called the Kolmogorov complexity of the data. This is in fact the idea that is at the heart of Ray Solomonoff's idealized theory of inductive inference. However, while a beautiful mathematical theory, it does not provide a practical way of doing inference; the most important reasons for this are the following:
MDL is an attempt to remedy this, by (1) restricting the set of allowed codes in such a way that it becomes possible (computable) to find the shortest codelength of the data, relative to the allowed set of codes, and (2) stipulating that one should choose the code such that it is reasonably efficient, whatever the data at hand. This second point is somewhat elusive and a lot of research is still going on in this area.
Rather than "programs" in MDL one usually speaks of candidate hypotheses or models. The set of allowed programs is then called the model class. (To confuse matters, some authors refer to the model class as the model.) The model that, together with the data, has the shortest description length, is then selected as the best.
MDL is actually not the first attempt to do hypothesis selection through minimizing description length; as early as 1968 Wallace and Boulton pioneered a related concept called Minimum Message Length. MDL was introduced by Jorma Rissanen in 1978; it differs from MML in several ways, most notably thourgh the extensive use of one-part rather than two-part codes.
Central to the theory of MDL, linking it to statistics, is the one-to-one correspondence between code length functions and probability distributions. (The central lemma involved is the Kraft-McMillan inequality.) For any probability distribution <math>P<math>, it is possible to construct a code <math>C<math> such that the length (in bits) of <math>C(x)<math> is equal to <math>-log_2 P(x)<math>; this code minimizes the expected code length. Vice versa, given a code <math>C<math>, one can construct a probability distribution <math>P<math> such that the same holds. (Rounding issues are ignored here.) In other words, searching for an efficient code equals searching for a good probability distribution, and vice versa!
This has led some researchers to view MDL as being equivalent to Bayesian inference. Where in the MDL framework one is concerned with the code length of the model, and with the code length of model and data together, the corresponding Bayesian concepts are called prior probability and marginal likelihood, respectively. This point of view is expressed for example in David MacKay's Information Theory, Inference, and Learning Algorithms (see link below). However, while Bayesian machinery is often useful as a tool to construct efficient codes from an MDL point of view, in MDL other codes are sometimes used that do not always fit into a Bayesian framework. Examples are the Shtarkov or `normalized maximum likelihood code', which plays a central role in current MDL theory. There is no equivalent of this notion in Bayesian inference. Furthermore, the MDL principle prefers some priors over others. In so-called objective Bayesian analysis the same priors tend to be favoured, but then they are chosen for quite different reasons.