Minimum Description Length (MDL) is an information theoretic model selection principle. MDL assumes that the simplest, most compact representation of data is the best and most probable explanation of it.

With MDL, the model selection problem is treated as a communication problem. There is a sender, a receiver, and data to be transmitted. For classification models, the data to be transmitted is a model and the sequence of target class values in the training data. Typically, each model under consideration comes from a list of potential candidate models. The sizes of the lists vary. The models compute probabilities for the target class of each training data set row. The model's predicted probability is used to encode the training data set target values. From Shannon's noiseless coding theorem, it is known that the most compact encoding uses a number of bits equal to -log2(pi), where pi is probability of the target value in the i-th training data set row for all rows in the data set.

The sender and receiver agree on lists of potential candidate models for each model under consideration. A model in a list is represented by its position in the list (for example, model #26). The position is expressed as a binary number. Thus, for a list of length m, the number of bits in the binary number is log2(m). The sender and receiver each get a copy of the list. The sender transmits the model. Once the model is known, the sender know how to encode the targets and the receiver knows how to decode the targets. The sender then transmits the encoded targets. The total description length of a model is then:

log2(m) - Sum(log2(pi))

Note that the better the model is at estimating the target probabilities, the shorter the description length. However, the improvement in target probability estimation can come at the expense of a longer list of potential candidates. Thus the description length is a penalized likelihood of the model.

The attribute importance problem can be put in the MDL framework, by considering each attribute as a simple predictive model of the target class. Each value of the predictor (indexed by i) has associated with it a set of ni training examples and a probability distribution, pij for the m target class values (indexed by j). From ni training examples there are at most ni pij's distinguishable in the data. From combinatorics, it is known that the number of distributions of m distinct objects in a total of n objects is n-m choose m. Hence the size of the list for a predictor is

Sumi log2((ni - m choose m))

The total description length for a predictor is then:

Sumi (log2(ni - m choose m)) - Sumij(log2(pij))

The predictor rank is the position in the list of associated description lengths, smallest first.