Consider the following data compression problem. Suppose we have a large data set we wish to transmit. They’re too many to send directly but luckily the precise values aren’t important. Slightly different values would work as long as the data retained it’s overall character. So how do we avoid transmitting n raw data points?
Here’s one method for compressing the data before sending. First define an empirical distribution
which is best thought of as a “histogram”. The easiest way to see it’s nature would be to divide x-space into small boxes and graph the histogram. Next find a distribution based on a few parameters such that
Instead of transmitting n data points, just send the name of this new distribution with the k parameter values. The receiver can then simulate a new set of ‘s for themselves.
But how in practice can we find a suitable ?
The answer lies with Entropy. First pick a function which in some sense captures the nature of and equate expected values:
Then find by maximizing the entropy subject to the constraint in (1). The result will be a distribution where lambda is chosen to ensure (1) holds.
If this is close to the empirical distribution then we’re done. If not, then choose an additional function and re-maximize subject to two constraints. Each time you introduce a new function you’ll bring closer to , so this process will always work eventually. In practice though, we’d carefully choose to get . The fewer constraints it takes to capture the essence of the data the better.
What does this have to do with the Nobel Prize in Economics? To see the connection plug the definition for into the constraint equation (1) and note that satisfies:
Which is the just the Method of Moments. A generalization of this procedure helped win the Nobel Prize in Economics this year. The usual “Method of Moments” type procedures don’t require a maxent distribution, but ideally you’d use a distribution which does a good job of “modeling” the data, which amounts to the same thing in this case.
Notice that if then would be a sufficient statistic for . This has an intuitive meaning in terms of data compression. Only certain functions of the data are required to get the ‘s, and they are all the receiver needs to reconstruct the data. In other words, those functions are “sufficient” to capture the essence of the data.