Thread Reader

\mathfrak{Michael "Shapes Dude" Betancourt}


Sep 23

23 tweets

I was just asked an interesting question so why don't we turn it into a little thread? How are we able to do any probabilistic inference on continuous observational spaces when the probability of any single observation is zero?

More formally let's consider a continuous space Y, such as the real numbers. An observational model pi_{theta} is a collection of probability distributions over Y, each indexed by a model configuration, or parameter, theta.

Now any probability distribution on a continuous space will assign zero probability to almost all atomic sets that consist of a single point. Here we'll round up "almost all" to "all" and say that P_{pi_theta} [ y ] = 0 for all y \in Y and all model configurations theta.

This means that the probability of _any_ observation that takes the form of a point tilde{y} \in Y is identically zero, P_{pi_theta} [ tilde{y} ] = 0. All of the data generating processes in our observational model allocate the same probability to every point observation!

If all of the data generating processes allocate the same probability to every point observation then how can a point observation provide any discrimination between them? How do we make any inferences about which model configurations are more compatible with the observations?

Well that's the thing about zero. If P_{pi_theta1} [ tilde{y} ] = 0 and P_{pi_theta2} [ tilde{y} ] = 0 then the _ratio_ of the probabilities is undefined, P_{pi_theta1} [ tilde{y} ] / P_{pi_theta2} [ tilde{y} ] = 0 / 0!

In other words even though a point observation is allocated zero probability by all of the data generating processes in our observational model the _relative_ probabilities might not be the same. Our inferences are hiding in this relative consistency.

In order to access this relative information we need to compare two different data generating processes _before_ we try evaluating them on a point observations. Mathematically this comparison is defined through an object known as a _Radon-Nikodym derivative_ or _density_.

Given the probability distributions pi_{theta1} and pi_{theta2} the Radon-Nikodym derivative is a function from the observational space to positive real numbers. d pi_{theta1} / d pi_{theta2} : Y \rightarrow \mathbb{R}^{+} that _locally_ quantifies the relative behavior.

I talk about the mathematical structure of Radon-Nikodym derivatives in more detail in… and some point next year I'll be writing a lot more about them.

Critically even if P_{pi_theta1} [ tilde{y} ] = 0 P_{pi_theta2} [ tilde{y} ] = 0 we can have d pi_{theta1} / d pi_{theta2} ( tilde{y} ) \ne 0!

If d pi_{theta1} / d pi_{theta2} ( tilde{y} ) > 1 then pi_theta1 allocates more probability in the neighborhood of tilde{y} _relative_ to pi_theta2, and if d pi_{theta1} / d pi_{theta2} ( tilde{y} ) < 0 then pi_theta2 dominates.

Consequently we can say that the data generating process pi_theta1 is _more_ consistent with the point observation tilde{y} than the data generation process pi_theta2 if and only if d pi_{theta1} / d pi_{theta2} ( tilde{y} ) > 1.

In this case the Radon-Nikodym derivative has another, more common name. It's called a _likelihood ratio_ and forms the basis of likelihood ratio tests. In fact Radon-Nikodym derivatives are also how we construct the infamous likelihood function.

For a convenient reference measure mu we can construct the corresponding Radon-Nikodym derivative for every data generating process, d pi_theta / d mu (y) = L_{y}(theta). This is a function of y and theta. Fixing y to an observation essentially defines a likelihood function.

What's nice about this construction is that it really clarifies exactly what likelihoods are. They are not probabilities but rather a quantification of _relative_ consistency between a fixed point observation tilde{y} and the data generating processes pi_theta.

Once we familiarize ourselves with Radon-Nikodym derivatives then we can immediately derive the typical implementations of likelihood ratios, likelihood functions, and the like.

For example if the data generating processes are implemented as density functions pi(y | theta) then Radon-Nikodym derivatives can be implemented as ratios of density functions, d pi_theta1 / d pi_theta2 (y) = pi(y | theta1) / pi(y | theta2).

Similarly the likelihood function can be constructed by evaluating the density functions for every data generating process in our observational model at a given point observation, L_{tilde{y}} (theta) \propto pi(tilde{y} | theta).

Finally what happens if we want to relax the assumption that _all_ point observations are allocated zero probability and consider the most general case where only _almost all_ point observations are allocated zero probability?

In this case a finite number of observations can be allocated a non-zero probability which complicates the Radon-Nikodym derivatives. To construct likelihood functions here we have to be careful to separate out these points from the rest.

This eventually leads to _hurdle models_ where some point observations appear to be _inflated_ relative to the others. Failing to respect the subtle Radon-Nikodym derivative construction here leads to the common confusions/mistakes that are made when implementing these models.

Anytime we consider the probabilistic behavior of points on continuous spaces we rely on Radon-Nikodym derivatives, whether we realize it or not. A little bit of experience with these objects provides a critical foundation that avoids interpretation and implementation mistakes.

\mathfrak{Michael "Shapes Dude" Betancourt}


Once and future physicist masquerading as a statistician. Reluctant geometer. @mcmc_stan developer. Support my writing at

Follow on Twitter

Missing some tweets in this thread? Or failed to load images or videos? You can try to .