# Talk:Mutual information

WikiProject Statistics (Rated C-class, Mid-importance)

This article is within the scope of the WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page or join the discussion.

C  This article has been rated as C-Class on the quality scale.
Mid  This article has been rated as Mid-importance on the importance scale.

WikiProject Mathematics (Rated C-class, Mid-priority)
This article is within the scope of WikiProject Mathematics, a collaborative effort to improve the coverage of Mathematics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
Mathematics rating:
 C Class
 Mid Priority
Field:  Probability and statistics

## Unit of information?

instead >> It should be noted that these definitions are ambiguous because the base of the log function is not specified. To disambiguate, the function I could be parameterized as I(X,Y,b) where b is the base. Alternatively, since the most common unit of measurement of mutual information is the bit, a base of 2 could be specified.

how about The unit of information depends on the base of the log function. Most common are bases of 2, e, or 10, resulting in units of bits, nats and digits, respectively.

Internetexploder 08:15, 29 April 2007 (UTC)

I didn't initiate the notice, but the guidelines state that this notice is internal to Wikipedia and are not really for the casual reader's consumption. Any attention that a qualified contributor can give is welcome. Ancheta Wis 23:55, 23 Oct 2004 (UTC)

Noting Category:Pages needing attention, I would say that, while someone may have thought that a good guideline, it is de facto incorrect (and not policy). I, for one, do not agree with that guideline, because it hides the fact that the article needs attention from all those who can edit it and it disclaims to newbies that we know the article isn't as good as it could be. — 131.230.133.185 5 July 2005 19:23 (UTC)

## A practical example might help

For audience out of the maths/stats field, an example could increase the broad understanding of the idea of the MI. To this end, I think some of the text in Simple English version of the article could be incorporated.

For example, knowing the temperature of a random day of the year will not reveal what month it is, but it will give some hint. The changes in the likelihood when the temperature is known can be explained and measured with mutual information. For instance, to measure mutual information between month and temperature, we would need to know how many days in the year are 10 degrees Celsius, how many days out of the year are March and finally how many days are 10 degrees Celsius in March.

(e.g. emre) 09:23, 5 July 2017 (UTC)

## Simplify eq?

why not just say:

${\displaystyle I(X,Y)=\sum _{x,y}p(x,y)\times \log _{2}{\frac {p(x,y)}{p(x)\,p(y)}}.\!}$

instead of all the confusing talk about what f and g are? Please elaborate if there is a specific reason why it is done this way. -- BAxelrod 02:08, 19 October 2005 (UTC)

The definitions given in the article are correct. They just happen to be highly formal. Less formal definitions are given in the article on information theory (recently added by me, but I called it transinformation). Whether this level of formality is appropriate for this article is a matter for debate. I tend to think not, because in general, someone who is working at that level of formality is not going to be looking in Wikipedia for a definition, but on the other hand, it "simplifies" matters because then one definition suffices for both the discrete and continuous cases. (i.e. integration over the counting measure is simply ordinary discrete summation.) -- 130.94.162.64 22:53, 2 December 2005 (UTC)
O.K. Simplified the formula. -- 130.94.162.64 05:24, 3 December 2005 (UTC)
Another note: ${\displaystyle I(X,Y)\,}$ is incorrect. ${\displaystyle I(X;Y)\,}$ is the accepted usage. Use a semicolon. -- 130.94.162.64 11:35, 4 December 2005 (UTC)

## Mutual information between ${\displaystyle m}$ random variables

${\displaystyle I(y_{1};\ldots ;y_{m})=\sum _{i=1}^{m}H(y_{i})-H(\mathbf {y} )}$

(In reply to unsigned comment above:) Apparently there isn't a single well-defined mutual information for three or more random variables. It is sometimes defined recursively:
${\displaystyle I(Y_{1};Y_{2})=H(Y_{1})-H(Y_{1}|Y_{2}),\,}$
${\displaystyle I(Y_{1};\ldots ;Y_{m})=I(Y_{1};\ldots ;Y_{m-1})-I(Y_{1};\ldots ;Y_{m-1}|Y_{m}),\,m\geq 3,}$
where ${\displaystyle I(Y_{1};\ldots ;Y_{m-1}|Y_{m})=\mathbb {E} _{Y_{m}}\{I((Y_{1}|y_{m});\ldots ;(Y_{m-1}|y_{m}))\}.}$
This definition fits more along the lines of the interpretation of the mutual information as the measure of an intersection of sets, but it can become negative as well as positive for three or more random variables (in contrast to the definition in the comment above, which is always non-negative).
--130.94.162.64 23:15, 19 May 2006 (UTC)
This may or may not be useful to those of you wanting to extend mutual information, see Interaction_information. Mouse7mouse9 20:30, 9 May 2016 (UTC) — Preceding unsigned comment added by Mouse7mouse9 (talkcontribs)

## Source

The formula is from Shannon (1948). This should be written.
Who coined the term "mutual information"? --Henri de Solages 18:41, 7 November 2005 (UTC)

## Remove irrelevant reference?

The first reference, Cilibrasi and Vitanyi (2005), contains only two mentions of mutual information:

"Another recent offshoot based on our work is hierarchical clustering based on mutual information, [23]."

"[23] A. Kraskov, H. St¨ogbauer, R.G. Adrsejak, P. Grassberger, Hierarchical clustering based on mutual information, 2003, http://arxiv.org/abs/q-bio/0311039"

I suggest this reference be removed as it's not helpful.

--84.9.75.186 10:57, 3 September 2007 (UTC)

The Kraskov & Stögbauer paper is an interesting one. Is that the one you are referring to? —Dfass 11:25, 3 September 2007 (UTC)

## In-text symbols are inconsistently formatted

Most of the in-text symbols and equations are formatted with italics, but a few (those with subscripts) are formatted in math mode. Shouldn't they all be formatted consistently? The separated equations use math mode, so my preference would be for in-text symbols and equations to be formatted in math mode as well. Jamesmelody (talk) 17:53, 22 February 2009 (UTC)

Ohhh... Apparently, the difference between the appearance of some in-text symbols and others is due to rendering differences in my browser. I was writing in-text equations elsewhere, using the math environment everywhere, and noticed the same type of differences. Perhaps I need to tweak my math rendering preferences...
Jamesmelody (talk) 19:18, 22 February 2009 (UTC)

## Subtleties with Entropy and Mutual Information for Continuous Random Variables

I believe the article requires greater rigor when dealing with continuous random variables. Consider the following example:

Let ${\displaystyle X=N(0,\alpha )}$, a normal distribution, and let ${\displaystyle Y=X^{2}}$, and suppose I want to find the mutual information ${\displaystyle I(X,Y)}$ via ${\displaystyle I(X,Y)=H(X)-H(X|Y)}$. I know that ${\displaystyle H(X)=H[N(0,\alpha )]=0.5\log _{2}(2\pi e\alpha )}$ bits. Additionally, I know that the random variable ${\displaystyle X|Y}$ is discrete with

${\displaystyle \mathrm {Prob} \{X={\sqrt {y}}\,|Y=y\}=1/2\ \mathrm {and} }$
${\displaystyle \mathrm {Prob} \{X=-{\sqrt {y}}\,|Y=y\}=1/2,}$

except when ${\displaystyle y=0}$, in which case ${\displaystyle \mathrm {Prob} \{X=0\,|Y=0\}=1}$. Hence, ${\displaystyle I(X,Y)=0.5\log _{2}(2\pi e\alpha )-1}$ bits.

Problem solved — except that now I can select ${\displaystyle \alpha }$ to make the first term less than one, making the mutual information negative. But the article tells me that mutual information can not be negative. This seems inconsistent.

I believe the inconsistency arises from the fact that ${\displaystyle X}$ and ${\displaystyle Y}$ are not jointly continuous random variables, because the support of the joint probability "density" is the curve ${\displaystyle y=x^{2}}$. More rigorously, the joint cumulative distribution function has a discontinuity on the curve ${\displaystyle y=x^{2}}$. This is evident in the fact that while the individual random variables are continuous, the conditional random variable ${\displaystyle X|Y}$ is discrete. Hence my argument subsequently mixed (discrete) entropy with differential entropy, two definitions that are not consistent.

Perhaps my belief is wrong. There may be a better explanation for the inconsistency, one which enables a fully consistent calculation in this example. (Perhaps the differential entropy of the conditional random variable ${\displaystyle X|Y}$ is ${\displaystyle -\infty }$? This would make the mutual information infinite, regardless of the value of ${\displaystyle \alpha }$, which would be consistent both with the requirement of nonnegativity and with the understanding of mutual information as an indication of degree of dependence.)

I suggest that the article be explicit in the case of continuous mutual information of the conditions on the joint distribution. In addition, I suggest that the section "Relation to Other Quantities" be explicit about when it is appropriate to use differential entropy as opposed to entropy. Finally, I suggest that the article explore the possible inconsistencies and limitations of mutual information for the continuous case. In particular, I should expect in my example that the mutual information indicate complete statistical dependence. Jamesmelody (talk) 19:44, 22 February 2009 (UTC)

The mutual information between two continuous random variable can be somewhat more rigorously explained by discretizing them by placing them in "bins" and then taking the limit as the bin size goes to zero. For example, if X and Y are continuous real-valued random variables, then their mutual information is
${\displaystyle I(X;Y)=\lim _{\delta \rightarrow 0}I\left(\left\lfloor {\frac {X}{\delta }}\right\rfloor ;\left\lfloor {\frac {Y}{\delta }}\right\rfloor \right).}$
Using this definition it can be easily seen that the (bivariate) mutual information can never be negative. In your example where one variable is completely determined by the other, i.e. ${\displaystyle Y=X^{2}}$, it makes sense that the mutual information would be infinite, since we can specify X to arbitrarily many digits of accuracy, and perfectly recover all those digits from the value of Y.
The most general rigorous and widely applicable definition of the mutual information is probably in terms of the Kullback–Leibler divergence of the joint distribution with respect to the product of the marginal distributions, which is defined (and remains finite) if and only if the joint distribution of the two random variables is absolutely continuous (in the sense of measures) with respect to the product of the marginals. Anyways, feel free to be bold and improve the article as you see fit. I'll try to help. Deepmath (talk) 22:48, 22 February 2009 (UTC)

## Scholarpedia

Possible source: [[1]]. Note that some of Scholarpedia is under copyright, so you can't just copy the content. —Preceding unsigned comment added by Njerseyguy (talkcontribs) 04:45, 7 July 2009 (UTC)

## p(x,y) = 0 ?

It seems to me that there is a possibility for joint probability function p(x,y) to return 0, for example, in such case when variable values x and y never occur together. Can anyone explain, how it is possible to calculate mutual information, when there is a chance that p(x,y) for some x and y returns 0.

90.190.231.235 (talk) 12:15, 30 December 2009 (UTC)Siim

Priors or pseudo-counts are often used to iron out wrinkles like this and account for un-observed data. In this case you can define I(X;Y) := 0 if p(x,y)=0. This is relatively trivial to prove using L'Hôpital's rule. --Paul (talk) 21:49, 30 December 2009 (UTC)

## Bogus Expectations

The conditional mutual expectation is defined using the epxression

${\displaystyle \mathbb {E} _{Z}{\big (}I(X;Y)|Z{\big )}}$

There are two problems with this expression. First, ${\displaystyle I(X;Y)}$ is not a random variable. So taking its expectation is a no-op. Second, the expectation is conditioned on the random variable ${\displaystyle Z}$, which is not defined outside the scope of the expectation. This is how conditional expecatation is defined (see e.g. [2])

${\displaystyle \mathbb {E} {\big (}X|Y)=\sum _{x}p(x|Y)x}$

Indeed, this expectation is itself a random variable.

The same problem occurs in the definition of the multivariate mutual entropy. —Preceding unsigned comment added by 128.114.60.41 (talk) 19:50, 9 March 2010 (UTC)

## Stochastic processes

Would some knowledgeable person please add some material on the mutual information of stochastic processes? Thanx! Rinconsoleao (talk) 14:12, 22 December 2010 (UTC)

## Multivariate mutual information equations

Can anyone verify the equations in the multivariate mutual information section.

This equation

${\displaystyle I(X_{1};\,...\,;X_{n})=I(X_{1};\,...\,;X_{n-1})-I(X_{1};\,...\,;X_{n-1}|X_{n}),}$

Does not reduce to any of the equations above for the basic 2-variable mutual information, if, for example, ${\displaystyle X_{1}=X}$ and ${\displaystyle X_{n}=X_{2}=Y}$

Should the equation instead be this?

${\displaystyle I(X_{1};\,...\,;X_{n})=H(X_{1};\,...\,;X_{n-1})-H(X_{1};\,...\,;X_{n-1}|X_{n}),}$

Willkeim (talk) 13:43, 8 April 2012 (UTC) Willkeim

## Distance is "universal"

The claim that the distance D(X,Y) is "universal" is pretty flimsy. Here is the text from the article "An information-based sequence distance and its application to whole mitochondrial genome phylogeny", Ming Li et al., referred to in the Wikipedia article:

Now, consider any computable distance D. In order to exclude degenerate distances such as D(x, y) = 1/2 for all sequences x and y, we limit the number of sequences in a neighborhood of size d. Let us require for each x,
|{y : |y| = n and D(x, y) ≤ d}| ≤ 2dn. (2)
Assuming equation (2), we prove the following theorem.
THEOREM 2. For any computable distance D, there is a constant c < 2 such that, with probability 1, for all sequences x and y, d(x, y) ≤ cD(x, y).

In other words, the distance must be "computable" (which I expect is a very specific, application-dependent notion) and it must satisfy another very technical condition (2). So it is not very universal -- at least not without considerable qualification. "Qualified universal" is fairly like "adulterated pregnant" in my view. See also remarks at Talk:Variation_of_information. 129.132.211.9 (talk) 19:57, 22 November 2014 (UTC)

## Figure

The figure giving the mutual information of various scatterings of data looks like nonsense. It gives positive values for distributions that are separable. — Preceding unsigned comment added by 128.40.61.82 (talk) 12:24, 19 February 2016 (UTC)

Furthermore no indication is given as to how the mutual information was calculated from a data set. There are no mentions of estimators for the mutual information in the article. — Preceding unsigned comment added by 128.40.61.82 (talk) 12:30, 19 February 2016 (UTC)

I don't see that it's nonsense. The figures represent normalized probability distributions P[x,y] on presumably the same scale, and the mutual information is calculated directly from that, no "estimators" required. I don't understand the statement " It gives positive values for distributions that are separable." or in what sense that implies a problem. I agree that an explanation would be good, particularly describing the scale, and someone should verify the numbers. PAR (talk) 17:06, 24 February 2016 (UTC)
It's explained via the code is included at https://en.wikipedia.org/wiki/File:Mutual_Information_Examples.svg - that it uses mi.plugin from the entropy R package, which uses a histogram-based approach to MI calculation. It's the equivalent figure to that used on the correlation page, and the distance covariance page. I don't really have any strong attachment to it. I make it to help me understand how MI worked on different datasets that I was already familiar with. naught101 (talk) 06:20, 25 February 2016 (UTC)
Looking at the numbers, the distribution at the top center looks like a normal distribution in 2 dimensions, and that mutual information should be zero, so I think all of the numbers are suspect. I think it is a great idea, a great plot, to compare the different notions of variance and entropy, if only the numbers were right. If you would rework it to fix those numbers, it would be an excellent contribution. PAR (talk) 11:13, 25 February 2016 (UTC)
Probably because n=1000, which is pretty low, and histogram binning is always a bit biased. I'm probably not going to fix it, because that would require re-writing a new MI estimator in R, and I don't have the time for that. There's a non-biased python implementation of a k-nearest neighbours estimator, so it would be possible re-build the script in python around that, but again, time restraints. naught101 (talk) 04:26, 26 February 2016 (UTC)

## Multivariate mutual information

The quantity introduced here has been called many things in the literature, but no publication I can find has ever referred to it as "multivariate mutual information". Wikipedia should NOT be the primary source defining new terms. This is a real problem that introduces confusion to the field. 130.209.89.69 (talk) 16:21, 25 April 2016 (UTC)

## Creating the maximum possible mutual information

As written the article takes the nature of the model to be fixed. A consequence from fixing it is for increase of the mutual information to be made impossible. To increase the mutual information is, however, the aim of scientific research. This pitfall can be avoided by viewing the model as a procedure for making inferences and by optimizing these inferences information theoretically. A consequence is for the maximum possible mutual information to be created from fixed resources. Details are provided by Ronald Christensen in his "Entropy Minimax Source Book."Terry Oldberg (talk) 22:14, 10 July 2016 (UTC)

## Zero mutual information between two variables

The article states that the mutual information between two variables is 0 iff the two variables are independent and random, however I think there may be another case: when both variables may take only one and the same value (thus they are always the same value with 1 probability). I'm not sure if that's a reasonable edge case, but if it is, it ought to be noted. — Preceding unsigned comment added by 192.80.95.230 (talk) 21:09, 11 January 2017 (UTC)

The article says "I(X; Y) = 0 if and only if X and Y are independent random variables." This is correct. This is not the same as "... the two variables are independent and random", though "being random" is often interpreted in informal contexts to mean "being a random variable of uniform distribution" (or sometimes simply as "having an uncertain value"). These informal interpretations do not apply here. What is meant by a "random variable" (duly linked in the article) is a quantity that has an associated distribution over its values. The edge case described (fixed value) corresponds to a random variable in which the distribution has a form that constrains the value to a constant. It is nevertheless correct to still call it a "random variable". —Quondum 17:03, 6 June 2017 (UTC)

## Removed statement about exact relationship for binary data

In Mutual information § For discrete data, I removed the following passage:

In the special case where the number of states for both row and column variables is 2 (i, j = 1, 2), the degrees of freedom of the Pearson's chi-squared test is 1. Out of the four terms in the summation
${\displaystyle \sum _{i,j}p_{ij}\log {\frac {p_{ij}}{p_{i}p_{j}}}}$
only one is independent. It is the reason that mutual information function has an exact relationship with the correlation function ${\displaystyle p_{X=1,Y=1}-p_{X=1}p_{Y=1}}$ for binary sequences.[1]

My reasons are as follows:

• The source describes a rather specific circumstance: a bit sequence, in which the mutual information between bits of the same sequence separated by a fixed offset is being considered. This may too esoteric for the article, and may be illustrated by the amount of information that would be needed to outline the case under consideration.
• As assumption appears to be that the statistics are stationary (i.e. that the joint probability of any pair of bit positions depends only on their separation, not on their absolute position), and symmetric; these are natural assumptions in settings of interest to some, but should not be used to suggest that there is anything special about binary sequences in the general case.
• The way in which the statement was presented, it comes across as implying what is demonstrably untrue: that with an arbitrary joint probability distribution (or contingency table) of two bits, there is a direct relationship between at the stated correlation function (and others?) and the mutual information of the bits.
• I found (through calculation) that in the general setting I had to constrain the joint probability (which starts with three degrees of freedom, the only constraint being that the total probability of all four possible sequences is 1) to one degree of freedom artificially, to achieve the stated exact relationship.

Quondum 16:40, 6 June 2017 (UTC)

References

1. ^ Wentian Li (1990). "Mutual information functions versus correlation functions". J. Stat. Phys. 60 (5–6): 823–837. doi:10.1007/BF01025996.

## Jargon: Inconsistent Usage of Prepositions "of" and "from" Needs Explaining (or at least acknowledging with a comment)

I am a novice with information theory (though not with probability) but there seems to be an uncommented glaring inconsistency of the jargon used between several different wikipedia pages. Throughout page Kullback–Leibler_divergence , "of" is attached to the distribution that is immediately to the left of the || in the mathematical notation. And "from" is applied to the distribution that is to the right of the ||. Whereas on pages Mutual_information#Relation_to_Kullback–Leibler_divergence and Information_gain_in_decision_trees this assignment of prepositions (to construct the accompanying jargon terminology) is precisely reversed. Can anyone please explain or comment (or correct if one of these pages is in error)? Is this just another instance of those in the theoretical know making a big effort to maximize the impenetrability of their subject matter to novices, or am I being too cynical? -Thanks in advance. — Preceding unsigned comment added by 138.40.68.45 (talk) 15:47, 27 February 2019 (UTC)

## Marginal Entropy or Individual Entropy?

In the section 'Relation to conditional and joint entropy' the description of the terms describes H(X) and H(Y) as 'marginal entropies' ("where H(X) and H(Y) are the marginal entropies"). But in the Venn diagram description, H(X) is described as the 'individual entropy' ("The circle on the left (red and violet) is the individual entropy H(X)"). It is confusing to someone not well versed in probability and statistics to use the two different labels 'marginal entropy' and 'individual entropy' because they won't know that the two terms refer to the same thing. I think H(X) and H(Y) should be referred to as 'marginal entropies' throughout, but a parenthetical such as '(The marginal entropy of X is the preferred name for the entropy of X, H(X), when contrasting it conditional entropies.)' should be added. If no one objects, I'm happy to add it. Nick (talk) 19:40, 25 June 2019 (UTC)