Talk:Entropy (information theory)/Archive 1

This is an archive of past discussions about Entropy (information theory). Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.

Archive 1

Archive 2

Archive 3

→

Archive 5

Cross-entropy

The term "cross-entropy" directs to this page, yet there is no discussion of cross-entropy.

Fixed. There is now a separate article on cross entropy. --MarkSweep 20:13, 13 Apr 2005 (UTC)

Move

Latest comment: 19 years ago1 comment1 person in discussion

It might be better to move this page to Shannon entropy instead of redirecting from there to this page. That way, this page can talk about other formulations of entropy, such as the Rényi entropy, and naming/linking to those pages from this one.

I agree with Vegalabs on this. --V79 19:33, 3 October 2005 (UTC)

zero'th order, first order, ...

The discussion of "order" is somewhat confusing; after reading Shaannon's paper I first thought the explanation here was incorrect, but now I see that the confusion comes from the difference between a "first-order Markov source" and what Shannon calls "the first-order approximation to the entropy."

Shannon says "The zeroth-order approximation is obtained by choosing all letters with the same probability and independtly. The first-order approximation is obtained by choosing letters independently but each letter having the same probability that it has in the natural language."

Thus using only the letter frequencies (that is, only single characters), the first order entropy is

$H_{1}=-\sum p_{i}\log _{2}p_{i},\,\!$

which is the exact entropy for a zeroth order Markov source, namely one in which the symbol probababilities don't depend on the previous symbol.

I think the text should make this distinction clear, and will think on a way to edit it appropriately.

Jim Mahoney 20:00, Apr 11, 2005 (UTC)

second axiom of entropy

Regarding the formula:

2) For all positive integers n, H satisfies

H\left({\frac {1}{n}},\ldots ,{\frac {1}{n}}\right)<H\left({\frac {1}{n+1}},\ldots ,{\frac {1}{n+1}}\right).

I would like to see some kind of hint that the left H has n arguments, whereas the right H as n+1 arguments. Perhaps an index at the H could do this, i.e.

H_{n}\left({\frac {1}{n}},\ldots ,{\frac {1}{n}}\right)<H_{n+1}\left({\frac {1}{n+1}},\ldots ,{\frac {1}{n+1}}\right).

alternatively I could imagine some kind of \underbrace construction, but I could not make it look right.

Why the term was invented

Latest comment: 19 years ago2 comments2 people in discussion

This page claims that the term "information entropy" was invented as the majority of people don't understand what is entropy and if one used it he/she would always have the advantage in the debates. If this is true, it should be included in the article. --Eleassar777 10:41, 14 May 2005 (UTC)

Well, the act of naming the quantity "entropy" wasn't meant to be amusing or confuse people. The page you link to cites only part of what was said, the full quotation of what Shannon said is the following (Sci. Am. 1971 , 225 , p. 180):

"My greatest concern was what to call it. I thought of calling it 'information,' but the word was overly used, so I decided to call it 'uncertainty.' When I discussed it with John von Neumann, he had a better idea. Von Neumann told me, 'You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, no one really knows what entropy really is, so in a debate you will always have the advantage.'"

I also disagree with the author of that page when he says that thermodynamic and information entropy are not the same. Indeed, the information theoretic interpretation has shed light on what thermodynamic entropy is. That being said (and what I think is the main point of the page), the information theoretic viewpoint is not always the easiest way to understand all aspects of thermodynamics. --V79 19:19, 3 October 2005 (UTC)

Layman with a question

Latest comment: 18 years ago2 comments2 people in discussion

I remember hearing that the amount of entropy in a message is related to the amount of information it carries. In otherwords, the higher the entropy of a message, the more information it has relative to the number of bits. (For example, would this mean that in evolution, the entropy of a string of base pairs actually increases over time?) Is there any truth to this? Keep in mind I flunked Calc II three times running, so please keep it simple and use short words. crazyeddie 06:58, 19 Jun 2005 (UTC)

There is quite a lot truth to that -- when viewed the right way. When I receive a message each letter gives me some more information. How uncertain I was of the message in the first place (quantified by the entropy) tells me how much information I will gain (on average) when actually receiving the message.

Not knowing the details of DNA and evolution, I assume that each of the four base pairs are equally likely i.e. probability of 0.25. The entropy is then 4 × 0.25 log 0.25 = 2 bits per base (which is intuitive since there are 4 ways of combining two bits, 00, 01, 10, 11, which can represent A, G, U, C). But this cannot be increased, so evolution cannot increase the entropy of the base pair string. This is because information in the information theoretic sense doesn't say anything about the usefulness of the information. The junk DNA that is thought to be merely random base pairs outside the genes contain as much information per base as the genes themselves. You can also say that while evolution "adds" some information by changing some base pairs it also "removes" the information about what was there before, giving no net change. --V79 20:04, 3 October 2005 (UTC)

-Using this (what I consider sloppy) terminology, a larger genome will imply an increase in uncertainty. Some plants have twice as many base-pairs in their genome as humans, for instance. Eric

JA: I made some attempt to explain the entropy/uncertainty formula from the ground up at the article on Information theory. Would be interested in any feedback on whether that helps any. Jon Awbrey 21:52, 4 February 2006 (UTC)

Suggestion for Introduction

The following sentence - The entropy rate of a data source means the average number of bits per symbol needed to encode it - which is currently found in the body of the article, really ought to be included in some form in the introductory paragraph. This is, essentially, the layman's definition of the concept, and provides an excellent introduction to what the term actually means; it is also an excellent jumping off point, into more abstract discussion of the Shannon's theory.

Graph is wrong

Latest comment: 19 years ago2 comments2 people in discussion

The graph that opens this article is wrong: the max entropy should be ln2 not 1.0. Lets work it out: for p=1/2, we have

H = -(p log p + (1-p) * log (1-p)) = -log(1/2) = log 2

linas 04:12, 9 October 2005 (UTC)

But the logarithm used in the definition of the entropy is based 2, not e. Therefore log 2 = 1. Brona 01:42, 10 October 2005 (UTC)

Problem with definition

Latest comment: 18 years ago3 comments2 people in discussion

I am confused by the definition given in the article, and believe that parts of it are wrong:

Claude E. Shannon defines entropy in terms of a discrete random event x, with possible states 1..n as:

H(x)=\sum _{i=1}^{n}p(i)\log _{2}\left({\frac {1}{p(i)}}\right)=-\sum _{i=1}^{n}p(i)\log _{2}p(i).\,\!

That is, the entropy of the event x is the sum, over all possible outcomes i of x, of the product of the probability of outcome i times the log of the probability of i (which is also called s's surprisal - the entropy of x is the expected value of its outcome's surprisal). We can also apply this to a general probability distribution, rather than a discrete-valued event.

First, there's a reference to "s's surprisal", but the variable "s" has not been defined. I suspect that it is supposed to be "i", but I'm not familiar enough with the material to make the change.

Second, the way I read the definition, it doesn't matter what the actual outcome is, all that matters is the collection of possible outcomes. I'm pretty sure that this is wrong. I'm probably just confused by the terminology used, but in that case, someone who understands this topic should try to rewrite it in a way that is more understandable to a layman. AdamRetchless 18:10, 21 October 2005 (UTC)

Indeed all that matters are the outcomes and the probabilities of them. The formula above intends to define the information generated by an experiment (for instance taking a coloured ball out of a vase that contains balls with several colours) before the experiment is actually performed. So the specific outcome is unknown. But what we do know is: if the outcome is i (which happens with probability

p(i)\,

) then the information that we get is

\log _{2}\left({\frac {1}{p(i)}}\right)

. Bob.v.R 00:16, 13 September 2006 (UTC)

The definition given at MathWorld makes a lot more sense to me.

AdamRetchless 18:17, 21 October 2005 (UTC)

I am confused about the statement in this article that "since the entropy was given as a definition, it does not need to be derived." Surely this implies that it could have been defined differently and still have the same properties - I don't think this is true. The justification given in the article derives it based on H = log(Omega), but the origin of this log(Omega) is not explained. —The preceding unsigned comment was added by 139.184.30.17 (talk • contribs) .

minimum

The statement at the end of the second paragraph is simply not true: "the shortest number of bits necessary to transmit the message is the Shannon entropy in bits/symbol multiplied by the number of symbols in the original message." -- the formula of (bit/symbol * number of symbols) does not give the entropy when multiplied by the number of symbols in the original message! The original should be replaced with something like the "shortest possible representation".—Preceding unsigned comment added by 139.149.31.232 (talk • contribs)

Units and the Continuous Case

Latest comment: 17 years ago3 comments3 people in discussion

The extension to the continuous case has a subtle problem: the distribution f(x) has units of inverse length and the integral contains "log f(x)" in it. Logarithms should be taken on dimensionless quantities (quantities without units). Thus, the logarithm should be of the ratio of f(x) to some characteristic length L. Something like log [ f(x) / L ] would be more proper.

The problem with taking a transcendental function of a quantity with units arises from the way we define arithmetic operations for quantities with units. 5 m + 2 m is defined (5 m + 2 m = 7 m) but 5 m + 2 kg is not defined because the units are different among the quantities to be added. Transcendental functions (such as logarithms), of a variable x with units, present problems for determining the resulting units of the results of the functions of x. This is why scientists and engineers try to form ratios of quantities in which all the units cancel, and then apply transcendental functions to these ratios rather than the original quantities. As an example, in exp[-E/(kT)] the constant k has the proper units for canceling the units of energy E and temperature T so units cancel in the quantity E/(kT). Then the result of the operation, of a typical transcendental function on its dimensionless argument, is also dimensionless.

My suggested solution to the problem with the units raises another question: what choice of length L should be used in the expression log [ f(x) / L ]? I think any choice can work. —The preceding unsigned comment was added by 75.85.88.234 (talk) 18:06, 17 December 2006 (UTC).

For canceling the inverse unit of length (actually the inverse unit of x), there should appear a product of f(x) and a length L under the logarithm, i.e. log [ f(x) L ]. This would be, indeed, bizare, as any length L would work - unless we are in the frame of quantum mechanics. In that case, we would simply use the smallest quantumly distinguishable value for L. If x is truly a length, then L could be Planck's length. But this is already too obfuscating for me. I would rather recommend on concentrating on the discrete formula of entropy: S = Sum [ p(i) log p(i) ]. Now, in the continuous case, the probability is infinitesimal an it is dP = f(x) dx. Thus, the exact transcription of the above formula with this probability would give S = Sum [ f(x) dx log ( f(x) dx ) ]. Now Sum would become Integral and log ( f(x) dx ) is a functional which must take a form of L(x) dx. The worst problem now is that there are two dx under one integral. This problem appears in the above modified formula for S. This problem must be worked out somehow. Its source is in the product in the initial Shannon entropy.

If you want to work with continuous variables, you're on much stronger ground if you work with the relative entropy, ie the Kullback-Leibler distance from some prior distribution, rather than the Shannon entropy. This avoids all the problems of the infinities and the physical dimensionality; and often, when you think it through, you'll find that it may make a lot more sense philosophically in the context of your application, too. Jheald 19:32, 7 February 2007 (UTC)

Of course, the relative entropy is very good for the continuous case, but, unlike Shannon entropy, it is relative, as it needs a second distribution from which to depart. I was thinking of a formula that would give a good absolute entropy, similar to the Shannon entropy, for the continuous case. This is purely speculative, though. —The preceding unsigned comment was added by 193.254.231.71 (talk) 13:52, 8 February 2007 (UTC).

Extending discrete entropy to the continuous case: differential entropy

Latest comment: 17 years ago1 comment1 person in discussion

Q —The preceding unsigned comment was added by 193.254.231.71 (talk) 10:18, 12 February 2007 (UTC).

The last definition of the differential entropy (second last formula) seems to malfunction. Actually, it should read

h[f] = lim (Delta -> 0) [ H^Delta + log Delta * Sum [ f(xi) Delta ] ]

This would ensure the complete canceling of the second sum in H^Delta. With the current formula, there would remain a non-canceling term:

h[f] = lim (Delta -> 0) [ H^Delta + log Delta ] = Integral[ f(x) log f(x) dx ] - -lim (Delta -> 0) [ log Delta * ( Sum [ f(xi) Delta ] -1 ) ] .

The last limit does not go to zero. Actually, through a l'Hopital applied to (1-Sum) / (1/log Delta) , it would go to

- lim (Delta -> 0) [ Delta (log Delta)^2 Sum[f(xi)] ],

and, as Delta -> 0, Sum[f(xi)] -> infinity as 1/Delta (since Sum[f(xi) Delta] -> 1), so it would cancel the first Delta in the limit above, and there would be only

- lim (Delta -> 0) [ (log Delta)^2 ] -> - infinity

Thus, the last definition of h[f] could not even be used. I recommend checking with a reliable source on this, then, maybe, if that formula is wrong, its erasure. Misfortunately, I have no knowledge of the way formulas are written in wikipedia (yet).

Roulette Example

Latest comment: 18 years ago1 comment1 person in discussion

In the roulette example, the entropy of a combination of numbers hit over P spins is defined as Omega/T, but the entropy is given as lg(Omega), which then calculates to the Shannon definition. Why is lg(Omega) used? (Note: I'm using the notation "lg" to denote "log base 2") 66.151.13.191 20:41, 31 March 2006 (UTC)

moved to talk page because wikipedia is not a textbook

Latest comment: 16 years ago2 comments2 people in discussion

Derivation of Shannon's entropy

Since the entropy was given as a definition, it does not need to be derived. On the other hand, a "derivation" can be given which gives a sense of the motivation for the definition as well as the link to thermodynamic entropy.

Q. Given a roulette with n pockets which are all equally likely to be landed on by the ball, what is the probability of obtaining a distribution (A₁, A₂, …, A_n) where A_i is the number of times pocket i was landed on and

P=\sum _{i=1}^{n}A_{i}\,\!

is the total number of ball-landing events?

A. The probability is a multinomial distribution, viz.

p={\Omega  \over \mathrm {T} }={P! \over A_{1}!\ A_{2}!\ A_{3}!\ \cdots \ A_{n}!}\left({\frac {1}{n}}\right)^{P}\,\!

where

\Omega ={P! \over A_{1}!\ A_{2}!\ A_{3}!\ \cdots \ A_{n}!}\,\!

is the number of possible combinations of outcomes (for the events) which fit the given distribution, and

\mathrm {T} =n^{P}\

is the number of all possible combinations of outcomes for the set of P events.

Q. And what is the entropy?

A. The entropy of the distribution is obtained from the logarithm of Ω:

H=\log \Omega =\log {\frac {P!}{A_{1}!\ A_{2}!\ A_{3}!\cdots \ A_{n}!}}\,\!

=\log P!-\log A_{1}!-\log A_{2}!-\log A_{3}!-\cdots -\log A_{n}!\

=\sum _{i}^{P}\log i-\sum _{i}^{A_{1}}\log i-\sum _{i}^{A_{2}}\log i-\cdots -\sum _{i}^{A_{n}}\log i\,\!

The summations can be approximated closely by being replaced with integrals:

H=\int _{1}^{P}\log x\,dx-\int _{1}^{A_{1}}\log x\,dx-\int _{1}^{A_{2}}\log x\,dx-\cdots -\int _{1}^{A_{n}}\log x\,dx.\,\!

The integral of the logarithm is

\int \log x\,dx=x\log x-\int x\,{dx \over x}=x\log x-x.\,\!

So the entropy is

H=(P\log P-P+1)-(A_{1}\log A_{1}-A_{1}+1)-(A_{2}\log A_{2}-A_{2}+1)-\cdots -(A_{n}\log A_{n}-A_{n}+1)

=(P\log P+1)-(A_{1}\log A_{1}+1)-(A_{2}\log A_{2}+1)-\cdots -(A_{n}\log A_{n}+1)

=P\log P-\sum _{x=1}^{n}A_{x}\log A_{x}+(1-n)\,\!

By letting p_x = A_x/P and doing some simple algebra we obtain:

H=(1-n)-\sum _{x=1}^{n}p_{x}\log p_{x}\,\!

and the term (1 − n) can be dropped since it is a constant, independent of the p_x distribution. The result is

H=-\sum _{x=1}^{n}p_{x}\log p_{x}\,\!

.

(Isn't factor of P dropped in the formula above?) —Preceding unsigned comment added by 128.200.203.33 (talk) 22:43, 18 November 2008 (UTC)

Not sure what you mean. At first glance it looks good to me. CRETOG8(t/c) 23:13, 18 November 2008 (UTC)

Thus, the Shannon entropy is a consequence of the equation

H=\log \Omega \

which relates to Boltzmann's definition,

{\mathcal {S}}=k\ln \Omega

,

of thermodynamic entropy, where k is the Boltzmann constant.

—The preceding unsigned comment was added by MisterSheik (talk • contribs) 17:34, 1 March 2007.

H(X), H(Ω), and the word 'outcome'

Latest comment: 17 years ago1 comment1 person in discussion

Recent edits to this page now stress the word "outcome" in the opening sentence:

information entropy is a measure of the average information content associated with the outcome [emphasised] of a random variable.

and have changed formulas like

H(X)=-\sum _{i=1}^{n}p(x_{i})\log _{2}p(x_{i}),\,\!

to

H(X)=-\sum _{\omega \in \Omega }p(\omega )\log _{2}p(\omega )

There appears to have been a confusion between two meanings of the word "outcome". Previously, the word was being used on these pages in a loose, informal, everyday sense to mean "the range of the random variable X" -- ie the set of values {x₁, x₂, x₃ ...) that might be revealed for X.

But "outcome" also has a technical meaning in probability, meaning the possible states of the universe {ω₁, ω₂, ω₃ ...), which are then mapped down onto the states {x₁, x₂, x₃ ...) by the random variable X (considered to be a function mapping Ω -> R).

It is important the mapping X may in general be many-to-one: so H(X) and H(Ω) are not in general the same. In fact we can say definitely that H(X) <= H(Ω), with equality holding only if the mapping is one-to-one over all subsets of Ω with non-zero measure. (the "data processing theorem").

The correct equations are therefore

H(X)=-\sum _{i=1}^{n}p(x_{i})\log _{2}p(x_{i}),\,\!

or

H(\Omega )=-\sum _{\omega \in \Omega }p(\omega )\log _{2}p(\omega )

But in general the two are not the same. -- Jheald 11:37, 4 March 2007 (UTC).

Sorry, I don't get it

Latest comment: 17 years ago3 comments1 person in discussion

Self-information of an event is a number, right? Not a random variable. Yes?

So how can entropy be the expectation of self-information? I sort-of understand what the formula is coming from, but it doesn't look theoretically sound... Thanks. 83.67.217.254 13:19, 4 March 2007 (UTC)

Ok, maybe I understand. I(omega) is a number, but I(X) is itself a random variable. I have fixed the formula. 83.67.217.254 13:27, 4 March 2007 (UTC)

Uh-oh, what have I done? "Failed to parse (Missing texvc executable; please see math/README to configure.)" Could you please fix? Thank you. 83.67.217.254 13:30, 4 March 2007 (UTC)

Compression of English Text

Latest comment: 17 years ago3 comments2 people in discussion

If I take the text of the book "Uncle Tom's Cabin", http://etext.lib.virginia.edu/etcbin/toccer-new2?id=StoCabi.sgm&images=images/modeng&data=/texts/english/modeng/parsed&tag=public&part=all , its about a megabyte of text. If I compress it using winzip I get 395K bytes. bzip2: 295KB. paq8l 235KB. This isn't normal English text, but I think you get the idea. Daniel.Cardenas 19:06, 13 May 2007 (UTC)

Compression software does give a nice rule-of-thumb entropy estimate, but in this case the actual entropy is a lot lower because compression software designed for general-purpose use doesn't have the extensive knowledge of the language that allows humans to see more redundancy in the text. More rigorous experiments usually show lower entropy rates for English, typically between 1.0 and 1.5 bits per character, as described in the reference I've added. 129.97.79.144 19:23, 21 May 2007 (UTC)

Thanks, that was a good one. :-) Daniel.Cardenas 19:35, 21 May 2007 (UTC)

Entropy of English text

Latest comment: 17 years ago2 comments2 people in discussion

The article currently says "The entropy of English text is between 1.0 and 1.5 bits per letter.". Shouldn't the entropy in question decrease as one discovers more and more patterns in the language, making a text more predictable? If so, I think it would be a good idea to be a little less precise, saying "The entropy of English text can be regarded as being between 1.0 and 1.5 bits per letter." or similar instead. —Bromskloss 11:43, 7 June 2007 (UTC)

No, that's like saying "The sum of 2 plus 2 can be regarded as 4." Entropy has a precise mathematical definition. It isn't just possible to "regard" it as having an exact value, it actually does have an exact value. At most it can be said that entropy is hard to measure, which (along with differences between receivers and in what's called "English") is the reason a range instead of a single value is given. It's true that knowing more about the language (i.e. having more ability to predict the text) decreases the entropy; the studies on which the referenced statement is based are generally assuming something like the average user of English. Anyway, the statement in the article is what's in the reference and it's not appropriate for us to second-guess it. 216.75.189.154 13:18, 26 June 2007 (UTC)

Boltzmann's lectures on entropy

Since entropy was formally introduced by Ludwig Boltzmann the article should refer to his work:

Boltzmann, Ludwig (1896, 1898). Vorlesungen über Gastheorie : 2 Volumes - Leipzig 1895/98 UB: O 5262-6. English version: Lectures on gas theory. Translated by Stephen G. Brush (1964) Berkeley: University of California Press; (1995) New York: Dover ISBN 0-486-68455-5

—The preceding unsigned comment was added by Algorithms (talk • contribs) 19:35, 7 June 2007.

log basis

Latest comment: 17 years ago1 comment1 person in discussion

Hmmm, this article seems to assume that logs must always be taken to base 2 - which is not the case. We can define entropy to whatever base we like (in coding it often makes things easier to define it to a base equal to the number of code symbols, which in computer science is typically 2). This leads to different units of measurements: bits vs. nats vs. hartleys.

The article should probably be modified to reflect this HyDeckar 01:16, 13 June 2007 (UTC)

Mistake inside an external reference

Latest comment: 17 years ago1 comment1 person in discussion

Regrading the reference: Information is not entropy, information is not uncertainty ! - a discussion of the use of the terms "information" and "entropy".

They referenced article is mistaken. It refutes the claim that "information is proportional to physical randomness". However, the more random a system is the more information we need in order to describe it. I suggest we remove this reference.

—The preceding unsigned comment was added by 89.139.67.125 (talk) 07:32, 13 June 2007

I agree. That reference reads more like a rant than a discussion. Its author appears to lack some basic understanding of thermodynamic vs. information-theoretic entropy. The above comment is absolutely correct in that "the more random a system is the more information we need in order to describe it." 198.145.196.71 16:36, 25 September 2007 (UTC)

Looking for reference

Im looking for realiable, hard references for the following phrase in the article:

"Shannon's entropy measures the information contained in a message as opposed to the portion of the message that is determined (or predictable). Examples of the latter include redundancy in language structure or statistical properties relating to the occurrence frequencies of letter or word pairs, triplets etc. See Markov chain."

Im sorry if the above concept is a bit basic and present in basic textbooks. I have not studied the subject formally, but i may have to apply the entropy concenpt in a small analysis for my master's dissertation.

Units in the continuous case

Latest comment: 17 years ago1 comment1 person in discussion

I think there need to be some explanition on the matter of units for the continuous case.

H[f]=-\int _{-\infty }^{\infty }f(x)\log _{2}f(x)\,dx,\quad

f(x) will have the unit 1/x. Unless x is dimmensionless the unit of entropy will inclue the log of a unit which is weird. This is a strong reason why it is more useful for the continuous case to use the relative entropy of a distribution, where the general form is the Kullback-Leibler divergence from the distribution to a reference measure m(x). It could be pointed out that a useful special case of the relative entropy is:

H_{relative}[f]=-\int _{x_{min}}^{x_{max}}f(x)\log _{2}(f(x)(x_{max}-x_{min}))\,dx,\quad

which should corresponds to a rectangular distribution of m(x) between xmin and xmax. It is the entropy of a general bounded signal, and it gives the entropy in bits.

Petkr 13:38, 6 October 2007 (UTC)

Entropy vs Entropy Rate

Latest comment: 16 years ago1 comment1 person in discussion

not sure about the section `Limitations of entropy as information content'.

quote Consider a source that produces the string ABABABABAB... in which A is always followed by B and vice versa. If the probabilistic model considers individual letters as independent, the entropy rate of the sequence is 1 bit per character. But if the sequence is considered as "AB AB AB AB AB..." with symbols as two-character blocks, then the entropy rate is 0 bits per character. endquote

the average number of bits needed to encode this string is zero (asymptotically)

also, treating this as a markov chain (order 1), we can see from the formula in http://en.wiki.x.io/wiki/Entropy_rate and also in this article that the entropy rate is 0

also in the next paragraph quote However, if we use very large blocks, then the estimate of per-character entropy rate may become artificially low. endquote

isn't the `per-character entropy rate' redundant? should be either the `per-character entropy' or the `entropy rate' —Preceding unsigned comment added by 71.137.215.129 (talk) 07:23, 16 January 2008 (UTC)

Uncertainty

Latest comment: 16 years ago2 comments2 people in discussion

Since "uncertainty" (whatever that may mean) is used as a motivating factor in this article, it might be good to have a brief discussion about what is meant by "uncertainty." Should the reader simply assume the common definition of uncertainty? Or is there a specific technical meaning to this word that should be introduced? —Preceding unsigned comment added by 131.215.7.196 (talk) 19:41, 27 January 2008 (UTC)

The article states:“Equivalently, the Shannon entropy is a measure of the average information content the recipient is missing when he does not know the value of the random variable.” This has also been interpreted as an uncertainty in a system, not a measure of the information.

This interpretation is valid if we are sending a message from a sender to a receiver along a noisy channel, which may make the message uncertain. But there is an alternative interpretation where information entropy is hardly a measure of uncertainty.

For instance if we replace a generation with Gaussian distributed quantitative characters of one billion individuals in a large population with a new generation, the situation is quite different. This is like sending one billion different Gaussian distributed messages in parallel from parents to offspring. Every new message is a random – noisy - recombination of messages from two randomly chosen parents, for instance.

As I see it, there is per definition no uncertainty with respect to the survival of the parents, and a moment matrix of their characters may as well exist. Thus a Gaussian distribution may serve as a good approximation of the region of acceptability, A, determining the possible spread of parents along A. See also the article about "Entropy in thermodynamics ... [[1]]--Kjells (talk) 13:30, 8 June 2008 (UTC)

Limitations of entropy as information content

Latest comment: 14 years ago2 comments2 people in discussion

This section needs a major rewrite. It correctly states that Shannon entropy depends crucially on a probabilistic model. Several important points need to be made, though.

When we are talking about the information content of an individual message, we are talking about its self-information, not entropy. Entropy is a measure of the complexity of the whole probability distribution, not of an individual message. Entropy is the expected self-information of a message, given our probabilistic model.
The Kolmogorov complexity is, as stated, a measure of the complexity of an individual message, independent of any probability distribution, however it is only defined up to an additive constant, which depends on the specific model of computation chosen.
Nonetheless the information entropy provides a lower bound on the expected Kolmogorov complexity of a message, i.e.:

H(M)\leq \mathbb {E} K(M).

Such a bound would be extremely to obtain in the case of a single message, due to the halting problem. Deepmath (talk) 21:28, 15 July 2008 (UTC)

The example given about the sequence ABABAB... sounds like utter nonsense to me: a source that always produces the same sequence has entropy 0, regardless of whether the sequence consists of a single symbol or not. For instance, the sequence of integers produced by counting from 0 has entropy 0, even though each symbol (integer) is different. —Preceding unsigned comment added by 99.65.138.158 (talk) 19:30, 21 January 2010 (UTC)

Name change suggestion to alleviate confusion

Latest comment: 16 years ago14 comments4 people in discussion

Resolved

– Page moved.

I suggest renaming this article to either "Entropy (information theory)", or preferably, "Shannon entropy". The term "Information entropy" seems to be rarely used in a serious academic context, and I believe the term is redundant and unnecessarily confusing. Information is entropy in the context of Shannon's theory, and when it is necessary to disambiguate this type of information-theoretic entropy from other concepts such as thermodynamic entropy, topological entropy, Rényi entropy, Tsallis entropy etc., "Shannon entropy" is the term almost universally used. For me, the term "information entropy" is too vague and could easily be interpreted to include such concepts as Rényi entropy and Tsallis entropy, and not just Shannon entropy (which this article exclusively discusses). Most if not all uses of the term "entropy" in some sense quantify the "information", diversity, dissipation, or "mixing up" that is present in a probability distribution, stochastic process, or the microstates of a physical system.

I would do this myself, but this article is rather frequently viewed, so I am seeking some input first. Deepmath (talk) 01:29, 23 August 2008 (UTC)

Support - Sounds sensible to me. I favour the name "Entropy (information theory)" rather than "Shannon entropy" - because that will make it clearer to newbies that this is where they come to find out what the unqualified word "entropy" means when they come across it in an information-theory context. Very often "entropy" is discussed in the literature without specifying that it's "Shannon entropy", even in many cases where the discussion only applies to Shannon entropy. --mcld (talk) 08:36, 23 August 2008 (UTC)

Yes, and the article could begin with "In information theory, entropy is a measure of [...]. Several types of entropy can be introduced, the most common one is Shannon entropy, defined as [...]. Other definitions include Rényi entropy, which is a generalization of Shannon entropy, [...]. Throughout this article, the unqualified word entropy will refer to Shannon entropy.", or something similar. --A r m y 1 9 8 7 ! ! 10:59, 23 August 2008 (UTC)

Thank you both for the input. If no major objections are forthcoming in the next few days, I say we go ahead with the move to "Entropy (information theory)". However, I counted 403 articles that link here, excluding user and talk pages. Is there a bot somewhere we could use to at least pipe those links to the new article title? That would let the people watching those articles about the new title and avoid all those annoying redirects for people who are just browsing Wikipedia. I guess I'm just not familiar with what's generally done in cases like this. Deepmath (talk) 21:05, 23 August 2008 (UTC)

Moving the page is quite easy, see WP:MV for a howto. A redirect will automatically be put in place. In general there's no need to go around fixing the 403 articles - they will gradually get fixed, either by bots or by users, and most users won't even notice the difference since the redirect thing happens so transparently. Some things will need tweaking I think, but nothing like 403. But that link I gave has all the info. --mcld (talk) 13:40, 25 August 2008 (UTC)

Entropy (information theory) already has an edit history. Deepmath (talk) 00:46, 27 August 2008 (UTC)

Still can't do the move, even though I tried to move the old page out of the way. An administrator needs to do this. Deepmath (talk) 06:42, 27 August 2008 (UTC)

Already having an edit history isn't a valid reason not to move - the edit history would just have to be copied to the talk page to preserve it. Dcoetzee 04:27, 27 August 2008 (UTC)

I'm not trying to suggest it is. The pre-existing edit history for the target page simply makes it technically more difficult to accomplish the move without an administrator's help. I'll see what I can do. Deepmath (talk) 06:22, 27 August 2008 (UTC)

Oh, I see now. :-) I'm an admin and I'll do the move once the discussion settles (has it already?) Dcoetzee 07:09, 27 August 2008 (UTC)

seems pretty settled to me... --mcld (talk) 08:22, 19 September 2008 (UTC)

The article was moved, indeed. I'm adding a {{resolved}} tag at the top. A r m y 1 9 8 7 ! ! ! 10:00, 19 September 2008 (UTC)

oops ok, thanks --mcld (talk) 14:27, 19 September 2008 (UTC)

And by the way, as to Army1987's suggestions, it might be a little confusing for newbies to talk about Rényi entropy right in the intro. I did try the edit the intro a little, but if you feel you can word things there a little more clearly, please go right ahead. Or perhaps a section later in the article about generalizations of Shannon's entropy would be more appropriate for mentioning Rényi entropy. Also by the way, I read Hartley's 1928 paper "Transmission of Information" after somebody posted a link to it in the Information theory article. This guy was apparently the first one to recognize that the amount of information that could be transmitted was proportional to the logarithm of the number of choices available. He did not attempt to analyze the mathematics behind unequal probability distributions like Shannon did, but he basically invented the concept of "bandwidth" as we know it today: that the rate of information that can be transmitted over a continuous channel is proportional to the width of the range of frequencies that one is allowed to use. And the formula for unequal probability distributions,

-\sum p\log p

, was already known to Boltzmann and Gibbs from their study of the entropy discovered by Clausius based upon Carnot's work improving the efficiency (and theoretical understanding) of steam engines. Deepmath (talk) 22:03, 23 August 2008 (UTC)

Scientists make simple things complicated

Latest comment: 14 years ago3 comments3 people in discussion

Very many scientists like to make simple things complicated and earn the respect over this. Information entropy is a very good example of such attempt. Actually entropy is only a number of possible permutations expressed in bits divided by the length of the message. And the concept is simple as well. For the given statistical distribution of symbols we can calculate the number of possible permutations and enumerate all messages. If we do that, we can send statistics and index of the message in enumeration list instead of the message and message can be restored. But the index of the message has length as well and it can be very long so we consider the worst case scenario and take the longest index that is number of possible permutations. For example, if we have message with symbols A,B,C of 1000 symbols long with statistics 700, 200 and 100. The number of possible permutations is (1000!) / (700! * 200! * 100!). The approximate bit length of this number divided by the number of symbols is (log(1000!) – log(700!) – log(200!) – log(100!))/1000 = 1.147 bits/symbol, where all logarithms have base 2. If you calculate the entropy it is 1.157. The figures are close and they asymptotically approach each other with the growing size of the message. The limits are explained by Sterling formula, so there is no trick, just approximation. Obviously, when writing his famous article Claude Shannon did not have an idea what is going on and could not explain clear what the entropy is. He simply noticed that in compression by making binary trees similar to Huffman tree the bit length of the symbol is close to –log(p) but always larger and introduced entropy as a compression limit without clear understanding. The article was published in 1948 and Huffman algorithm did not exist but there were other similar algorithms that provided slightly different binary trees with the same concept as Huffman tree, so Shannon knew them. Surprising is not Shannon’s entropy but the other scientists who use obscure and misleading terminology for 60 years. Entropy is a measure for a number of different messages that can be possibly constructed with constrain given as frequency for every symbol that is all, simple and clear. —Preceding unsigned comment added by 63.144.61.175 (talk) 17:47, 24 June 2008 (UTC)

Ok, so you're angry and thinking Claude Shannon sucks. Even as I type I realize this is a pointless post but seriously, expressing it unambigously in mathematical terms that are irrefutable is essential, especially in a subject area such as this.85.224.240.204 (talk) 02:15, 25 November 2008 (UTC)

This is the way that Kardar introduced the information entropy in his book Statistical Physics of Particles. There is also a wikibook at the external connection named An Intuitive Guide to the Concept of Entropy Arising in Various Sectors of Science, which this kind of opinion might be contributed to. Tschijnmotschau (talk) 09:02, 3 December 2010 (UTC)

Missing figure for continuous case

Latest comment: 16 years ago4 comments3 people in discussion

The section about the entropy of a continuous function refers to a figure, but no figure is present. —Preceding unsigned comment added by Halberdo (talk • contribs) 17:10, 22 December 2008 (UTC)

The corresponding text apparently was added almost three years ago, and apparently the figure itself never was added as an image but only as a comment:

<!-- Figure: Discretizing the function $ f$ into bins of width $ \Delta$
 \includegraphics[width=\textwidth]{function-with-bins.eps} -->

Furthermore, apparently the text was copied and pasted from PlanetMath (see here) without proper attribution of the authors as I think would be required by the GNU Free Documentation License. This talk page mentions that the article incorporates material from PlanetMath, which is licensed under the GFDL, but I am not sure that is enough? So, should the section be removed as a copyright violation? — Tobias Bergemann (talk) 21:28, 22 December 2008 (UTC)

I haven't read the article or preceding comments, but AFAIK it is not a copyright violation to copy GFDL-licensed material to Wikipedia as long as it has proper attribution (maybe you need to change the attribution above to more closely reflect the kind of attribution PlanetMath wants, at most). Shreevatsa (talk) 21:38, 22 December 2008 (UTC)

I think you are right. As far as I understand the history of the article at PlanetMath, all visible versions of that article were authored by Kenneth Shum. — Tobias Bergemann (talk) 22:21, 22 December 2008 (UTC)

entropy explained

Latest comment: 15 years ago1 comment1 person in discussion

this statement is not followed up with something that uses the premise it states: "Since the probability of each event is 1 / n"

watson (talk) 04:08, 5 March 2009 (UTC)

log probability

Latest comment: 15 years ago2 comments2 people in discussion

The article Perplexity says that information entropy is "also called" log probability. Is it true that they're the same thing? If so, a mention or brief discussion of this in the article might be appropriate. dbtfz ^talk 01:34, 20 April 2006 (UTC)

Entropy is *expected* log probability Full Decent (talk) 01:21, 3 December 2009 (UTC)

Interesting properties

Latest comment: 15 years ago2 comments1 person in discussion

Hello, I have posted some (relatively?) interesting properties of the entropy function on my blog. http://fulldecent.blogspot.com/2009/12/interesting-properties-of-entropy.html Normally information from a blog is not authoritative and I wouldn't use this way to post primary information on Wikipedia; but math is math and stands on its own. Full Decent (talk) 01:27, 3 December 2009 (UTC)

I have made a few edits today and merged in some of that blog to http://en.wiki.x.io/wiki/Perplexity I think this article requires some review and attention to mathematical pedanting to maintain B-class level. Full Decent (talk) 16:00, 11 December 2009 (UTC)

Actual information entropy

Latest comment: 15 years ago1 comment1 person in discussion

I was surprised to see Shannon entropy here and not the explicit collapse of information and was even more surprised to see the latter not even linked! I'd like to see this in the otheruses template at the top but am not sure how to phrase it succinctly. Can someone give it a shot? It's a hairy issue since the articles are so tightly related. .froth. (talk) 03:06, 26 March 2009 (UTC)

About the relation between entropy and information

Latest comment: 14 years ago5 comments4 people in discussion

Hi. I've drawn this graphic. I'd like to know your comments about the idea it describes, thank you very much.

--Faustnh (talk) 00:07, 29 March 2009 (UTC)

Also posted at Fluctuation theorem talk page.

It's the same amount of information, but the information in the first one can be described more succinctly. See Kolmogorov complexity .froth. (talk) 01:30, 4 April 2009 (UTC)

I think it's not the same amount of information.

There is more information in the wave frame.

It is true that there is something that remains constant in both graphs, but it is not information:

The bigger quantity of information in the wave case, gets compensated by, or gets correlated to, the smaller quantity of entropy in that wave's case.

So, certainly, there is something that remains constant. But it is not information.

Here : aaa , there is less information than here : abc . But here : aaa , entropy is bigger than here : abc .

Another example:

This universe : abcd - abcd - abcd - abcd , is maximum entropy and minimum information.

This other universe : aaaa - bbbb - cccc - dddd , is minimum entropy and maximum information.

But something remains constant in both universes, because the second universe is a big replica of each of the small particles or sub-universes of the first universe.

--Faustnh (talk) 11:03, 4 April 2009 (UTC)

More information = more entropy. Also, I don't think this is very relevant to the information theory definition of entropy. Full Decent (talk) 15:58, 11 December 2009 (UTC)

Actually "abcd - abcd - abcd - abcd" is highly ordered. If we read these as sets of four hexadecimal digits, "aaaa - bbbb - cccc - dddd" is different 16 bit characters, while "abcd - abcd - abcd - abcd" is four of the same 16 bit character, and therefore more ordered. Any pattern is order.

I agree the poster of this image meant well, and it would be a great analogy if it were right. Unfortunately it's wrong, for reasons I get into below. Randall Bart Talk 21:45, 2 December 2010 (UTC)