Talk:Entropy (information theory)/Archive 3
This is an archive of past discussions about Entropy (information theory). Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page. |
Archive 1 | Archive 2 | Archive 3 | Archive 4 | Archive 5 |
Lead text
I removed a paragraph in the lead which explained entropy as an amount of randomness. Indeed, entropy is greater for distributions which are more "random." For example, a uniform distribution has the maximum entropy over all possible discrete distributions. Gaussian distributions take that role for continuous distributions. However, when it comes to messages having the alphabet of say, the English alphabet, then it is not clear whether "aaaaaaaa" has more entropy than the message "alphabet." If it is assumed that all letters of the English language are equally likely then, in fact, these two messages are equally likely and thus have the same entropy. However, in reality, "a"'s are probably more likely to occur than other letters. In which case "aaaaaaaa" would have less entropy. However, this can be terribly misleading to relay to people who may not know the subject well to give them this general rule-of-thumb based off of perceived randomness. BourkeM Converse! 05:50, 17 October 2015 (UTC)
Units of entropy
The introduction states bits, nats and bans as common units of Shannons entropy, while the definition section uses dits, rather than bans. — Preceding unsigned comment added by 129.177.143.236 (talk) 10:56, 18 November 2013 (UTC)
- Shannon entropy units are bits/symbol. For example, each of the following 6 "messages" below have a Shannon entropy of log2(2) = 1 because each symbol is used with the same probability and there are 2 symbols. An equal probability distribution is the max entropy for that number of symbols:
- ab, 12, ababababab, 1212121212, aaaaabbbbbccccc, 221122112211221122112211.
- But physical entropy units are nats because it increases as the "length of the message" (number of particles or volume) increases. Intensive entropy is more like Shannon entropy, being nats per mole, kg, volume, or particle. The Joules/Kelivins units of kb are unitless because Kelvins are a measure of kinetic energy (joules) so the units cancel. It's merely the slope needed to make sure there is (classically) zero kinetic energy at zero temperature, which is a truism. If Joules and Kelvins were fundamentally different units there would not be this strict requirement that they are both zero at the same time when kb is used. Quantum mechanically kb is not even needed to get S and it becomes more obvious the units are "nats". Ywaz (talk) 23:02, 30 November 2015 (UTC)
Why do you say that information entropy is in bits per symbol? If each of 210 possibilities is equally probable, for a string of zeros and ones that is ten symbols long, then any one of those strings represents 10 bits, not 1 bit, yes? 𝕃eegrc (talk) 14:56, 1 December 2015 (UTC)
- Shannon entropy H measures a disorder rate aka an information rate in a set of symbols. A string with all 1's or all 0's has H = 0 bits/symbol. At the other extreme, half 1's and half 0's is H=1 bit/symbol. Search "entropy calculator" to play with it. If H = 0.5 for a particular set of 10 bits, then it is carrying only H*10 = 5 bits of data, which shows compressibility only in a very restricted sense. H*N like this is total entropy for a set of N symbols which is analogous to physical entropy's N microstates. H is analogous to specific entropy So (usually in units of nats per mole or per kg). If each bit of 10 is randomly selected, then all 1's is not equally likely as half 0's and half 1's, but it is as likely as any other particular sequence. Shannon entropy and physical entropy do not look at the sequence order. But a random selection of 10 bits is likely to have H close to 1, and H*N=10 for the total string. Ywaz (talk) 18:16, 1 December 2015 (UTC)
You may want to try for consensus on that before editing the article. My experience is that information entropy is more general than the per symbol definition I think you mean. 𝕃eegrc (talk) 20:10, 2 December 2015 (UTC)
- The above discussion may make some sort of sense, but I cannot see it. The most glaring problem is that a message *instance* is talked about as if it had entropy. The entropy of "ab" is not defined, the entropy of " ababababab" is not defined. Entropy is not something which a particular message possesses. "A string with all 1's or all 0's has H = 0 bits/symbol" is a nonsense statement. "each of the following 6 "messages" below have a Shannon entropy of log2(2) = 1" is a nonsense statement. Please rewrite the argument in a form that makes sense. PAR (talk) 15:31, 3 December 2015 (UTC)
Shannon's entropy equation H is the definition of information entropy. You can look up "entropy calculator" and find many implementations of it online and you will get the same answers I provided. Both physical and informational entropy are an instantaneous statistical measure ("picture") of the data. The information most people intuitive think about is "extensive" entropy called shannons in bits as explained in the article. The "shannons" of ABABAB is H*N =6. H is "intensive" entropy and is 1 bit/symbol in this case. The first image in the article with coins describes "extensive" (absolute) informational entropy and the second image and all the equations describe "intensive" entropy (bits per symbol, aka Shannon's H function.
Concerning the correct units, see Shannon's book, chapter 1, section 7 which says "H or H' measures the amount of information generated by the source per symbol [H] or per second [H']." A "source" can be any data file being read from memory, an FM radio station, or the position and energies of particles in a gas. See Landauer's principle for this direct connection between physical entropy and information entropy, with experimental observations. Normal physical entropy is an absolute value (extensive).
Where S is physical extensive entropy, kb is Boltzmann's constant, N is number of particles, ln(2) is the conversion from bits to nats, and H is Shannon's (intensive) entropy in bits per particle. You assign a symbol to each possible microstate a particle can occupy (100 possible internal energies in 100 possible locations would use 10,000 symbols as the alphabet for the 10,000 possible microstates), then calculate H from the probabilities of the distinct symbols by looking at a large number particles, or by already knowing the physics of the particles. You could take a picture of the particles, write down the symbol for every one of their energy-volume (phase space, microstate) locations, and consider this a message the system of particles has sent you. The "particles" could be in mole, kg, or m^3 and the symbols would represent the phase space of that block. N would then be in mole, kg, or m^3. Ywaz (talk) 17:55, 3 December 2015 (UTC)
- I have the feeling we are arguing over language, rather than concepts. Just to be sure, what would you say is the entropy (extensive and intensive) of the message "A5B"? PAR (talk) 01:32, 4 December 2015 (UTC)
The Shannon entropy of "A5B" and "AAA555BBB" and "AB5AB5AB5" are all the same: H = log2(3) = 1.58 bits/symbol. This is an intensive type of entropy. The shannons (extensive entropy) of these messages are H*N: 1.58*3 for the first one and 1.58*9 for the other two. H is the sum of pi*log2(1/pi) where pi is the observed frequency of each symbol in the message.
How can you revert the article claiming Shannon entropy is extensive when Shannon himself said it is bits/symbol (intensive)? I have a reference from the man himself and you have nothing but opinion. A reference from the original source of information theory overrides 2 votes based on opinion. Admit your error and let the article be corrected to reflect the facts. Ywaz (talk) 19:34, 4 December 2015 (UTC)
- Is my understanding of your statements correct: You are saying that "A5B" is independent draws of "A", "5", and "B" from a distribution and thus our best estimate of that distribution is that each of "A", "5", and "B" has a 1/3 probability. I agree that the entropy of such a (1/3, 1/3, 1/3) distribution is indeed log(3). I hesitate to call this bits per symbol, though, because it is the counts of occurrences that is being divided by the number of symbols, not some measure of bits that is being divided by the number of symbols. 𝕃eegrc (talk) 21:13, 4 December 2015 (UTC)
- Ok, "A5B" does not possess an entropy. Entropy is defined for the alphabet and their associated probabilities that produced the instance "A5B". What you have done is used the instance to assume an alphabet ("A","5","B") and you have assumed equal probabilities for each letter of the alphabet. These are not assumptions that you can generally make. Because of this, it is simply wrong to say that "A5B" has a particular value of entropy. The symbols imply nothing about the alphabet, nor the probabilities of the letters of that alphabet. I think you know this, but you talk as if "A5B" has entropy, and that is what is bothering me. PAR (talk) 21:51, 4 December 2015 (UTC)
Leegrc, "bits" are the invented name for when the log base is 2 is used. There is, like you say, no "thing" in the DATA itself you can point to. Pointing to the equation itself to declare a unit is, like you are thinking, suspicious. But physical entropy itself is in "nats" for natural units for the same reason (they use base "e"). The only way to take out this "arbitrary unit" is to make the base of the logarithm equal to the number of symbols. The base would be just another variable to plug a number in. Then the range of the H function would stay between 0 and 1. Then it is a true measure of randomness of the message per symbol. But by sticking with base two, I can look at any set of symbols and know how many bits (in my computing system that can only talk in bits) would be required to convey the same amount of information. If I see a long file of 26 letters having equal probability, then I need H = log2(26) = 4.7 bits to re-code each letter in 1's and 0's. There are H=4.7 bits per letter.
PAR, as far as I know, H should be used blind without knowledge of prior symbol probabilities, especially if looking for a definition of entropy. You are talking about watching a transmitter for a long time to determine probabilities, then looking at a short message and using the H function with the prior probabilities. Let's say experience shows a "1" occurs 90% of the time and 0 occurs 10%. A sequence then comes in: 0011. "H" = 2*(-0.1*log2(0.1) - 0.9*log2(0.9) ) = 6.917. [edit: this 6.917 result is wrong as PAR points out below, so the rest of this paragraph should be ignored] This would be the amount of information in bits conveyed when only 4 bits were sent. Two out of 4 symbols being 0 was a surprise, so it carried more information than usual. I don't know to incorporate this into a general idea of entropy. It's not Shannon's H because Shannon's H requires the sum of probabilities to be one. To force it in this case would be H = "H" / 4 = 1.72 bits per symbol (bits/bit). So 0.72 bits/bit is the excess information the short message carried than normally expected. Backing up, I can say the H of this source with 90% 1's is 0.496 bits/bit which is less than the ideal 1 bit/bit that random bits carry from a random source. All that's interesting but I don't see a definition of entropy coming out of it. It seems to be just applying H to discover a certain quantity desired.
Let me give an example of why a blind and simple H can be extremely useful. Let's say there is a file that has 8 bytes in it. One moment it say AAAABBBB and the next moment it says ABCDABCD. I apply H blindly not knowing what the symbols represent. H=1 in the first case and H=2 in the second. H*N went from 8 to 16. Now someone reveals the bytes were representing microstates of 8 gas particles. I know nothing else. Not the size of the box they were in, not if the temperature had been raised, not if a partition had been lifted, and not even if these were the only possible microstates (symbols). But there was a physical entropy change everyone agrees upon from S1=kb*ln(2)*8 to S2=kb*ln(2)*16. So I believe entropy H*N as I've described it is as fundamental in information theory as it is in physics. Prior probabilities and such are useful but need to be defined how they are used. H on a per message basis will be the fundamental input to those other ideas, not to be brushed aside or detracted from.
edit: PAR I found an example of what you're probably thinking about: http://xkcd.com/936/ The little blocks in this comic are accurate, representing the number of bits needed to represent all the possibilities, which uses a prior knowledge. This is just N*H shannons as I've described for each groupings of his blocks, and to get a total number of shannons (entropy as he calls it, extensive entropy as I call it) you just add them up like he has done. Actually, he has made a mistake if the words we choose are not evenly distributed. In that case, we calculate an H which will come out lower than the 11 bits per word-symbol has indicated which means if our hacker starts with the most common words, he is more likely to finish sooner, which is like saying there are fewer than 2^44 things we have to search. Ywaz (talk) 01:48, 5 December 2015 (UTC)
- Ywaz- you say:
as far as I know, H should be used blind without knowledge of prior symbol probabilities, especially if looking for a definition of entropy. You are talking about watching a transmitter for a long time to determine probabilities, then looking at a short message and using the H function with the prior probabilities.
- NO. Watching a transmitter will allow us to estimate probabilities, the longer we watch, the better the estimation. Or, we may have a model which gives us the probabilities directly, as in statistical mechanics where each microstate (message) is assumed to have equal probability. Once we have these probabilities, we can calculate the entropy, and only then, not before. In "message" terms, the entropy is the average information carried by a message. This requires knowing the set of all messages and their probabilities, which sum to unity. In micro/macrostate terms, the macrostate is a set of microstates, each with their own probability, the sum of which is unity. The entropy is only defined for the macro state. A microstate has no entropy. It does carry information, however, and I think you are confusing the two. The entropy is the average information carried by a message or microstate. It is averaged over all possible microstates which constitute the macrostate, or alternatively, it is averaged over the set of all possible messages.
Let's say experience shows a "1" occurs 90% of the time and 0 occurs 10%. A sequence then comes in: 0011. "H" = 2*(-0.1*log2(0.1) - 0.9*log2(0.9) ) = 6.917. This would be the amount of information in bits conveyed when only 4 bits were sent. Two out of 4 symbols being 0 was a surprise, so it carried more information than usual. I don't know to incorporate this into a general idea of entropy. It's not Shannon's H because Shannon's H requires the sum of probabilities to be one. To force it in this case would be H = "H" / 4 = 1.72 bits per symbol (bits/bit). So 0.72 bits/bit is the excess information the short message carried than normally expected. Backing up, I can say the H of this source with 90% 1's is 0.496 bits/bit which is less than the ideal 1 bit/bit that random bits carry from a random source. All that's interesting but I don't see a definition of entropy coming out of it. It seems to be just applying H to discover a certain quantity desired.
- First of all, 2*(-0.1*log2(0.1) - 0.9*log2(0.9) ) = 0.937991 bits. This is NOT the amount of information contained in 4 bits, it is the amount of information contained in the specific 4 bit message 0011. The entropy of a 4-bit message is 4(p log2(p)+q log2(q)) where p=0.1 and q=1-p=0.9. That comes out to 1.87598 bits, which is the entropy of the set of four bit messages, given the above probabilities. It is the average amount of information carried by 4 bits, and clearly the information in 0011 (0.937991 bits) is much less than average (1.87598 bits). In macro/microstate terms, the macrostate is represented by any of the 2^4=16 possible microstates, 0011 being one of those microstates. The entropy of the macrostate is 1.87598 bits. To ask for the entropy of a microstate is an improper question.
Let me give an example of why a blind and simple H can be extremely useful. Let's say there is a file that has 8 bytes in it. One moment it say AAAABBBB and the next moment it says ABCDABCD. I apply H blindly not knowing what the symbols represent. H=1 in the first case and H=2 in the second. H*N went from 8 to 16. Now someone reveals the bytes were representing microstates of 8 gas particles. I know nothing else. Not the size of the box they were in, not if the temperature had been raised, not if a partition had been lifted, and not even if these were the only possible microstates (symbols).
- You say that a microstate is given by AAAABBBB, but you have no knowledge of the macrostate. Only a macrostate has entropy, you have no macrostate, you cannot calculate entropy. You then presume to know the macrostate by looking at AAAABBBB and saying the macrostate is the set of all possible arrangements of 8 particles in two equally probable energy levels. A totally unwarranted assumption. Given this unwarranted assumption, you then correctly calculate the information in AAAABBBB to be 8 bits. THIS IS NOT THE ENTROPY, it is the amount of information in AAAABBBB after making an unwarranted assumption. Only if your unwarranted assumption happens to be correct will it constitute the entropy.
- In the second case, ABCDABCD, you make another unwarranted assumption; that there are 8 particles, each with four equally probable states. Given this unwarranted assumption, you then correctly calculate the information in ABCDABCD to be 16 bits. THIS IS NOT THE ENTROPY, it is the amount of information in ABCDABCD after making yet another unwarranted assumption. Only if your unwarranted assumption happens to be correct will it constitute the entropy. If that unwarranted assumption is correct then the amount of information in AAAABBBB will also be 16 bits.
But there was a physical entropy change everyone agrees upon from S1=kb*ln(2)*8 to S2=kb*ln(2)*16. So I believe entropy H*N as I've described it is as fundamental in information theory as it is in physics. Prior probabilities and such are useful but need to be defined how they are used. H on a per message basis will be the fundamental input to those other ideas, not to be brushed aside or detracted from.<\blockquote>
- Everyone does NOT agree on your statement of physical entropy change. You are confusing the amount of information in a message (or microstate) with the entropy of the set of all possible messages (the macrostate).
- Again, The bottom line is that a particular message (microstate) may carry varying amounts of information. Entropy can only be defined for a macrostate which is a set of microstates whose individual probabilities are known or estimated, and add to unity. Entropy on a per message basis is nonsense, the information carried by a message is not. Prior probabilities are not just useful, they are mandatory for the calculation of entropy, whether you estimate them from a large number of individual messages (microstates), or you assume them, as is done with statistical mechanics entropy (each microstate being equally probable). PAR (talk) 05:33, 5 December 2015 (UTC)
I agree you can shorten up the H equation by entering the p's directly by theory or by experience. But you're doing the same thing as me when I calculate H for large N, but I do not make any assumption about the symbol probabilities. You and I will get the same entropy H and "extensive" entropy N*H for a SOURCE. Your N*H extensive entropy is N*sum(p*log(p)). The online entropy calculators and I use N*H = N*sum[ count/N*log(count/N) ] ( they usually give H without the N). These are equal for large N if the source and channel do not change. "My" H can immediately detect if a source has deviated from its historical average. "My" H will fluctuate around the historical or theoretical average H for small N. You should see this method is more objective and more general than your declaration it can't be applied to a file or message without knowing prior p's. For example, let a partition be removed to allow particles in a box to get to the other side. You would immediately calculate the N*H entropy for this box from theory. "My" N*H will increase until it reaches your N*H as the particles reach maximum entropy. This is how thermodynamic entropy is calculated and measured. A message or file can have a H entropy that deviates from the expected H value of the source.
The distinct symbols A, B, C, D are distinct microstates at the lowest level. The "byte" POSITION determines WHICH particle (or microvolume if you want) has that microstate: that is the level to which this applies. The entropy of any one of them, is "0" by the H function, or "meaningless" as you stated. A sequence of these "bytes" tells the EXACT state of each particle and system, not a particular microstate (because microstate does not care about the order unless it is relevant to it's probability). A single MACROstate would be combinations of these distinct states. One example macrostate of this is when the gas might be in any one of these 6 distinct states: AABB, ABAB, BBAA, BABA, ABBA, or BAAB. You can "go to a higher level" than using A and B as microstates, and claim AA, BB, AB, and BA are individual microstates with a certain probabilities. But the H*N entropy will come out the same. There was not an error in my AAAABBBBB example and I did not make an assumption. It was observed data that "just happened" to be equally likely probabilities (so that my math was simple). I just blindly calculated the standard N*H entropy, and showed how it give the same result physics gets when a partition is removed and the macrostate H*N entropy went from 8 to 16 as the volume of the box doubled. The normal S increases S2-S1=kb*N*ln(2) as it always should when mid-box partition is removed.
I can derive your entropy from the way the online calculators and I use Shannon's entropy, but you can't go the opposite way.
Now is the time to think carefully, check the math, and realize I am correct. There are a lot of problems in the article because it does not distinguish between intensive Shannon entropy H in bits/symbol and extensive entropy N*H in bits (or "shannons to be more precise to distinguish it from the "bit" count of a file which may not have 1's and 0's of equal probability).
BTW, the entropy of an ideal gas is S~N*log2(u*v) where u and v are internal energy and volume per particle. u*v gives the number of microstates per particle. Quantum mechanics determines that u can take on a very large number of values and v is the requirement that the particles are not occupying the same spot, roughly 1000 different places per particle at standard conditions. The energy levels will have different probabilities. By merely observing large N and counting, H will automatically include the probabilities.
In summary, there are only 3 simple equations I am saying. They precisely lay the foundation of all further information entropy considerations. These equations should replace 70% of the existing article. These are not new equations, but defining them and how to use them is hard to come across since there is so much clutter and confusion regarding entropy as a result of people not understanding these statements.
1) Shannon's entropy is "intensive" bits/symbol = H = - sum[ count/N*log2(count/N) ] = sum [count/N*log2(N/count) ] where N is the length of a message and count is for each distinct symbol.
2) Absolute ("extensive") information entropy is in units of bits or shannons = N*H.
3) S = kb*ln(2)*N*H where each N has a distinct microstate which is represented by a symbol. H is calculated directly from these symbols for all N. This works from the macro down to the quantum level.
In homogenous solids, 3) is not formally correct on a "per atom" or "per molecule" basis because phonons are interacting across the "lattice". The "symbols" would be interacting if they are no the scale of atoms, and are therefor not a random variable with a probability distribution that H can use. Each N needs to be on a larger scale like per kg, per mole, or per cm^3. However, once that is obtained, a simple division by molecules/kg or whatever will allow a false "per molecule" probability distribution and H that would be OK to use as long it is not taken used below the bulk level. N can be each molecule in a gas. They bump, but that is not an interaction that messes up the probability distribution.
Ywaz (talk) 17:12, 5 December 2015 (UTC)
- Ywaz - You say:
But you're doing the same thing as me when I calculate H for large N, but I do not make any assumption about the symbol probabilities.
- You DO make an assumption - that the frequencies of a letter in a particular message represent the probabilities of all messages. Totally unwarranted.
A message or file can have a H entropy that deviates from the expected H value of the source.<\blockquote>
- Again, a message does not have entropy, it carries a certain amount of information. The amount of information may deviate from the entropy (average amount of information carried by a message), but that amount of information is not called entropy.
There are a lot of problems in the article because it does not distinguish between intensive Shannon entropy H in bits/symbol and extensive entropy N*H in bits (or "shannons to be more precise to distinguish it from the "bit" count of a file which may not have 1's and 0's of equal probability).
- Shannon's "bit" is not a 0 or a 1. It is a measure of information. It is a generalization of the old concept of a bit. A single message carries information such that if the probabilities of each of the characters in the message are equal, the amount of information it carries is the length of the message. Shannon generalized the concept of a bit to include cases where the probabilites were not equal. A previous discussion here has determined that the word "bit" as a measure of information, rather than the "Shannon" is to be preferred, because that is the most common convention in the literature. As long as we specify units as bits (extensive) or "bits/symbol" there will be no confusion. Can you point out an example in the article where there is confusion by failing to make this distinction?
1) Shannon's entropy is "intensive" bits/symbol = H = sum[ count/N*log2(count/N) ] where N is the length of a message and count is for each distinct symbol.
- NO - "count" is not the count found in a single message, it is the count averaged over an infinite number of messages. In other words it is the expected value of the count.
2) Absolute ("extensive") information entropy is in units of bits or shannons = N*H.
- Fine, no problem.
3) S = kb*ln(2)*N*H where each N has a distinct microstate which is represented by a symbol. H is calculated directly from these symbols for all N. This works from the macro down to the quantum level.
- It's not clear what you are saying. If there are N microstates (or messages) in the macrostate, how is H "calculated directly from these symbols"? You speak of an "entropy" for each microstate, how do you arrive at the total intensive H ? PAR (talk) 06:30, 6 December 2015 (UTC)
As N gets larger, the message more closely exhibits the properties of all future messages from the source, unless the source changes its identity. Changing its identity means it stops acting like it did in the past. You're saying "NO" to this "count" method, but it is the method used by many others and is compatible with Shannon's text. Shannon even DEFINES entropy H for large N. He says "....a long message of N symbols. It will contain with high probability piN occurrences [of each symbol]". For you to say getting pi's from large N is "unwarranted" is so, so wrong. It is the opposite of the truth. There is no other way to get pi's except by theory without observation which is called metaphysics or philosophy instead of science. Entropy is an observation and a measurement. No one can observe an infinite number of symbols from any source. People were measuring entropy in thermodynamics many decades before they had any theory to explain where it came from and had no idea about the p's or the quantum theory from which they came. They were even using it before the molecular theory of gases.
"Information" is the N*H (where I've defined H for all messages of all lengths) if no other definition is provided. This is pretty much standard. You even used this definition of information in the subsequent paragraph. Or are you saying no one can measure the information content of a file without first knowing the p's based on a detail knowledge of the source that generated the file? "Bit" in the bit/symbol unit is correct, but it is not precise in terms of the measure of information because "bit" is normally simply a count of the "bits". Calling information content unit "shannons" may not be formally needed, but it is precise. It describes how the bit count was adjusted and is "minimal bits needed encode this message".
Using "intensive" and "extensive" information entropy may be new. So if it is used, it must be in quotes, with a reference to the thermo article on them to indicate why they are in quotes. But something must be said explicitly to show clearly to a wider audience that H and N*H are both called "entropy" but they have an important difference. Most people can't see the connection to physical entropy because they are looking at H instead of N*H. The other problem is that they do not know kb is merely a units conversion factor from kinetic energy per molecule as measured by Kelvins to joules per molecule (Joules/joules, unitless). In this way physical entropy really is just a statistical measure exactly like Shannon entropy, and they are not even separate concepts in systems operating at the Landauer limit.
The problem throughout the article is that information entropy is not defined, so every time the word "entropy" it is not clear if it is N*H entropy or H entropy. And half the time the wording is false. Example: "Shannon's entropy [H, by definition] measures the information contained in a message". No. N*H is the information content, not H. Finally, half way down, it warns "Often it is only clear from context which one is meant" and even complains about Shannon "confusing the matter" in a way that shows the writer does not know H itself is REQUIRED to be in bits/symbol: "Shannon himself used the term in this way". Shannon's entropy is not a term. It's an equation. So the article should not say "Shannon entropy" so many times when it is really referring to N*H.
Where did I speak of entropy of a microstate? I think I said Macrostate. 1 microstate = 1 symbol. H for 1 symbol = 1*log2(1) = 0. This is the "ideal" lowest entropy state, zero information.
Ywaz (talk) 13:18, 6 December 2015 (UTC)
Ywaz: you say:
As N gets larger, the message more closely exhibits the properties of all future messages from the source, unless the source changes its identity. Changing its identity means it stops acting like it did in the past. You're saying "NO" to this "count" method, but it is the method used by many others and is compatible with Shannon's text. Shannon even DEFINES entropy H for large N. He says "....a long message of N symbols. It will contain with high probability piN occurrences [of each symbol]". For you to say getting pi's from large N is "unwarranted" is so, so wrong. It is the opposite of the truth. There is no other way to get pi's except by theory without observation which is called metaphysics or philosophy instead of science. Entropy is an observation and a measurement. No one can observe an infinite number of symbols from any source. People were measuring entropy in thermodynamics many decades before they had any theory to explain where it came from and had no idea about the p's or the quantum theory from which they came. They were even using it before the molecular theory of gases.
- There is a set of probabilities for each symbol in the alphabet (pi for the i-th symbol). You call these pi's "metaphysical" or whatever, but they are not. Yes, their values are something we can estimate by looking at a long message, the longer the message, the more precise the estimation. We can express this by saying that "the error in the estimate of the pi's tends to zero as N approaches infinity". I do not say getting pi's from large N messages is unwarranted. Its a good way, but not the only way. I am saying that getting them from SMALL N messages is unwarranted. Saying that the entropy of AB is 2 bits is not correct. Only when you know the probability of message "AB" can you calculate the information content of "AB" and you cannot get that probability from the message "AB". You cannot say that a message "AB" shows that there are two symbols, each with 50% probability. Therefore you cannot calculate entropy of anything, because you don't have the pi's. Furthermore, In the case of statistical mechanics, we do not look at microstates and calculate their probabilities by counting. Rather we make the assumption that their probabilities are all equal and then we see that this assumption "gives the right answer" when calculating the entropy. This is not a counting method, but it is not "metaphysical" or "philisophical", it is science.
"Information" is the N*H (where I've defined H for all messages of all lengths) if no other definition is provided. This is pretty much standard. You even used this definition of information in the subsequent paragraph.
- This is not standard, this is not the definition of information. The information content of message "AB" is I=-log2(pab) where pab is the probability of the message "AB". Looking at that message "AB" and saying pab=(1/2)(1/2)=1/4 and therefore the entropy is 2 bits is totally unwarranted. You have to get pab from somewhere else, either by estimating them from a long message or many short messages, or by knowing how the messages were generated. The entropy of a 2-symbol message is then the average information in a 2-symbol message.
Or are you saying no one can measure the information content of a file without first knowing the p's based on a detail knowledge of the source that generated the file?
- I am saying you need the pi's in order to calculate the information content of a file. If the file is large, then yes, you may estimate the pi's from the symbol frequencies (assuming they are independent). If the file is small, you may not, and then you have to determine the pi's in some other way. If you have some outside information, fine, use it. In statistical mechanics, we have huge "files" (microstates) and a huge number of microstates, but we cannot count the frequencies. However, we make the assumption that the probabilities of the microstates are equal and we get results that match reality. So we say we know the pi's by yet another method.
- Don't rely on online entropy calculators for your definition of entropy. For example, the first part of [[1]] correctly calculates the entropy per symbol of a message GIVEN THE PROBABILITES OF A SYMBOL BEFOREHAND. Good. The second part is crap - it presumes to know the probabilities from the frequencies in a short message, the same mistake you make, and it is generally wrong. [[2]] is likewise crap.
Using "intensive" and "extensive" information entropy may be new. So if it is used, it must be in quotes, with a reference to the thermo article on them to indicate why they are in quotes. But something must be said explicitly to show clearly to a wider audience that H and N*H are both called "entropy" but they have an important difference. Most people can't see the connection to physical entropy because they are looking at H instead of N*H. The other problem is that they do not know kb is merely a units conversion factor from kinetic energy per molecule as measured by Kelvins to joules per molecule (Joules/joules, unitless). In this way physical entropy really is just a statistical measure exactly like Shannon entropy, and they are not even separate concepts in systems operating at the Landauer limit.
- If there are confusions between bits and bits/symbol in the article, then I agree, that needs to be made clear.
Where did I speak of entropy of a microstate? I think I said Macrostate. 1 microstate = 1 symbol. H for 1 symbol = 1*log2(1) = 0. This is the "ideal" lowest entropy state, zero information.
- No - a microstate corresponds to a particular message, e.g. "A5C". A macrostate is a set of microstates, for example, the set of all 3-symbol messages. Entropy applies to a macrostate, not a microstate. "A5C" carries some information equal to -log2(pA5C) where pA5C is the probability of message "A5C". The entropy of the set of 3-symbol messages is the weighted average of the amount of information in a 3-symbol message, averaged over all possible 3-symbol messages. Their probabilities will sum to unity. You cannot count on the short message "A5C" to give you the weights for the average, nor can you count on it to give you the full alphabet of 3-symbol messages. PAR (talk) 17:17, 6 December 2015 (UTC)
PAR writes:
- You cannot say that a message "AB" shows that there are two symbols, each with 50% probability.
When I come across a file who's contents are "AB" and I know nothing about what generated the file or anything about its past or future, I have no choice but to view the data as the complete life of the source. So myself and the online entropy calculators apply the statistical measure H to it.
- In the case of statistical mechanics, we do not look at microstates and calculate their probabilities by counting.
Look up Einstein solid and Debye model. They count oscillators and phonons.
- This is not standard, this is not the definition of information. The information content of message "AB" is I=-log2(pab) where pab is the probability of the message "AB".
If a source sends me ABBABBAABAAB....random stuff for a while, I will apply H to any large portion of it and see H=1. It then sends me "AB" or "BB" and you ask me the information content of it. I say H*N=2, which is the same answer you get from the equation above because you had also observed p(AB)=1/4 in the long sequence. So again you see, my "crappy" 1) and 2) can derive what you consider correct.
- No - a microstate corresponds to a particular message, e.g. "A5C". A macrostate is a set of microstates, for example, the set of all 3-symbol messages. Entropy applies to a macrostate, not a microstate. "A5C" carries some information equal to -log2(pA5C) where pA5C is the probability of message "A5C". The entropy of the set of 3-symbol messages is the weighted average of the amount of information in a 3-symbol message, averaged over all possible 3-symbol messages. Their probabilities will sum to unity. You cannot count on the short message "A5C" to give you the weights for the average, nor can you count on it to give you the full alphabet of 3-symbol messages.
If your implied 36 alphanumerics are equally probable, causing your 3-symbol microstates to be equally probable, then by treating each alphanumeric as a microstate, my entropy calculation for your Macrostate is N*H = 3*log2(36). This is the same entropy you will calculate by -log2(36^3). There is a tradeoff between the two methods. My microstate lookup table for the probabilities has 36 entries but yours has 36^3. So you have to carry a larger memory bank. But you will recognize any higher level patterns. This is the basis of most compression schemes. For example, if A5CA5CA5C.... was all the source ever sent, I would happily calculate H=log(3)= 1.585 and 3*H as the entropy for the macrostate. You would calculate H=0 and entropy 1*H=0 (your N = 3 * my N). The higher level microstates can show more "intelligence" (which requires the ability to compress data).
So I do not disagree with your example, but I wanted to show you that by remembering 36^3 probabilities in your lookup table for each microstate, you are actually assigning a SYMBOL (a row number in the lookup table) to each microstate. Again, I can derive your methods from mine.
My 1) above should be simplified to H= - sum( log2(count/N) ) because the other count/N sums to 1. Ywaz (talk) 00:53, 7 December 2015 (UTC) </ref>
- Ywaz: You write:
When I come across a file who's contents are "AB" and I know nothing about what generated the file or anything about its past or future, I have no choice but to view the data as the complete life of the source. So myself and the online entropy calculators apply the statistical measure H to it.
- Ok, I see why you call a single symbol a microstate. To my mind you are calling each symbol a microstate, and the entire file (say the file consists of "A5C"), a collection of symbols, the macrostate. You then calculate the frequencies (i.e. probabilities) of the microstates, make the assumption that they are independent, and with these pi's, you calculate entropy. So what do you do with this entropy. Of what use is it?
Look up Einstein solid and Debye model. They count oscillators and phonons.
- Ok, a semantic problem. They count them conceptually, not experimentally, I thought you were insisting they be counted experimentally. No problem.
If a source sends me ABBABBAABAAB....random stuff for a while, I will apply H to any large portion of it and see H=1. It then sends me "AB" or "BB" and you ask me the information content of it. I say H*N=2, which is the same answer you get from the equation above because you had also observed p(AB)=1/4 in the long sequence. So again you see, my "crappy" 1) and 2) can derive what you consider correct.
- Ok, you have used ABBABBAABAAB... to estimate prior pi's before applying them to "AB". But if the source sent you "AAAAAAAAAAB", your H would not give the correct answer. That's my point. Without that prior knowledge of the source, the previously estimated pi's, you cannot calculate the information content of "AB".
If your implied 36 alphanumerics are equally probable, causing your 3-symbol microstates to be equally probable, then by treating each alphanumeric as a microstate, my entropy calculation for your Macrostate is N*H = 3*log2(36). This is the same entropy you will calculate by -log2(36^3).
- Yes, yes! If my 36 alphanumerics are equally probable (and independent!), you are right. But if they are not, you are wrong. And that's my point. My point is that you have no reason, no justification, to assume that they are equally probable if all you have is "A5C". Noting that A, 5, and C are equally likely in A5C does not justify this assumption.
Again, I can derive your methods from mine.
- No, you cannot. Not without making some assumptions, basically pulling them out of thin air. You have to make the assumption that the symbol frequencies (probabilities) in the short message correspond to the frequencies of a long message, and that they are independent (i.e. the probability of finding "AB" in a message is pA pB). You can say that, without the long message, you have nothing else to go on, but what, then, is the value of your calculation? PAR (talk) 03:46, 7 December 2015 (UTC)
Continuing in new heading below. Ywaz (talk) 06:56, 7 December 2015 (UTC)