Talk:Entropy (information theory)/Archive 4

This is an archive of past discussions about Entropy (information theory). Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.

Archive 1

Archive 2

Archive 3

Archive 4

Archive 5

Applying H to short messages

Latest comment: 9 years ago10 comments2 people in discussion

This is a continuation of the above to determine if the online entropy calculators are correct in how they use Shannon's H to calculate entropy for short messages.

PAR wrote:

..... make the assumption that they are independent, and with these pi's, you calculate entropy. So what do you do with this entropy. Of what use is it?

That's how molar and specific entropy's are calculated. They are useful when they are independent because S=N*S^o or S=N*H in information theory. If I calculate H on a short file and it is < 1, then I know I can compress it. If I have 13 balls with 1 either over or under weight, how do I use a balance in 3 weighings to determine which ball is over or underweight? I look for solutions that give me the highest N*H which represents 3 weighings*log(3 symbols) where symbols represent which position the balance is in when weighing (Left tilt, Right tilt, no tilt). These are the microstates. Highest number for macrostate gives highest entropy. So I want weigh methods that have an equal probability of triggering 1 of the 3 states of the balance, independently. I do not weigh 2 balls on 1st try because it is unlikely to tilt. 6 balls on each side may or may not tilt, and even keeps the chance of no tilt open, but that last symbol is not likely to occur, so 6 on each side may not be the right starting point. In any event, I am looking to get my highest N*H value on a very short message.

They count them conceptually, not experimentally, I thought you were insisting they be counted experimentally.

The phonons in the Debye model are real, so indirectly (by solid theory based on experiment) they are by experiment. But Carnot and Clausius did not need these theories and got the same results in bulk because in bulk there are microstates that are independent even though you have no clue about their source.

Ok, you have used ABBABBAABAAB... to estimate prior pi's before applying them to "AB". But if the source sent you "AAAAAAAAAAB", your H would not give the correct answer.

OK, yeah, my example was not right and just accidentally came out right. You want the information content of AB given that the source normally says AAAAAAAAAAB. You said you would calculate this 2 symbol information content as -log₂(p(AB)). I get p(AB) = 1/12 by looking at the sequence. Is this information measure related to entropy?

Yes, yes! If my 36 alphanumerics are equally probable (and independent!), you are right. But if they are not, you are wrong.

I have been talking about mutually-independent symbols, even if they are NOT mutually independent, because you can never know to what EXTENT they are not, unless it is physical entropy with a solid theory or experimental history describing the source. Physics is already the best set of compression algorithms for the data we observe (occam's razor). But in information theory for stuff running around on our computers, no matter what we think we know about the source, there can always be a more efficient way to assign the symbols for the microstates as more data comes in (a better compression algorithm). What algorithm are you going to use to determine the most efficient number of symbols to call your microstates? In practice, something can be tried then N*H is effectively checked to see if it got smaller. If it did, then you found a better definition of the microstates. In physics, it would constitute an new and improved law.

You have to make the assumption that the symbol frequencies (probabilities) in the short message correspond to the frequencies of a long message,

No, no, I clarified earlier that the short messages were to be taken "as is" without any assumption about there being a source that could possibliy generate anything different, especially since there usually is no other observable data to go on than the short message itself. Let's say I send you a GIF image. How are you going to calculate its "information" or entropy content?

what, then, is the value of your calculation?

What are you calling "my" calculation? Applying H to a short message? Who decides which messages are short? Who decides if a file on my computer is or is not the life-sum total of some "source" that is never to be seen again? Ywaz (talk) 06:56, 7 December 2015 (UTC)

Ywas - you wrote:

If I calculate H on a short file and it is < 1, then I know I can compress it.<\blockquote>

I know what you are saying, but it sounds like you think this compression method has some special significance, and I don't want to lose track of the fact that it doesn't. If I have 8 GIF files of 1Mb each, I can number them 1 thru 8 and compress them as follows: 1,2,3,4,5,6,7,8. That's fine if all I ever deal with is duplicates of one of those 8 files. That's an assumption I have made. If I make another assumption, I get another compression method. If I make your assumption, that all the pixels are equally likely and independent, I get your compression method. My point is that an assumption has to be made, which effectively declares the pi's beforehand, and you cannot completely justify any assumption by just looking at a finite set of data. You can come close with large amounts of data, but for smaller amounts, you lose more and more justification. By saying "I can compress "AAB" because H=(-2/3)log2(2/3)+(-1/3)log2(1/3)=0.918 is less than unity, you are implicitly assuming that "AAB" is not the only file you expect to deal with. If it were the only one, you could just compress it to "1", H=0, and be done with it. You are implicitly assuming that the files you expect to deal with have a 2-symbol alphabet, each of which is equally likely and independent. That assumption, if true, is what makes your compression method useful. If its not true, then why bother?

If I have 13 balls with 1 either over or under weight, how do I use a balance in 3 weighings to determine which ball is over or underweight? I look for solutions that give me the highest N*H which represents 3 weighings*log(3 symbols) where symbols represent which position the balance is in when weighing (Left tilt, Right tilt, no tilt). These are the microstates. Highest number for macrostate gives highest entropy. So I want weigh methods that have an equal probability of triggering 1 of the 3 states of the balance, independently. I do not weigh 2 balls on 1st try because it is unlikely to tilt. 6 balls on each side may or may not tilt, and even keeps the chance of no tilt open, but that last symbol is not likely to occur, so 6 on each side may not be the right starting point. In any event, I am looking to get my highest N*H value on a very short message.

Ha. That's an interesting thought experiment. Of course we ASSUME independence: the weight of a ball is not affected by the weight or proximity of any other ball. This is "outside information" - prior constraints on our pi's, which is not justified by any data we may gather by weighing them. That's part of my point. We assume that the measurements are not independent and not order dependent. If our first measurement tilts right, and our second measurement is of the same balls in the same cups, it will certainly tilt right. More outside information constraining our pi's, not justified by any measurement we make. Another part of my point. The probability that the odd ball is heavy is 1/2, equal to the probability that the odd ball is light. More outside information, not justified by any measurement. The probability that any given ball is odd is 1/13, same as any other ball. Again, more outside information.

The balls are labelled (A,B,C...M). Our microstates every possible set of three measurements, where a measurement is done by putting n distinct balls in each cup, n=1 thru 6. You can't put a ball in both cups, or two of the same ball in one cup. More outside information. For example, { {ABCD}{DEFG}, {AGE,MBE}, {F,G} } is a microstate. Macrostates are triplets of the three symbols R, L, and E. Each microstate yields a macrostate, but not vice versa. Now, and only now, can we calculate our pi's, none of which are justified by the data. We search for the microstate(s) that give maximum information, or, actually, any microstate with an information content less than or equal to three bits. If one or more exist, they are our solutions. If none have an information content less than or equal to two bits, then three measurements is the best we can do.

Note that the above "outside information" has been implicitly incorporated into your method. My point is that your method, by making these implicit assumptions, is not in any way fundamental. If we come across a situation in which your assumptions are invalid, your method is wrong. Furthermore, the data (the measured microstates) do not fully support your assumptions. The assumptions are "a priori" knowledge that you bring to the problem.

OK, yeah, my example was not right and just accidentally came out right. You want the information content of AB given that the source normally says AAAAAAAAAAB. You said you would calculate this 2 symbol information content as -log2(p(AB)). I get p(AB) = 1/12 by looking at the sequence. Is this information measure related to entropy?

If you ASSUME that AAAAAAAAAAB represents exactly the full alphabet (A,B) and their respective frequencies in any message (not necessarily true), then pA=11/12 and pB=1/12. If the symbol ocurrences are ASSUMED to be independent, then the probability of AB is pAB = pA pB = 11/144, and the information content of AB is -log2(pAB)=3.701... bits. The entropy of the set of two-bit messages is the weighted average of the information in all two-bit messages: pAA=121/144, pAB=pBA=11/144, pBB=1/144, (extensive) Entropy = H = pAA log2(pAA)+etc. = 0.8276... bits.

I have been talking about mutually-independent symbols, even if they are NOT mutually independent, because you can never know to what EXTENT they are not, unless it is physical entropy with a solid theory or experimental history describing the source.

Or outside knowledge you bring to the problem, like the 13-ball problem, in which any two measurements are assumed (or realized to be) not independent. (make a measurement using the same balls but switching cups and if your first measurement is R, then your second will certainly be L).

Physics is already the best set of compression algorithms for the data we observe (occam's razor). But in information theory for stuff running around on our computers, no matter what we think we know about the source, there can always be a more efficient way to assign the symbols for the microstates as more data comes in (a better compression algorithm). What algorithm are you going to use to determine the most efficient number of symbols to call your microstates? In practice, something can be tried then N*H is effectively checked to see if it got smaller. If it did, then you found a better definition of the microstates. In physics, it would constitute an new and improved law.

Yes.

Let's say I send you a GIF image. How are you going to calculate its "information" or entropy content?

The information content of a GIF image will be -log2(pGIF) where pGIF is the probability of that GIF image occurring. My first impulse would be to ASSUME equal probability for each pixel, so if there are M possible pixels and N pixels in the image, p=1/M, pGIF=(1/M)^N and extensive H = - N log2(1/M). Then I might think that most GIF images are not random noise, but areas of constant color, so given that one pixel is green, the pixel just after is more than randomly likely to be green. That changes my estimate of pGIF.

For example, if I look at a large number of GIF files, and note that GG occurs with probability p(G,G) which is significantly larger than p^2=(1/M)^2 that I would calculate if the pixel frequencies were independent, and this holds true for any color of pixel, then if I have a 3-pixel image BRG, I would say that the probability of such an image is pGIF = the probability of B given no previous pixel times probability of R given previous was B times probability of G given previous was R. I expect information content that I calculate will be smaller and my compression will be more efficient, as long as the GIFs that I used to estimate these probabilities are rather representative of all GIFs. If I consider my "macrostate" to be all GIF images of the same size, I expect my entropy will be smaller than that which I calculate assuming all pixels are independent.

As always, I have to pre-estimate pGIF, either by assuming p's are independent and equal to 1/M or looking at some other data set and using conditional probabilities.

What are you calling "my" calculation? Applying H to a short message? Who decides which messages are short? Who decides if a file on my computer is or is not the life-sum total of some "source" that is never to be seen again?

If you are making the compression algorithm, you do. If you don't know, then fall back on the p=1/M independent pixel idea, but if you have some outside information, use it to improve the algorithm. PAR (talk) 21:41, 7 December 2015 (UTC)

PAR writes:

If you ASSUME that AAAAAAAAAAB represents exactly the full alphabet (A,B) and their respective frequencies in any message (not necessarily true), then pA=11/12 and pB=1/12. If the symbol ocurrences are ASSUMED to be independent, then the probability of AB is pAB = pA pB = 11/144, and the information content of AB is -log2(pAB)=3.701... bits. The entropy of the set of two-bit messages is the weighted average of the information in all two-bit messages: pAA=121/144, pAB=pBA=11/144, pBB=1/144, (extensive) Entropy = H = pAA log2(pAA)+etc. = 0.8276... bits.

I count 10 A's where I think you've counted 11, so I'll use 11. You created a new symbol set by using pAA etc... in order to calculate an H and you claim this is in "bits". But it is really 0.8276 bits/symbol where your new symbol set has 1 symbol like "AA" where the old set had 2 symbols (the same "AA"). H uses the p's of single symbols, not a sequence of symbols like you've done, unless you allow that it is a new symbol set. If you had stuck with the old set and calculated H bits/symbol and then multiplied by 2 like I've been saying, you would have gotten the same result, 0.8276. We're still getting the same results and I'm still using shorter math equations. I agree 3.7 is the entropy content of pAB in bits, so you've given an exact measure of how much more information it carries than the expected.

I do not see that you have a different measure of entropy than the 1) and 2) I described above. I maintain that it can be (and very often is) applied to short messages without any problem, with the understanding it has been divorced from any possible source and is pretty ignorant about compressibility and completely ignorant about interdependencies. These are NP hard tasks, and since there is no ultimate or clear universally accepted benchmark for measuring compressibility or symbol interdependences, especially across the board on all data, a blind statistic is a great starting point. Indeed, it has always been the de facto starting point and reference. I maintain 1) and 2) are statistical measures like an average and can be used just as blindly without worries. Ywaz (talk) 00:10, 8 December 2015 (UTC)

Yes, sorry about that, I meant 11.

I think we understand each other, and we both come up with the same answers. Our disagreement is about who is coming up with new symbols. We disagree on what is fundamental. I am saying that it is fundamental that each microstate (or message) carries information and has a certain probability of occurrence and the entropy is the expected value of the information per microstate averaged over all microstates (i.e. averaged over the macrostate). This makes no assumptions about the nature of the microstates, whether they are a string of independent symbols, a string of symbols conditionally dependent on each other, or the energy levels of atoms or molecules, or whatever. You say that if a microstate can be broken up into independent pieces (e.g. individual symbols with independent probabilities pA and pB), those pieces are fundamental, if not, then a microstate is represented by one two-letter symbol with with probabilities (e.g. pAA, pAB, pBA, pBB). I'm ok for us to agree to disagree, since I don't think we will disagree on our solutions to problems. Or maybe I did not summarize your position correctly?

More explicitly, If we are dealing with 2-symbol messages with an alphabet of 2 symbols, then there are 4 micro states, AA, AB, BA, BB. The (extensive) entropy , lets call it He, is defined only in terms of the probability of those micro states (pAA, pAB, pBA, pBB). Then He=pAA log2(pAA) + pAB log2(pAB)+etc. It is only when you ASSUME that they are independent (i.e. there exists pA and pB such that pAA=pA pA, pAB=pA pB, pBA=pB pA and pBB=pB pB) can you then say

He = pA pA log2(pA pA)+pA pB log2(pA pB) +pB pA log2(pB pA)+pB pB log2(pB pB) = 2 ( pA log2(pA) + pB log2(pB) )

which is, as you say, twice the intensive entropy you calculate. I say I have not introduced new symbols, but rather it is you who have introduced new symbols pA and pB by assuming that they exist such that pAA=pA pA, pAB=pA pB, etc. But what if pAA=0.1, pAB=0.1, pBA=0.1 and pBB=0.7? Then the micro states will not consist of two independent symbols, there will be no pA and pB that fit the criterion of independence, yet there will still be an amount of information carried by each symbol pair, and an entropy of the set of symbol pairs equal to the expected value of that information.

I think I can modify my position on short messages. If I am faced with a short message "A5C" and no prior information and I don't expect any more messages, then my first impulse is to say that there is no point in worrying about the information content, since I cannot use whatever number I come up with in any meaningful way. I have no past or future messages to ponder. If I expect more messages to come, then I will say that assuming the message has an alphabet of 3 symbols, A, 5, and C, each equally likely and independent is one of the simpler initial assumptions I can make. As you say, "a blind statistic is a great starting point". I will wait for more messages and modify my assumptions and maybe eventually I will be able to more accurately estimate the information content of new messages, and the extensive entropy per message, or, if the symbols exhibit independence, the intensive entropy per symbol, of messages yet to be received. PAR (talk) 03:37, 8 December 2015 (UTC)

Note: If hasn't already been apparent, I something throw out a -1 from H and I'm letting H=p*log(1/p) instead of H=-p*log(p).

Example: 3 interacting particles with sum total energy 2 and possible individual energies 0,1,2 may have possible energy distributions 011, 110, 101, 200, 020, or 002. I believe the order is not relevant to what is called a microstate, so you have only 2 symbols for 2 microstates, and get the probability for each is 50-50. Maybe there is usually something that skews this towards low energies. I would simply call each one of the 6 "sub-micro states" a microstate and let the count be included in H. Assuming equal p's again, the first case gives log(2)=1 and the 2nd log(6)=2.58. I believe the first one is the physically correct entropy (the approach, that is, not the exact number I gave). If I had let 0,1,2 be the symbols, then it would have 3*1.46 = 4.38 which is wrong.

Physically, because of the above, when saying S=k*ln(2)*NH, it requires that you look at specific entropy S^o and make it = k*ln(2)*H, so you'll have the correct H. This back-calculates the correct H. This assumes you are like me and can't derive Boltzmann's thermodynamic H from first (quantum or not) principles. I may be able to do it for an ideal gas. I tried to apply H to Einstein's oscillators (he was not aware of Shannon's entropy at the time) for solids, and I was 25% lower than his multiplicity, which is 25% lower than the more accurate Debye model. So a VERY simplistic approach to entropy with information theory was only 40% lower than experiment and good theory, for the one set of conditions I tried. I assumed the oscillators had only 4 energy states and got S=1.1*kT where Debye via Einstein said S=1.7*kT

My point is this: looking at a source of data and choosing how we group the data into symbols can result in different values for H and NH, [edit: if not independent]. Using no grouping on the original data is no compression and is the only one that does not use an algorithm plus lookup table. Higher grouping on independent data means more memory is required with no benefit to understanding (better understanding=lower NH). People with bad memories are forced to develop better compression methods (lower NH), which is why smart people can sometimes be so clueless about the big picture, reading too much with high NH in their brains and thinking too little, never needing to reduce the NH because they are so smart. Looking for a lower NH by grouping the symbols is the simplest compression algorithm. The next step up is run-length encoding, a variable symbol length. All compression and pattern recognition create some sort of "lookup table" (symbols = weighting factors) to run through an algorithm that may combine symbols to create on-the-fly higher-order symbols in order to find the lowest NH to explain higher original NH. The natural, default non-compressed starting point should be to take the data as it is and apply the H and NH statistics, letting each symbol be a microstate. Perfect compression for generalized data is not a solvable problem, so we can't start from the other direction with an obvious standard.

This lowering of NH is important because compression is 1 of 3 requirements for intelligence. Intelligence is the ability to acquire highest profit divided by noise*log(memory*computation) in the largest number of environments. Memory on a computing device has a potential energy cost and computation has a kinetic energy cost. The combination is internal energy U. Specifically, for devices with a fixed volume, in both production machines and computational machines, profit = Work output/[k*Temp*N*ln(N/U)] = Work/(kTNH). This is Carnot efficiency W/Q, except the work output includes acquisition of energy from the environment so that the ratio can be larger than 1. The thinking machine must power itself from its own work production, so I should write (W-Q)/Q instead. W-Q feeds back to change Q to improve the ratio. The denominator represents a thinking machine plus its body (environment manipulator) that moves particles, ions (in brains), or electrons (in computers) to model much larger objects in the external world to try different scenarios before deciding where to invest W-Q. "Efficient body" means trying to lower k for a given NH. NH is the thinking machine's algorithmic efficiency for a giving k. NH has physical structure with U losses, but that should be a conversion factor moved out to be part of the kT so that NH could be a theoretical information construct. The ultimate body is bringing kT down to k_b at 0 C. The goal of life and a more complete definition of intelligence is to feed Work back to supply the internal energy U and to build physical structures that hold more and more N operating at lower and lower k*T. A Buddhist might say we only need to stop being greedy and stop trying to raise N (copies of physical self, kT, aka the number of computations) and we could leave k, T, and U alone. This assumes constant volume, otherwise replace N/U with NN/UV. Including volume would mean searching for higher V per N which means more space exploration per "thought". The universe itself increases V/N (Hubble expansion) buth it cancels in determining Q because it causes U/N to decrease at the same rate. This keeps entropy and energy CONSTANT on a universal COMOVING basis (ref: Weinberg's 1977 famous book "First 3 Minutes"), which causes entropy to be emitted (not universally "increased" as the laymen's books still say) from gravitational systems like Earth and Galaxies. The least action principle (the most general form of Newton's law, better than Hamiltonian & Lagrangian for developing new theories, see Feynman's red books) appears to me to have an inherent bias against entropy, preferring PE over KE over all time scales, and thereby tries to lower temp and raise the P.E. part of U for each N on Earth. This appears to be the source of evolution and why machines are replacing biology, killing off species 50,000 times faster than the historical rate. The legal requirement of all public companies is to dis-employ workers because they are expensive and to extract as much wealth from society as possible so that the machine can grow. Technology is even replacing the need for shareholders and skill (2 guys started MS, Apple, google, youtube, facebook, and snapchat and you can see trend in decreasing intelligence and age and increasing random luck needed to get your first $billion). Silicon, carbon-carbon, and matals are higher energy bonds (which least action prefers over kinetic energy) enabling lower N/U and k, and even capturing 20 times more Work energy per m^2 than photosynthesis. Ions that brains have to model objects with still weigh 100,000 times more than the electrons computers use.

In the case of the balance and 13 balls, we applied the balance like asking a question and organize thigs to get the most data out of the test. We may seek more NH answers from people or nature than we give in order to profit, but in producing W, we want to spend as little NH as possible.

[edit: I originally backtracked on dependency but corrected it, and I made a lot errors with my ratios from not letting k be positive for the ln().]Ywaz (talk) 23:09, 8 December 2015 (UTC)

Ywaz - you wrote:

Example: 3 interacting particles with sum total energy 2 and possible individual energies 0,1,2 may have possible energy distributions 011, 110, 101, 200, 020, or 002. I believe the order is not relevant to what is called a microstate, so you have only 2 symbols for 2 microstates, and get the probability for each is 50-50. Maybe there is usually something that skews this towards low energies. I would simply call each one of the 6 "sub-micro states" a microstate and let the count be included in H. Assuming equal p's again, the first case gives log(2)=1 and the 2nd log(6)=2.58. I believe the first one is the physically correct entropy (the approach, that is, not the exact number I gave). If I had let 0,1,2 be the symbols, then it would have 3*1.46 = 4.38 which is wrong.

If the particles are indistinquishable, 2 microstates, entropy 1. If the particles are distinquishable, 6 microstates, entropy 2.58. (See Identical particles.)

My point is this: looking at a source of data and choosing how we group the data into symbols can result in different values for H and NH

Yes. The fact remains that if we define the macrostate, and the microstates with their probabilities, we have specified entropy. Different groupings of symbols is essentially defining different macrostates, so yes, different entropies. If we have a box with a partition and oxygen at the same temperature and pressure on either side and remove the partition - no change in entropy. If we have one isotope of oxygen on one side, another on the other, but cannot experimentally distinguish the two, if we remove the partition, again, no change in entropy. If we can experimentally distinguish the two, we have a different definition of macrostate: we remove the partition, entropy increases. Altering the definition of macrostate for the same situation, we get different entropies.

If we have a GIF file and no other information, we assume independent pixels, we calculate entropy. If we have two types of GIF files, one a photo, the other noise, but we cannot know beforehand which is which, then we assume independent pixels, get an entropy for each. If we can distinguish a photo from noise beforehand, then we can assume some dependence between pixels for the photo, get one entropy, no dependence for the noise and get another entropy. Again, altering the definition of macrostate for the same situation, we get different entropies.

I read the rest of your post, it's interesting, but off-topic, I think. PAR (talk) 04:53, 10 December 2015 (UTC)

Thanks for the link to indistinguishable particles. The clearest explanation seems to be here, the mixing paradox. The idea is this: if we need to know the kTNH energy required (think NH for a given kT) to return to the initial state at the level 010, 100, 001 with correct sequence from a certain final sequence, then we need do the microstates at that low level. Going the other way, "my" method should be mathematically the same as "yours" if it is required to NOT specify the exact initial and final sequences, since those were implicitly not measured. Measuring the initial state sequences without the final state sequences would be changing the base of the logarithm mid-stream. H is in units of true entropy per symbol when the base of logarithm is equal to the number of distinct symbols. In this way H always varies from 0 to 1 for all objects and data packets, giving a true disorder (entropy) per symbol (particle). You multiply by ln(2)/ln(n) to change base from 2 to n symbols. Therefore the ultimate objective entropy (disorder or information) in all systems, physical or information, when applied to data that accurately represents the source should be

Entropy=N*(-H)=\sum _{i}count_{i}\log _{n}(N/count_{i})

where i=1 to n distinct symbols in data N symbols long. Shannon did not specify which base H uses, so it is a valid H. To convert it to nats of normal physical entropy or entropy in bits, multiply by ln(n) or log₂(n). The count/N is inverted to make H positive. In this equation, with the ln(2) conversion factor, this entropy of "data" is physically same as the entropy of "physics" if the symbols are indistinguishable, and we use energy to change the state of our system E=kT*NH where our computer system has a k larger than k_b due to inefficiency. Notice that changes in entropy will be the same without regard to k, which seems to explain why ultimately distinguishable states get away with using higher-level microstates definitions that are different with different absolute entropy. For thermo, k_b is what appears to have fixed not caring about the deeper states that were ultimately distinguishable.

The best wiki articles are going to be like this: you derive what is the simplest but perfectly accurate view, then find the sources using that conclusion to justifyits inclusion.

So if particles (symbols) are distinguishable and we use that level of distinguishability, the count at the 010 level has to be used. Knowing the sequence means knowing EACH particle's energy. The "byte-position" in a sequence of bits represents WHICH particle. This is not mere symbolism because the byte positions on a computer have a physical location in a volume, so that memory core and CPU entropy changes are exactly the physical entropy changes if they are at 100% efficiency (Landauer's principle). (BTW the isotope method won't work better than different molecules because it has more mass. This does not affect temperature, but it affects pressure, which means the count has to be different so that pressure is the same. So if you do not do anything that changes P/n in PV=nRT, using different gases will have no effect to your measured CHANGE in entropy, and you will not know if they mixed or not. )

By using indistinguishable states, physics seems to be using a non-fundamental set of symbols, which allows it to define states that work in terms of energy and volume as long as kb is used. The ultimate, as far as physicists might know, might be phase space (momentum and position) as well as spin, charge, potential energy and whatever else. Momentum and position per particle are 9 more variables because unlike energy momentum is a 6D vector (including angular), and a precise description of the "state" of a system would mean which particle has the quantities matters, not just the total. Thermo gets away with just assigning states based on internal energy and volume, each per particle. I do not see kb in the ultimate quantum description of entropy unless they are trying to bring it back out in terms of thermo. If charge, spin, and particles are made up of even smaller distinguishable things, it might be turtles all the way down, in which case, defining physical entropy as well as information entropy in the base of the number of symbols used (our available knowledge) might be best. Ywaz (talk) 11:48, 10 December 2015 (UTC)

Ywas - you wrote:

The clearest explanation seems to be here, the mixing paradox. The idea is this: if we need to know the kTNH energy required (think NH for a given kT) to return to the initial state at the level 010, 100, 001 with correct sequence from a certain final sequence, then we need do the microstates at that low level. Going the other way, "my" method should be mathematically the same as "yours" if it is required to NOT specify the exact initial and final sequences, since those were implicitly not measured. Measuring the initial state sequences without the final state sequences would be changing the base of the logarithm mid-stream. H is in units of true entropy per symbol when the base of logarithm is equal to the number of distinct symbols. In this way H always varies from 0 to 1 for all objects and data packets, giving a true disorder (entropy) per symbol (particle). You multiply by ln(2)/ln(n) to change base from 2 to n symbols. Therefore the ultimate objective entropy (disorder or information) in all systems, physical or information, when applied to data that accurately represents the source should be

: $Entropy=N*(-H)=\sum _{i}count_{i}\log _{n}(N/count_{i})$

where i=1 to n distinct symbols in data N symbols long.

I don't understand what you are saying. What is the meaning of "sequence"? If "sequence" 001 is distinct from 100, then "sequence" means microstate and statements like "Measuring the initial state sequences" are improper, microstates are not measured in thermo, only macrostates are measured. I assume the macrostate is total energy, so that microstates 010, 100, 001 form a macrostate: energy measured to be 1.

Again, what you call "Objective Entropy" is not entropy, it is the amount of information in a particular message (microstate), assuming independence, and assuming that the frequencies of the symbols in the message are a perfect indication of the frequencies of the population (macrostate) from which it is drawn. Its ok to say its a "best estimate" of the entropy, just as the best estimate of the mean of a normal distribution given one value is that value. But please don't call it entropy. Likewise, what you call entropy/symbol is an estimate of the information per symbol.

The best wiki articles are going to be like this: you derive what is the simplest but perfectly accurate view, then find the sources using that conclusion to justifyits inclusion.

Fine, but please don't invent new quantities and call them by names which are universally accepted to be something else. It just creates massive confusion and interferes with communication.

By using indistinguishable states, physics seems to be using a non-fundamental set of symbols, which allows it to define states that work in terms of energy and volume as long as kb is used. The ultimate, as far as physicists might know, might be phase space (momentum and position) as well as spin, charge, potential energy and whatever else. Momentum and position per particle are 9 more variables because unlike energy momentum is a 6D vector (including angular), and a precise description of the "state" of a system would mean which particle has the quantities matters, not just the total.

Physics is not using a non-fundamental set of symbols. With indistinguishable particles, the "ultimate" you describe does not exist. For distinguishable particles 011, 101, 110 are distinct microstates (alphabet of two, message of three), for indistinguishable particles, they are not. (alphabet of one, message of one: 011, 101, 110, are the same symbol.) You cannot use the phrase "which indistinguishable particle", its nonsense, there is no "which" when it comes to indistinguishable particles.

Thermo gets away with just assigning states based on internal energy and volume, each per particle. I do not see kb in the ultimate quantum description of entropy unless they are trying to bring it back out in terms of thermo.<\blockquote>

Thermo doesn't "get away" with anything. It only works with observable, measureable quantities and assigns macrostates. Different observable quantities, different macrostates. Thermo knows nothing about microstates, and doesn't need to in order to calculate thermodynamic entropy (to within a constant). In thermal physics, microstates are unmeasureable. If they were measureable, they would be macrostates, and their entropy would be zero. Please note that there are, for our purposes, two kinds of entropy, thermodynamic entropy and information entropy. The two have nothing to do with each other, until you introduce the statistical mechanics model of entropy. Then the two are basically related by the statistical mechanics constant kB. (Boltzmann's S=kB ln(W)). The physical information entropy is proportional to the amount of information you are missing about the microstate by simply knowing the macrostate (internal energy, volume, or whatever). PAR (talk) 06:42, 11 December 2015 (UTC)

PAR: " what you call "Objective Entropy" is not entropy, it is the amount of information in a particular message (microstate), "

Don't you mean entropy = information in a Macrostate? That is what you should have said.

I didn't make it up. It's normally called normalized entropy, although they normally refer to this H with log_n "as normalized entropy" when according to Shannon they should say "per symbol" and use NH to call it an entropy. I'm saying there's a serious objectivity to it that I did not realize until reading about indistinguishable states.

I hope you agree "entropy/symbol" is a number that should describe a certain variation in a probability distribution, and that if a set of n symbols were made of continuous p's, then a set of m symbols should have the same continuous distribution. But you can't do that (get the same entropy number) for the exact same "extrapolated" probability distributions if they use a differing number of symbols. You have to let the log base equal the number of symbols. I'll get back to the issue of more symbols having a "higher resolution". The point is that any set of symbols can have the same H and have the same continuous distribution if extrapolated.

If you pick a base like 2, you are throwing in an arbitrary element, and then have to call it (by Shannon's own words) "bits/symbol" instead of "entropy/symbol". Normalized entropy makes sense because of the following

entropy in bits/symbol = log₂(2^("avg" entropy variation/symbol)) entropy per symbol = log_n(n^("avg" entropy variation/symbol))

The equation I gave above is the normalized entropy that gives this 2nd result.

Previously we showed for a message of N bits, NH=MH' if the bits are converted to bytes and you calculate H' based on the byte symbols using the same log base as the bits, and if the bits were independent. M = number of byte symbols = N/8. This is fine for digital systems that have to use a certain amount of energy per bit. But what if energy is per symbol? We would want NH = M/8*H' because the byte system used 8 fewer symbols. By using log base n, H=H' for any set of symbols with the same probability distribution, and N*H=M/8*H.

Bytes can take on an infinite number of different p distributions for the same H value, whereas bits are restricted to a certain pair of values for p0 and p1 (or p1 and p0) for a certain H, since p0=1-p0. So bytes have more specificity, that could allow for higher compression or describing things like 6-vector momentum instead of just a single scalar for energy, using the same number of SYMBOLS. The normalized entropy allows them to have the same H to get the same kTNH energy without going through contortions. So for N particles let's say bits are being used to describe each one's energy with entropy/particle H, and bytes are used to described their momentums with entropy/particle H'. Momentums uniquely describe the energy (but not vice versa). NH=NH'. And our independent property does not appear to be needed: H' can take on a specific values of p's that satisfy H=H', not some sort of average of those sets. Our previous method of NH=MH' is not as nice, violating Occam's razor. Ywaz (talk) 14:57, 11 December 2015 (UTC)

Ywaz - you said:

PAR: " what you call "Objective Entropy" is not entropy, it is the amount of information in a particular message (microstate), " Don't you mean entropy = information in a Macrostate? That is what you should have said.

What I said was "what you call "Objective Entropy" (given in your equation above) is not entropy, it is the amount of information in a particular message (microstate), assuming independence, and assuming that the frequencies of the symbols in the message are a perfect indication of the frequencies of the population (macrostate) from which it is drawn."

Entropy is the AVERAGE amount of MISSING information in a macrostate. It is the AVERAGE amount of information SUPPLIED by knowing a microstate.

Assuming the probability of the i-th microstate is pi, and assuming they are independent, the information supplied by knowing the i-th microstate is -log2(pi) bits. The entropy is the weighted average: It is the sum over all i of -pi log2(pi) (bits). Note that the sum over all i of pi is unity. PAR (talk) 02:56, 14 December 2015 (UTC)

Assessment comment

Latest comment: 16 years ago1 comment1 person in discussion

The comment(s) below were originally left at Talk:Entropy (information theory)/Comments, and are posted here for posterity. Following several discussions in past years, these subpages are now deprecated. The comments may be irrelevant or outdated; if so, please feel free to remove this section.

# As requested by Jheald, some comments on my assessment. As I see it the article is reasonably close to B class. Its mainly failing the first B-class criterion: suitable referencing. The article is almost completely missing inline citations. The other criteria seem mostly Ok'ish. (I not much of expert on the content, so I'll refrain from any statements about comprehenisiveness.) (TimothyRias (talk) 12:22, 29 August 2008 (UTC))

Last edited at 12:22, 29 August 2008 (UTC). Substituted at 14:34, 29 April 2016 (UTC)

Rationale

Latest comment: 8 years ago5 comments3 people in discussion

quote	question
...For instance, in case of a fair coin toss, heads provides $log 2 (2) = 1$ bit of information, which is approximately 0.693 nats or 0.631 trits. Because of additivity, $n$ tosses provide $n$ bits of information, which is approximately $0.693 n$ nats or $0.631 n$ trits. Now, suppose we have a distribution where event $i$ can happen with probability $p i$ . Suppose we have sampled it $N$ times and outcome $i$ was, accordingly, seen $n i = N p i$ times. The total amount of information we have received is $\sum _{i}{n_{i}\mathrm {I} (p_{i})}=\sum {Np_{i}\log(1/p_{i})}$ . The average amount of information that we receive with every event is therefore $\sum _{i}{p_{i}\log {1 \over p_{i}}}.$	Explain me somebody please why single event has entropy = 1 bit, but according last formula average entropy = 0,5 bit ? 95.132.143.157 (talk) 11:13, 11 November 2015 (UTC)

The last formula gives 1 bit as desired. The value of $log 2 (1/ p i)$ is 1 bit for a fair coin where $p i = 1/2$ . The summation adds half of one bit to half of one bit to get one bit. 𝕃eegrc (talk) 14:49, 11 November 2015 (UTC)

Why summation is used for single toss?95.132.143.157 (talk) 04:18, 15 November 2015 (UTC)

The summation is over the possible results, not just the observed results. 𝕃eegrc (talk) 13:30, 16 November 2015 (UTC)

The Rationale section reads, in part:

"I(p) is monotonic – increases and decreases in the probability of an event produces increases and decreases in information, respectively."

shouldn't that be "increases and decreases in the probability of an event produces decreases and increases in information, respectively." since less probable events convey more information? In other words, I(p) is monotonic, but isn't it monotonic decreasing, rather than monotonic increasing, as the article implies?207.165.235.61 (talk) 16:46, 14 September 2016 (UTC) Gabriel Burns

Entropy definition using frequency distributions

Latest comment: 7 years ago2 comments2 people in discussion

I propose an addition to the Definition section of the article:

Using a frequency distribution

A frequency distribution $\mathrm {F} (X)$ is related to its probability mass function $\mathrm {P} (X)$ by the equation:

{\frac {\mathrm {F} (x_{i})}{\sum _{i=1}^{n}{\mathrm {F} (x_{i})}}}=\mathrm {P} (x_{i})

.

A definition of entropy, in base $b$ , of random variable $X$ , with frequencies $F(x_{i})$ , that uses frequency distributions is:

\mathrm {H} (X)=\log _{b}{\sum _{i=1}^{n}{\mathrm {F} (x_{i})}}-{\frac {1}{\sum _{i=1}^{n}{\mathrm {F} (x_{i})}}}\sum _{i=1}^{n}{\mathrm {F} (x_{i})\log _{b}\mathrm {F} (x_{i})}.

People that need to calculate entropy for a large number of outcomes $x_{i}$ will appreciate that $\mathrm {F} (x_{i})\log _{b}\mathrm {F} (x_{i})$ can be pre-calculated and that this definition does not require you to know the total number of observations before starting calculation. The definition using probability mass function requires that you know all observations (to calculate $\mathrm {P} (x_{i})$ ) before you start calculation.

Disclaimer: I am the author of this research and the supporting link is to a website I control. Additionally, I would move the cross-entropy definition under a subheading. Would any interested parties please share if you believe this is a useful contribution to the article? --Full Decent (talk) 05:02, 12 March 2017 (UTC)

I don't see a problem with it as long as you include something like "According to a 2015/new/ongoing study (reference), entropy may be defined in terms of frequency distributions... "

177.68.225.247 (talk) 21:44, 24 March 2017 (UTC)

Clarifying jargon in lead section

Latest comment: 7 years ago4 comments2 people in discussion

A couple options: Use wikilinking (to offer readers a ready path to clarification and expanded vocabulary) or instead use a more common/less specialized term (one more readily understandable to a general readership, though perhaps not the first choice of those with a particular topical interest).

Regarding stochastic as used in the first sentence, it seems [[stochastic]] might suffice for the former, and random might work for the latter.

Gonna' go with the wikilink for now. Hopefully this will serve as a broadly accommodating 'middle path' (between bare tech jargon and plain speech). I'd offer no objection to someone just substituting random instead though.

A fellow editor, --75.188.199.98 (talk) 14:05, 11 November 2017 (UTC)

As written now (possibly before your revision), I would think that "probabilistic stochastic" is redundant. "Stochastic" (or similar) should be sufficient. Attic Salt (talk) 14:09, 11 November 2017 (UTC)

Thank for offering input, I took your suggestion and reduced "probabilistic stochastic" to wikilinked "stochastic". Also started to consider adding some more wikilinks ... but then the more I looked at the lead (and indeed the article as a whole) and gave it consideration I started to feel there were general concerns beyond a few terms and added a {{jargon}} tag to the article. Much of it seems pretty opaque to me. Overall seems quite long as well, but hard to judge the value of that without better being able to follow the material in the first place.

Perhaps someone might eventually be found who both follows the topic and has skill at translating such into plain speech. --75.188.199.98 (talk) 14:32, 11 November 2017 (UTC)

The lede is a mess. Attic Salt (talk) 14:34, 11 November 2017 (UTC)