Talk:Perplexity

Latest comment: 4 months ago by TokenByToken in topic Scaling Laws

Perplexity as confusion

edit

I'd humbly suggest that somewhere should be found a small amount of space to formally recognize that perplexity, to most English speakers' understanding, a state of confusion or bewilderment. I think it's great to learn details about this math theory, but perhaps someone can find a spot to mention the origin of the word or any kind of clue/reference to the actual non-math meaning of perplexity.

Action taken: I've added a link to the Wiktionary definition. --84.9.73.211 (talk) 10:01, 6 February 2008 (UTC)Reply

Plagiarism

edit

Phrase "Perplexity Per Word"

edit

"Perplexity Per Word" is taken verbatim from the following uncited source: [1] --133.11.168.148 (talk) 05:32, 27 May 2010 (UTC)Reply

It's just perplexity scaled by the number of words. It's a really common term and would be even more common if most papers didn't just say "perplexity" and assume the reader knows it's per word. --130.108.16.157 (talk) 17:44, 20 April 2011 (UTC)Reply
Deriving a "per word" unit from a unit applied to text is an obvious thing to do for anybody on the field. Citing a source for it would be most unusual. There is no plagiarism here. Jojo (talk) 16:42, 15 April 2017 (UTC)Reply

Better wording

edit

Why not just write this page as follows:

In information theory, perplexity is a unit of information such that a perplexity of $p$ equals $log_2 p$ bits, or $log_10 p$ hartleys, etc.

The rest of the page is redundant with most information theory articles.MisterSheik (talk) 02:39, 21 April 2011 (UTC)Reply

Since when does to fact that similar information can be found in academic articles mean that a wikipedia page should be removed?

I agree, this could even be merged with Entropy. Full Decent (talk) 22:17, 18 April 2016 (UTC)Reply

By the same token, any other page about a derived concept could be merged. The main purpose of an encyclopedia is to help people to quickly understand a concept. This page explains the perplexity measure. It also links to entropy for those who want to know more about the theory behind it. Making it a subsection of "Entropy (information theory)" would make it much more difficult to find. Jojo (talk) 16:42, 15 April 2017 (UTC)Reply

Seriously out-of-date figures...

edit

Article currently states:

The lowest perplexity that has been published on the Brown Corpus (1 million words of American English of varying topics and genres) as of 1992 is indeed about 247 per word

I'm really not sure why the best we could do 25 years ago is relevant here. I see perplexity of language models in English (although not on the Brown corpus) that are substantially lower than this (e.g. ~100 for recurrent neural net models). If the Brown corpus is the benchmark, what are more recent figures on that corpus? JulesH (talk) 09:17, 3 June 2017 (UTC)Reply

I haven't seen any recent experiments on the Brown corpus, but there's no reason not to replace that with a more common benchmark. The One Billion Word Benchmark [1] is common now, and Exploring the Limits of Language Modeling (2016) [2] gets a perplexity of 23.7 using an ensemble of massive LSTM and skipgram models. I'm not aware of any better results on that set. That whole section needs to be rewritten if we switch the example, though; it refers to the Brown corpus and the 247 ppl result all over the place. 130.108.103.115 (talk) 23:01, 26 September 2017 (UTC)Reply

References

  1. ^ Chelba, Ciprian; Mikolov, Tomas; Schuster, Mike; Ge, Qi; Brants, Thorsten; Koehn, Phillipp; Robinson, Tony (December 2013). "One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling". arXiv:1312.3005.
  2. ^ Jozefowicz, Rafal; Vinyals, Oriol; Schuster, Mike; Shazeer, Noam; Wu, Yonghui. "Exploring the Limits of Language Modeling". arXiv:1602.02410.

2020 onwards status

edit

Even more so after the BERT and subsequent advances of late 2018 onwards. — Preceding unsigned comment added by 93.172.199.165 (talk) 09:56, 11 June 2020 (UTC)Reply

  • Perplexity of 12 bits per word was not unusual by 2021, and GPT-4 report indicates a test loss below 1.5 bits per word on Python code (but that's difficult to convert to natural language loss, let alone perplexity). See https://paperswithcode.com/task/language-modelling for more perplexities and losses on different datasets, and note that perplexity may be calculated on training, validation and test datasets (training data may be memorized by the model, the test/validation set can't but still adds bias due to tuning hyperparams while the validation/test dataset shows the true performance, see [2], [3] and [4] for details) by its designers, while in some cases the people evaluating the perplexity don't even know whether the model has seen the dataset they use during training, which may make such a number kinda useless. Ain92 (talk) 16:44, 30 March 2023 (UTC)Reply
Would be a big supporter of removing the section on Brown corpus. Who cares? If you've ever looked at the raw source data, it's not fantastic quality relative to modern datasets. A lot of this page is devoted to discussing what is probably esoteric nowadays (2020s+). I also don't think it would necessarily make sense to replace with a new dataset because I don't think exact perplexity numbers are tracked in this fashion anymore, let alone on a standard dataset. Happy to be wrong though. TokenByToken (talk) 22:37, 21 June 2024 (UTC)Reply

Inverse perplexity

edit

The inverse of the perplexity (which, in the case of the fair k-sided die, represents the probability of guessing correctly), is 1/1.38 = 0.72, not 0.9.

This seems barmy to me. Inverse perplexity seems like a crazy thing. Perplexity is kind of like average list length in an underspecified input system. Inverse list length WTF? I can't make any useful mental picture of this. — MaxEnt 00:22, 30 March 2018 (UTC)Reply

In point of fact, the "perplexity" in "bits" is the average number of "Yes/No" questions that one would need to ask to guess the correct word assuming that you know the vocabulary but not the word-probabilities.
To the best of my knowledge, the "inverse perplexity" doesn't mean _anything_.
(Possibly the writer was thinking about lg(1/p_i) being the "surprisal" corresponding to word "i" in the vocabulary, and got the two concepts confused. "Entropy" is then the "expected surprisal".) Speaker to Wolves (talk) 17:28, 29 November 2023 (UTC)Reply

Removal of Sources Added on 14:22, 31 July 2023‎ by Intuivo

edit

The addition of the sources from 14:22, 31 July 2023‎ by Intuivo is being undone for the following reasons:

  • Relevance: The cited COVID-19 article discusses Perplexity in a specific context (LDA models) that doesn't align with the broader concept in the main article.
  • Availability of Authoritative Sources: More suitable references that directly address Perplexity in information theory are likely available.
  • Editorial Integrity: The inclusion of more relevant and direct sources would better adhere to Wikipedia's guidelines on reliable sourcing and clarity.

The undoing aims to maintain the article's focus and ensure that the sources are directly relevant to the subject of Perplexity. — Preceding unsigned comment added by Ynwaps (talkcontribs) 20:45, 24 August 2023 (UTC)Reply

Scaling Laws

edit

Something that might be interesting to discuss on this page is perplexity as used to guide the so-called "scaling laws" of large language models:

2020, Kaplan et al.: Scaling Laws for Neural Language Models

2022, Hoffman et al.: Training Compute-Optimal Large Language Models

It's already discussed on the linked page, so perhaps just a link from here to there would suffice. Still, interesting to see perplexity used in this way. TokenByToken (talk) 22:42, 21 June 2024 (UTC)Reply