Wikipedia talk:Abandoned stubs

Latest comment: 1 year ago by BilledMammal in topic Question

So the descriptions here seem a little odd

edit

The text of the essay doesn't seem to align very well with the footnotes. Consider:

  • "an obscure topic" = "in the bottom 50% of articles by views"
    • Nothing in the middle quintile should be called "obscure" merely because it's in the center of the normal distribution. Just one page view less than the actual median is not "obscure". That's really not a reasonable measurement. Maybe a different cutoff should be tried (perhaps somewhere around the 20th percentile). Also, page views can vary over time, so what's the time period? How long after creation was it measured?
    • Maybe page views aren't the best measurement. Maybe it would make more sense to compare orphans vs single-link vs non-orphan articles. Obscure topics tend not to have very many incoming links.
  • "Such articles, typically sub-stubs"
    • Really? We officially stopped calling very short articles 'sub-stubs' in 2005 (=18 years ago), but sub-stubs usually had about 10–20 words, zero sources, and no content beyond a dictionary definition ("Airplanes are flying machines"). What's your basis for saying that such extremely short articles were "typically" created in 2018?
    • I just opened 100 consecutive page creations in yesterday's RecentChanges, in the Draft: and main namespaces. 87 of them were redirects. Two were dab pages. None of the 11 actual articles (including one unsourced BLP that I will be glad to see deleted) were sub-stubs.
  • "developed articles" = "over 5000 bytes"
    • Where did this very round number come from? What's its relationship to the median existing article? What's its relationship to standards like the DYK minimum? How long after creation was this measured? Three of the regular DYKs right now were less than 5,000 bytes at the end of the day they were created, and one of them is still under 5,000. That one has 4 paragraphs, five different sources, 15 sentences, and 265 words of readable prose. Is that really "undeveloped"?
    • Is "bytes" the right way to measure this quality? You can fill up a lot of bytes by adding an infobox with a lot of blank parameters, by adding a large number of categories, or by adding a long ==See also== section. You can also reach that by adding a lot of citations without having much readable content.
  • "turned into developed articles by other contributors" = "collectively contributing at least as much as the creator"
    • Is that a reasonable measurement? If Alice adds 2500 bytes, then Bob has to add 2500 bytes to count, but if Alice adds 4500 bytes and Bob adds 2500 bytes, then this counts as a failure to collaborate, even though the article has now reached, through non-trivial collaboration, to a state you're calling "developed"?

I think we need to spend more time understanding the existing corpus, including doing quick-and-dirty estimates, but I think we should be careful to describe them for what they are. Five thousand bytes isn't a developed article; it's an article that takes up 5K bytes in the database. Also, it'd be ideal to provide links to Quarry queries or other descriptions of how the calculations were done. That would let other editors try to replicate or update it.

More fundamentally, I wonder: Is the goal still the sum of all human knowledge, or has the goal changed, and now we only want that fraction of human knowledge that is given to us in articles that are at least 5,000 bytes in length? WhatamIdoing (talk) 03:20, 16 March 2023 (UTC)Reply

5,000 bytes seems pretty high to me as a cut-off for a well developed article, and has little to do with readable prose. Scaly-naped amazon is exactly 5,000 bytes and seems quite well-developed (if I remove the stub tag, which I think would be warranted, it would be fall below 5,000). Yellow penduline tit is 4,999 bytes and much less developed (the references make up a greater proportion of bytes). Phyllastrephus is 5,023 bytes and poorly developed (most of the "prose" is a list of species). These articles started out as stubs created by a bot (i.e. fully automated article creation), and were expanded by other editors. Plantdrew (talk) 22:09, 16 March 2023 (UTC)Reply
That first one has 30 sentences and 441 words of readable prose. The stub tag will be removed the next time someone runs AWB with genfixes on it. ORES suggests Start-class, but C-class is also possible. WhatamIdoing (talk) 00:10, 17 March 2023 (UTC)Reply
I’ll reply in more depth later, but less views that the median is, I believe, a reasonable way of defining this; views have a very long tail.
In addition, definitions change. Sub stubs now refer to the sort of article editors like Lugnuts create.
5000 bytes is a reasonable approximation of an underdeveloped article; we are forced to rely on bytes as the tool I use, quarry, does not support other assessments of an articles quality. BilledMammal (talk) 22:42, 16 March 2023 (UTC)Reply
  • I do not agree that a single page view less than the actual median is a reasonable way of determining what's obscure. That's like saying that someone with an IQ of 104 has "low IQ", because the median is 105. If you've got the data (or a sample – a few thousand randomly selected articles should be enough), we can figure out what a sensible cutoff is. But whatever cutoff is used, it needs to be described fairly.
  • WP:Substub (marked historical since 2005, after a series of community discussions) gives a definition and examples. Wikipedia:Glossary#Substub gives a simpler definition. When people apply this to "the sort of article editors like Lugnuts create", it's just a pejorative. But my question for you isn't whether you're allowed to use the word; my question for you is whether new substubs are "typically" being created. I didn't find any. How many have you seen?
  • I don't agree that 5000 bytes is a reasonable approximation of an underdeveloped article. Do you think that the articles @Plantdrew linked above are all underdeveloped?
WhatamIdoing (talk) 00:03, 17 March 2023 (UTC)Reply
The median IQ is, by definition, 100; 104 is above average. IQ also has a normal distribution; page views do not.
I’ve seen many, and “reasonable approximation” doesn’t mean “perfect determinator”; it will include some results that aren’t underdeveloped, and will exclude some results that are. BilledMammal (talk) 00:13, 17 March 2023 (UTC)Reply
It's not really important to this discussion, but I understand that the median IQ is now around 105 in the US. They don't re-normalize it every year, because they want to do long-term comparisons. Apparently when you do things like ban leaded gasoline and reduce child hunger, then the median slowly creeps up over time.
Back to the real subject: Can you provide examples of several articles that have 4,900 bytes but you think are significantly under-developed? Or for 4,000 bytes? If a majority of the 4,000–4,999-byte-long pages are not significantly under-developed, then they probably shouldn't be included (or this page should be moved to User:BilledMammal/My personal views on unacceptably short articles). WhatamIdoing (talk) 06:41, 17 March 2023 (UTC)Reply
"Significantly underdeveloped" isn't the term I used; I used "underdeveloped". However, even using your term examples exist; Mikhail Dovgalyuk, Martina De Memme, Cristian Salcedo, David Morgan (swimmer), Ildikó Farkasinszky-Bóbis, Yadinis Amarís, Elizabeth Bravo, and many others.
To answer some of your other questions; page views were the number of views received in the year prior to the analysis, and the length of an article was measured at the time of analysis, so between four and five years after creation. BilledMammal (talk) 11:32, 17 March 2023 (UTC)Reply
I asked for examples that are significantly underdeveloped, because it's both easy and pointless to say that any article below FA isn't fully developed, and therefore it's automatically at least slightly underdeveloped.
The articles you list provide a huge amount of information in the format of an infobox, but little readable prose. I wonder if that suggests that byte size is simply the wrong way to measure article development.
Also, I still want to know what percentage of articles are below 5,000 bytes. I clicked on https://en.wiki.x.io/wiki/Special:Random?action=history ten times, and found six articles that were below your cutoff (plus one that was just 4% over it). Is that what you expect? WhatamIdoing (talk) 19:30, 17 March 2023 (UTC)Reply
Some have infoboxs that provide a decent amount of statistical information, but we are writing articles, not databases - copying a database doesn't prevent an article from being "significantly underdeveloped".
That sounds accurate; 4.3% of articles (excluding dab pages etc) are below 1000 bytes, 15.3% are between 1000 and 2000, 14.4% are between 2000 and 3000, 11.6% are between 3000 and 4000, and 9.0% are between 4000 and 5000. BilledMammal (talk) 00:37, 18 March 2023 (UTC)Reply
Thanks! Would you please put that into Wikipedia:Size of Wikipedia, for anyone else who's interested?
This essay says "26% were turned into developed articles...", with a footnote that says that means under 5,000 bytes. I wonder if it would be clearer to say something like "The median file size for articles is a bit less than 5,000 bytes. 26% reached this length..." WhatamIdoing (talk) 23:26, 21 March 2023 (UTC)Reply
I've put the full data on the talk page for now; I'll consider how best to include it in the article.
That's a good idea, but I first want to work out why only 33% of articles from 2018 reached the median length; after five years, I would expect that figure to be much closer to the long term average. I'm wondering if mass creation was unusually high in 2018? BilledMammal (talk) 23:44, 21 March 2023 (UTC)Reply
Well, the way to figure that out would be to run data from a couple of years before/after that. Wikipedia:Size of Wikipedia#Annual growth rate for the English Wikipedia reports a net page creation rate that is lower for 2018 than for either 2017 or 2019, which might suggest that mass creation is probably not a factor. It might be possible to check whether 2013 articles had a similar level of development by 2018. Perhaps average development simply requires time. WhatamIdoing (talk) 00:27, 22 March 2023 (UTC)Reply

Question

edit

Of the articles created in 2018, 26% were turned into developed articles[c] by the creator. So, this means all articles, right? Including those where the initial revision is over 5,000 bytes? Or is it some subset of articles that were eventually expanded? Can that be clarified? Folly Mox (talk) 13:52, 18 July 2023 (UTC)Reply

If I have understood you correctly: Of all the articles created in 2018, including those who initial revision was over 5000 bytes, 26% were turned into developed articles. This 26% includes those whose initial revision pushed them into the category of developed articles. BilledMammal (talk) 20:50, 23 July 2023 (UTC)Reply