User talk:BilledMammal/Average articles

Latest comment: 2 months ago by WhatamIdoing in topic Methodology queries

Methodology queries

edit

How does your algorithm count Dogtown, St. Louis, which has punctuation errors, or Balkan Rhapsodies: 78 Measures of War, which has multi-sentence quotations? I think I'd count the latter as 6 sentences, and someone else might count it as 11, but it's recorded as 13. I count 12 in Campbell v. Clinton and IEEE 802.11 (legacy mode), but your script says 13 for them. 1640 in paleontology is reported as 1 sentence, but it contains 5. WhatamIdoing (talk) 19:05, 14 August 2024 (UTC)

  1. Campbell v. Clinton, 203 F.3d 19 (D.C. Cir.
  2. 2000), was a case holding that members of Congress could not sue President Bill Clinton for alleged violations of the War Powers Resolution in his handling of the war in Yugoslavia.
  3. The War Powers Resolution requires the President to submit a report within 48 hours "in any case in which United States Armed Forces are introduced … into hostilities or into situations where imminent involvement in hostilities is clearly indicated by the circumstances," and to "terminate any use of United States Armed Forces with respect to which a report was submitted (or required to be submitted), unless the Congress … has declared war or has enacted a specific authorization for such use of United States Armed Forces" within 60 days.
  4. On March 26, 1999, two days after President Clinton announced the commencement of NATO air and cruise missile attacks on Yugoslav targets, he submitted to Congress a report, "consistent with the War Powers Resolution," detailing the circumstances necessitating the use of armed forces, the deployment's scope and expected duration, and asserting that he had "taken these actions pursuant to [his] authority … as Commander in Chief and Chief Executive."
  5. On April 28, Congress voted on four resolutions related to the Yugoslav conflict: It voted down a declaration of war 427 to 2 and an "authorization" of the air strikes 213 to 213, but it also voted against requiring the President to immediately end U.S. participation in the NATO operation and voted to fund that involvement.
  6. The conflict between NATO and Yugoslavia continued for 79 days, ending on June 11 with Yugoslavia's agreement to withdraw its forces from Kosovo and allow deployment of a NATO-led peacekeeping force.
  7. Throughout this period Pentagon, State Department, and NATO spokesmen informed the public on a frequent basis of developments in the fighting.
  8. The lawsuit was filed before the end of hostilities by 31 members of Congress opposed to U.S. involvement in the Kosovo intervention, led by Tom Campbell of California.
  9. The Congressmen sought a declaratory judgment that the President's use of American forces against Yugoslavia was unlawful under both the War Powers Clause of the Constitution and the War Powers Resolution ("the WPR").
  10. Appellants claim that the President did submit a report sufficient to trigger the WPR on March 26, or in any event was required to submit a report by that date, but nonetheless failed to end U.S. involvement in the hostilities after 60 days.
  11. The district court granted the President's motion to dismiss for lack of standing.
  12. The appellate court affirmed.
  13. It held appellants had ample legislative authority it could exercise to stop appellee's war making, and thus, appellants lacked the power to challenge such executive action in court.
  1. IEEE 802.11 (legacy mode) or more correctly IEEE 802.11-1997 or IEEE 802.11-1999 refer to the original version of the IEEE 802.11 wireless networking standard released in 1997 and clarified in 1999.
  2. Most of the protocols described by this early version are rarely used today.
  3. It specified two raw data rates of 1 and 2 megabits per second (Mbit/s) to be transmitted via infrared (IR) signals or by either frequency hopping or direct-sequence spread spectrum (DSSS) in the Industrial Scientific Medical frequency band at 2.4 GHz.
  4. IR remained a part of the standard until IEEE 802.11-2016, but was never implemented.
  5. The original standard also defines carrier sense 0 access with collision avoidance (CSMA/CA) as the medium access method.
  6. A significant percentage of the available raw channel capacity is sacrificed (via the CSMA/CA mechanisms) in order to improve the reliability of data transmissions under diverse and adverse environmental conditions.
  7. IEEE 802.11-1999 also introduced the binary time unit TU defined as 1024 μs.
  8. At least seven different, somewhat-interoperable, commercial products appeared using the original specification, from companies like Alvarion (PRO.11 and BreezeAccess-II), BreezeCom, Digital / Cabletron (RoamAbout), Lucent, Netwave Technologies (AirSurfer Plus and AirSurfer Pro), Symbol Technologies (Spectrum24), and Proxim Wireless (OpenAir and Rangela2).
  9. A weakness of this original specification was that it offered so many choices that interoperability was sometimes challenging to realize.
  10. It is really more of a "beta specification" than a rigid specification, initially allowing individual product vendors the flexibility to differentiate their products but with little to no inter-vendor interoperability.
  11. The DSSS version of legacy 802.11 was rapidly supplemented (and popularized) by the 802.11b amendment in 1999, which increased the bit rate to 11 Mbit/s.
  12. Widespread adoption of 802.11 networks only occurred after the release of 802.11b which resulted in multiple interoperable products becoming available from multiple vendors.
  13. Consequently, comparatively few networks were implemented on the 802.11-1997 standard.
  1. Dogtown is a traditionally Irish section of St. Louis, Missouri.
  2. It is located south of Forest Park, with its southeastern edge abutting the traditionally Italian section of town, The Hill neighborhood.
  3. The neighborhood is anchored by St. James the Greater Catholic Church.
  4. The boundaries of Dogtown are Oakland Avenue on the north, Macklind Avenue on the east, and McCausland Avenue on the west.
  5. Its southern boundary is generally Manchester Avenue, but between Hampton and Dale Avenues, the southern boundary extends to Interstate 44.
  6. Dogtown is not one of the 79 Neighborhoods of St. Louis recognized by the city government.
  7. Rather, it is an area that includes four neighborhoods, and part of a fifth: Clayton-Tamm Franz Park Hi-Pointe Cheltenham eastern portion of Ellendale Dogtown got its name as a small mining community in the mid-1800s.
  8. There was a concentration of small clay and coal mines in the area during that time, and the term "Dogtown" was widely used in the 1800s by miners to describe a group of small shelters around mines.
  9. Although some erroneously think Dogtown was named during the 1904 World's Fair, it actually got its name long before then.
  10. An article published on August 14, 1889 in the Missouri Republican is the earliest known reference to Dogtown in St. Louis.
  11. The 1889 newspaper article describes a lost 5-year-old boy who lived in "the classic precincts of Dogtown, near Cheltenham."
  12. The term 'dog' appears in official mining terminology (dogholes, doghouse, dogtowns, dogmines, etc.
  13. ), and it's quite easy to find places all over the U.S. that were called "Dogtown," whose whole existence was due to mining Dogtown Historical Society Map of Dogtown St. Patrick's Day Parade
  1. Balkan Rhapsodies: 78 Measures of War is a 2007 documentary by Jeff Daniel Silva about the Kosovo War.
  2. Balkan Rhapsodies is a "ruminative experimental mosaic", exploring the Kosovo War through encounters with recovering civilians, observational moments, and Silva's own reflections on his various trips through war-torn Serbia and Kosovo.
  3. Jeff Daniel Silva was the first United States Citizen granted access into the territory that was formerly known as Yugoslavia.
  4. His access was granted just weeks after the North Atlantic Treaty Organization (NATO) launched a military operation that dropped bombs over the Federal Republic of Yugoslavia.
  5. Lisa Stevenson, Professor of Anthropology, McGill University says, "Balkan Rhapsodies: 78 Measures of War deftly juxtaposes the everyday and the exceptional, the sublime and the horrific, into a film that challenges us to reconsider what we thought we knew about war, about humanitarianism and about our own good intentions.
  6. It challenges us as students and scholars of the humanities and social sciences to interrogate our own process of meaning making - how we come to know what we do about war, and especially how we come to know and consume the pain and suffering of others.
  7. Balkan Rhapsodies is a beautiful and disturbing film, a film whose afterimage should provoke us to think about war and humanitarianism in a more sophisticated, and ultimately more compassionate way."
  8. Branka Bogdanov, Director of Film and Video, Institute of Contemporary Art, Boston says, "It's become a cliché that every documentary film/video tells a story, but what and whose story does it tell?
  9. Experimental filmmaker Jeff Daniel Silva, with his documentary Balkan Rhapsodies, exercises an exceptional talent to capture a kaleidoscopy of life in the post-war Serbia, after the NATO bombing in 1999.
  10. With this film Silva strives to crackle beneath the façade of ordinary lives, opening up a whole landscape of disillusionment and hope, humor and pain, and above all else succeeds in vividly capturing the bruised acceptance of a people in the aftermath of war.
  11. With its unique multilayered structure Balkan Rhapsodies remains open to interpretation and most importantly, each of the seventy-eight Rhapsodies is its own reality.
  12. Kudos for a filmmaker who exercises a profound commitment to the proposition that film is a vital manifestation and a document of the political volatile time we live in."
  13. Basil Wright Prize RAI Film Festival, 2009 "Visual Representation of Crisis through Ethnographic Film," EASA Biennial Conference, Ireland, 2010 AAA/Society for Visual Anthropology Film, Video & Multimedia Festival, 2009 11th RAI International Festival of Ethnographic Film Leeds, 2009 Harvard Film Archive, Cambridge, MA, 2008 Göttingen International Ethnographic Film Festival, Göttingen, Germany, 2008 Best Documentary, Festival Cinemateca Uruguaya, Montevideo, Uruguay, 2008 Museum of Modern Art (MoMA), NY - Doc Fortnights, 2008 OVNI Arxius de l'Observatori, Barcelona, Spain, 2008 DocHouse Brussels, Belgium, 2007 ForumDocBH, Belo Horizonte, Brazil, 2007 Valdivia International Film Festival, Chile, 2007 DokumentART Festival, Neubrandenburg, Germany, 2007 Official Website DER Balkan Rhapsodies Harvard Film Archive Article
  1. Robert Plot, the first man to illustrate a dinosaur fossil, is born.

To address your specific questions:

  • It doesn't always correctly parse abnormal punctuation, but that is a very difficult problem and the script is still reasonably effective at doing so.
  • It does not exclude quotes, and I don't think it should
  • It uses the source code of the page, so while 1640 in paleontology appears to have five it only has one, as most of the sentences are brought in by templates.

Overall, I think it's sufficient to establish that the median number of sentences is much higher than 4, even though one day I might want to come back and tweak it. BilledMammal (talk) 19:30, 14 August 2024 (UTC)Reply

I think this is great. I'd love to be able to provide some basic information, particularly to NPP and AFC reviewers, about what's typical.
 
Relative frequency of article lengths, as measured in sentences (statistical outliers excluded)
For example, in this sample, the mean is 47 sentences, the median is 13, the mode is 2, and the standard deviation is ±27, with a range of 0 to 1,056 sentences. I think the inner ranges tell more about the data set, though. The inner 80% range (1000 to 9000) is 2 to 60 sentences. The inner 90% (500 to 9,500) is 2 to 95 sentences. The inner 99% (100 to 9900) is 1 to 234 sentences. The quartiles fall at 5, 13, and 29 sentences. Anything above 67 sentences is a statistical outlier.
The same numbers for the word count is a mean of 746 words, a median of 338 words, and a standard deviation of ±1400, with a range of 0 to 33,873 words. The inner 80% range is 47 to 1,667 words. The inner 90% is 32 to 2,692 words. The inner 99% is 21 to 6,813 words. Anything above 1,770 words is a statistical outlier.
In other words, if you get more than a couple thousand words, it's an outlier, but even the shortest articles are normal.
In addition to the number of words/sentences, I'm particularly interested in knowing the number of links and the number of sources in typical articles.
Based on prior manual investigations (example), the lead of FAs tend to have 1.5 internal wikilinks per sentence and one per ~16 words. I'd expect the overall articles, especially longer articles, to have a lower link density.
I don't know what to expect for a refs:sentence ratio, but I would not be surprised if the ratio was around one ref for every two or three sentences. It would probably be appropriate to exclude abnormally long articles from such calculations. WhatamIdoing (talk) 23:19, 14 August 2024 (UTC)Reply
About the script: I'm surprised that it did so well. I see that it's picking up the ==External links== section, which might tend to inflate the count, if descriptions are given. Overall, though, I suspect that the undercounts and overcounts approximately balance each other for sentence counts.
For word counts, though, it may be systematically undercounting. Consider 1964 Malaysian state elections, which is almost entirely tables. There are obviously more than 21 words on the page, but actually counting them would be difficult. WhatamIdoing (talk) 23:34, 14 August 2024 (UTC)Reply
I've updated with the number of wikilinks and references. Currently, the reference check is a little simplistic; it looks for reference tags or any of the 4200 citation templates. This means that general references used without reference tags, like at 143rd New York Infantry Regiment, are missed. I'll look into how to address that.
For word counts, possible - although in the case of 1964 Malaysian state elections, that's because most of the words are inside templates, which are excluded from the count. BilledMammal (talk) 21:47, 15 August 2024 (UTC)Reply
Thanks. Here's a quick summary of the numbers:
  • Number of refs per article: Range of 0 to 452.  Mean 8.63, median 4, mode 1, standard deviation 17.7.  Quartiles are 2, 4, 9.  Anything above 20 is an outlier (according to https://www.calculatorsoup.com/calculators/statistics/descriptivestatistics.php). The most common numbers were zero detected refs (9.35%), 1 ref (14.45%), 2 refs (14.17%), 3 (10.65%), 4 [the median] (7.58%), 5 (6.28%), and from there everything else (~44%) has less than 5% each.
  • Number of internal links per article: Range of 0 to 4,624 (it's possible that September 2011 in sports should be considered a list article, in which case the range becomes 0 to 1,458), with a mean of 45, a median of 23, and a mode of 8.  The standard deviation is 90.  The quartiles fall at 12, 23, and 46.  Anything at or above 98 is an outlier.  A frequency table is interesting; whereas all the others cluster towards small numbers (e.g., two sentences, one ref...), 95% of articles have 5+ links.
Combining this with the above numbers, the "perfect median" article has
  • 338 words
  • 13 sentences
  • 4 refs
  • 23 links
and the inner quartiles (25th percentile to 75th percentile) – the most middling 50% of Wikipedia's articles, and therefore obviously "typical" content – have these ranges:
  • 123–782 words
  • 5–29 sentences
  • 2–9 refs
  • 12–46 links
WhatamIdoing (talk) 04:40, 18 August 2024 (UTC)Reply
Here's another fun fact: Most articles in this sample set are statistical outliers for at least one of those four metrics.
Looking only at 80% of articles that are "normal" on at least one of the four metrics, there is an average of 4.3 sentences per detectable ref, 18.4 words per wikilink, or one wikilink for every 1.3 sentences. The shorter the article, the greater the density on all elements.
Looking at the 23% of articles that are "normal" on all four of the metrics, there is an average of 3.6 sentences per ref, 15 words per link, and one wikilink for every 1.6 sentences.
The shorter half of the "all normal" articles (using the number of sentences to split them) have 2.6 sentences per ref and 12.8 words per wikilink (or 2 wikilinks per sentence). The longer half of these articles have 4.7 sentences per ref and 18.6 words per wikilink (or 1 wikilink per 1.25 sentences). WhatamIdoing (talk) 22:02, 26 August 2024 (UTC)Reply