Wikipedia:Request a query/Archive 4

Latest comment: 6 months ago by Cryptic in topic Page table slow query
Archive 1Archive 2Archive 3Archive 4Archive 5

Blocks and unblocks

Trying to produce a list of blocks for users over 5,000 edits with a column indicating [effectively] whether or not the block was lifted, expired, or still active and a column for time between block and unblock [if applicable]. I can pull blocks and unblocks, but navigating durations and timing is more challenging. This is part of a larger project for which poking at the block log is just one component. — Rhododendrites talk \\ 21:44, 21 August 2023 (UTC)

I should have time to work on this either tomorrow or (less likely) later today, but if you want to keep poking at it in the meantime, I've taken a glance at what you've tried so far. I wouldn't attempt to parse block_params; instead, look for an unblock in logging with a timestamp later than the latest block. You can use ipblocks_ipindex to see if a user (not just an ip, despite the table name) is currently blocked, which will let you tell if a block without a corresponding unblock has expired. —Cryptic 22:02, 21 August 2023 (UTC)
I'll also have to come back to it tomorrow. My understanding of the various tables (and of SQL) is fairly limited, so I'd appreciate any help you have time for! — Rhododendrites talk \\ 01:57, 22 August 2023 (UTC)
Is it important at all that it be done in a single query? I'm thinking it would be easier to post-process this. —Cryptic 20:10, 22 August 2023 (UTC)
@Cryptic: Not necessarily, no. Certainly "column for time between block and unblock" and a conditional [block date + duration < current date]" is something I can add afterwards. Where it'd be difficult for me (not knowing Python, etc.) is being able to take the full list of blocks and unblocks and automatically pair them up, where applicable, if that makes sense. Like this unblock goes with that block. — Rhododendrites talk \\ 20:50, 22 August 2023 (UTC)
Well, that much is easy to do (slowly) - just sort first by the blocked username, then by timestamp; and don't limit log_action to 'block'. Might run into trouble with overlapping blocks and partial blocks or the like, and modified blocks might be an issue, but I expect the overwhelming majority of blocks will have exactly either 1 or 0 unblocks. quarry:query/76016 is the simplest thing that could possibly work. I'll check in on it later to see whether it completes or times out. —Cryptic 21:11, 22 August 2023 (UTC)
Ha - it did while I was typing that out. Running it again so it sorts usernames-with-underscores and usernames-without consistently; it'll be back up in under five minutes. —Cryptic 21:13, 22 August 2023 (UTC)
What do you want your final output to look like, say, for the first four users at that query ($1LENCE D00600D, $uperFan32, (CA)Giacobbe, and *drew) - they should be fairly representative? Do you still want columns for the blocking admin, block reason, and so on? —Cryptic 21:37, 22 August 2023 (UTC)
This is great, thanks! I'll fork it and poke over the next couple days. Would you indulge me by saying in human terms what this is doing:
UNION
SELECT '99999999999999', NULL, 'currently blocked', NULL, user_name, user_editcount, NULL, NULL
FROM ipblocks_ipindex
JOIN user ON user_id = ipb_user
Rhododendrites talk \\ 18:36, 23 August 2023 (UTC)
UNION combines the output of two selects. The later ORDER BY has lower precedence, so it'll sort the combined output.
The second SELECT finds all rows in the ipblocks_ipindex table (currently-active blocks - Special:BlockList is the onwiki interface). JOINing user limits the selected rows to ones with a corresponding row in the user table, ie it eliminates autoblocks and blocks of ips. It also lets us limit the output to users with 5000 or more edit counts in the WHERE clause.
UNION requires all of the select clauses it's combining to output the same number of columns, so I used NULL where there either wasn't data to match the first quarry (such as log_id) or where, even when there's an equivalent in ipblocks_ipindex, it would just repeat data from the first query (such as the blocker, actor_name in the first and ipb_by_actor in the second). user_name, on the other hand, is still included since we sort on that column later. user_editcount could've been nulled as redundant, too, I suppose, but we already had it from joining user, while we'd have to join the actor table to get a name for the blocking admin.
'99999999999999' is always output in timestamp column so that it always sorts after a real timestamp (whether sorting numerically or as a string), and 'currently blocked' was picked for clarity in the log_action column. Selecting a constant value like that instead of a table column name just outputs the constant in that column for every row.
I'm still willing to do the postprocessing if you let me know precisely which data you need. Getting the data is the easy part; dealing with all the different formats that the logging table has used over the years to indicate duration - if you indeed want the initial duration of modified or expired or still-active blocks - is what's actually hard here. —Cryptic 19:40, 23 August 2023 (UTC)
This is really helpful, Cryptic, thanks. Ok, so the thing I'm really trying to find out at the moment is to find blocks for reasons related to attacks, harassment, civility, etc., then to compare unblock %s for various edit count ranges (e.g. <10, 10-99, 100-999, 1000-9999, and 10000+). I know there's a big limitation in that we don't store the edit count at the time of the block, granted, but that's the most important. The "nice to have"s would also consider the mean/median starting block duration and mean/median actual block duration (if unblocked), but I appreciate that's harder.
I think I may be able work with the query you've put together here to get what I'll need. I'll have to think about the kinds of conditional statements to make it work. I started to try to adapt it here, though the results I'm seeing aren't what I expect ("currently blocked" lines without preceding block entries, and missing comment text when I've searched for comment text). Adding: Oops. Accidentally clicked submit again, after it took a while the first time (in case you click the query above shortly I write this).Rhododendrites talk \\ 02:28, 26 August 2023 (UTC)
The reason you're seeing "currently blocked" lines without corresponding blocked lines is that, in the first select, you filter out block log entries that don't mention certain terms; the second select shows all currently-blocked users in the given edit count range. (Be aware you're also filtering out the unblocks that don't match those terms.)
There's a field in ipblocks (ipb_reason_id) that works like log_comment_id, so you could, conceivably, filter both halves of the query. Ultimately, though, I don't think that would work very well, since - besides missing the unblocks - for users that were reblocked, it'll be the comment of the last active block, so you'll get cases both where you see the currently-active block and the reblock, but not the original block, and vice-versa. And you're still not going to get the unblock lines - the block log entries for unblocks hardly ever quote the original block. This would only be a little better than what you have now.
What you're really interested in filtering by isn't the comment field in the block log of the unblock or reblock entry in the block log, it's by the comment field in the original block. There's probably a more elegant way to do that than quarry:query/76110, but this way works. What it does is add a column, original_reason, that shows comment_text when block_action = 'block' (that is, a "Admin blocked Vandal for 3 weeks" log entry, not a reblock or unblock or one of the 'currently blocked' fake entries added from the ipblocks table); otherwise, whatever was in the previous row's original_reason column. (It also shows comment_text instead of the previous row's value if the blocked user's name isn't the same as in the previous row as a failsafe.)
I did that query for 5000+-edit-users instead of 1-100-edit-users as in your query as I wasn't interested in waiting half an hour for it to complete. I also filtered by regex instead of that list of LIKE clauses, which A) is easier, if you happen to already be familiar with regular expressions, and B) lets us make the comparison case-insensitive, so you'd see a block log entry of "Harassing [[User:Innocent]]", which LIKE '%harass%' would omit. —Cryptic 04:29, 26 August 2023 (UTC)
@Cryptic: Queried, combined, and poking away now, and noticed something. Look at your query you link above and scroll to A3RO. Note an indef, never unblocked, not expired, and no "currently blocked" line. Do you know what would be causing that? — Rhododendrites talk \\ 02:14, 29 August 2023 (UTC)
It's because, when I take the value for original_reason from the previous row instead of the current one, I'm taking that row's comment_text column instead of its original_reason column. And it's nontrivial to take the original_reason column instead. Crapola. I'll try to think of a different way to do it. Selecting all the rows in the query and then filtering by the block reason after post-processing would work for user_editcount >= 5000, but it won't for user_editcount BETWEEN 1 AND 100 - there's just too many results that would get thrown away in the middle to be practical. Probably won't be tonight. —Cryptic 03:00, 29 August 2023 (UTC)
quarry:query/76110 updated. It no longer has the safety check against the previous line's log_title like before, but it seems to work. Running up hard against the limits of my experience here. —Cryptic 04:05, 29 August 2023 (UTC)

I found this Quarry query which finds per-wiki file usage for a given image: https://quarry.wmcloud.org/query/62636

I was wondering if it is possible to do a query like this, but with cumulative results for a whole Commons category of images (recursively for subcategories, if possible) rather than a single image. I was going to try to use the categorylinks for this, but did not get very far yet. I am mainly interested in Category:Images from the National Archives and Records Administration for now. Thanks! Dominic·t 20:45, 13 September 2023 (UTC)

There's currently 467285 files directly in that category, and 1549115 in the entire tree. Even if they average global usage on a single wiki each (compared to the 95 that the Messier 87 image has), Quarry can't handle that many results. I can write the query for you, but you're going to have to arrange some other way to run it and get the results, especially if I'm to take that "for now" at face value. —Cryptic 21:40, 13 September 2023 (UTC)
@Cryptic: Thanks, I do appreciate that concern! If you have a good idea about how to do this for a hypothetical category, I can plug in smaller ones or play around with the query on my own, as well. (The "for now" was just meaning, if it's feasible, I could query other categories as well.) Dominic·t 22:28, 13 September 2023 (UTC)
quarry:query/76609 has results for the tree anchored at c:Category:Wormholes.
It's also going to be slow for files with a lot of usage. One of my tests was with c:Category:High-resolution or SVG official Wikimedia logos, since the smallish subcats I picked out in the National Archives tree kept coming up with zero global usage; it turns out there's a separate row in globalimagelinks for every usage of every file on every wiki. Which, really, there has to be, but it hadn't sunk in that files like c:File:Commons-logo.svg have a couple hundred thousand uses in the mainspaces of each of dozens of wikis. It ran for half an hour before I gave up. You're likely to get similar results when running it for categories with large numbers of files, even if they're not in widespread use, or any use at all. —Cryptic 22:43, 13 September 2023 (UTC)

I'm trying to use a regex query to help me do an AWB fixing instances of improperly formatted section links in see also sections like this. I came up with the query insource:/\*\s?\[\[[\w\s]*#[\w\s]*\]\]/, which seems in testing to properly find the strings, but it's timing out when I try to use it. Any idea how I can simplify the query enough to get it to run? {{u|Sdkb}}talk 01:03, 21 September 2023 (UTC)

That's not a query, it's a search string. You're asking in the wrong place, though I'm not aware what the right place would be. To start with, you're using entirely the wrong variant of regexes; neither \w nor \s are character classes, and # is a metacharacter that needs to be escaped. Help:Searching/Regex might be useful. —Cryptic 01:17, 21 September 2023 (UTC)
Any search for just insource:/whatever/ is likely to time out, because it has to search every page in the selected namespace(s), usually several million. Search performs optimally (i.e. completes at all) with at least one "normal" search term, optionally limited further by insource:, e.g. Apple insource:/banana/. Certes (talk) 11:18, 21 September 2023 (UTC)
@Cryptic @Certes, thanks both! I didn't realize that the search flavor of RegEx was so different from the AWB flavor, which is what I was using. I got the search query to work using insource:/\* ?\[\[[A-Za-z0-9 ]+\#[A-Za-z0-9 ]+\]\]/ and am replacing \*( ?)\[\[([A-Za-z0-9 ]+)\#([A-Za-z0-9 ]+)\]\] with *$1{{section link|$2|$3}}. Cheers, {{u|Sdkb}}talk 18:16, 21 September 2023 (UTC)
Yes, it is frustrating that the insource regex language is different from the ones used by AWB, AutoEd, etc. Here's a variant search that also times out but that should keep you busy for a while. It does not search for all possible characters that can appear in titles – see WP:TITLESPECIALCHARACTERS, which might help you construct a "any character but not these characters" regex – but it's a start. – Jonesey95 (talk) 18:20, 21 September 2023 (UTC)
There's a number of especially frustrating things here.
  • The syntax looks like PCRE or POSIX regexes, unlike, say, the flavor used in lua (which is also different, but very visibly so), so it misleads people into googling "how can I make a regex to do X" and thinking it'll work; worse, frequently it does get just enough results to look like it worked, even though the syntax differences means that what was searched for wasn't what was meant to be searched for.
  • The underlying database engine natively supports PCRE - the de-facto standard variant that everyone expects - but we can't use it in queries because page text isn't copied to the replicas we have access to.
  • The combination of not implementing ^ or $, having no syntax that matches just a newline, and including newlines in . and complemented character classes makes it verbose and error-prone to express a match that has to occur all on the same line, or one that starts at the beginning of a line - and this particular search really needs to do both.
/^\*.*\[\[[^]|]*#/m would do what you want, if we could use a sane regex engine; it would give bad output for unclosed wikilinks, but those are unlikely to survive long in articlespace anyway. Here, the closest equivalent, I think, is /[^ -􏿽]\*[ -􏿽]*\[\[[ -\\^-{}-􏿽]*\#/, which has the additional drawbacks of not matching at the very start of a page (which won't matter) and treating tabs like they're newlines (which might). —Cryptic 20:10, 21 September 2023 (UTC)
Well done. That sounds better, though titles and anchors can contain characters other than alphanumerics and spaces. If I need a one-off search which times out, I follow it with prefix:A then repeat with prefix:B etc., though not all titles are alphabetic and it might be antisocial of me to do that too often. Certes (talk) 18:23, 21 September 2023 (UTC)
@Sdkb: Taking a step back, how necessary is this task? I don't see any guideline or even essay that prefers {{section link}} to [[Foo#bar]]. As a section symbol, § may be more familiar than # to some readers (though not me personally). However, it might make certain searches harder. For example, I occasionally trawl for links to [[C#something]], which link to sections about letter C but were usually meant for C♯ (musical note) or C Sharp (programming language). I never thought to check whether any have been converted to section links. Certes (talk) 20:01, 21 September 2023 (UTC)
The MOS doesn't seem to outright forbid visible hash marks, but it does condemn them as "unsightly". —Cryptic 20:25, 21 September 2023 (UTC)
Yeah, my understanding from that section (and the example of general practice at refined pages) is that § is the preferred form for display to readers (and better display for readers always takes precedence over easier searching for editors). {{u|Sdkb}}talk 21:11, 21 September 2023 (UTC)
Fair enough then (though that section also explicitly recommends using a link and doesn't mention the section link template). We should also consider linking to the section via a redirect, in case the text moves into a stand-alone article or elsewhere, but that's obviously not a task that can be automated. Certes (talk) 22:40, 21 September 2023 (UTC)
Yeah, what I noticed during the run (about 600 pages) was lots of questionable entries in see also sections (including some stuff that should be navboxes, some stuff that should be links within the article, and some stuff that just should not have been linked. The sample of pages that showed up in the query is probably lower-quality on average than a random sample of see also pages as a whole would be. Still, there's definitely lots of cleanup to do in the area. Doing a systemic review/cleanup is way beyond the scope of what I want to get into, though.
There were also occasional instances where the query didn't pick up all the section links in a see also section (e.g. here), so a smarter query could potentially pick those up. And there were also a few erroneous entries where someone tried to redlink a title that includes a hashtag, not realizing that such a title cannot exist due to technical restrictions (which I skipped or, if I saved before catching the error, reverted, as here). {{u|Sdkb}}talk 23:58, 21 September 2023 (UTC)
I suspect your "from" AWB regex begins with something other than "\[", meaning that it can include part of any previous adjacent link and prevent that from being amended if it's also to a section. So if it's "(.*)\[\[whatever" → "$1{{section link|whatever" then, as well as being inefficient, the initial .* will grab any previous section link. Certes (talk) 10:44, 22 September 2023 (UTC)

Number of articles under protection

Hi. I'm looking into historical trends in the number of articles under each type of protection. Could anyone please point me to or create a query where for a given date, it says "On this date, x articles were under semi-protection, y articles were under full protection, and z articles were under extended-confirmed protection"?

Note that this is not the number of articles for which protection was *applied* on the date; it's the number of articles for which protection was in effect. Thanks in advance and cheers, Clayoquot (talk | contribs) 17:12, 4 October 2023 (UTC)

This is prohibitively difficult to extract from the live database - we'd have to parse not only the protection logs, but deletion and move as well, support all the data formats that have ever been stored in all three of them (which has always been for the convenience of php-based mediawiki, not sql datamining; getting durations is a particular mess), and trace every page that's been protected through moves until either the protection expires, the protection changes, or the page is deleted. That's difficult enough when you're doing it manually online for just one page.
It's more feasible to pull it from the dumps, though, particularly if you don't need page namespaces or titles. You'd just need the page_restrictions file for that - it's always been relatively small, and parsing it is simple enough that you wouldn't even need to install a sql engine. You'll be limited to what dates you can get dumps with that file for, of course.
If you do need articles in a more strict sense of the word (namespace 0, not a redirect) you're going to need the page file too; that's about 2 gigabytes compressed in the latest dump, and while still possible to parse and match up with the page_restrictions table without a real sql engine, it's a lot less trivial. (If insist on "articles" in the even stricter sense that {{NUMBEROFARTICLES}} uses - namespace 0, not a redirect, has at least one link - you need the pagelinks file too, and definitely a sql engine, and I daresay nobody's going to lift a finger to help you.) —Cryptic 01:14, 5 October 2023 (UTC)
Thank you Cryptic for this thorough explanation! Your detailed answer is very much appreciated. Clearly this is above and beyond the call of duty for volunteers. It might be a good topic for academic research - perhaps Benjamin Mako Hill would know someone interested in the challenge. Cheers, Clayoquot (talk | contribs) 20:55, 5 October 2023 (UTC)
Don't forget that disambiguation pages are generally not regarded as articles, despite being non-redirects in namespace 0. However, it's rare that they need protection. Certes (talk) 21:06, 5 October 2023 (UTC)
As it happens, Mako wrote a paper about this very topic in 2015 together with Aaron Shaw, and they also published a dataset and code to recreate it: https://communitydata.cc/wiki-protection/ (link redirects to a current one that for some reason triggers the spam blacklist.) No guarantees that this code still works, of course. Regards, HaeB (talk) 04:40, 12 October 2023 (UTC)

This query used to work, now it does not. It would print output like:

http://www.bfi.org.uk/index.html 0 British_Film_Institute

ie. the given URL is in the mainspace page British Film Institute

Is there a new/better way to list external links for a domain? -- GreenC 01:31, 12 October 2023 (UTC)

You now need to use el_to_domain_index and el_to_path. See phab:T312666 * Pppery * it has begun... 01:37, 12 October 2023 (UTC)

Thanks. Got this working:

# adapted from https://quarry.wmcloud.org/query/77092
USE enwiki_p;
SELECT page_title,
       page_namespace,
       el_to_domain_index,
       el_to_path
FROM externallinks
JOIN page ON page_id = el_from
WHERE el_to_domain_index LIKE 'https://nl.kb.%' OR el_to_domain_index LIKE 'http://nl.kb.%';

Which produces:

enwiki Huwen_op_Bevel 0 http://nl.kb.kranten. /view/article/id/ddd:010229261:mpeg21:p005:a0112

Is there a way to combine the last two columns into a single URL? ie. http://kranten.kb.nl/view/article/id/ddd:010229261:mpeg21:p005:a0112 This example suggests a way, but it's dependent on knowing the hostname for the replace() command ("www"). Maybe it's better to post-process the output since it's being generated by a script anyway, take the load off the SQL server. -- GreenC 04:00, 12 October 2023 (UTC)

Well, not so much to take the load off the sql server, but to make it remotely elegant. "Split this string into parts by this character, reverse them, and join them back together" can't really be expressed succinctly in sql; the least ugly way I know how to do it is like this. —Cryptic 05:12, 12 October 2023 (UTC)
... or, duh, mariadb has a regexp_replace() to play with. quarry:query/77235 is much more practical. —Cryptic 06:58, 12 October 2023 (UTC)
Fantastic, thank you very much. Nice regex statement, 8-level deep domain seems reasonable which allows 5 or 6 hostnames eg. uk.co.bus.host1.host2.host3.host4.host5 . I'll keep an eye out for exceptions. Yes I didn't want to push SQL to do complicated string processing with many statements and temporary variables. Good to know regex is available. The execution time is only about 10% slower with the regex, and I won't be doing this query often. -- GreenC 18:43, 12 October 2023 (UTC)
There's plenty of external links that pattern won't match, from [http://www.ncbi.nlm.nih.gov.ezproxy.webfeat.lib.ed.ac.uk/pubmed/6804402] on User:Myxoma to [http://cache.cache.cache.cache.cache.cache.cache.cache.cache.cache.cache.cache.cache.cache.cache.cache.cache.cache.cache.cache.cache.cache.click.down2crazy.com/] on Wikipedia:Reference_desk/Archives/Computing/2010 September 7 to [news:comp.lang.ada] on Talk:Type safety. It wasn't meant to; I tailored it to your existing query, which already requires starting with http(s)://nl.kb., and there aren't any external links that match that and have more than five parts. Matching eight gives some headroom without making the backtracking too ridiculous or needing to resort to the obscure \g{} syntax. I wouldn't infer anything from the relative execution times, either; when the difference is only a second or two, that's going to be dominated by caching, server load, and - particularly when the resultset is large - network transfer of data from the sql replicas to Quarry and inserting the results into Quarry's own database. —Cryptic 20:19, 12 October 2023 (UTC)
Hmm.. well, maybe I'd be better off parsing with my script for unlimited levels, and code for exceptions like WP:PRURL and other short-hand forms that come up. URLs can get pretty messy. Too bad as I like your solution. -- GreenC 21:06, 12 October 2023 (UTC)

Find non-redirects with "(Kanye West song)" in title

The header explains this pretty well. I plan to create versions of those titles with "(Ye song)" as a plausible {{R from alternative disambiguation}}. Thanks. (no need to ping me on reply) CLYDE TALK TO ME/STUFF DONE 02:19, 13 October 2023 (UTC)

Quarry:query/77255. BilledMammal (talk) 02:30, 13 October 2023 (UTC)
Thanks much. CLYDE TALK TO ME/STUFF DONE 02:31, 13 October 2023 (UTC)
Or quarry:query/77254, if you perchance want non-mainspace pages too. —Cryptic 02:32, 13 October 2023 (UTC)

Active editors at Peer review

Hey there! I'm looking to query pages beginning with Wikipedia:Peer review/ for the users with either the highest number of edits or the highest number of characters added (or both) over the last six months. It would be useful if the query also reported the pages in question for each user. I'd appreciate any help, as I have next to no knowledge of SQL. Thanks! TechnoSquirrel69 (sigh) 02:52, 17 October 2023 (UTC)

quarry:query/77339. —Cryptic 03:10, 17 October 2023 (UTC)
Thanks a bunch, Cryptic! TechnoSquirrel69 (sigh) 03:15, 17 October 2023 (UTC)

Is there a way to find pages in the file talk space that don't have any template transcluded on?

Mostly to find any unsigned but pertinent or vandalistic/test edits. Jo-Jo Eumerus (talk) 11:11, 21 October 2023 (UTC)

@Jo-Jo Eumerus: quarry:query/77451. There's about 32,000 file talk pages that do not transclude any templates, so I also filtered by pages that have no wikilinks on them (which includes filtering out pages with user page/talk page links from signatures). If you'd like, I can add in pages that contain autosigned comments ("preceding unsigned comment by...") but when I did this I was seeing a lot of false positives.
Anyway, there seems to be plenty of vandalism/test and WP:NOTFORUM edits in the result set, with some good faith mixed in as well. FYI, you may need to scroll the results to the right on Quarry to see the additional columns: last revision and page size. The results are sorted by the talk pages with the most recent revision appearing first by default. Uhai (talk) 16:04, 21 October 2023 (UTC)
Thanks! Aye, there is a lot of vandalism in this namespace that goes undetected - but also a fair amount of germane (if often misplaced) commentary. Thanks for the heads-up on scrolling, too. Jo-Jo Eumerus (talk) 16:32, 21 October 2023 (UTC)

Village pump participation

I'd like to know the total number of registered editors who have ever posted to any of the village pumps. I'd like this to be de-duplicated, so that me posting on WP:VPM and VPIL and VPT, etc. only counts as one editor. Is this possible? WhatamIdoing (talk) 02:15, 23 October 2023 (UTC)

Quarry:query/77480 - 22,963 for registered editors, 31,359 for all editors. BilledMammal (talk) 03:52, 23 October 2023 (UTC)
There's a fair number more than that. Plenty of discussion in village pump subpages, and in previous sub-pumps that have since been marked historical or redirected, and a particular anomaly at WP:Village pump archive 2004-09-26. I count 26102 registered, 36512 total editors. —Cryptic 04:17, 23 October 2023 (UTC)
I didn't think of those, thank you.
Also:
  1. WP:Village_pump_(proposals)/FritzpollBot_creating_thousands_of_new_articles
  2. WP:Village_pump_(proposals)/FritzpollBot_creating_millions_of_new_articles
  3. WP:Village_pump_(proposals)/FritzpollBot_creating_up_to_two_million_new_articles
That escalated quickly. BilledMammal (talk) 04:21, 23 October 2023 (UTC)
Precision is not necessary. With either set of numbers, I'm looking at only 1 in every ~550 editors (counting those those who registered an account and actually made at least one edit) having ever posted to any village pump.
@Cryptic, could your approach work for identifying how many people ever posted to an AFD? There are a h-u-g-e number of AFD pages. WhatamIdoing (talk) 04:24, 23 October 2023 (UTC)
Yes, but it's going to be a lot slower. There's 156 pages non-archive Village pump pages in WP: and WT: space, and about 550,000 starting with 'WP:Articles for deletion' or 'WP:Votes for deletion'. The village pumps have a lot more revisions - the query's going to scale by that number as well as the number of pages - but not nearly enough to make up for it. I'll start it running on toolforge where it's less likely to time out. —Cryptic 04:37, 23 October 2023 (UTC)
quarry:query/77485 - 161875/247267. —Cryptic 05:26, 23 October 2023 (UTC)
I'm looking at only 1 in every ~550 editors (counting those those who registered an account and actually made at least one edit) having ever posted to any village pump. I'd suggest that isn't a very useful ratio; 95% of accounts have less than ten edits and the fact that those editors haven't participated on certain pages isn't particularly informative.
When we look at extended-confirmed editors, we see that 17% of such editors have contributed to a village pump; I think that is a more informative figure. BilledMammal (talk) 04:56, 23 October 2023 (UTC)
So ~60% of the people posting are extended confirmed and 40% aren't. A cutoff that excludes almost half the participants doesn't make me feel good. WhatamIdoing (talk) 05:20, 23 October 2023 (UTC)
The cutoff certainly shouldn't be 1 edit, either, though. There's no reason to think someone would find their way to the project namespace (let alone the Village pump in particular) on their first edit, and it's so implausible that it tends to generate suspicion when it does happen. —Cryptic 05:26, 23 October 2023 (UTC)
Though 889 registered users did get to some variant of WP:VP as their first and only edit, it seems. —Cryptic 05:57, 23 October 2023 (UTC)
Looking through those editors, it seems that most of them are lost - usually looking for the teahouse or helpdesk, sometimes just lost in general. The rest just seem to have something they want to say about/in relation to Wikipedia and aren't interested in contributing further; I suspect that actually suspicious ones are the ones with more than one edit but still a low number. BilledMammal (talk) 06:06, 23 October 2023 (UTC)
This doesn't surprise me. Sometimes it's a request for to change an article ("He died. Please update the article"). We used to get complaints about fundraising from new accounts, though that seems less common now (I wrote User:WhatamIdoing/Fundraising years ago [2011?] in response to one of them). WhatamIdoing (talk) 15:38, 23 October 2023 (UTC)
Out of curiosity, I pulled the figures for just 2023:
Registered editors Auto-confirmed editors Editors with 100 edits Extended confirmed editors
AfD 7910 7377 6142 5061
Village pumps 1953 1848 1738 1605
BilledMammal (talk) 08:48, 23 October 2023 (UTC)
The ratios are 4, 4, 3.5, and 3. The Community™, no matter what cutoff you use, is more likely to be found (in aggregate) at AFD than at the Village pumps (in aggregate). This is a bit complicated, then, because it's easier to discern The True™ Will of The Community™ in an abstract discussion than by looking at messy individual cases. However, the individual actions (a type of revealed preference) are likely to be more accurate than the stated preferences, and there are many more individuals involved in those individual actions. Obviously, the preferences of people (like me) who self-select for spending a lot of time telling others (like the participants in AFD) what to do are not always going to be the same as the preferences of the people who are doing the work. That's one of the reasons that RFA has long sought evidence that the candidate has at least created a decent article.
Also, I note that the all-time/all-editors ratio is 6, suggesting that either the Village pumps have gotten more popular over time (=my guess) or that AFD has gotten less popular (or a combination of both).
  Thank you both for the fascinating numbers. I'd be happy to hear about anything similar at any time (even years from now). Let me see if there's a page where I can record some of these stats for anyone else who might be interested in the future. WhatamIdoing (talk) 15:32, 23 October 2023 (UTC)

Stubs by Article Size

I was wondering if someone could compile all the articles in Category:Stub-Class Green Bay Packers articles (note, the category is filled by the article talk pages) and create a sortable table under Wikipedia:WikiProject Green Bay Packers/substubs#Stubs by Article Size with the article size in the next column. Something like this, but for all 994 articles in the category:

Stub Article Size (bytes)
1921 Green Bay Packers season 4482

My goal with this is to identify priority articles for improvement. Thanks for any help that can be provided. « Gonzo fan2007 (talk) @ 22:29, 7 November 2023 (UTC)

quarry:query/77945. Click "Download data" and "Wikitable". —Cryptic 00:04, 8 November 2023 (UTC)
Thanks Cryptic, is there anyway for the output table to have the article titles Wikilinked? « Gonzo fan2007 (talk) @ 13:58, 8 November 2023 (UTC)
I suppose. —Cryptic 14:35, 8 November 2023 (UTC)
Thank you! « Gonzo fan2007 (talk) @ 14:48, 8 November 2023 (UTC)

Number of active high-volume editors

Related to User talk:Terribilis11#Odds and Wikipedia:Village pump (proposals)#Scoring for Wikipedia type Articles Generated by LLM, could someone tell me how many enwiki editors meet all three of these criteria:

  • account is more than 1 year old
  • total of 500+ edits at the English Wikipedia (ever)
  • at least 30 edits in the last 30 days

Thanks,

WhatamIdoing (talk) 01:20, 14 November 2023 (UTC)

@WhatamIdoing: quarry:query/78064. Looks to be 9151 meeting these criteria. I filtered out flagged bots from the results but unflagged ones may still appear. Uhai (talk) 01:55, 14 November 2023 (UTC)
Thank you so much, @Uhai! WhatamIdoing (talk) 04:17, 14 November 2023 (UTC)

Mainspace pages with one editor (sorted by createdatetime)

I'd like to see a list of the mainspace pages with only one editor and preferrably including the createdate (looking for the oldest). (In this case, bots would count among editors)Naraht (talk) 14:08, 17 November 2023 (UTC)

A full list isn't practical; it would have to look at the majority of the 1.2 billion-row revision table.
Finding even the oldest isn't really possible, either, since we've lost most of the history from 2001. On the other hand, just about all the pages that old have also been edited by User:Conversion script, so they've got at least two editors. Oldest page with only a single known editor is Pythagorean Theorm (history), created on 2002 Jan 29 and edited only by 193.203.83.xxx. —Cryptic 16:47, 17 November 2023 (UTC)
And somehow the only edit after creation changed nothing... Thanx, I see now querying the revision table would be ugly.Naraht (talk) 17:04, 17 November 2023 (UTC)

High traffic talk pages

Apologies if this is already documented in a report somewhere (this is not what I'm looking for), but is there a way to see how many pages in the Talk: namespace average at least one edit per day over some time period?

Thinking about how I'd structure this query in SQL immediately brings up problems of timeframes where every talk page that's ever been edited qualifies for a one-day moving window, and almost every talk page fails when the window is "days since talk page creation".

I'm bringing this up in relation to Wikipedia:Bot requests#Bot to add archiving bots to talk pages, which got me thinking (probably a bad sign), and would welcome some kind of practicable, reasonable metric from someone who understands the database schema, but it's not priority, and I don't have an intended purpose for the information apart from curiosity. Folly Mox (talk) 11:58, 25 October 2023 (UTC)

Are you familiar with SQL? If so then you could do something like checking for 30 edits in 30 days by ordering the revisions by page,time and use LAG to check that the last row but 30 was for the same page as the current row and less than 30 days previous. (PARTITION BY page ORDER BY time might work faster, but I have never tested that feature.) Certes (talk) 13:06, 25 October 2023 (UTC)
I'd think it'd work almost as well to look for talk pages by length (and no auto-archive template). That's trivial. —Cryptic 13:58, 25 October 2023 (UTC)
Actually, no, you're overthinking it. quarry:query/77535 for the last 30 days; quarry:query/77537 can do arbitrary timespans, but is much slower. If you want more than 30 edits in the last 30 days, all you've got is two bot wars. —Cryptic 14:55, 25 October 2023 (UTC)
@Folly Mox, I wonder if you would be more interested in talk pages, without archiving, that exceed some reasonable page size (e.g., 100,000 bytes). A brief burst of activity isn't necessarily a problem, so long as the page size doesn't get out of control. WhatamIdoing (talk) 22:18, 28 October 2023 (UTC)
Only four of those, and that many only if you count World Series/world title's talk page. Not so many longer than 50,000, even. —Cryptic 22:33, 28 October 2023 (UTC)
One of those was a single edit and probably qualifies for CSD. WhatamIdoing (talk) 23:47, 28 October 2023 (UTC)
Not sure how I missed this notification. Work, sleep, or age all present themselves as valid candidates. Pinging @Riposte97: as the BOTREQ initiator. Folly Mox (talk) 14:28, 17 November 2023 (UTC)
Agreed that page activity isn’t as much of a concern as page size, considering size recommendations. Riposte97 (talk) 21:28, 20 November 2023 (UTC)

Stub-class articles not tagged with a stub template

I'd like a list of all articles whose talk pages are in Category:Stub-Class articles (or any of its subcategories) but are not in Category:All stub articles and vice versa? That is, all articles that are rated as stubs on the content assessment scale but do not have a stub template and vice versa. Thanks, Andumé (talk) 06:36, 25 November 2023 (UTC)

@I Am Andumé: quarry:query/78304. There's quite a large number of articles meeting these criteria so I limited each result set to 1000 records. The first result set consists of articles that are not tagged as stubs but whose talk pages are members of at least one stub-class category. The second result set consists of articles that are tagged as stubs but whose talk pages are not members of any stub-class category. The former appears to have a total of 1,074,275 total articles while the latter has 178,979. Note that articles with no talk pages aren't included in the results. The size of the latter result set could be reduced if you included start-class categories as well but I don't know if this would be what you are going for. Uhai (talk) 09:36, 25 November 2023 (UTC)
That's not promising. There's only about 3.2 million talk pages in that tree to start with. (And a bunch in other namespaces, which isn't great either.) —Cryptic 10:41, 25 November 2023 (UTC)
Here's a list of the pages in other namespaces, if anyone wants to knock themselves out. Probably wouldn't be a bad idea for a bot task to fix these, especially considering these are only from the stub-class category and I'm sure the other content assessment categories make plenty of appearances in the wrong namespaces as well. There's also redirect talk pages like Draft talk:2020 in Australia that should either also redirect or have the content assessment removed at the very least. I'm not actually sure what the correct thing to do here is. Uhai (talk) 11:59, 25 November 2023 (UTC)
Things like Special:Diff/1084669075 are interesting to come across too. Uhai (talk) 12:04, 25 November 2023 (UTC)
Thanks for all the help! There's definitely a need for a bot here, especially for 1,074,275 page backlog. Andumé (talk) 06:24, 26 November 2023 (UTC)

Short descriptions containing Unicode non-ASCII characters

At Wikipedia_talk:Short_description#Hostility_to_basic_literacy someone claimed Unicode characters (that are not ASCII) should be avoided in {{short description}} (and its redirects). Would it be possible to find the offending articles, please? Here's how it'd be done for text files: [1]. Thanks. fgnievinski (talk) 02:27, 12 December 2023 (UTC)

@Fgnievinski quarry:query/78337. Descriptions with only ASCII characters but with en dashes as well make up about 200k of the results, so I created another query that ignores those: quarry:query/78639. Uhai (talk) 04:14, 12 December 2023 (UTC)
Marvelous, thank you for both queries! fgnievinski (talk) 04:27, 12 December 2023 (UTC)
Diacritics are also fairly common and unavoidable in non-English terms. Any idea how to ignore those as well? I'm looking specially for chemical and mathematical formulas. Thanks! fgnievinski (talk) 04:37, 12 December 2023 (UTC)
That's more difficult but I can put some more time to get at what you're looking for. If anyone else wants to take a shot at it, that's welcome too. Uhai (talk) 04:43, 12 December 2023 (UTC)
@Fgnievinski Okay, this really sucked, but I created a query using many of the character ranges at Mathematical operators and symbols in Unicode: quarry:query/78642. There's likely to be false positives and false negatives but this should hopefully be a good enough start. Note that some apparent false positives like Michael Ende actually aren't because the character between the birth and death dates is a Unicode minus sign rather than a dash or en dash. Let me know what issues you find and I can attempt to continue to refine. Uhai (talk) 07:07, 12 December 2023 (UTC)
Something like quarry:query/78652 sucks less to do, at least to exclude diacritics rather than look specifically for math symbols. (Nonbreaking spaces aren't as frequent as en-dashes, but there's still nearly 30000 of them and they make it a lot harder to figure out what's going wrong when they show up in the results.) —Cryptic 16:01, 12 December 2023 (UTC)
These results have a lot fewer false positives and nicely identify easily fixed quote marks and dashes. It would be great to have something like this as a daily or weekly report, maybe with a separate daily report for nbsp characters. I suspect that we have a few AWB editors who could make passes through the results and clean up the unambiguous problems pretty quickly. – Jonesey95 (talk) 16:58, 12 December 2023 (UTC)
{{Database report}} is your friend. Certes (talk) 17:26, 12 December 2023 (UTC)
quarry:query/78658 has just shortdescs containing nonbreaking spaces and thin spaces, which I think everyone can agree are incorrect.
Are other forbidden unicode characters documented anywhere in any more detail than WP:SDFORMAT's "plain text – without HTML tags, wiki markup, typographic dingbats, emojis, or CSS display manipulation"? That doesn't seem to even mildly discourage the "A number 𝑥, 𝑦, or 𝑧 that can solve the Markov equation 𝑥² + 𝑦² + 𝑧² = 3𝑥𝑦𝑧" example from the talk page or even the unusual spaces here. Certainly no more than it does en-dashes. —Cryptic 17:54, 12 December 2023 (UTC)
That does look like a new friend! I have set up Wikipedia:Database reports/Short descriptions containing invalid space characters and will wait for the bot to populate it, if I did it right. – Jonesey95 (talk) 18:58, 12 December 2023 (UTC)
Interesting. It looks as if the &nbsp; in {{Infobox television/Short description}} may be responsible for a lot of those. Some odd characters don't show up clearly in the output or wikitext: for example, Ethical non-naturalism is a "Meta-ethical view" with a non-obvious non-breaking space between l and v. Certes (talk) 19:14, 12 December 2023 (UTC)
I have fixed {{Infobox television/Short description}} and am working on clearing out those transclusions, which might take a couple hours. That should leave us with a more interesting report. – Jonesey95 (talk) 19:54, 12 December 2023 (UTC)
See also User:Certes/Reports/Short descriptions for improvement, a regular report to catch bad SDs using slightly different criteria. Certes (talk) 10:18, 12 December 2023 (UTC)

I think this discussion can be wrapped up. Thanks to all for these great queries and other tips. I learned a lot today. If you are interested in continuing the discussion, see Wikipedia talk:Short description, where I link to two database reports that I have created. – Jonesey95 (talk) 21:38, 12 December 2023 (UTC)

quarry:query/78673 showing unprintable unicode characters is probably of interest, too. All the hit excepts Reversed half H have that `UNIQ--...--QINU` pattern, and every shortdesc with that pattern shows up in that query. —Cryptic 22:57, 12 December 2023 (UTC)
I have fixed the two four templates that were causing most of those. If you want to add that code to Wikipedia:Database reports/Short descriptions containing possibly invalid characters, that would be fine with me. (Edited to add: After the template fixes, there were just four articles left that needed to be fixed manually. All of these errors appear to have been caused by <ref>...</ref> tags in the short descriptions.) – Jonesey95 (talk) 00:24, 13 December 2023 (UTC)

number of deletions

how could one list the number of article deletions per month over the years, please? bonus points if one can distinguish between the type of deletion (speedy, proposed, or discussed). thanks! PS: wmcharts used to do it, but it seems to be out of date: https://wmcharts.toolforge.org/wmchart0004.php fgnievinski (talk) 04:31, 3 December 2023 (UTC)

quarry:query/78459 for just the counts. Classifying by type of deletion is inexact - you basically have to look at the log comment and guess - but there's an attempt at quarry:query/78460. —Cryptic 05:31, 3 December 2023 (UTC)
Beautiful, many thanks! How can I run a similar query, just for the counts, in other Wikipedia languages, please? fgnievinski (talk) 05:50, 3 December 2023 (UTC)
Log into Quarry, click the fork button, and change the database name. —Cryptic 05:54, 3 December 2023 (UTC)
Will do, thanks! Would it be easy to select only the deletions of articles older than one year? Most deletions should concern recent article creations, I suspect. fgnievinski (talk) 06:03, 3 December 2023 (UTC)
Possible. Depending on how accurate you want it to be, between not easy ("is there any deleted revision dating from a year before the deletion") to hard ("is there any deleted revision dating from a year before the deletion, but after any previous deletions") and very hard ("was there a revision dating from a year before the deletion, after any previous deletions, that might currently not be deleted anymore, or might have been moved to a different title since then). Either way, it'll be very, very slow. —Cryptic 06:07, 3 December 2023 (UTC)
I was hoping the page creation timestamp could be stored in the pages table or in the page properties table. Looking at the revisions table, page creations could be identified with rev_parent_id = 0. Would that speed things up, when trying to calculate the page age at death? fgnievinski (talk) 06:32, 3 December 2023 (UTC)
They're not in revision, page, or page_props anymore once they've been deleted, though there's a similar ar_parent_id column in archive. Either way, (rev|ar)_parent_id=0 isn't necessarily the first revision - a non-deleted revision could have a *_parent_id=0 without it being first if it's been history merged, for example - and there might not be a visible *_parent_id=0 at all if the history was split or if the creating edit was revdeleted or suppressed. And it wouldn't be appreciably faster than looking at the timestamp. (The creation log is no help, either, since it's only existed since mid-2018, and it doesn't account for moves between creation and deletion.) —Cryptic 06:41, 3 December 2023 (UTC)
I recently came across quarry:query/78759, which achieves something related to my original goal, of "separating the wheat from the chaff" or splitting the number of deletions into probably uncontroversial ones and the rest. They've looked at the time difference between user registration and article creation and concluded half of the pages deleted had been created by new users. So now comes the ask: would it be possible to adapt quarry:query/78459 for counting only the deletions of pages created by users registered more than a week before the page creation -- or, better yet, only by autoconfirmed users (with such status at the time of page creation)? Thanks! fgnievinski (talk) 05:00, 17 December 2023 (UTC)
Thank you Cryptic! Is there a way to filter this query to show the monthly counts for deleted articles rather than all types of pages? For this purpose
"article" can be defined as namespace=0 and excluding redirects. I don't mind having disambiguation pages and very short or empty articles included. Clayoquot (talk | contribs) 23:10, 13 December 2023 (UTC)
Namespace 0 yes - add AND log_namespace = 0 to the WHERE clause, if you're comfortable forking the queries yourself - but whether a page was a redirect or not is lost when it's deleted, short of parsing the deleted page's text. Quarry has neither permissions to get at the full information of deleted pages, nor page text even if it hasn't been deleted. It does have access to the length in bytes of each deleted revision, but syncing these log entries up to the correct latest deleted revision is difficult, per my 06:07 3 December comment above. —Cryptic 23:21, 13 December 2023 (UTC)
Ah, of course. Thank you! I made a fork here for "Mainspace deletions by month": https://quarry.wmcloud.org/query/78694# . For April 2021 it gives an average of 625 deletions per day, whereas wmcharts for a similar period gives an average of around 250 deletions per day. This seems like a bigger discrepancy than would arise from variations in the definition of "article". Do you have any idea what's going on? Clayoquot (talk | contribs) 00:03, 14 December 2023 (UTC)
13000 articles were deleted on April 8, 2021 per Wikipedia:Administrators' noticeboard/Archive332#Large batch deletion probably needed, which is (barely) outside of the range of that wmchart and bumps the number up. * Pppery * it has begun... 00:20, 14 December 2023 (UTC)
Great find! If I take 13000 out of the April 2021 total, I get 581 deletions per day. Still weird, unfortunately. Clayoquot (talk | contribs) 00:30, 14 December 2023 (UTC)
Check your math. The query reports 18000 deletions in April 2021. 18000-13000=5000. 5000 deletions is nowhere near 581 per day. * Pppery * it has begun... 00:37, 14 December 2023 (UTC)
Doh, I missed a zero in my calculations. Thanks! The correctly-corrected number for April 2021 is ((18737 - 13157)/30) or 186 deletions per day. Clayoquot (talk | contribs) 17:50, 14 December 2023 (UTC)

Editors that might make good admins

I took a stab at a query to see which editors might make good admins. I've got a subquery to the logging table looking for blocks that is making it time out. Any ideas for speeding it up? I tried left join, is still timing out. Thanks! –Novem Linguae (talk) 06:34, 19 December 2023 (UTC)

mw:Manual:Logging table#Indexes. logging is no different from any other table in that log_title isn't indexed without log_namespace. —Cryptic 06:53, 19 December 2023 (UTC)
That fixed it, thanks. Another question. Why is ZéroBot in my results despite the condition AND LOWER(user_name) NOT LIKE '%bot'? –Novem Linguae (talk) 07:42, 19 December 2023 (UTC)
You've found some great candidates there. I don't see any bots; ZéroBot has gone now. Perhaps it didn't write enough GAs to pass RfA. Certes (talk) 10:00, 19 December 2023 (UTC)
Hehehe. Yes, if you remove all the conditions below AND LOWER(user_name) NOT LIKE '%bot', it will come back though. I am clearly misunderstanding something. –Novem Linguae (talk) 10:02, 19 December 2023 (UTC)
Per wikitech:Help:MySQL queries#Alternative Views, I was about to suggest selecting from recentchanges_userindex rather than recentchanges. However, I tried that and it didn't help performance. LOWER is a pig to use because it fails silently on BINARY strings; the workaround is to convert them first. In fact, once you do that, LIKE works as documented and you don't even need LOWER: the check can be AND CONVERT(user_name USING latin1) NOT LIKE '%bot' Certes (talk) 10:56, 19 December 2023 (UTC)
Generally you want to avoid correlated subqueries (subqueries that match based on fields in the parent query). This is because each subquery needs to run for each row in the parent query. In order words, it's very slow. If you are going to use EXISTS with a correlated subquery, then avoid the asterisk and SELECT the table's surrogate key instead, e.g. recentchanges' rc_id. In fact, for performance reasons, it's good practice to avoid the asterisk for all queries once you are past the stage of noodling around.
Other issues are that you duplicated the user group subqueries when you could have done AND ug_group IN ('sysop', 'bot') and that in the recentchanges subquery you JOINed the actor table when you could have done this in the parent query to improve performance. The latter, again, ties back to not using correlated subqueries to begin with.
I refactored the query at quarry:query/78839 to eliminate the correlated subqueries and fix the other issues and it's running much faster. It does use more memory, but c'est la vie. Uhai (talk) 14:35, 19 December 2023 (UTC)
Thank you for the code review. Great tips for next time. Will try to remember them. Happy holidays :) –Novem Linguae (talk) 19:05, 19 December 2023 (UTC)
I am keen to see this list. BD2412 T 20:30, 19 December 2023 (UTC)
It's linked above, twice, at quarry:query/78816 and quarry:query/78839.
This sort of thing comes up every few years - I'm put in mind of Wikipedia talk:Requests for adminship/Archive 246#Poll candidate search, for example. One of the problems is that, just as statistics are a really, really poor way to evaluate a candidate at RFA, using these sorts of statistics to find candidates for candidacy at RFA isn't such a great idea. Get accidentally blocked instead of the user you've listed at AIV and then immediately unblocked? Too bad, you're off the list. 9500 of your >10000 edits were automated and performed in the last month, with most of the rest deleted spam from the month before? That's ok, you're still on it, long as you actually picked out your username two years ago. Have you written great content, maybe even have a few featured articles under your belt, but the only time you bothered with a Good Article evaluation first you got a reviewer who put you off for months and then tried to make you jump through meaningless hoops? Too bad. But if all your blocks and failed rfas were before your namechange, you're still in the running. And it goes on and on. —Cryptic 21:07, 19 December 2023 (UTC)
The other methods of finding RFA candidates are also imperfect, and involve things such as "ooh this person seems to have the right personality, let me click on their userpage and dig around more", missing thousands of other good candidates. Or folks posting at WP:ORCP, missing all the folks who don't post there. Or asking admin friends for ideas of folks to nominate, missing folks the friend doesn't mention. No technique will be perfect. The combination of all these imperfect techniques by multiple nominators is likely to find most of the good candidates though. –Novem Linguae (talk) 21:16, 19 December 2023 (UTC)

Stub types not listed on the Great List of Stub Types

I'd like a list of all stub templates (templates whose name ends with "-stub") and all stub categories (categories whose name ends with "stubs") that are not listed on WikiProject Stub sorting's List of stub types Thanks, Andumé (talk) 20:25, 21 December 2023 (UTC)

There are tens of thousands. Any other subpages we need to worry about besides Wikipedia:WikiProject Stub sorting/Stub types/Geography? —Cryptic 22:01, 21 December 2023 (UTC)
Never mind, I think I found all of them. That page is miserable to work with (but I suppose that's to be expected; the whole project is redundant makework anyway). quarry:query/78899. —Cryptic 22:28, 21 December 2023 (UTC)

List of editors for a set of articles

Hello, favorite people who have still not told me to learn SQL myself so I'll quit bugging you (or, more realistically, so I'll have more complicated questions for you),

Wikipedia:WikiProject Directory/Description/WikiProject Video games and related pages were compiled and updated by a now-broken bot. But: Could we do this/something like it in a Quarry query, individually, for whichever WikiProject interested us?

The main use case at the moment is to find people who are making multiple edits to articles within a group's scope, and then to invite those people to join the group. WhatamIdoing (talk) 21:14, 13 January 2024 (UTC)

Define "WikiProject-area pages and discussions" and "an article in the WikiProject's subject area"? I guess the latter is probably mainspace pages whose talk is in Category:WikiProject Video games articles. If Reports bot's source is available, it's not immediately obvious where it is. —Cryptic 22:22, 13 January 2024 (UTC)
What I'd like is to find editors who edit articles whose talk pages are in the subcats of Category:WikiProject Medicine articles. For example, could we make a list of editors who have made 10 edits to WPMED-tagged articles in the last 30 days?
For my own purposes, I'd also like to filter the list to remove editors who are blocked (at least if they're indeffed), who probably know about WikiProjects (e.g., those who have already made 10,000 edits) and those who mostly edit other areas (e.g., someone on an AWB run that happened to include some WPMED-tagged articles. WPMED tags a bit less than 1% of articles, so if WPMED-tagged articles are less than 10% of your edits, then you might not be a good candidate for an invitation to WPMED).
How much of that do you think is feasible? WhatamIdoing (talk) 22:43, 13 January 2024 (UTC)
"Subcats" is almost always tricky, both in defining what that means - direct subcats, or any in the category tree? How deep in the tree? Subcats only, or also directly in the category (not relevant here)? - and in the query. In this case we've got Category:All WikiProject Medicine articles, which should be everything. Editcount's easy for users, impractically difficult for ips. Directly-blocked users and ips is easy, but excluding ips that are currently rangeblocked is more trouble than it's worth. Mostly WPMED-tagged, maybe; needs thought. quarry:query/79597 is most of it. —Cryptic 23:08, 13 January 2024 (UTC)
The "All" cat appears to have an accurate number of articles estimated, so let's go with that.
We can simply exclude the IPs, though perhaps that will get easier when m:Temporary accounts are rolled out (still estimated for later this year, AFAIK).
Could the "mostly WPMED-tagged" be estimated as the percentage of edits we're counting, divided by the total of all mainspace edits during the time period? This would require counting how many mainspace edits User:Promising made overall last month, in addition to how many to WPMED-tagged articles specifically. WhatamIdoing (talk) 23:14, 13 January 2024 (UTC)
quarry:query/79597 again. IPs still there, rangeblocks still not. I expect everything involving ips to become unreasonably difficult once Temporary accounts happens. It certainly won't be easier. —Cryptic 23:24, 13 January 2024 (UTC)
For my purposes related to maintaining a WikiProject, I'd like this to look at any subject page that's in the "All ... articles" category of related talk pages, mainspace or not. I want to know about the work being done on any page under our project's inclusion umbrella. Stefen Towers among the rest! GabGruntwerk 23:29, 13 January 2024 (UTC)
quarry:query/79599. —Cryptic 23:42, 13 January 2024 (UTC)
Thanks! I was already playing with a fork, but I was taking out the namespace check altogether, where you are limiting it to subject pages (as I stated). Cool. Pretty powerful. Stefen Towers among the rest! GabGruntwerk 23:47, 13 January 2024 (UTC)

Need Help, for Missing Infobox items

Hi Supports, How to find the missing or empty Infobox parameters, for e-g, Template:Infobox Indian constituency this template have multiple parameters, "| constituency_no" and "| constituency_no = " is blank/empty. Kindly suggest any other tools. - IJohnKennady (talk) 17:32, 20 January 2024 (UTC)

That's not queryable. It's not reliably searchable either, but you can come close - this search will find pages transcluding that template where "constituency_no" doesn't appear anywhere in the page source, and this one will find pages transcluding that template where it does appear, but doesn't have any value other than perhaps spaces or control characters (like tabs and newlines). The problem with both is that they can't guarantee that "constituency_no" doesn't appear elsewhere on the page, perhaps as a parameter of another template, or where this template is used more than once. —Cryptic 18:57, 20 January 2024 (UTC)
It's really superb, but there is no option to export CSV/Excel. I have all the data for Indian constituencies. If there's a chance to add an export option, I can pull the relevant data. - IJohnKennady (talk) 09:43, 21 January 2024 (UTC)

Querying Wikipedia articles

I would like to query for Wikidata items that are linked to multiple language editions of Wikipedia, e.g has both Yiddish Wikipedia and English Wikipedia sitelinks.

Another query, for all Yiddish Wikipedia articles that have a link to any other language edition except English Wikipedia. ~ 🦝 Shushugah (he/him • talk) 22:43, 21 January 2024 (UTC)
These are both questions for d:WD:RAQ. —Cryptic 23:11, 21 January 2024 (UTC)
I ended finding meta:PetScan/en fitting my needs. Thank you! ~ 🦝 Shushugah (he/him • talk) 23:20, 21 January 2024 (UTC)

Filter by is_redirect / is_dab

I have adapted this query:

SELECT page_namespace, page_title FROM categorylinks
JOIN page ON page_id = cl_from
WHERE cl_to = 'WikiProject_banners_without_banner_shells'
AND page_namespace = 1
ORDER BY RAND()
LIMIT 100000

from quarry:query/79763 (thanks SD0001).
Can this be limited to cases where the corresponding mainspace page is both not a redirect page, and not a dab page? — Qwerfjkltalk 20:31, 31 January 2024 (UTC)

I've forked that as quarry:query/80091. See also Special:RandomInCategory. Certes (talk) 20:57, 31 January 2024 (UTC)
Certes, thanks. And yes, I am aware of RandomInCategory, unfortunately it's missing an api. (And I did use RAND() instead of page_random intentionally, but nevermind.) — Qwerfjkltalk 21:31, 31 January 2024 (UTC)

Pages in category that haven't been edited by specific users

Could someone please write a query to find members of Category:Pages using WikiProject banner shell without a project-independent quality rating that haven't been edited (at all) by Qwerfjkl (bot) or Cewbot? — Qwerfjkltalk 15:01, 5 February 2024 (UTC)

Number of blocks last year

Hello, all.

Related to Wikipedia talk:Student assignments#Stats and in the hope that the information will be generally interesting, I'd like to know the number of registered accounts that were created during 2023 (any time during the year) and made an edit (any page, any namespace, deleted or undeleted, any time during 2023), and then to have that number divided according to whether the account has ever been blocked (any time since account creation, including in January 2024). I'd prefer to only count "full" blocks, not partial ones, but if that's too difficult, that's okay. If an account gets blocked a dozen times, I'd prefer to have that counted as "one blocked account" instead of "a dozen blocks were issued".

My goal is to be able to write a sentence like "About 1% of new accounts get blocked" or "Blocks are fairly common, while sitebans are rare. 12,345 new accounts were blocked in 2023, but only about a handful of people were sitebanned during the year". WhatamIdoing (talk) 22:11, 2 February 2024 (UTC)

quarry:query/80159. —Cryptic 22:56, 2 February 2024 (UTC)
73,528/1,968,140=3.7% of new accounts get blocked.   Thank you! WhatamIdoing (talk) 23:00, 2 February 2024 (UTC)
Wait, no, overlooked 'made an edit'. Hang on. —Cryptic 23:06, 2 February 2024 (UTC)
OK, quarry:query/80159 again. —Cryptic 23:18, 2 February 2024 (UTC)
Thanks. Meanwhile, I've made quarry:query/80160 to count the total number of blocks placed during the year (and posted it to WP:BLOCKBANDIFF); please let me know if I screwed it up. WhatamIdoing (talk) 02:08, 3 February 2024 (UTC)
The new set says 48,682/478,209=10.1% chance of getting blocked. So if you merely create an account and never edit, you have a (73,528-48,682)/(1,968,140-478,209)=1.66% chance of getting blocked (presumably some Checkuser blocks in that group), and if you make at least one edit, you have a 10% chance of getting blocked.
Maybe I didn't run the numbers correctly, but my calculation suggests that the risk of a block during 2023 was actually a bit higher for accounts created before 2023. WhatamIdoing (talk) 02:20, 3 February 2024 (UTC)
Pulling the list of accounts created from the newusers log, as I initially did and you still do in that fork, isn't going to work right unless you're filtering by time, since it didn't exist before September 2005 and hasn't been backfilled. —Cryptic 03:23, 3 February 2024 (UTC)
Which log should I use instead? (I have pretty good cutting and pasting skills, but...) Did I correctly filter for blocks that were placed during 2023, not accounts that were indeffed years ago? WhatamIdoing (talk) 04:36, 3 February 2024 (UTC)
You'd use the user table, if you also cared about when they registered or how many edits they had. But it's huge, and doesn't mix well with the also-huge logging table because it stores spaces in usernames as spaces instead of munging them into underscores, and of course the block logs don't store the user_id (or perhaps directly whether it's a block of an ip or not) like it would in a sane world. So it turns out the only reasonably-efficient way to do this is to look at each block to see if it parses as an ip or an ip range, as in quarry:query/80167. —Cryptic 07:44, 3 February 2024 (UTC)
That's... a brilliant workaround for a problem that shouldn't exist. MPopov (WMF), is this a problem for your team as well? I always assume that changing the structure of a log is scary, but this seems like a really odd choice. Maybe it'd be worth it.
Is parsing the user_id the best way to get the block status for a list of ~15K specific accounts (names in https://dashboard.wikiedu.org/campaigns/spring_2023/students.csv, https://dashboard.wikiedu.org/campaigns/summer_2023/students.csv and https://dashboard.wikiedu.org/campaigns/fall_2023/students.csv)? WhatamIdoing (talk) 19:38, 3 February 2024 (UTC)
(name, not id, but I know what you meant). For 15k it's not going to make a difference. For the not-quite-47 million registered users we have, it did.
This is mainly a problem in datamining. In production, it's perfectly reasonable to store the affected user or ip of a historical block into the same field normally used for a page title, and for that matter its details into a string of serialized php data like a:3:{s:11:"5::duration";s:7:"2 weeks";s:8:"6::flags";s:17:"anononly,nocreate";s:8:"sitewide";b:1;} - this isn't information that the wiki actually uses for anything except to display it when someone looks at Special:Log. For active blocks, all of this data is stored in a separate, easily- and quickly-queryable table. —Cryptic 20:05, 3 February 2024 (UTC)
So if I want to find out how many of my ~15K student-newbies got blocked at any point (including expired blocks), will I need to put all 15K names into the query? WhatamIdoing (talk) 06:43, 6 February 2024 (UTC)
Well, you could. Or you could just shoehorn them into the db. Paste them all into a user subpage and link them, and they'll show up in the pagelinks table. —Cryptic 07:26, 6 February 2024 (UTC)
That sounds more sane. Does this mean that I create a page that says something like "* WhatamIdoing * Cryptic", and then the list of names becomes usable in the query? WhatamIdoing (talk) 20:06, 6 February 2024 (UTC)
Yes. —Cryptic 20:09, 6 February 2024 (UTC)
The list is at User:WhatamIdoing/Student accounts in 2023 now. WhatamIdoing (talk) 22:29, 7 February 2024 (UTC)
quarry:query/80240 cover everything you're looking for? —Cryptic 00:25, 8 February 2024 (UTC)
Cryptic, can you update a previous query (from a few months ago, probably when we were talking about village pump participation) to tell me how many registered editors made an edit during 2023? I want to compare that against the 82,865 total blocks (although I suppose some of those were blocked before their first edit, which is a bit of an apples-and-oranges comparison). WhatamIdoing (talk) 06:41, 6 February 2024 (UTC)
I don't see anything adaptable in Wikipedia:Request a query/Archive 4#Village pump participation. We were examining users who'd made edits to specific sets of pages, then once we had those users, examining them to see how many edits they'd ever made. That last number's handily stored as a field in the user table, I believe originally to support autoconfirmed. If you want to know how many users - without any other way to narrow down which users to examine - made any edits in a specific time period, there isn't going to be a significantly faster way than to look at every revision in that time period. There were 60 million in 2023. I'll start some queries running and check them in the morning. No promises. —Cryptic 07:26, 6 February 2024 (UTC)
quarry:query/80211. —Cryptic 17:52, 6 February 2024 (UTC)

Green Bay Packers Hall of Fame Query

Would someone be able to search all of the articles on {{Green Bay Packers Hall of Fame}} to provide a list of those article that do not have the string https://www.packers.com/history/hof/ somewhere on their page. If you take a look at Green Bay Packers Hall of Fame, Packers historian Cliff Christl wrote an article on almost every Hall of Famer for the Packers (referenced in the table). It is a great source and I would like to add the reference to each players' write-up to their Wikipedia article without having to search through every single article (160+ articles, with many already having their write-up referenced on their page). Thank you! « Gonzo fan2007 (talk) @ 15:44, 16 February 2024 (UTC)

@Gonzo fan2007 This got a little messy because of needing to account for archived pages as well, but this should be correct? quarry:query/80453. It looks like 114 of 167 in the category do not have links matching the string you indicated (note I also included packers.com/history/hall-of-famers/ as this seemed to appear as well). Let me know if you come across issues and I'll try to fix them. FTP! Uhai (talk) 17:58, 16 February 2024 (UTC)
Uhai, this is perfect! Thank you! Is there any way you can format the output with Wikilinks, so when I download it as a wikitable I can have clickable links? « Gonzo fan2007 (talk) @ 19:16, 16 February 2024 (UTC)
@Gonzo fan2007 Updated. Does that work? Uhai (talk) 19:19, 16 February 2024 (UTC)
That is perfect Uhai, thank you so much! « Gonzo fan2007 (talk) @ 19:35, 16 February 2024 (UTC)
No problem! Again, let me know if you encounter any issues. Uhai (talk) 19:41, 16 February 2024 (UTC)

AfC review's 25th review

I would like to find all new AfC reviewers who recently hit some review threshold (see this discussion). I'd operationalize this as something like "All users with their first AfC review under 6 months ago and their 25th review within the last week", but if something with a similar effect is technically easier I'm not attached to the specific formulation. @Novem Linguae directed me here to see if someone could help write a query; they say the typical way to grab AfC reviews, which isn't a formally logged action, is to look at edit summaries, e.g. here. Rusalkii (talk) 00:20, 24 February 2024 (UTC)

To start with, this can't possibly see afc reviews where the underlying page has been deleted (including where the redirect left from draft or, more likely userspace, has been deleted) - edit summaries of deleted revisions aren't exposed to the toolforge replicas. And searching revision, which is necessary if you're looking more than 30 days back, without some way to drastically narrow the search space isn't feasible. "Only users who've had at least one matching edit summary in the last week" may or may not be sufficient, we'll see. These actions should really, really be tagged; go ask at WP:EFR, if leaving such edit summaries without using the actual tool are to be counted, or change the tool if not. —Cryptic 02:20, 24 February 2024 (UTC)
The AFC helper script that accepts/declines/rejects drafts makes a variety of edits, including to: drafts, the log page at Wikipedia:Articles for creation/recent, sometimes a userspace log depending on the user's preferences (example: User:Novem Linguae/AfC log), and user talk pages. Which edits would be most helpful to tag? Or do you recommend tagging all of them? Also, tags are different than edit filters right? I could just make a tag at Special:Tags and update AFCH's code, and not need to use an edit filter at all, correct? –Novem Linguae (talk) 02:44, 24 February 2024 (UTC)
I'd tag accepts and declines but not comments, log pages, or user talk pages, probably with a separate tag for accepts and declines/rejects. Yes, they're very different; I mention edit filters because they can add tags even if someone doesn't use the tool. —Cryptic 02:59, 24 February 2024 (UTC)
To give you some idea of the scope of the problem, quarry:query/80638. There's 118 users with at least one vaguely-likely-to-be-an-AFC-review edit in the last week; among them, they have 10.8 million total edits. Each of those edits has to be looked at to see if the edit summary matches. (Well, not quite all - I can exclude edits from before 11 December 2009, the first revision of what became the AFCH script. But most of them.) Even limiting it to one of the sets of users with lower edit count is going to take a long time.
Are the writes to Wikipedia:Articles for creation/recent optional, or do they always happen? Is there always exactly one edit there per afc review? If so, when did that start? Analyzing its history would be much, much faster even than if there'd been tags from the beginning, and would be able to deal with now-deleted reviews too. —Cryptic 04:23, 24 February 2024 (UTC)
And, as tends to happen, my query on toolforge finished just after I wrote that. The public version at quarry:query/80639 should finish in about an hour unless the results happened to cache favorably (I don't think that's likely for this sort of query), but as of a few minutes ago the only such user was Broc. —Cryptic 04:41, 24 February 2024 (UTC)
The /recent page is always written to for accepts, but never for declines and rejects. Filtering by user talk namespsace may be helpful for counting afc actions, because everyone always gets a notification, and user talks are never deleted by g13. –Novem Linguae (talk) 04:43, 24 February 2024 (UTC)
Yeah, just saw that from analyzing Broc's reviews. Looking at user talk notifications would help some with now-deleted reviews, but still be unusably slow with nothing to narrow them except edit summary. And many, many, many IP talk pages have been badly speedied. —Cryptic 04:56, 24 February 2024 (UTC)
I think it's a bit rare to have more than one tag per piece of software. At this point I am leaning towards having AFCH tag all its edits as AFCH. That could hopefully then be used to speed up quarry queries. Feature request.Novem Linguae (talk) 11:12, 24 February 2024 (UTC)

Users who meet certain thresholds

Hi all, over at WP:RFA2024/I someone proposed we pull administrators from those who meet the following criteria:

  1. At least 10,000 total edits, including at least 5,000 in main space
  2. At least 1,000 edits in the past year, including at least 500 in main space
  3. Account registered at least three years ago
  4. No sanctions within the past five years
  5. At least one featured article or three good articles
  6. Have never lost adminship under a cloud

Is it possible to generate a list of editors who meet at least 1-4? #6 is rare enough that I can check manually. #5 is a bit tricky. I believe each editor with number of GAs is listed at User:GA_bot/Stats. I believe all users who have brought a featured article through FAC are at Wikipedia:List of Wikipedians by featured article nominations. I think generating this list would help inform the discussion. My guess is that it's a small list. Thanks a million to anyone up to the challenge! Ajpolino (talk) 02:42, 27 February 2024 (UTC)

The short answer is "no".
User:GA bot/Stats is users by the number of reviews performed, not good articles written; the last entry on the list, ヒストリ案, is illustrative. Similarly, the FA page isn't the users who primarily wrote the articles, it's the ones who took them through the FAC bureaucracy (though there's a high amount of correspondence). And even if they were both good data, there isn't a convenient way to extract the number of articles from the GA page and the FA page includes articles that are no longer featured; those, at least, could be addressed by manually creating similar pages (WP:List of users with three or more Good Articles and WP:List of users with at least one still-Featured Article).
#4 isn't doable even in principle. Sanctions, in general, aren't logged (in the Special:Log sense) unless they happen to be blocks.
I don't think there's a good list anywhere for #6. We have a couple applicable subpages of Wikipedia:Former administrators/reason, but lots are mixed into /resigned.
#2 and the second half of #1 are absurdly slow. There's no remotely efficient way to count edits by time or namespace other than to look at all of them. That's barely doable when we're already looking at individual users, or a small list of users with relatively few edits - say, no more than a million total edits, in either case - but when I poked at this briefly when it was first proposed, I ran the numbers just for 10k total edits and 3 years' registration; there's 10,318 users with 441.2 million total edits. #5 will cut that down, but not nearly by enough. —Cryptic 04:33, 27 February 2024 (UTC)
Understood, thank you for looking into it. Would it be possible to get the list of just the first halves of #s 1 and 2, plus #3. I.e. Users whose accounts are at least 3 years old, with at least 10,000 total edits including 1,000 in the past year? I'm guessing even a filter of "have edited in the past year" would cut down your 10,318 users quite a bit, but I'm not sure it's possible or worthwhile to attempt. Thanks again Cryptic. Ajpolino (talk) 13:56, 27 February 2024 (UTC)
Here's the first half of #1 combined with #3: https://quarry.wmcloud.org/query/80710Novem Linguae (talk) 14:51, 27 February 2024 (UTC)
That shows, as of this writing, users who've registered less than three years ago. You'll also want to exclude users in groups sysop or bot. —Cryptic 15:18, 27 February 2024 (UTC)
"1000 in the last year" still means having to look at every edit each of those editors made in the last year, and there's too many. "Any edits in the last year" does help - it cuts the number down from 10319 (congrats, whoever you are) to 6176 (quarry:query/80711)- but the other counts still aren't going to be feasible until we're down to a couple dozen. —Cryptic 15:18, 27 February 2024 (UTC)

Top 500 most redlinked articles?

I would love to see the most red-linked articles. There's a wikispace table for this somewhere, but it has serious problems. Zanahary (talk) 07:41, 28 February 2024 (UTC)

Related:
  • Special:WantedPages - all namespaces, so not very useful. phab:T37758 to add filtering by namespace to this page exists and is open and awaiting a patch.
  • Wikipedia:Most-wanted articles - looks like an attempt to list this for mainspace. but the list is outdated by over a year and also appears to be incomplete.
Will let someone more experienced speak about if Quarry/SQL can help here. –Novem Linguae (talk) 07:49, 28 February 2024 (UTC)
Thanks for this! Yeah, Most-wanted is what I was referring to. It's confusing and clearly just not the most-redlinked articles. Zanahary (talk) 07:56, 28 February 2024 (UTC)
To be more specific: Most redlinked articlespace articles, whose redlinks are in articlespace (if that's possible). Zanahary (talk) 07:58, 28 February 2024 (UTC)
If this would exclude transclusions via template! I just want a meaningful result that has red-linked articles by the number of times they were linked to by an editor. Zanahary (talk) 08:06, 28 February 2024 (UTC)
You can't get that. —Cryptic 08:10, 28 February 2024 (UTC)
Why not? Zanahary (talk) 08:11, 28 February 2024 (UTC)
Bad decisions made 22 years ago. phab:T14396 from 2007 is a reasonable place to start looking; it contains a couple indices of the many, many discussions (even then!) begging for this to be made possible. —Cryptic 08:19, 28 February 2024 (UTC)
So it's not possible for the query to exclude template links? Zanahary (talk) 08:23, 28 February 2024 (UTC)
I believe the link documents that pretty thoroughly? I mean, the workarounds to get a single page's incoming nontranscluded links clearly can't be scaled to multiple pages, let alone all articles. —Cryptic 09:39, 28 February 2024 (UTC)
It would still be useful and has been requested repeatedly, most recently in m:Community Wishlist Survey 2023/Larger suggestions/A way to see WhatLinksHere directly, excluding transcluded links. Certes (talk) 16:05, 28 February 2024 (UTC)
Leaving out the transcluded-link question, there's no reason someone can't just run Certes' query, as linked from WP:MWA, again. The hard part's already done, it just takes someone willing to deal with the tedium of running it for each partition of titles and collating the results - AND pl_title < "B", AND pl_title >= "B" AND pl_title < "C", ... AND pl_title >= "Y" AND pl_title < "Z", AND pl_title >="Z". —Cryptic 08:21, 28 February 2024 (UTC)
If it's going to be used regularly then we can copy the Quarry query to Wikipedia as a {{database report}}. That has the advantage that anyone can edit it on Wikipedia and re-run it without logging in to Quarry. (Only I can edit the Quarry query, though anyone can copy and paste it.) pl_title LIKE "B%" seems as fast as the >= AND < syntax, and may be more readable. Beware that many titles which appear to be commonly redlinked are actually linked once from a widely used template which contain dozens of similar redlinks, e.g. {{Areas of Chennai}}. Certes (talk) 09:40, 28 February 2024 (UTC)
LIKE 'B%' is never going to work for non-alphabetically-starting titles. Granted they're not likely to be widely-linked, but it's not impossible. —Cryptic 09:51, 28 February 2024 (UTC)
True. There are plenty of initials before A (notably 1 and 2 for "2024 in Tiddlywinks" etc.) and after Z. We need to use <= and > for the first and last partitions. We may also be able to get away with something like >="W" for adjacent uncommon initials. Certes (talk) 16:07, 28 February 2024 (UTC)

List of old issues (per year) for articles in a WikiProject

I've been able to create some queries at Quarry but my SQL is a bit too rusty for some things. Using my basic query for listing all mainspace pages in a WikiProject, I'd like to take a category like Category:Clean-up categories from 2013 and two levels of subcategories' worth of pages and do an intersection with the initial query's results. I realize I could probably use PetScan but I'd like to list these and results for other years on a page that a bot fills in with Database Reports. Any assistance appreciated! Stefen Towers among the rest! GabGruntwerk 03:14, 8 March 2024 (UTC)

@StefenTower quarry:query/81027. Change the two WHERE statements in the CTEs as needed and let me know if you encounter any issues. Uhai (talk) 06:42, 8 March 2024 (UTC)
Uhai Wow, it works perfectly - thanks! I only needed to make a couple minor adjustments for my use. Check out the results here. I'm excited I won't need to manually keep up this data. Now if I could wrap my head around the SQL (I never had to use RECURSE before). :) Best, Stefen Towers among the rest! GabGruntwerk 08:28, 8 March 2024 (UTC)
No problem, glad to help! Uhai (talk) 11:16, 8 March 2024 (UTC)

Edit protected-move unprotected pages

Do any exist? Can they be made? Why might they be made? Geardona (talk to me?) 06:57, 15 March 2024 (UTC)

Respectively: Lots. Clearly. I suspect mostly carelessness. —Cryptic 07:16, 15 March 2024 (UTC)
Wow, thanks for the fast response! Geardona (talk to me?) 17:29, 15 March 2024 (UTC)

Category:Green Bay Packers Hall of Fame by size and WikiProject rating

I kindly requested a query of Category:Green Bay Packers Hall of Fame to produce a wikitable with three columns:

  • Article name: linked name of the article
  • Article size: by bytes
  • WikiProject rating: stub, start, C, etc

The table would look something like (but for all 160+ articles in the cat):

Name Size (bytes) Rating
Bob Monnett 2,347 Stub

This is to help me prioritize important articles under WP:PACKERS. Thank you! « Gonzo fan2007 (talk) @ 18:39, 21 March 2024 (UTC)

@Gonzo fan2007: Does User:Certes/Reports/Green Bay Packers do what you need? Certes (talk) 19:33, 21 March 2024 (UTC)
Certes, perfect! Thank you so much. « Gonzo fan2007 (talk) @ 19:52, 21 March 2024 (UTC)
There's a link top right to update the report if the situation changes (e.g. you do a lot of reassessments). Certes (talk) 20:06, 21 March 2024 (UTC)
This is great, I appreciate it! « Gonzo fan2007 (talk) @ 20:27, 21 March 2024 (UTC)
Thanks also from me - I can make use of the code for showing the rating. I had to take out "{{!}}^[FG]A-Class_" from the REGEXP string to make it work, though. I'm not sure what that part does. Stefen Towers among the rest! GabGruntwerk 06:12, 22 March 2024 (UTC)
That bit picks up the FA and GA categories which are not biographies, like Talk:Wilner Burke which is only in GA-Class Wisconsin articles. Without it, the rating comes out blank. The {{!}} produces a "|" for regexp alternation; the template prevents it from delimiting the |sql= parameter. If you're copying the SQL into Quarry or some other SQL client that doesn't parse wikitext, you'll need to change {{!}} to | manually. Certes (talk) 10:34, 22 March 2024 (UTC)
<nowiki> should work in {{database report}} now, which would make things clearer. —Cryptic 16:35, 22 March 2024 (UTC)
Thanks. I already knew about {{!}} from all the template work I've done and alternation from all the RegEx I've written. It was the rest that confused me, esp. why there would be interest in just FA/GA outside of bios. At any rate, that part wasn't necessary for my purposes for listing articles that are definitely within one wikiproject. Stefen Towers among the rest! GabGruntwerk 16:39, 22 March 2024 (UTC)

Fetch a subset of log comments from the last year

quarry:query/81887. The part of my query that limits things to the last year (log_id > subquery) is running too slow and is causing it to timeout. Any suggestions to speed it up? The query works fine when I hard-code a log_id like I did in quarry:query/81844. But I'd prefer the query to be more dynamic than just hard-coding a log_id. Thanks. –Novem Linguae (talk) 13:55, 9 April 2024 (UTC)

Why not just do "and log_timestamp > TIMESTAMP" directly? That's what I do at Wikipedia:Database reports/Possibly out-of-process deletions and it seems to work. * Pppery * it has begun... 14:04, 9 April 2024 (UTC)
quarry:query/81893. Good suggestion, thank you. That's good for code readability. But the query is still timing out if anyone has any suggestions. –Novem Linguae (talk) 23:50, 9 April 2024 (UTC)
quarry:query/81896. Using the indexed tables logging_logindex and actor_logging sped things up a lot. As a side note, since text fields in the replicas are either VARBINARY or BLOB type, the LIKE clause is in fact case-sensitive. If you first convert the fields to UTF-8 then LIKE works as expected. Also, unless you don't want the oldest results to appear first, there's no reason to sort by the log_id ascending since that's already the clustered index of the table. Uhai (talk) 03:25, 10 April 2024 (UTC)
Bah. I always forget to add those pesky alternative views. Thank you very much for catching that, and for the other tips. –Novem Linguae (talk) 06:10, 10 April 2024 (UTC)
The specialized views helped a little, but what made the real difference here was adding "log_type = 'block'". log_action is a subindex of that; trying to filter by that without log_type is like searching a paper dictionary for words where the second letter is "r". Compare EXPLAIN output without and with the change, and see Database index#Column order. —Cryptic 16:07, 10 April 2024 (UTC)

Women in Red

I'm not exactly requesting a query, but help with one we already have: Wikipedia:WikiProject_Women_in_Red/Redlinks/C2. This version ran in 37 seconds. Some time between 20 and 25 February, it stopped working, instead throwing SQL Error: ER_DATA_TOO_LONG: Data too long for column 'subcat' at row 2. This is because it initially populates a table with one row, a fairly short category name, then attempts to add longer names. This used to work but something in the database or engine must have changed, as the same SQL now fails. I've fixed it by seeding the table with a long dummy value which doesn't affect the results, but that version takes nearly two hours to run. Does anyone know what changed? Is part of the forthcoming categorylinks "upgrade" causing the failure and wrecking performance? (A simple search for a link to details of that change timed out too – perhaps there is a more general temporary performance problem but VPT has nothing, and performance was just as bad last night.) Certes (talk) 15:24, 24 March 2024 (UTC)

I poked at this a little when you first asked it, and some more today, and I wasn't able to come up with a version that avoided a full table scan on pagelinks - taken in isolation, the CTE completes instantly, and I can do something like WITH names AS ( /* the long cte */ ) SELECT COUNT(*) FROM names JOIN pagelinks ON pl_namespace = 0 AND pl_title = name GROUP BY name, but every variant of "pl_title starts with name" I could come up with - name=SUBSTRING_INDEX like you used, pl_title LIKE CONCAT(name, '%'), LEFT(pl_title, LENGTH(name)) = name, and so on - was unindexed. Which is why your query taken as a whole is looking through all of page first, when I suspect it was able to make use of the Cn..D range limit before. Irritatingly, putting the CTE results directly in the query instead like quarry:query/81913 does work, but I don't know a workaround to force use of the index when we don't have access to the real pagelinks table, just a view. (And no, I don't know what changed.)
Something like SELECT CAST('Feminine_given_names' AS VARCHAR(256)), 0 would be cleaner than your long 'This category does not exist but...' dummy value, but obviously that's not the real problem here.
What, precisely, are you trying to do? Maybe we can find another way to do it? —Cryptic 17:50, 10 April 2024 (UTC)
I was listing frequently used redlinks which resemble women's names, so that the editors at Women in Red can consider writing articles on them. Of course, this is heuristic and contains false positives such as Ms May Rebellion, but they found it helpful and have created many wanted articles from it. I split it by initial letter to prevent timeouts by allowing index use, and to avoid overwhelming the authors by requesting hundreds of articles at once. Further details: Wikipedia:WikiProject Women in Red/Redlist index#Alphabetical (most frequently redlinked), User talk:Certes#WIR Redlist problem, User talk:Certes/Archive 9#Frequently redlinked women. Certes (talk) 19:05, 10 April 2024 (UTC)
Is WHERE pl_from_namespace IN (0, 10) ... HAVING MAX(pl_from_namespace) = 0 meant only to exclude titles linked from any template? The query completes quickly if the first half of that is changed to WHERE pl_from_namespace = 0. Explain output for query 81446 as written (sql-optimizer seems to choke on queries with ctes):
+------+-----------------+---------------+--------+------------------------------------------------------------------------------+-----------------+---------+---------------------------------+----------+-----------------------------------------------------------+
| id   | select_type     | table         | type   | possible_keys                                                                | key             | key_len | ref                             | rows     | Extra                                                     |
+------+-----------------+---------------+--------+------------------------------------------------------------------------------+-----------------+---------+---------------------------------+----------+-----------------------------------------------------------+
|    1 | PRIMARY         | page          | index  | PRIMARY,page_name_title                                                      | page_name_title | 261     | NULL                            | 57680411 | Using where; Using index; Using temporary; Using filesort |
|    1 | PRIMARY         | pagelinks     | ref    | PRIMARY,pl_namespace,pl_backlinks_namespace,pl_backlinks_namespace_target_id | PRIMARY         | 8       | enwiki.page.page_id,const       |        4 | Using where                                               |
|    1 | PRIMARY         | <derived5>    | ref    | key0                                                                         | key0            | 258     | func                            |        1 | Using where                                               |
|    1 | PRIMARY         | page          | eq_ref | page_name_title                                                              | page_name_title | 261     | const,enwiki.pagelinks.pl_title |        1 | Using index                                               |
|    5 | DERIVED         | <derived2>    | ALL    | NULL                                                                         | NULL            | NULL    | NULL                            |        2 | Using where; Using temporary                              |
|    5 | DERIVED         | categorylinks | ref    | PRIMARY,cl_timestamp,cl_sortkey                                              | cl_sortkey      | 258     | deepcat.subcat,const            |       19 | Using where; Using index                                  |
|    5 | DERIVED         | page          | eq_ref | PRIMARY,page_name_title                                                      | PRIMARY         | 4       | enwiki.categorylinks.cl_from    |        1 | Using where                                               |
|    2 | DERIVED         | NULL          | NULL   | NULL                                                                         | NULL            | NULL    | NULL                            |     NULL | No tables used                                            |
|    3 | UNION           | NULL          | NULL   | NULL                                                                         | NULL            | NULL    | NULL                            |     NULL | No tables used                                            |
|    4 | RECURSIVE UNION | <derived2>    | ALL    | NULL                                                                         | NULL            | NULL    | NULL                            |        2 | Using where                                               |
|    4 | RECURSIVE UNION | categorylinks | ref    | PRIMARY,cl_timestamp,cl_sortkey                                              | cl_sortkey      | 258     | deepcat.subcat,const            |       19 | Using where; Using index                                  |
|    4 | RECURSIVE UNION | page          | eq_ref | PRIMARY,page_name_title                                                      | PRIMARY         | 4       | enwiki.categorylinks.cl_from    |        1 | Using where                                               |
| NULL | UNION RESULT    | <union2,3,4>  | ALL    | NULL                                                                         | NULL            | NULL    | NULL                            |     NULL |                                                           |
+------+-----------------+---------------+--------+------------------------------------------------------------------------------+-----------------+---------+---------------------------------+----------+-----------------------------------------------------------+

and the same for WHERE pl_from_namespace = 0:

+------+-----------------+---------------+--------+------------------------------------------------------------------------------+------------------------+---------+---------------------------------+----------+-----------------------------------------------------------+
| id   | select_type     | table         | type   | possible_keys                                                                | key                    | key_len | ref                             | rows     | Extra                                                      |
+------+-----------------+---------------+--------+------------------------------------------------------------------------------+------------------------+---------+---------------------------------+----------+-----------------------------------------------------------+
|    1 | PRIMARY         | pagelinks     | range  | PRIMARY,pl_namespace,pl_backlinks_namespace,pl_backlinks_namespace_target_id | pl_backlinks_namespace | 265     | NULL                            | 50732398 | Using where; Using index; Using temporary; Using filesort |
|    1 | PRIMARY         | <derived5>    | ref    | key0                                                                         | key0                   | 258     | func                            |        1 | Using where                                               |
|    1 | PRIMARY         | page          | eq_ref | page_name_title                                                              | page_name_title        | 261     | const,enwiki.pagelinks.pl_title |        1 | Using index                                               |
|    1 | PRIMARY         | page          | eq_ref | PRIMARY,page_name_title                                                      | PRIMARY                | 4       | enwiki.pagelinks.pl_from        |        1 | Using where                                               |
|    5 | DERIVED         | <derived2>    | ALL    | NULL                                                                         | NULL                   | NULL    | NULL                            |        2 | Using where; Using temporary                              |
|    5 | DERIVED         | categorylinks | ref    | PRIMARY,cl_timestamp,cl_sortkey                                              | cl_sortkey             | 258     | deepcat.subcat,const            |       19 | Using where; Using index                                  |
|    5 | DERIVED         | page          | eq_ref | PRIMARY,page_name_title                                                      | PRIMARY                | 4       | enwiki.categorylinks.cl_from    |        1 | Using where                                               |
|    2 | DERIVED         | NULL          | NULL   | NULL                                                                         | NULL                   | NULL    | NULL                            |     NULL | No tables used                                            |
|    3 | UNION           | NULL          | NULL   | NULL                                                                         | NULL                   | NULL    | NULL                            |     NULL | No tables used                                            |
|    4 | RECURSIVE UNION | <derived2>    | ALL    | NULL                                                                         | NULL                   | NULL    | NULL                            |        2 | Using where                                               |
|    4 | RECURSIVE UNION | categorylinks | ref    | PRIMARY,cl_timestamp,cl_sortkey                                              | cl_sortkey             | 258     | deepcat.subcat,const            |       19 | Using where; Using index                                  |
|    4 | RECURSIVE UNION | page          | eq_ref | PRIMARY,page_name_title                                                      | PRIMARY                | 4       | enwiki.categorylinks.cl_from    |        1 | Using where                                               |
| NULL | UNION RESULT    | <union2,3,4>  | ALL    | NULL                                                                         | NULL                   | NULL    | NULL                            |     NULL |                                                           |
+------+-----------------+---------------+--------+------------------------------------------------------------------------------+------------------------+---------+---------------------------------+----------+-----------------------------------------------------------+

The first doesn't look like it'll look through so many more rows, but it's definitely taking forever, and the table order makes less sense. —Cryptic 20:03, 10 April 2024 (UTC)

quarry:query/81916 if my reading of your intention above was correct. —Cryptic 20:13, 10 April 2024 (UTC)
And quarry:query/81918 is a more (ahem) straightforward fix. —Cryptic 20:29, 10 April 2024 (UTC)
  Fixed @Cryptic: Thank you so much. So it needs a STRAIGHT_JOIN rather than a JOIN, to force the tables to be processed in the suggested order. I've put that into C2 and it works, so I'll update the other queries.
I've also neatened the table creation as you suggest. I had something similar originally but replaced it by a simpler kludge, because the CAST was the only code I'd changed when the timing went off and I suspected it was causing the problem, perhaps by making columns to be joined have incompatible types. Certes (talk) 20:54, 10 April 2024 (UTC)
...and yes, HAVING is to exclude people in templates, who might be linked from hundreds of articles just because their name appears in one navbox. Certes (talk) 20:56, 10 April 2024 (UTC)
I'd bet money that what changed is just that pagelinks is incrementally bigger. The optimizer's wildly overestimating the cost of scanning it, so that looking at page AS Pf became more attractive. Removing namespace 10 helped not because the database horrifically couldn't use two disjoint ranges anymore, but because it lowered the estimated cost just enough to scan it first. What confused me last month is that I thought it was looking at page AS Pt instead, and that it was primarily the cte that was meant to cut the number of rows down. —Cryptic 21:06, 10 April 2024 (UTC)
That sounds very plausible. I remember from my SQL-writing days decades ago that query plans would suddenly tip over from the desired order to something unhelpful when data sizes changed. I've applied the fix to the similar reports with other initials, rerun the broken ones and everything's now working. Thanks again. Certes (talk) 21:20, 10 April 2024 (UTC)
I've also updated the first chapter of Men in Red, though currently no one seems interested in its output. There's a biologist there with over 100 links (from actual articles, not navboxes) and articles in French, German and Spanish. Certes (talk) 15:21, 11 April 2024 (UTC)

@Cryptic: Just as we got this working, the WMF is about to improve the pagelinks table by removing the useful columns. I've tried updating a query but of course it now times out. Any further help would be very welcome. Certes (talk) 20:45, 18 April 2024 (UTC)

On Toolforge, I got results for quarry:query/82121, a straightforward update from 81918 above, in 4 seconds, and the query from your diff in 4 minutes 37. They had identical query plans, modulo the extra union from 'This category does not exist...', except that the first estimated 1743728 rows from linktarget (index lt_namespace_title) and the second 3057442. If SDZeroBot's timing out, I'd try with a narrower title range. —Cryptic 21:50, 18 April 2024 (UTC)
Thanks for investigating. We may be able to complete this task before the columns we need are dropped. If not then I'll take your advice. Certes (talk) 15:12, 19 April 2024 (UTC)

Redirects connected to a Wikidata item that target a disambiguation page

I attempted this in quarry:query/82243, but I may well have missed things that would make it more efficient. I’m wondering if the time it takes to run is purely a symptom of having to look through two very large categories, or if anyone’s aware of things that might improve it. All the best :) ‍—‍a smart kitten[meow] 18:14, 24 April 2024 (UTC)

The GROUP_CONCAT() followed by FIND_IN_SET() is going to perform a lot worse than just joining the second category - quarry:query/82254. —Cryptic 20:10, 24 April 2024 (UTC)
Ah, thank you! I don't know why that didn't occur to me. {{self-trout}} ‍—‍a smart kitten[meow] 20:48, 24 April 2024 (UTC)

Check for company articles that use both Parent and Owner field.

I'm wondering if it would be possible to check for articles that use both the Parent and Owner field in infobox company. The reason being, I have noticed that many articles incorrectly use the infobox company template to show higher-level ownership. This is not allowed per the infobox documentation. I have fixed this on most pages. However, there are many that I cannot find. WiinterU 22:45, 27 April 2024 (UTC)

@WiinterU: Quarry can't do that, because it doesn't have access to the wikitext. This search should catch most of the offenders but will contain a lot of false positives, such as articles with parent= and owner= in different infoboxes. Certes (talk) 08:35, 28 April 2024 (UTC)
Okay, thank you. WiinterU 12:31, 28 April 2024 (UTC)
(edit conflict) Hah. I typed up almost the same thing as Certes, so will defer to their great answer. –Novem Linguae (talk) 09:05, 28 April 2024 (UTC)

Query to get all articles without inline citations

Can somebody here help me write a query for all articles that don't have <ref> or {{sfn or {{sfnm or {{harvnb in the article's wikicode? Here is the context of the discussion: User_talk:CactiStaccingCrane#Untagged_unreferenced_articles CactiStaccingCrane (talk) 13:41, 4 May 2024 (UTC)

I don't believe occurrences of <ref> are stored in any table to which Quarry has access. Certes (talk) 20:09, 4 May 2024 (UTC)
Agreed. According to my sandbox, ref tags aren't placed into any categories, so we can't search the categorylinks table either. The others that OP mentioned can be queried via the templatelinks table. Or, for new pages, you can use Special:NewPagesFeed, and tick the filter for "have no citations", which should detect most of these but maybe not all (I forget which ones). WP:PETSCAN could also be useful since I think it lets you search by templates and/or categories. OP, what direction would you like to go in with this? –Novem Linguae (talk) 20:30, 4 May 2024 (UTC)
It's also possible to query for use of {{Reflist}} and similar, though many articles use a bare <references /> tag. Certes (talk) 20:43, 4 May 2024 (UTC)
We could generate a list of pages without any external links at all, or no external links except to specific sites like imdb. That's functionally the same thing - there's no real difference between a formally unreferenced page that has its subject's official site properly placed in ==External links==, and one that plugs it into ref tags instead. —Cryptic 22:10, 4 May 2024 (UTC)
Or pages that don't transclude anything in the Category:Citation templates tree. Which will miss bare <ref>[external link]</ref>s, of course; I suppose we could mitigate by also excluding pages in Category:All articles with bare URLs for citations, and tag the remaining false positives as they're noticed. —Cryptic 22:19, 4 May 2024 (UTC)
You're looking at quite a lot of pages. For example, there are about 30,000 with titles beginning with A. [2]. Many of them have a References section with plausible looking citations but no inline refs. Certes (talk) 22:56, 4 May 2024 (UTC)
CactiStaccingCrane, I wonder if it would be easier to start with the known 50,000 articles that transclude {{no footnotes}}? Once editors are done with those, come on back and we'll figure out how to find more. – Jonesey95 (talk) 03:39, 5 May 2024 (UTC)
The linked discussion shows the original user was specifically looking for unreferenced pages that aren't already tagged. (I can't imagine why, unless the intent is merely to tag them without actually looking for sources, but whatever.) —Cryptic 07:13, 5 May 2024 (UTC)
Yeah that's exactly the point. Both of us are on WP:WikiProject Unreferenced articles and we want to tag all articles that don't have citations before launching another drive around June-July. CactiStaccingCrane (talk) 09:09, 5 May 2024 (UTC)

ListeriaBot issue

I already have a query, albeit SPARQL, but been running into this issue and I thought somebody could help here. Thanks! Assem Khidhr (talk) 05:11, 12 May 2024 (UTC)

You might have better luck asking at d:WD:RAQ. —Cryptic 12:09, 16 May 2024 (UTC)

Help with expanding the self-transcluded-templates report

Hey. I was wondering if this is even possible and if someone here might be able to help with this. The query in User:Jonesey95/self-transcluded-templates collects templates that either have zero transclusions or are only transcluded on the template itself. I'd like to extend this to also the sub-pages of the template.

The new logic would be:

  • Get all templates with number of transclusions less than 6 (5 valid transclusions are: main, talk, /sandbox, /doc, /testcases - of the same template).
  • Keep in the report all templates that have all of their transclusions only on one of the above pages.

Is this possible? Gonnym (talk) 09:17, 29 May 2024 (UTC)

What you ask for in your first paragraph and in your bulleted list aren't the same things, or even terribly similar. Also, while we could literally do what you're asking for in your bulleted list - first generate a list of templates with 0-5 transclusions, and then cull that - but it would be complex, quite likely would be very slow, and possibly wouldn't do quite what you want. Generally it's a better idea to ask for what you actually want, rather than how to get it, and that's especially true with SQL since, since it's primarily a descriptive language, that's what the query-writer does too. (At least, until something goes wrong.)
What I'm guessing you're really after is to exclude transclusions on
  1. the template's own talk page, and
  2. either
    1. the template's own "/sandbox", "/doc", or "/testcases" subpages, or
    2. all of the template's own subpages.
Which of B1 or B2 are you after? —Cryptic 17:01, 29 May 2024 (UTC)
What you ask for in your first paragraph and in your bulleted list aren't the same things I know... The first paragraph is what we currently have at User:Jonesey95/self-transcluded-templates. What I asked in the bulleted list is what I hoped we could modify it to.
Regarding your second question, B2. Exclude all of a template's own subpages (and talk page). So the finale result would be a database report with templates that have "zero" transclusions (but might have actual transclusions on their own pages). Gonnym (talk) 17:09, 29 May 2024 (UTC)
If I've understood correctly: as well as the existing condition tl_from <> page_id, tl_from also has to differ from the page ids of all subpages and any talk page. Certes (talk) 17:23, 29 May 2024 (UTC)
My knowledge of SQL is limited so can't answer that. Gonnym (talk) 17:34, 29 May 2024 (UTC)
(I meant the "I'd like to extend this to also the sub-pages of the template." sentence.)
Do these results look right? —Cryptic 17:49, 29 May 2024 (UTC)
No. The transclusion check is meant to reduce the number of valid transclusions to zero. So for example, if template is transcluded only on itself, it then it should be on the report. Template:Article length bar/L0 was removed, but it shouldn't as it's unused other than itself. Gonnym (talk) 17:53, 29 May 2024 (UTC)
Yeah, just realized that I was only removing items from the list that already showed no transclusions. —Cryptic 17:55, 29 May 2024 (UTC)
I've created Template:Test SQL report so help test this. It should appear on the report. Gonnym (talk) 17:59, 29 May 2024 (UTC)
Take 2. —Cryptic 20:12, 29 May 2024 (UTC)
Looks great! I've browsed it and everything looks good. I'll have a deeper dive into it and if I find something I'll let you know, but so far works as requested. Thank you! Gonnym (talk) 20:14, 29 May 2024 (UTC)

How do the patrol/pagetriage-curation logs really work?

Does anyone really know, definitively, what the distinction between (log_type = 'patrol' AND log_action = 'patrol) and (log_type = 'pagetriage-curation' AND log_action IN ('reviewed', 'reviewed-article', 'reviewed-redirect')) is? In particular, why do you variously get one, the other, or both even for patrols by the same person within minutes of each other (example: both, pagetriage only, patrol only); exactly which on-wiki actions result in those cases; and is there a good way to distinguish genuine multiple reviews of the same title without doublecounting the various log_type/log_action combinations (perhaps group by log_page)?

(Context is Wikipedia:Requests for adminship/Elli#General comments, starting with User:Hey man im josh's comment at 21:42, 31 May 2024; my best guesses for the answers are in the description of quarry:query/83443.) —Cryptic 00:05, 1 June 2024 (UTC)

Wikipedia:New pages patrol#Patrol versus review is my attempt to document this. It is confusing. I think the native patrol system has some unintuitive behaviors (maybe autoreviewing in certain cases, maybe being revision-based instead of page-based, not exactly sure, but I sense some differences), that I haven't quite wrapped my head around yet and that make it not correspond 1:1 to reviewing. Also, the original PageTriage authors tried to keep the patrol and review logs in sync but missed some cases such as page moves, so there are some bugs. phab:T346215, phab:T337356. Finally, PageTriage reviewing only occurs in the main namespace (and formerly the user namespace), whereas native patrolling with the "[Mark this page as patrolled]" link can occur in any namespace.
When evaluating an NPP's number of reviews, one should use the pagetriage-curation log exclusively because it corresponds to the NPP clicking the reviewed button in the Page Curation toolbar. The patrol log is less accurate. When the Page Curation toolbar is open, as it is for most NPPs unless they close it, the "[Mark this page as patrolled]" link is hidden.
Confusingly, XTools only displays the patrol count, not the reviewed count. quarry:query/70425 is an old query I wrote to get a count of a person's reviews only. I think I'll go file an XTools ticket to use "reviews" instead of "patrols" for enwiki. phab:T366397
One other thing. 'reviewed', 'reviewed-article', 'reviewed-redirect'. It used to be only reviewed, then we got rid of reviewed and split it into reviewed-article and reviewed-redirect for easier querying. This split happened about a year ago. phab:T349048Novem Linguae (talk) 05:08, 1 June 2024 (UTC)

Page table slow query

Any way to speed up this page table query? CirrusSearch does it fast, but maybe elasticsearch has its own indexing or something to help make it fast. –Novem Linguae (talk) 18:08, 15 June 2024 (UTC)

Yes, CirrusSearch has full-text indices on page titles, while the main db only has normal (prefix) ones.
Using the page_name_title index speeds this sort of query up considerably, since titles not matching '%.js' can be filtered out there (even though every title still needs to be looked at, the whole row won't need to be fetched). There's no way to force that, though, since we only have access to views. Sometimes you can fool the optimizer into the indexed search if you say which namespaces you want instead of the ones you don't, even if the list is long, something like WHERE page_namespace IN (0,1,3,4,5,6,7,9,10,11,12,13,14,15,100,101,118,119,710,711,828,829), but that doesn't work here. So you're going to have to partition it into several (well, lots of) queries that look small enough that the optimizer uses the index instead of a full table scan - WHERE page_namespace = 0 AND page_title < 'M', WHERE page_namespace =0 AND page_title >= 'M', and so on for the other possible namespaces. —Cryptic 00:38, 22 June 2024 (UTC)
An alternate approach: since you know that there won't be many titles ending in '.js' except in userspace and mediawiki:, find the titles yourself by grepping enwiki-20240601-all-titles.gz from a dump. Link to all of them from a page in your userspace, then you can use pagelinks to find them in a query and check page_content_model and page_restrictions. —Cryptic 04:19, 22 June 2024 (UTC)