Wikipedia talk:Dump reports
Background information
editThis is sort of similar to Wikipedia:WikiProject Check Wikipedia except the scope is a bit broader. "Dump reports" is modeled after "Database reports"; users are able to request reports of instances of strings in particular pages here.
Unlike database reports, these reports will almost always be (at least slightly) out of date due to the nature of database dumps and they will be run (at least initially) by hand, without any scheduled refresh.
The dumps being scanned are of the current wikitext of pages on the English Wikipedia. When requesting a report, there are four components you need to specify:
- what string or strings you're looking for;
- whether the string should be case sensitive or include any other type of regular expression;
- what namespaces you want to search;
- and, whether you want to list all instances of the string (multiple rows per title) or all titles that contain the string (one row per title).
Report ideas
editNote: The internal search function is fairly decent for basic text string searches. Dump reports are for strings that are not indexed well by the internal search function.
- <div/>
- <span/>
- <a>
- <hr>
- <center>
- "plainlinks"
- Trailing <br( /)?> (for paragraphs or lists)
- Lines starting with a manually-written number
- == Biography ==
- <ref></ref>
- spaced # and * lists
- articles containing e-mail addresses
- == bold header ==
- broken section anchor redirects
- Template:Spamsearch for spam searching...
- Large HTML comments (<!-- -->) in articles
- template parameters with trailing pipes
- ":''" at the beginning of articles (these should be templated!)
- Old table code at the beginning of articles (like this)
- Bullshit alt text (example)
- Dash issues
- " -- "
- XXXX-YYYY
- December-January
- "<monthname> 2-5" and "<monthname> 2-<monthname> 5"
- italic text and other button noise
- pages containing "•" and related characters
- <span class="wikiEditor-tab"></span>
- [[foo (bar)|]]
- * blah on/in/at YouTube --> can be turned into a template
- YouTube search URLs...
- bold terms that don't have redirects, of course
- ordinal date suffixen!!!
Botched MiszaBot settings
editThis is very low priority, so at your convenience.
Need to know what talk pages have the following arrangement:
|archiveheader = |
or
|archiveheader = }}
This is because the KingbotK plugin was molesting certain MiszaBot settings and this will fail the archiving. –xenotalk 21:09, 11 January 2010 (UTC)
- One of the issues that "dump reports" has that "database reports" doesn't is that the report title is often worthless for "dump reports." If you want to find all instances of "</br>" and "<a>", there are a few choices for a title: "Pages containing </br> and <a>," "Code quirks 1," "Report 1," etc. But the issue is that < and > are invalid in page titles, so the first option won't work. A generic name or ID is often difficult for people to remember ("Oh, report 35 is the manually-created ordered lists!"). So I'm thinking about using a theme. Each report would essentially get a code name, which is easier to remember and could possibly even relate to the report contents (maybe). The theme needs to be expandable—Disney princes or the seven Dwarves have pretty fast-approaching limits. But there are plenty of other themes that could be used: Pokémon characters, Muppets, Simpsons characters, major cities, types of cheese, etc. I would also like to see it relate to (American) pop culture, if possible. Though don't let this stifle your creativity! Creativity is a major bonus here. (And you'd win the grand prize if you could tie in a "dump" theme.) Any suggestions for a suitably kick-ass theme is likely to get me working on this project much quicker. :-) --MZMcBride (talk) 21:23, 11 January 2010 (UTC)
- I came up blank =) The Simpsons seems to be a pretty popular target around these parts, though. –xenotalk 22:25, 18 January 2010 (UTC)
- If only you hadn't come up blank, this likely would've been resolved ages ago. The theme for Dump reports will be chemical element names. These have a number of advantages: they have a standardized and internationalized spelling, they do not contain spaces or other odd characters, they have a set order (which provides for a loose chronological indication at a quick glance), and there are enough of them that it will be a while before I run out and am forced to pick a different theme (or start making up element names). WoodwardBot will be coming out of retirement for the occasion. He'll be managing Dump reports while his partner (in the business sense) BernsteinBot manages Database reports. --MZMcBride (talk) 19:20, 14 April 2010 (UTC)
- I came up blank =) The Simpsons seems to be a pretty popular target around these parts, though. –xenotalk 22:25, 18 January 2010 (UTC)
I assume this is still needed. Are there any examples of broken pages (not necessarily broken right now, but broken before March 12 or so)? Which namespaces are we talking about? (I guess I'm going to need a way to catch namespaces = () for 'all'....) --MZMcBride (talk) 15:04, 3 May 2010 (UTC)
- Could be any talk namespace, but I would assume they're mostly in (article) Talk:. Here is an e.g.. I had actually forgotten about this, I would guess that most have been noticed by now, but may as well check just in case. –xenotalk 15:13, 3 May 2010 (UTC)
List of in-use 'fair-use' rationale templates
editAsking here rather than at WP:DBR because I expect this to be a one-time.
Basically, I'd like a list of templates that put preformated fair-use rationales on File pages (Possible search strings "fair use" "fair-use" "rationale" etc..) so that 'rationale' templates not currently using {{Non-free use rationale}} in some manner can be migrated or the entries re-worked to avoid them showing up an a WP:DBR report that attempts to find images that have no rationale at all.
Files using {{logo rationale}} are already mostly migrated to {{Logo fur}} for example. Sfan00 IMG (talk) 11:36, 20 January 2010 (UTC)
- This sounds like what Non-free files missing a rationale (configuration) is. I just need to set up that database report to update regularly. --MZMcBride (talk) 15:12, 3 May 2010 (UTC)
Articles that use Ibid.
editCould you please generate a list of titles that meet the following regex.
\<\s?ref[^\>]*\>\s*(ibid(\.)?|op\.?\s?cit\.?|loc\.?\s?cit\.?)\s*\<\s?/\s?ref
Thanks!
Tim1357 (talk) 01:59, 31 January 2010 (UTC)
- Nevermind I got it. Tim1357 (talk) 07:42, 27 February 2010 (UTC)
Finding deprecated CSS
editCould you produce a list of articles that are still using prettytable
and references-2column
CSS classes? See MediaWiki talk:Common.css/to do and MediaWiki talk:Common.css/Archive 10#HiddenStructure and Prettytable. — Dispenser 22:01, 7 February 2010 (UTC)
- Ill run it if you give me some regex. Tim1357 (talk) 07:41, 27 February 2010 (UTC)
Looks like we're just talking about re.compile(r'(references-2column|prettytable)') where namespaces = (0)? --MZMcBride (talk) 15:02, 3 May 2010 (UTC)
Pages in cat Category:Disambiguation pages and have ("NRHP" or "National Register of Historic Places") in wikitext
editPer Wikipedia:BOTREQ#tagging of dab pages for wikiproject NRHP, and abbreviations. Thanks! –xenotalk 15:49, 4 March 2010 (UTC)
- What we need is that list, but omitting the ones where the talk page is already tagged for WikiProject NRHP / WikiProject National Register of Historic Places. The point is to find ones missing the wikiproject tag. Thanks! --doncram (talk) 20:59, 6 March 2010 (UTC)
- Doing... Tim1357 (talk) 23:22, 6 March 2010 (UTC)
- Ack! I forgot to detach the process from the terminal! In essence, I got half way done, and then closed the computer! That was stoopid. Tim1357 (talk) 02:14, 9 March 2010 (UTC)
- Hey, sorry that happened. There's no rush though, i am just glad to be getting something eventually. Thanks! --doncram (talk) 02:48, 9 March 2010 (UTC)
- Bump? –xenotalk 00:25, 12 March 2010 (UTC)
- Ack! I forgot to detach the process from the terminal! In essence, I got half way done, and then closed the computer! That was stoopid. Tim1357 (talk) 02:14, 9 March 2010 (UTC)
- Doing... Tim1357 (talk) 23:22, 6 March 2010 (UTC)
Is this still needed? --MZMcBride (talk) 15:00, 3 May 2010 (UTC)
- I think so. Archived without completion. –xenotalk 14:13, 4 May 2010 (UTC)
- Damn, I completley forgot. Running it now, on january dump. You can track it's progress here Tim1357 talk 16:57, 16 May 2010 (UTC)
- Done 752 matches found. Tim1357 talk 03:23, 17 May 2010 (UTC)
- Damn, I completley forgot. Running it now, on january dump. You can track it's progress here Tim1357 talk 16:57, 16 May 2010 (UTC)
It would be nice to have a way to scan the rest of wikipedia for 4shared links that he's added, which go off of the final text as it is meant to be rendered/as it is meant to appear to the reader;
[[test|something]]
appears as:
something
We recently enabled a filter to find these links, but it does not work in retro, therefore I request a way to find them, if there are any left, so we can delete them. The search function doesn't cut it.— Dædαlus Contribs 06:58, 13 March 2010 (UTC)
- Can you use http://en.wiki.x.io/w/index.php?title=Special:LinkSearch&target=www.4shared.com ? –xenotalk 18:01, 15 March 2010 (UTC)
Is this still needed? If so, is there a particular string or group of strings you want to search for? --MZMcBride (talk) 15:00, 3 May 2010 (UTC)
Pages with four or more WikiProject banners, no shell, and at least one heading
editI would like a list of talk pages with four or more WikiProject banners1, no shell2, and at least one ==heading==3. –xenotalk 18:38, 14 April 2010 (UTC)
Criteria 1
\{\{.*?(WikiProject|WP|\|[ ]class[ ]*=).*?\}\}.*?\{\{.*?(WikiProject|WP|\|[ ]class[ ]*=).*?\}\}.*?\{\{(.*?WikiProject|WP|\|[ ]class[ ]*=).*?\}\}.*?\{\{(.*?WikiProject|WP|\|[ ]class[ ]*=).*?\}\}
Criteria 2
\{\{[ ]*(Template:|)(W(iki|)p(roject|)[ ]*banner[ ]*s(hell|)|(WPBS|WPB|WBS|Shell)[ ]*\|)
Criteria 3
==
- I imagine the first iteration of this will be rather large, so you can just store it somewhere before making it into a auto-refreshing report. –xenotalk 18:39, 14 April 2010 (UTC)
- Sounds like you're posting to the wrong page. --MZMcBride (talk) 18:51, 14 April 2010 (UTC)
- bah. you ignore that page like a red-headed stepchild. –xenotalk 18:55, 14 April 2010 (UTC)
- (cross-posted from WT:DBR) 18:56, 14 April 2010 (UTC)
- See the top-most section about filing requests (including namespaces to search, etc.). --MZMcBride (talk) 19:16, 14 April 2010 (UTC)
- Probably only needs to be in namespaces 1,3,7,11 and 101. –xenotalk 19:21, 14 April 2010 (UTC)
- See the top-most section about filing requests (including namespaces to search, etc.). --MZMcBride (talk) 19:16, 14 April 2010 (UTC)
- (cross-posted from WT:DBR) 18:56, 14 April 2010 (UTC)
- bah. you ignore that page like a red-headed stepchild. –xenotalk 18:55, 14 April 2010 (UTC)
- Sounds like you're posting to the wrong page. --MZMcBride (talk) 18:51, 14 April 2010 (UTC)
Is this still needed? --MZMcBride (talk) 14:59, 3 May 2010 (UTC)
- Tim gave me about 5000 of them. I haven't started working through them yet. So yes, and no. –xenotalk 14:12, 4 May 2010 (UTC)
Plus, MC, I think the dumps we have access to are only in Namespace:0. Tim1357 talk 17:34, 16 May 2010 (UTC)- Huh? I don't know who "MC" is, who "we" are, or what the hell you're talking about regarding namespace limitations. --MZMcBride (talk) 17:35, 16 May 2010 (UTC)
- Sorry, MZMcBride: Im pretty sure that the dumps that are on the toolserver are limited to the main article namespace. This is because when I started to do this dump scan, I did not get any pages that started with 'Talk:'. Tim1357 talk 17:40, 16 May 2010 (UTC)
- The dumps available on the Toolserver are whichever dumps you choose to download. You may be scanning a file with only certain namespaces, but dumps certainly exist that contain all namespaces. E.g., ls /mnt/user-store/dumps/enwiki/ --MZMcBride (talk) 17:42, 16 May 2010 (UTC)
- Sorry, MZMcBride: Im pretty sure that the dumps that are on the toolserver are limited to the main article namespace. This is because when I started to do this dump scan, I did not get any pages that started with 'Talk:'. Tim1357 talk 17:40, 16 May 2010 (UTC)
- Huh? I don't know who "MC" is, who "we" are, or what the hell you're talking about regarding namespace limitations. --MZMcBride (talk) 17:35, 16 May 2010 (UTC)
That explains it. The only enwiki-p dump I was aware of was /mnt/user-store/dump/enwiki-20100130-pages-articles.xml. That might be why I didn't find any talk pages in the scan. Tim1357 talk 17:45, 16 May 2010 (UTC)
- Cool, I just changed the input filename on my original script, and its working pretty well. I used different regexes, but the result seems good. All pages are saved here Tim1357 talk 18:17, 16 May 2010 (UTC)
- Either it hit some error or it finished, either way 7617 talk pages should keep you busy. Tim1357 talk 23:52, 16 May 2010 (UTC)
- Is it just me or is there some wonky encoding going on there? "Talk:African-American Civil Rights Movement (1955–1968)" ? –xenotalk 15:36, 25 May 2010 (UTC)
- It's you (sort of). Set your character encoding to utf-8. The page isn't printing the appropriate headers to indicate to your browser to use utf-8. In Firefox, you'd set the page encoding by going to View --> Character Encoding --> utf-8. --MZMcBride (talk) 16:47, 25 May 2010 (UTC)
- That did the trick. Thanks =) –xenotalk 17:11, 25 May 2010 (UTC)
- It's you (sort of). Set your character encoding to utf-8. The page isn't printing the appropriate headers to indicate to your browser to use utf-8. In Firefox, you'd set the page encoding by going to View --> Character Encoding --> utf-8. --MZMcBride (talk) 16:47, 25 May 2010 (UTC)
- Is it just me or is there some wonky encoding going on there? "Talk:African-American Civil Rights Movement (1955–1968)" ? –xenotalk 15:36, 25 May 2010 (UTC)
- Either it hit some error or it finished, either way 7617 talk pages should keep you busy. Tim1357 talk 23:52, 16 May 2010 (UTC)
- Cool, I just changed the input filename on my original script, and its working pretty well. I used different regexes, but the result seems good. All pages are saved here Tim1357 talk 18:17, 16 May 2010 (UTC)
Yes, forgot to mention that. Tim1357 talk 00:42, 27 May 2010 (UTC)
Progress update
editAs some people may have noticed, I took a look at this over the weekend. I've written most of the necessary code for this now. The remaining issue is that the new code is taking considerably more time to run than the old code (140–180 minutes vs. 60–80 minutes), so some profiling needs to be done in order to figure out what the cause of the slowness is. Once that's completed, it should be possible to have fairly regular reports, though they will be manually run for the indefinite future, mostly due to the irregular output of the Wikimedia dumps. --MZMcBride (talk) 14:59, 3 May 2010 (UTC)
More succinctly, to do:
- profile code against old script and figure out where the slowness is;
- if len(namespaces) == 0, assume all titles? (assuming this code isn't what's slowing everything down);
- test output for xeno's report above, possibly create separate file/report;
- run output for Dispenser's report above;
- run output for spamsearch idea above;
Non-PD images that link to somewhere other than their description page
editWould like a report of attribution-required files that are using link= to link somewhere other than their image description page (as this constitutes a breach of license). E.g. –xenotalk 17:23, 7 May 2010 (UTC)
- You want WT:DDR. I don't believe |link= is stored anywhere in the database. --MZMcBride (talk) 17:26, 7 May 2010 (UTC)
- Thanks, moved here. –xenotalk 17:28, 7 May 2010 (UTC)
- I have no idea how I would go about doing this in an efficient way. I think the best way would to be a scan for every instance of where |link is used with a file. Then, theoretically, I would intersect this with a list of non free files. I can't really think of a more elegant way of doing this though. Tim1357 talk 23:05, 19 May 2010 (UTC)
- My plan was to separate Category:Wikipedia image copyright templates into free->(subcat)public domain and non-free. –xenotalk 17:36, 20 May 2010 (UTC)
- But how do I parse xml, then compare that data to the list of non-free files? I cant keep the list in memory (without pissing off some people). Maybe I'll keep a list in a file, then when i find a |link parameter in a page, I'll parse through my file to see if the file exists there. Tim1357 talk 20:29, 9 June 2010 (UTC)
- My plan was to separate Category:Wikipedia image copyright templates into free->(subcat)public domain and non-free. –xenotalk 17:36, 20 May 2010 (UTC)
- I have no idea how I would go about doing this in an efficient way. I think the best way would to be a scan for every instance of where |link is used with a file. Then, theoretically, I would intersect this with a list of non free files. I can't really think of a more elegant way of doing this though. Tim1357 talk 23:05, 19 May 2010 (UTC)
- Thanks, moved here. –xenotalk 17:28, 7 May 2010 (UTC)
- Look for
|link=
- Is that image in /my/file/with/all_non_PD_images ?
- Look for
Tim1357 talk 20:29, 9 June 2010 (UTC)
- Sounds like a good start. –xenotalk 20:36, 9 June 2010 (UTC)
You want to avoid duplicate effort, and a lot of files will likely have an identical |link= target. So I think the best way to go about this is to get all |link= text from all pages and then remove the duplicates and the list to a file (or a table, whatever). Then check the names against Commons and remove any matches. Then check the remaining list against either a full list of non-free media or do a direct check in one of the huge categories ("All non-free media" and "All free media" or some such). A direct check being "select 1 from page join categorylinks on cl_from = page_id where cl_to = 'foo' and page_namespace = 6 and page_title = 'bar';". That seems to be the most efficient method of doing this. There are some false positives you'll have to deal with, but (as usual) these are the result of user stupidity and can probably stand to be fixed anyway, so listing them isn't really an issue. --MZMcBride (talk) 19:22, 10 June 2010 (UTC)
- Just so it's clear, it's not only non-free media that must link to the description page, somewhat-free media with attribution-required (i.e. some Commons content) also cannot be linked elsewhere than the description page. –xenotalk 19:25, 10 June 2010 (UTC)
- Well, that's more annoying. The beginning part of the methodology I outlined is still correct; the other pieces will obviously need tweaking. Honestly, WP:DDR should probably only be involved in the steps that require actual page text scanning (in this case, a full list of |link= text from all pages, dupes removed). The rest of the work can be done elsewhere and by others. That's my view, at least. --MZMcBride (talk) 19:33, 10 June 2010 (UTC)
- I think what is becoming clear is that this one is probably more trouble than it's worth. –xenotalk 19:38, 10 June 2010 (UTC)
- I've currently invested about a minute, so I'm not so sure about that quite yet. ;-) --MZMcBride (talk) 19:52, 10 June 2010 (UTC)
- As an aside, I'm still waiting on toolserver approved ;p –xenotalk 19:53, 10 June 2010 (UTC)
- Yeah, I went to follow-up on that, and noticed that the ticket had been marked private (your super-secret full name, I presume). Makes it a bit more difficult to poke people about it, but I'll certainly try to remember to later this evening. --MZMcBride (talk) 19:54, 10 June 2010 (UTC)
- Much obliged! –xenotalk 19:56, 10 June 2010 (UTC)
Ideas/Proposals
editI have a few ideas that might make this page a bit more use-able. The first is moving the request page to somewhere other than the talk page. Perhaps WP:Dump reports/Requests? The second is creating a list of re-occurring requests, or low-priority requests that can be run together in the same scan. It is becoming increasingly silly to do a full scan for each requests. Grouping these requests together will make things go faster. That being said, we must collaborate to make sure that we are not running the same dump scan twice. Tim1357 talk 19:16, 16 May 2010 (UTC)
- I don't see any advantage to moving the requests to a separate page. All requests are the same (low) priority. Allowing people to tell you what's a higher priority is a poor idea. People are selfish and stupid. The idea is combine requests might have some merit. It depends on cost vs. benefit in code complexity and other things. I don't see any "we must collaborate" arguments, though. There's no moral imperative here. --MZMcBride (talk) 23:44, 16 May 2010 (UTC)
Determining population of 'status' parameter in Template:Infobox television
editWithin the Infobox television template, there is a parameter |status= which contains many different comments (in production, canceled, hiatus, etc.). I would like to request a report of existing comments within the parameter, including frequency of use of each comment. (As an FYI, a brief discussion of this issue can be found on the Template:Infobox television talk page here.) Thanks. --Logical Fuzz (talk) 22:40, 19 May 2010 (UTC)
- Doing... Svick (talk) 22:48, 19 May 2010 (UTC)
- Just saw the data. Thanks so much! --Logical Fuzz (talk) 02:46, 20 May 2010 (UTC)
Pages with many alsos
editAs requested at the village pump, I have created a list of pages that contain many times the word “also”, if anyone here is interested. Svick (talk) 15:46, 18 July 2010 (UTC)
Trivia sections
editIm going to slap this one on our todo list. I want to have a periodic scan of all the articles that have a ==Trivia== section. These sections should be removed, or at least tagged with {{Trivia}}. If nobody else here feels like running this scan now, Ill do it in a few weeks. Tim1357 talk 14:01, 21 July 2010 (UTC)
- I actually did this a while ago, in case anybody wanted to take a look. [1]. Tim1357 talk 22:22, 18 September 2010 (UTC)
Category Cycles
editCan this dump report be re-run or migrated to WP:DBR ? Sfan00 IMG (talk) 09:14, 15 May 2011 (UTC)
- Updated. I'm not sure this a good fit for a database report, because it's not easily translatable into SQL. User<Svick>.Talk(); 18:47, 28 May 2011 (UTC)
Categorylinks dump
editI am requesting this dump report on behalf of Bearcat. The dump report needs to use the en.wikipedia SQL dump from 20190401 and uses the page, categorylinks and templatelinks sql tables. Using an more recent sql dump will not work. This report is an one-time request. Results can be posted at an subpage of Bearcat´s userpage.
- figure out the page_id of "Template:Infobox Town AT" from the dump
- Run the following query:
use enwiki_p;
select tl_from_namespace, page_title, cl_to from templatelinks
join categorylinks on cl_from = tl_from
join page on tl_from = page_id
where tl_from = -- insert page_id of "Template:Infobox Town AT" here