Wikipedia:Bots/Requests for approval/H3llBot 2
- The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was Approved.
Operator: H3llkn0wz (talk · contribs)
Automatic or Manually assisted: Automatic
Programming language(s): C# .NET via MediaWiki API, Not English
Source code available: No (unless requested)
Function overview: Archive dead reference citation external links via Archive.org's Wayback Machine.
Links to relevant discussions (where appropriate): Same as Tim1357's BRFA, same recent VP link, WebCiteBOT BRFA, and some more older request links.
Edit period(s): Continuous ~11pm-2am GMT, run from my PC (I may inquiry Toolserver about hosting of .exe's)
Estimated number of pages affected: Given dead URL caching, much less than read capacity — estimating 5epm
Exclusion compliant (Y/N): Y
Already has a bot flag (Y/N): Y
Function details:
- Select a scheduled page or a page from a selected list1 for reading
- Read a page and find all <ref/> tagged external links or citation templates with url= and accessdate=2
- Add all newly seen links to link repository3 for immediate and 5 day repeated check
- Find a Wayback Machine url4 within last 6 month range for all dead links
- Update the correspoding urls with found Wayback urls and remove {{deadlink}}s if any, otherwise mark as {{deadlink}}s
- Schedule page for future processing if first-seen 404 links encountered and proceed with next page
1 FA/GA and Articles with dead links are priorities, otherwise I have classes for category/template/whatever parsing
2 For find link addition revision Wikiblame is extremely slow and proper API revision search is slow to implement but is in progress
3 A link storage and 404-state checker with next 404-check schedules
4 Retrieve archive.org's query result page for selected URL and parse for links with acceptable dates
Dead links are considered to be pages returning 404 error twice between 5 days (I have not yet encountered 404s because of server disk spinning up as pointed out by Tim1357, but I can setup double checks to see how many servers actually behave like that). The parser builds page tree structures and omits pages it cannot process so no <!-- --> or <nowiki> </nowiki> processing.
Discussion
editRegarding same task request: the way I see it, article pool is very large, this is a growing problem, and diminishing returns/redundancy even with multiple bots should not arise. Current progress has almost all of the above functionality. — Hellknowz ▎talk 00:42, 19 May 2010 (UTC)[reply]
- Are you willing and able to publish your source code?
- Approved for trial (5 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. Let's see the bot in action a small sample set. Josh Parris 06:42, 19 May 2010 (UTC)[reply]
- {{OperatorAssistanceNeeded}} Have you undertaken the trial yet? Josh Parris 11:19, 25 May 2010 (UTC)[reply]
- Sorry that this is taking so long. I had a big deadline yesterday, so this has taken longer than anticipated. I am currently coding the revision retrieval functionality. I really do not wish to hurry with this implementation as any unhandled exceptions may lead to nasty sideeffects. — Hellknowz ▎talk 11:33, 25 May 2010 (UTC)[reply]
- Just so long as you haven't forgotten about this, all's well. Josh Parris 11:40, 25 May 2010 (UTC)[reply]
- Sorry that this is taking so long. I had a big deadline yesterday, so this has taken longer than anticipated. I am currently coding the revision retrieval functionality. I really do not wish to hurry with this implementation as any unhandled exceptions may lead to nasty sideeffects. — Hellknowz ▎talk 11:33, 25 May 2010 (UTC)[reply]
- {{OperatorAssistanceNeeded}} Have you undertaken the trial yet? Josh Parris 11:19, 25 May 2010 (UTC)[reply]
- About publishing source code — do you reckon it would be useful to make my custom C# API available? There is currently only one I know of and it lacks many important features. — Hellknowz ▎talk 14:24, 29 May 2010 (UTC)[reply]
- Yes, I encourage you to do so. Feel free to mention it in the appropriate lists. Josh Parris 02:26, 31 May 2010 (UTC)[reply]
Trial complete. Special:Contributions/H3llBot. — Hellknowz ▎talk 16:36, 1 June 2010 (UTC)[reply]
- Question - If there are multiple archived versions of a link in the WayBack Machine, and the text is different on each one, how does the bot know the right one to pick?--Rockfang (talk) 18:33, 1 June 2010 (UTC)[reply]
- It always picks the older version closest to the original access date (link addition date) up to 180 day range. — Hellknowz ▎talk 18:57, 1 June 2010 (UTC)[reply]
- Only five edits, lets do one more big test, because I've found citation setup varies a lot from article to article. Tim1357 talk 21:33, 7 June 2010 (UTC)[reply]
- Approved for trial (25 edits (more if you want more testing)). Please provide a link to the relevant contributions and/or diffs when the trial is complete. (Just make sure you babysit the bot). Tim1357 talk 21:33, 7 June 2010 (UTC)[reply]
Trial complete. Contributions/H3llBot
- One funny issue was this: [1]; [2] where bot applied a fix to a bad template syntax — vertical pipe before closing brackets. This comes from a super-subtle bug in page structure parser. — Hellknowz ▎talk 22:55, 7 June 2010 (UTC)[reply]
- Very well done! Tim1357 talk 23:04, 7 June 2010 (UTC)[reply]
- Approved. Tim1357 talk 23:04, 7 June 2010 (UTC)[reply]
- The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.