Wikipedia:Reference desk/Archives/Computing/2018 January 9

Computing desk
< January 8 << Dec | January | Feb >> Current desk >
Welcome to the Wikipedia Computing Reference Desk Archives
The page you are currently viewing is an archive page. While you can leave answers for any questions shown below, please ask new questions on one of the current reference desk pages.


January 9

edit

Why is the "I'm not a robot" box presented?

edit

If I understand correctly, when I click inside the "not a robot" box, an algorithm determines whether my mouse movement prior to the click was "human". So, why isn't this feature embedded into any control, enabling to hide the "not a robot" box? For example, instead of clicking the "not a robot" box, then "Open account" - the "Open account" control would be programmed to already contain the "not a robot" functionality. Gil_mo (talk) 13:47, 9 January 2018 (UTC)[reply]

Because it's not really part of the site. It's a third-party service provided by ReCAPTCHA.
ApLundell (talk) 15:20, 9 January 2018 (UTC)[reply]
That's not a complete answer as Google does now offer invisible reCAPTCHA.

USB Charger

edit

1) What Which is the current charging wire that is available and what is coming soon.

3) Why the smart phones still using v2.0 in new brands?

123.108.246.106 (talk) 16:04, 9 January 2018 (UTC)[reply]

What happened to question number 2? See our article USB#USB_3.0. (((The Quixotic Potato))) (talk) 16:13, 9 January 2018 (UTC)[reply]
Apology, careless mistake. 119.30.47.241 (talk) 16:21, 11 January 2018 (UTC) [reply]

detecting bot-generated content in Wikipedia

edit

Hello,

I use the raw data of many Wikipedia articles for a machine-learning project. I use Python WikiExtractor for data extraction, but I need only man-made articles (i.e. I don't want to collect articles that were automatically generated by bots).

I have a few questions:

(1) How can I check the amount of articles in a given Wikipedia (specifically the Basque Wikipedia) that was generated by bots?

(2) can I check also the proportions of sentences (rather than articles) that were generated by bots?

(2) can I use WikiExtractor, or other tools, to extract only articles that were not written by bots?

Thanks! — Preceding unsigned comment added by 77.127.131.255 (talk) 17:13, 9 January 2018 (UTC)[reply]

I'll answer for English wikipedia, because that's all I know - you'll have to ask on the Basque Wikipedia how they've done things, but I imagine they've done things much the same.
You may struggle, depending on how strict you want "man-made articles" to be; depending on what arbitary definition you choose, there are no articles on (English Wikipedia at least) that are entirely "generated" by bots, or they all are - or some proportion are. All WikiExtractor does is give you access (albeit en masse) to revisions of articles - just the same as you'd get using the MediaWiki API, or just looking at article histories manually using the web interface. You can certainly identify bot accounts, and you can see the diffs that they do. You can find articles that they've created. That's all we store - just text diffs. So you can reject articles that were created, or edited, by a bot account (like the prolific User:rambot). But there are lots of semi-automated tools (like WP:REFILL or WP:TWINKLE) that are neither fully human or entirely automated. If you feel you have to reject articles edited with them, then you'll essentially reject the entire English Wikipedia (bar some ultra-neglected fringes). But beyond that, it just gets harder. Many (most, now) articles are built on dozen, hundreds, or many thousand edits, by people, bots, people with scripts, bots reverting vandalism, people partially reverting vandalism, endless edit wars and format tweaks, and lots and lots of individually tiny revisions. And all we store, and all any API will give you, is the diffs. There is no easy means to attribute any non-trivial article to a given editor and (beyond some fairly basic chunks, like US Census information inserted by bots like rambot) attribute the content of any sentence to a specific editor. Be they bot or not. It's not that this information is gone, it's that every article is the sum of many editors. Frankly, rather than this being the input to your project, analysing a specific article and determining what proportion of it was written by whom (given all those complexities) would itself be an interesting machine-learning or language-analysis problem. So TL;DR: the answer to all your questions is "probably no". -- Finlay McWalter··–·Talk 17:50, 9 January 2018 (UTC)[reply]

To actually determine what percentage of articles were made by bots, I recommend against any sort of automatic recognition. There's simply no need, just do a normal random sampling. I.E. Define some criteria of how to determine that an article is bot generated and then do a random sampling of maybe 1000 articles applying this criteria and extrapolate from there.

For the other questions, I don't know much about the Basque wikipedia, but I know a bunch of the small wikipedias have a very large number of stub articles created by bots. These articles have very little content and AFAIK most of them have never been edited by a non bot. The Volapük wikipedia is a classic case study in this to the extent that it's even mentioned in that article about the language, but there are a bunch more including non constructed languages. The Cebuano wikipedia for example has the second most number of articles to English. There's a perception this was at least in part an attempt to promote them on www.wikipedia and en.wikipedia. Fortunately while this was dealt with in various ways, no one ever really suggested trying to use a bot to assess these wikipedias, so there may have been no attempt to "evade" such a bot.

In the en.wikipedia case, after this started to happen, a depth requirement (see Meta:Wikipedia article depth) was introduced. See the archives of Template talk:Wikipedia languages and Talk:Main Page for details. Some of these wikipedias then started to get increasing depth. (I never investigated properly how this happened but I think the bots made a bunch more edits to all the articles and maybe also made support pages.) After this, wikipedias with too many stubs or short articles were simply excluded, an undefined objective criteria. I wasn't entirely happy with this, but I admit after doing a quick test and seeing how stark the difference generally was, I never made a big fuss about it. (Although the adhoc way this was done meant some wikipedias were listed which failed.)

www.wikipedia never bothered about this since they list all wikipedias anyway, although IIRC they did change the criteria for what was listed around the globe probably at least partially because of this. What used to be based on number of articles was changed to the popularity of the wikipedia i.e. how many people visited it, see Meta:Talk:Top Ten Wikipedias. (Mysteriously they chose to use Alexa statistics rather than WMF ones, but eventually this too was fixed.)

Anyway my main point relating to this question is if this happened on the Basque wikipedia, a lot depends on how these bot edits were done. If this didn't happen, and Basque followed a more typical editing pattern with some bots, some scripts, and a whole bunch of human editors, then unfortunately none of this helps much and you're likely to encounter the problems outlined by FMW albeit I expect to a far lesser extent (given there are a lot fewer editors). Looking at the list of wikipedia, the number of Basque wikipedia articles is perhaps a little high compared to the number of active users, but not extremely so so it's possible this didn't really happen to any real extent. Also it has not been excluded from the en.wikipedia list suggesting it isn't extremely stubby, unless it is one of the missed cases.

Anyway for any case where a very large percentage of the articles have never been touched by a human, if they were done from bot accounts or normal accounts but with a clear tag when editing from a bot, you could programmatically exclude most of these without much effort simply by finding out all these bots and tags. I doubt the number is more than 1000 and probably far less and I think most of them have edited a lot of articles. And I'm pretty sure with most articles there will be no edits from non bots. And most of these bots were basically completely automated so you don't really have to worry about defining what is and isn't a bot. The reason why this was often so widely criticised is because these wikipedias only had perhaps 10s or less active editors and were expanding the wikipedia from 10k or less to 100k or more articles. You will obviously miss out on some where someone not tagged as a bot did edit but maybe did nothing really. And maybe some smaller bots which edited a lot less.

But I suspect even with all that you'll still catch a large percentage of "bot articles". The only question is whether the situation is so bad despite exclude I suspect maybe 99+% of the wikipedia, you still find you have a large number of "bot articles" and working out a way to exclude them algorithmically is difficult.

OTOH, if these edits were made by normal editor accounts without any tags and the editor accounts have also contributed a lot of human work on the wikipedia, it's likely to be fairly difficult. You could perhaps try to use time, since I suspect the bots were often run for a period and then stopped but I'm not sure how easy it will be to find all the relevant timeframes, and also whether there is any risk of overlap with real human edits; or if not how quickly the human edits started after the bot editing stopped (i.e. how accurate your time frames will need to be).

Nil Einne (talk) 04:10, 10 January 2018 (UTC)[reply]

Thank you both! — Preceding unsigned comment added by 2.53.184.234 (talk) 12:07, 10 January 2018 (UTC)[reply]