User:Retro/Character encoding

My subpages
Broad editing organization More specific projects
Article improvements Mental accretion Phrases to watch Templates Character encoding DKC organizing Footnotes Tooltips
If my subpages inspire you to carry out an idea yourself or you have any questions or concerns, I'm interested in hearing about it!

This page is not limited to discussion about Unicode code points, but it is currently a dominant focus here.

I was able to painlessly generate a list of Unicode code point redirects using Special:AllPages, and I templated them all with {{R from Unicode code}} in a semi-automated way (I reviewed each edit, and made corrections to the redirect target when appropriate).

  • Many Unicode code point redirects don't exist. These could probably be created in a semi-automated way by utilizing the target of each code point's related character. It may be possible to transition to automated creation, but the main issue is that character point redirects should ideally redirect to a specific section in the article about the characters encoding, and the names for those sections have not been standardized. I would like to do a comprehensive sampling of section names in character articles so I can determine the best path towards standardization (there is one section name that I am a bit against here: "In popular culture" is a bit broad; I currently prefer something like "Encoding").
    • Examples of code points lacking redirects: U+2024, U+2022
    • Some articles don't mention the Unicode code points at all.
    • Here's a related case study; I need to be careful to avoid double redirects if I create new redirects. I should also have high certainty in what the final resulting product should be before I create many redirects.
    • These were my original thoughts about automating mass-creation of such redirects: I could make a bot that searches for where the actual Unicode character targets on Wikipedia, and also check that it exists, then use the indirect target to create the new redirect (to avoid creating a double redirect.) There's so much more, though. Section-specific redirects are preferable to me, and that would require standardization of section titles.
  • For character articles I think the Unicode code point is worth a mention in the infobox.
  • Latin-1 Supplement (Unicode block)#Compact table: Why does the character block have "XXX" for some of the acronyms even though they are know and even appear in the tooltip text? This is probably a broader issue than this single article, and should be investigated closely.
  • Non-breaking space: Are code blocks the standard way to include Unicode character point codes? Another question that can only be effectively answered with some automated searching.
  • What follows is an idea I thought of, but have since rejected: What if all the unicode code points were linked, and the unassigned ones redirected to the article on unassigned unicode code points. It might be too annoying though, and trivially useful. Instead, the absence of a page should communicate the message that the code point is unassigned.
  • {{R from Unicode code}} still has a few problems:
  • Each character article should have its relevant unicode block mentioned in the page. Otherwise, one has to go to Unicode block to reverse engineer the parent block. This could probably even be templated, but it would require a bit of cleverness to design elegantly.
    • Example: É
edit
  • I remember a Tom Scott video about how these marks allowed an Android message to crash the messaging application... it hints at the

complexities involved, and links to the Unicode consortium page, a good place for information about these sorts of thing.

  • For Bullet (typography), I don't think it should have a subsection in unicode; most characters have a higher level unicode description table.
  • Right-to-left mark and Left-to-right mark are probably similar enough to be merged. But what the target name should probably not be either of the original target articles
  • U+00A4 is an exception to the normal redirect templates and uses {{R from systematic}}
  • Ugh, I don't like that Character point redirects to an article about roleplaying games. Perhaps it should be a disambiguation. But maybe I'm the only one who has confused in with "Code point"

Involved people

edit


Miscellaneous notes

edit

I made these notes while I was adding {{R from Unicode code}} to the redirects. They are currently unsorted. While editing, I generally made the decision to preserve the original target in most cases, except if there was clearly a separate page that was a much better target. This decision was both to save time, and to allow different options of section linking or its lack to remain manifest within the redirect template links. I can analyze the links more systematically later.

  • The question here is whether section redirects of the following form are acceptable:
  • For consistency, I should match the previous character (in the U+009x row) I just edited as well.
  • U+009E:
    #REDIRECT [[ANSI_escape_code#PM]]
    {{R from Unicode code}}
  • Yeah, but U+009F redirects here. Is it consistent?
  • List of Unicode characters is too large to really be that useful in finding pages. But I'll leave it for now. (One of the redirects linked to List_of_Unicode_characters#Control_codes)
  • Should the Unicode code point for Yen sign redirect to the 'Code points' section?
  • The Broken Bar section should have a note that 'broken bar symbol' redirects here.
  • Ordinal indicator could redirect to the 'Encoding' section.
  • I suppose ¬ redirection to Not is legit. But it should probably be to a section on the characters used to represent the concept, not just the concept itself.
  • Why is 'soft hyphen' invisible? It's weird, I wonder what the deal is with my browser encoding.
  • Square and cube algebra might need hatnotes; I can imagine other uses (perhaps a section redirect would suffice.)
  • How is a micro sign different from a mu? I think maybe it should redirect to the unit (what do the ambiguous pronouns refer to!), because mu is almost certainly a different letter in the encoding in some Greek section of Unicode's scheme.
    • The pictures displayed for 'mu' and 'micro-' contradict; this should be resolved. I suspect 'micro-' is in the wrong, because it prescribes a difference, but font designers won't necessarily respect that difference.
  • Unicode subscripts and superscripts might be a better target for some of the previous superscripts I ran into (2 and 3).
  • Superscripts and Subscripts: this too? It's redundant and unnecessary if the other page exists.
    • The intended use[2] when these characters were added to Unicode was to allow chemical and algebra formulas and phonetics to be written without markup, but produce true superscripts and subscripts.
    • There ya go Unicode consortium, adding characters with specifications that nobody uses.
  • Number Forms should use a template, instead of doing fractions with raw tags.
  • Wasn't sure about the use of decimal HTML codes in the U+00Cx block, so I left them like Å
  • Could use 'Circumflex in digital character sets' (of course, could remove the 'Circumflex in')
  • Some of these are nonsensically inconsistent. Why doesn't 'Ô' deserve its own page? Seems reasonable enough to me, especially with all the other similar pages I've seen. It shouldn't redirect to Circumflex like I just supported.
  • Multiplication has two encoding sections: 'In computer software' and 'Unicode'
  • U+00D9
    • Not entirely happy that it redirects to grave accent, the same as Ù. They should have their own page, in my opinion. Grave accent is *way* too general.
  • Why does Ø have a separate 'Computers' and 'Encoding' section? I left the 'Computers' section. Ideally, they would both be merged into 'Encoding', since that's the only way characters are related to computers in most cases: the method the character is encoded in the computer.
  • Unicode, U+066D ٭ ARABIC FIVE POINTED STAR (HTML ٭ ·
  • Generating the HTML right after the unicode could be automated by a template; there's no need to type any of it out, except the character's code point.
  • Oh, U+2189 is used in basketball scorekeeping. I'm actually interested in that; perhaps I shouldn't redirect so hastily. But the basketball article doesn't currently mention it, so it needs expansion before being back-targetted.
  • Smiley#In text
  • The people who make individual pages for Braille pattern dots are a bit insane. I can't believe that there's so much Japanese text. It's essentially a data dump.
  • What happened here, Vanissec?:
  • Editor's note: I'm not sure what I was referring to in this parenthetical: (This again?)
  • Braille pattern dots-0#Plus dots 7 and 8 I thought this was inconsistent with the previous braille character articles, but now I think I may have been wrong.
  • -27 -127
  • Why are those both missing from the page, if it's supposed to encompass all the 7 and 8 dots combinations? This looks a bit fishy with the braille redirection consistency, but I gather dots 7 and 8 are less used.
  • I see, but I don't like those as much; it looks inconsistent, and I haven't looked at it carefully, only a general spot check for the braille.
  • Oh, I missed out on adding {{R to section}} where appropriate. Oh well.
  • {{R to section}} could *definitely* have an automated check based on page content.
  • For U+FFFE and U+FFFF, I wonder if there's a page on Unicode non-characters; I could probably find a section somewhere in Unicode that covers it better.

I tried to do a comprehensive search to check if any Unicode code point redirects have been RfD'd. The main discussion I found was this, where many participants mentioned the 'U+' redirects specifically and said they're fine.

  • Not sure about U+E for getting the entire block... I've just skipped it entirely. I think it should be RFD'd, but I've still got to get my arguments together.

I will probably RfD it eventually.

Other

edit
  • Wikipedia:Naming conventions (Unicode) (draft): This page is largely historical at this point; we now have specific guidelines about Unicode characters (which should be relatively straightforward), and this no longer serves its purpose as a centralized proposal to organize discussion; there are better places to have this discussion that would get more awareness. I can probably carry out this myself, since justification will be easy to come by
    • I wonder if a bot could search for pages to flag as historical; obviously the bot can't make the nuanced decision on its own, but at least it could search when the last edit was. If the last edit was more than a year ago, a page is suspect; it should probably just be back-prioritized. But sometimes people randomly edit pages, so it would still have false negatives.
  • Space: I think Space (punctuation) should be mentioned in the hatnote instead of Spacebar. But I guess I'll check out the page views.
  • \=: List of LaTex escapes (it's used to make a Macron)
  • Code point
    • Strictly, these definitions imply that it’s meaningless to say ‘this is character U+265E’. U+265E is a code point, which represents some particular character; in this case, it represents the character ‘BLACK CHESS KNIGHT’, ‘♞’. In informal contexts, this distinction between code points and characters will sometimes be forgotten.
    • Just make sure to cite the specific version of the documentation. But this cites other sources; it's too secondary.
  • Why do we have disambiguating notes in redirect templates? Oh I guess it's because they're mostly intended for editors, not readers.
  • Regarding {{Redirect category}}:
    • I wonder if there's any way to detect this and do it with a bot. (Note I implemented it wrong in the linked diff.)
    • The 'to' and 'from' should probably have consistency with the descriptions on the redirect page; a bot could definitely enforce that (or maybe some crazy transclusion filtering could take place? but that would probably be too expensive.)
  • {{R template index}} is nowhere near comprehensive enough; it doesn't include {{R from code}}
  • Why when I search for 'U+F8FF', the only result that shows up is 'U+f8ff'; U+F8FF is a page, but it looks the lowercase form. I should test how this works by creating another similar page, but I need to ensure I understand how search works beforehand. I think this can be deleted because the other result will show up in search; having the lowercase form is likely redundant. But I need to work on evidence, not intuition and a philosophy of minimizing the number of pages.
  • The phrasing in search results "redirect from" preceding is ambiguous about which title is the redirect. A more clear interface would suffix the link with "redirects here".
  • Maybe CirrusSearch should be noted more prominently in the lead of Help:Searching, instead of being confined to a hatnote.