CharBusters

10 Unicode myths

CharBusters

10 Unicode myths

काचं शक्नोम्यत्तुम् । नोपहिनस्ति माम् ॥
Μπορῶ νὰ φάω σπασμένα γυαλιὰ χωρὶς νὰ πάθω τίποτα 🇬🇷
ᛁᚳ᛫ᛗᚨᚷ᛫ᚷᛚᚨᛋ᛫ᛖᚩᛏᚪᚾ᛫ᚩᚾᛞ᛫ᚻᛁᛏ᛫ᚾᛖ᛫ᚻᛖᚪᚱᛗᛁᚪᚧ᛫ᛗᛖ᛬
⠊⠀⠉⠁⠝⠀⠑⠁⠞⠀⠛⠇⠁⠎⠎⠀⠁⠝⠙⠀⠊⠞⠀⠙⠕⠑⠎⠝⠞⠀⠓⠥⠗⠞⠀⠍⠑
Я можу їсти скло, і воно мені не зашкодить. 🇺🇦
მინას ვჭამ და არა მტკივა. 🇬🇪
Կրնամ ապակի ուտել և ինծի անհանգիստ չըներ։ 🇦🇲
నేను గాజు తినగలను మరియు అలా చేసినా నాకు ఏమి ఇబ్బంది లేదు 🇮🇳
איך קען עסן גלאָז און עס טוט מיר נישט װײ
ᠪᠢ ᠰᠢᠯᠢ ᠢᠳᠡᠶᠦ ᠴᠢᠳᠠᠨᠠ ᠂ ᠨᠠᠳᠤᠷ ᠬᠣᠤᠷᠠᠳᠠᠢ ᠪᠢᠰᠢ 🇲🇳
我能吞下玻璃而不伤身体。🇨🇳
Jag kan äta glas utan att skada mig 🇸🇪

kermitproject.org/utf8.html

https://madhatters.me.uk/2009/07/16/health-warning-3/smoking-kills/

https://en.wikipedia.org/wiki/List_of_Unicode_characters

https://www.dogancanulker.com/noktali-ve-noktasiz-problemi/

I don’t need to worry about Unicode

Myth 0

https://twitter.com/filipvanlaenen/status/1009397273351131136

History of IT

(or how we estimate badly)

Parkinson's Law

Work expands so as to fill the time available for its completion

Hofstadter's Law

It always takes longer than you expect, even when you take into account Hofstadter's Law

https://twitter.com/HPC_Guru/status/850698874457141248

IPv4

1978

640 KiB

Bill Gates IBM, 1981

Y2K38

512K problem

en.wikipedia.org/wiki/Border_Gateway_Protocol#Routing_table_growth
blog.thousandeyes.com/what-is-768k-day

GPS 2019

www.orolia.com/resources/blog/lisa-perdue/2018/gps-2019-week-rollover-what-you-need-know

ASCII

1963

https://en.wikipedia.org/wiki/ASCII

1 character = 1 byte

Myth 1

Let’s talk about "ą"

ą

ISO-8859-2	`B1`
ISO 8859-13	`E0`
ISO 8859-16	`A2`
Windows-1250	`B9`
CP775	`D0`
CP852	`A5`
Mazovia	`86`

`Content-type: text/html; charset=utf-8`

🇵🇱

ą	ć	ę	ł	ń	ó	ś	ź	ż
Ą	Ć	Ę	Ł	Ń	Ó	Ś	Ź	Ż

🇨🇿

🇫🇷

🇷🇺

🇩🇪 🇬🇷 🇪🇸

Unicode 1.0

modern [characters], whose number is undoubtedly far below 2¹⁴ = 16 384

1988

Unicode code points

a = U+0061 = 97
å = U+00E5 = 229
ą = U+0105 = 261
鑫= U+946B = 37995

1 character = 1 char

Myth 2

[...] undoubtedly far below 2¹⁴ = 16 384

1988

🇨🇳

讓我來! 让我来!

https://www.quora.com/How-do-you-say-hold-my-beer-in-Chinese

CJK

88 thousand characters in Unicode 12.0

𝄞

`U+1D11E` (119 070)

Unicode 2.0

1996

a	`U+0061`
ą	`U+0105`
鑫	`U+946B`
𝄞	`U+1D11E`

Correct Java type for one `character` is...?

byte
char
int
String

`String`


						codePointAt(int)         : int
						codePoints()             : IntStream
						codePointCount(int, int) : int

Unicode

vs

UTF-*

UTF-7, UTF-8

UTF-16 [BOM | LE | BE]

UTF-32 [BOM | LE | BE]

a

`U+0061`

UTF-8	`61`
UTF-16	`00 61`
UTF-32	`00 00 00 61`

ą

`U+0105`

UTF-8	`C4 85`
UTF-16	`01 85`
UTF-32	`00 00 01 05`

鑫

`U+946B`

UTF-8	`E9 91 AB`
UTF-16	`94 6B`
UTF-32	`00 00 94 6B`

𝄞

`U+1D11E`

UTF-8	`F0 9D 84 9E`
UTF-16	`D8 34 DD 1E`
UTF-32	`00 01 D1 1E`

🤔

`"𝄞".length() == 2`

`"(𝄞)".substring(0, 2)`

(?

Surrogate pairs

`var life = "🏭" + "🏖";`


				        StringBuilder rev = new StringBuilder();
				        for (int i = life.length() - 1; i >= 0; i--)
				            rev.append(life.charAt(i));

`var life = "🏭🏖";`

`"?🏭?"`


new StringBuilder("🏭🏖")
          .reverse()
          .toString()

Java is UTF-16

Myth 3

Java 8


    					private final char value[];

Java 9+


						private final byte[] value;

Java 9+


						public int indexOf(int ch, int fromIndex) {
						    return isLatin1() 
						      ? StringLatin1.indexOf(value, ch, fromIndex) 
						      : StringUTF16.indexOf(value, ch, fromIndex);
						}

`String.getBytes()`

blog.thetaphi.de/2012/07/default-locales-default-charsets-and.html

Unicode is unambiguous

Myth 4

`"ą".equals("ą")`

Tiny tail

ą	Latin Small Letter A with Ogonek	U+0105
a	Latin Small Letter A	U+0061
̨	Combining Ogonek	U+0328


Normalizer.normalize("ą", Form.NFKC)

java.text

1 character ≤ 1 int

Myth 5

Let’s talk about emoji

https://www.imdb.com/title/tt4877122/

https://www.dailymail.co.uk/femail/article-4794964/World-s-emoji-translator-ridiculed-Twitter.html

https://twitter.com/sundarpichai/status/924487551372615680

http://curlicuecal.tumblr.com/post/175362924100/an-entomologist-rates-ant-emojis

🇵🇱

🇵+🇱

🇵🇱

🇵	`00 01 F1 F5`
🇱	`00 01 F1 F1`

👧🏽

👧	`00 01 F4 67`
🏾	`00 01 F3 FE`
👧🏽	`00 01 F4 67 00 01 F3 FE`

1 character ≤ 2 ints

Myth 6

👩🏾‍🚀

👩	`00 01 F4 69`
🏾	`00 01 F3 FE`
ZWJ	`20 0D`
🚀	`00 01 F6 80`

👨‍👩‍👧‍👦

👨	`00 01 F4 68`
ZWJ	`20 0D`
👩	`00 01 F4 69`
ZWJ	`20 0D`
👧	`00 01 F4 67`
ZWJ	`20 0D`
👦	`00 01 F4 66`

https://twitter.com/relizarov/status/1128347860263669761

`String.length()` is useful

Myth 7

`public int length()`

Returns the length [...] equal to the number of Unicode code units in the string.

https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/String.html#length()

Code unit 😱

The minimal bit combination that can represent a unit of encoded text.
[§3.9, D77]


"a".length()	  // 1
"ą".length()	  // 1
"ą".length()	  // 2
"𝄞".length()	  // 2


"👰".length()    // 2
"🇵🇱".length()    // 4
"👩🏾‍🚀".length()    // 7
"👨‍👩‍👧‍👦".length() 	// 11

"T̢̗̮͉͈̠̣͆͆̎͐̌͒͢ȍ̵͑̾͒͂͛̄̔͢҉̡̦͙͎̱̹͍͎͖̪̮̙̪͔̺͕̞̰̤̯m̍ͩ̓͋ͫ̑҉̵̷͓̦̩̭̗̩̫̺e̵̦̫̭̫̬͉̞̪̹̓̆̈́͊̂̃̀͡ǩ̸̴̢̛̫̦̬̪̘̱̖̼̺͕͇͕̞͓̮̭̯ͣ̌̂̏ͨͤͬ͛̏̋̉̀".length()

119


import java.text.BreakIterator;

int graphemes(String str) {
   BreakIterator graphemeCounter = BreakIterator
           .getCharacterInstance(Locale.US);
   graphemeCounter.setText(str);
   int graphemeCount = 0;
   while (graphemeCounter.next() != BreakIterator.DONE) {
       graphemeCount++;
   }
   return graphemeCount;
}

https://developer.twitter.com/en/docs/basics/counting-characters.html

What is character?

Code point
Code unit
Grapheme cluster
Glyph

UTF

UTF is an [...] mapping from every Unicode code point [...] to a unique byte sequence

Whitespace is straightforward

Myth 8

How many different types of whitespaces there are?

All of them:

Space, tab, enter...


						IntStream.rangeClosed(0, 0x10FFFF)
						         .filter(Character::isDefined)
						         .count();

Java	Unicode	`isDefined`
8	6.2	249 698
9/10	8.0	260 253
11	10.0	276 271
12	11.0	276956


						IntStream.rangeClosed(0, 0x10FFFF)
						         .filter(Character::isWhitespace)
						         .count();

Java	`isWhitespace`
8	26
9-12	25

`Character.isWhitespace()`

25 characters

`String.trim()`

32 characters

`Pattern.compile("\\s")`

6 characters

Upper case is simple

Myth 9

🇹🇷

🇹+🇷

`"i".toUpperCase(tr_TR)`

i → İ

jira.atlassian.com/browse/CONFSERVER-7168 (Confluence)
blogs.msdn.microsoft.com/anutthara/2005/12/05/avoiding-the-turkish-i-issue/ (.NET)
bz.apache.org/bugzilla/show_bug.cgi?id=38787 (BCEL)
bugzilla.redhat.com/show_bug.cgi?id=1408950 (Fedora)

Unicode is harmless

Myth 10

https://twitter.com/JenMsft/status/1012586276678086656

effective. Power لُلُصّبُلُلصّبُررً ॣ ॣh ॣ ॣ 冗

https://www.businessinsider.com/iphone-unicode-bug-crashes-messages-forces-devices-to-reboot-arabic-2015-5?IR=T

జ్ఞా

https://serhack.me/articles/crash-iphone-telugu-character-en

<⚫️>👈🏻

https://www.macworld.com/article/3271426/iphone-ipad/black-dot-unicode-bug-can-crash-messagesheres-how-to-fix-it.html

`‏‎`

https://blog.infobytesec.com/2018/05/remember-iphone-unicode-bug-android.html

https://www.dogancanulker.com/noktali-ve-noktasiz-problemi/

https://www.theinquirer.net/inquirer/news/1017243/cellphone-localisation-glitch

Zaten sen sıkışınca konuyu değiştiriyorsun.
Ramazan (24 yo)

Zaten sen sikişınce konuyu değiştiriyorsun.
Emine (20 yo)

sıkışınca ≠ sikişince

Zaten sen sıkışınca konuyu değiştiriyorsun.
Ramazan (24 yo)

Zaten sen sikişınce konuyu değiştiriyorsun.
Emine (20 yo)

sıkışınca ≠ sikişince

Anyhow, whenever you can't answer an argument, you change the subject.
Ramazan (24 yo)

Anyhow, whenever they are f***ing you, you change the subject.
Emine (20 yo)

Romanization

Zażółć gęślą jaźń

👇

Zazolc gesla jazn

Pangram

The quick brown fox jumps over the lazy dog

Jeżu klątw, spłódź Finom część gry hańb

Conclusions

Ińtërnâtiônàlizætión☃⛄️

https://mathiasbynens.be/notes/javascript-unicode

Which encoding is the best?

It depends

UTF-8

References

Thank you!

nurkiewicz.github.io/talks/charbusters


							public static void main(String[] args) throws UnsupportedEncodingException {
								var s = "\uD83D\uDC69\uD83C\uDFFE\u200D\uD83D\uDE80";
								System.out.println(s);
								final byte[] bytes = s.getBytes("UTF-32BE");
								for (byte b : bytes) {
									System.out.print(toHex(b) + " ");
								}
							}


							private static String toHex(byte b) {
								final int unsigned = b & 0xFF;
								final String s = Integer.toHexString(unsigned).toUpperCase();
								return s.length() == 1? "0" + s : s;
							}

CharBusters

10 Unicode myths

CharBusters

10 Unicode myths

I don’t need to worry about Unicode

Myth 0

History of IT

(or how we estimate badly)

Parkinson's Law

Hofstadter's Law

IPv4

1978

640 KiB

Bill Gates IBM, 1981

Y2K38

512K problem

GPS 2019

ASCII

1963

1 character = 1 byte

Myth 1

Let’s talk about "ą"

ą

Content-type: text/html; charset=utf-8

🇵🇱

🇨🇿

🇫🇷

🇷🇺

🇩🇪 🇬🇷 🇪🇸

Unicode 1.0

1988

Unicode code points

1 character = 1 char

Myth 2

1988

🇨🇳

讓我來! 让我来!

CJK

𝄞

U+1D11E (119 070)

Unicode 2.0

1996

Correct Java type for one character is...?

String

Unicode

vs

UTF-*

a

U+0061

ą

U+0105

鑫

U+946B

𝄞

U+1D11E

🤔

"𝄞".length() == 2

"(𝄞)".substring(0, 2)

Surrogate pairs

var life = "🏭" + "🏖";

var life = "🏭🏖";

"?🏭?"

Java is UTF-16

Myth 3

Java 8

Java 9+

Java 9+

String.getBytes()

Unicode is unambiguous

Myth 4

"ą".equals("ą")

Tiny tail

1 character ≤ 1 int

Myth 5

Let’s talk about emoji

🇵🇱

🇵+🇱

🇵🇱

👧🏽

1 character ≤ 2 ints

`Content-type: text/html; charset=utf-8`

`U+1D11E` (119 070)

Correct Java type for one `character` is...?

`String`

`U+0061`

`U+0105`

`U+946B`

`U+1D11E`

`"𝄞".length() == 2`

`"(𝄞)".substring(0, 2)`

`var life = "🏭" + "🏖";`

`var life = "🏭🏖";`

`"?🏭?"`

`String.getBytes()`

`"ą".equals("ą")`

`String.length()` is useful

`public int length()`

`Character.isWhitespace()`

`String.trim()`

`Pattern.compile("\\s")`

`"i".toUpperCase(tr_TR)`

`‏‎`