काचं शक्नोम्यत्तुम् । नोपहिनस्ति माम् ॥
Μπορῶ νὰ φάω σπασμένα γυαλιὰ χωρὶς νὰ πάθω τίποτα 🇬🇷
ᛁᚳ᛫ᛗᚨᚷ᛫ᚷᛚᚨᛋ᛫ᛖᚩᛏᚪᚾ᛫ᚩᚾᛞ᛫ᚻᛁᛏ᛫ᚾᛖ᛫ᚻᛖᚪᚱᛗᛁᚪᚧ᛫ᛗᛖ᛬
⠊⠀⠉⠁⠝⠀⠑⠁⠞⠀⠛⠇⠁⠎⠎⠀⠁⠝⠙⠀⠊⠞⠀⠙⠕⠑⠎⠝⠞⠀⠓⠥⠗⠞⠀⠍⠑
Я можу їсти скло, і воно мені не зашкодить. 🇺🇦
მინას ვჭამ და არა მტკივა. 🇬🇪
Կրնամ ապակի ուտել և ինծի անհանգիստ չըներ։ 🇦🇲
నేను గాజు తినగలను మరియు అలా చేసినా నాకు ఏమి ఇబ్బంది లేదు 🇮🇳
איך קען עסן גלאָז און עס טוט מיר נישט װײ
ᠪᠢ ᠰᠢᠯᠢ ᠢᠳᠡᠶᠦ ᠴᠢᠳᠠᠨᠠ ᠂ ᠨᠠᠳᠤᠷ ᠬᠣᠤᠷᠠᠳᠠᠢ ᠪᠢᠰᠢ 🇲🇳
我能吞下玻璃而不伤身体。🇨🇳
Jag kan äta glas utan att skada mig 🇸🇪
kermitproject.org/utf8.html
I don’t need to worry about Unicode
Myth 0
History of IT
(or how we estimate badly)
Parkinson's Law
Work expands so as to fill the time available for its completion
Hofstadter's Law
It always takes longer than you expect, even when you take into account Hofstadter's Law
640 KiB
Bill Gates IBM, 1981
1 character = 1 byte
Myth 1
ą
ISO-8859-2 |
B1
|
ISO 8859-13 |
E0
|
ISO 8859-16 |
A2
|
Windows-1250 |
B9
|
CP775 |
D0
|
CP852 |
A5
|
Mazovia |
86
|
Content-type: text/html; charset=utf-8
🇵🇱
ą |
ć |
ę |
ł |
ń |
ó |
ś |
ź |
ż |
Ą |
Ć |
Ę |
Ł |
Ń |
Ó |
Ś |
Ź |
Ż |
🇨🇿
á |
é |
í |
ó |
ú |
ý |
č |
ď |
ě |
ň |
ř |
š |
ť |
ž |
ů |
Á |
É |
Í |
Ó |
Ú |
Ý |
Č |
Ď |
Ě |
Ň |
Ř |
Š |
Ť |
Ž |
Ů |
🇫🇷
ù |
û |
ü |
ÿ |
à |
â |
æ |
ç |
é |
è |
ê |
ë |
ï |
î |
ô |
œ |
Ù |
Û |
Ü |
Ÿ |
À |
|
Æ |
Ç |
É |
È |
Ê |
Ë |
Ï |
Î |
Ô |
Œ |
🇷🇺
а |
б |
в |
г |
д |
е |
ё |
ж |
з |
и |
й |
к |
л |
м |
н |
о |
п |
А |
Б |
В |
Г |
Д |
Е |
Ё |
Ж |
З |
И |
Й |
К |
Л |
М |
Н |
О |
П |
р |
с |
т |
у |
ф |
х |
ц |
ч |
ш |
щ |
ъ |
ы |
ь |
э |
ю |
я |
Р |
С |
Т |
У |
Ф |
Х |
Ц |
Ч |
Ш |
Щ |
Ъ |
Ы |
Ь |
Э |
Ю |
Я |
Unicode 1.0
modern [characters], whose number is undoubtedly far below 214 = 16 384
1988
Unicode code points
a = U+0061 = 97
å = U+00E5 = 229
ą = U+0105 = 261
鑫= U+946B = 37995
1 character = 1 char
Myth 2
[...] undoubtedly far below 214 = 16 384
1988
CJK
88 thousand characters in Unicode 12.0
a | U+0061 |
ą | U+0105 |
鑫 | U+946B |
𝄞 | U+1D11E |
Correct Java type for one character
is...?
byte
char
int
String
String
codePointAt(int) : int
codePoints() : IntStream
codePointCount(int, int) : int
Unicode
vs
UTF-*
UTF-7, UTF-8
UTF-16 [BOM | LE | BE]
UTF-32 [BOM | LE | BE]
a
U+0061
UTF-8 |
61 |
UTF-16 |
00 61 |
UTF-32 |
00 00 00 61 |
ą
U+0105
UTF-8 |
C4 85 |
UTF-16 |
01 85 |
UTF-32 |
00 00 01 05 |
鑫
U+946B
UTF-8 |
E9 91 AB |
UTF-16 |
94 6B |
UTF-32 |
00 00 94 6B |
𝄞
U+1D11E
UTF-8 |
F0 9D 84 9E |
UTF-16 |
D8 34 DD 1E |
UTF-32 |
00 01 D1 1E |
StringBuilder rev = new StringBuilder();
for (int i = life.length() - 1; i >= 0; i--)
rev.append(life.charAt(i));
new StringBuilder("🏭🏖")
.reverse()
.toString()
Java 8
private final char value[];
Java 9+
private final byte[] value;
Java 9+
public int indexOf(int ch, int fromIndex) {
return isLatin1()
? StringLatin1.indexOf(value, ch, fromIndex)
: StringUTF16.indexOf(value, ch, fromIndex);
}
Unicode is unambiguous
Myth 4
Tiny tail
ą |
Latin Small Letter A with Ogonek |
U+0105 |
a |
Latin Small Letter A |
U+0061 |
̨ |
Combining Ogonek |
U+0328 |
Normalizer.normalize("ą", Form.NFKC)
java.text
1 character ≤ 1 int
Myth 5
🇵🇱
🇵 | 00 01 F1 F5 |
🇱 | 00 01 F1 F1 |
👧 |
00 01 F4 67 |
🏾 |
00 01 F3 FE |
👧🏽 |
00 01 F4 67 00 01 F3 FE |
1 character ≤ 2 ints
Myth 6
👩 |
00 01 F4 69 |
🏾 |
00 01 F3 FE |
ZWJ |
20 0D |
🚀 |
00 01 F6 80 |
👨 |
00 01 F4 68 |
ZWJ |
20 0D |
👩 |
00 01 F4 69 |
ZWJ |
20 0D |
👧 |
00 01 F4 67 |
ZWJ |
20 0D |
👦 |
00 01 F4 66 |
String.length()
is useful
Myth 7
Code unit 😱
The minimal bit combination that can represent a unit of encoded text.
"a".length() // 1
"ą".length() // 1
"ą".length() // 2
"𝄞".length() // 2
"👰".length() // 2
"🇵🇱".length() // 4
"👩🏾🚀".length() // 7
"👨👩👧👦".length() // 11
"T̢̗̮͉͈̠̣͆͆̎͐̌͒͢ȍ̵͑̾͒͂͛̄̔͢҉̡̦͙͎̱̹͍͎͖̪̮̙̪͔̺͕̞̰̤̯m̍ͩ̓͋ͫ̑҉̵̷͓̦̩̭̗̩̫̺e̵̦̫̭̫̬͉̞̪̹̓̆̈́͊̂̃̀͡ǩ̸̴̢̛̫̦̬̪̘̱̖̼̺͕͇͕̞͓̮̭̯ͣ̌̂̏ͨͤͬ͛̏̋̉̀".length()
119
import java.text.BreakIterator;
int graphemes(String str) {
BreakIterator graphemeCounter = BreakIterator
.getCharacterInstance(Locale.US);
graphemeCounter.setText(str);
int graphemeCount = 0;
while (graphemeCounter.next() != BreakIterator.DONE) {
graphemeCount++;
}
return graphemeCount;
}
What is character?
- Code point
- Code unit
- Grapheme cluster
- Glyph
UTF
UTF is an [...] mapping from every Unicode code point [...] to a unique byte sequence
Whitespace is straightforward
Myth 8
How many different types of whitespaces there are?
IntStream.rangeClosed(0, 0x10FFFF)
.filter(Character::isDefined)
.count();
Java |
Unicode |
isDefined |
8 |
6.2 |
249 698 |
9/10 |
8.0 |
260 253 |
11 |
10.0 |
276 271 |
12 |
11.0 |
276956 |
IntStream.rangeClosed(0, 0x10FFFF)
.filter(Character::isWhitespace)
.count();
Java |
isWhitespace |
8 |
26 |
9-12 |
25 |
Character.isWhitespace()
25 characters
String.trim()
32 characters
Pattern.compile("\\s")
6 characters
Upper case is simple
Myth 9
"i".toUpperCase(tr_TR)
i → İ
Unicode is harmless
Myth 10
Zaten sen sıkışınca konuyu değiştiriyorsun.
Zaten sen sikişınce konuyu değiştiriyorsun.
sıkışınca ≠ sikişince
Zaten sen sıkışınca konuyu değiştiriyorsun.
Zaten sen sikişınce konuyu değiştiriyorsun.
sıkışınca ≠ sikişince
Anyhow, whenever you can't answer an argument, you change the subject.
Anyhow, whenever they are f***ing you, you change the subject.
Romanization
Zażółć gęślą jaźń
👇
Zazolc gesla jazn
Pangram
The quick brown fox jumps over the lazy dog
Jeżu klątw, spłódź Finom część gry hańb
Which encoding is the best?
It depends
UTF-8
public static void main(String[] args) throws UnsupportedEncodingException {
var s = "\uD83D\uDC69\uD83C\uDFFE\u200D\uD83D\uDE80";
System.out.println(s);
final byte[] bytes = s.getBytes("UTF-32BE");
for (byte b : bytes) {
System.out.print(toHex(b) + " ");
}
}
private static String toHex(byte b) {
final int unsigned = b & 0xFF;
final String s = Integer.toHexString(unsigned).toUpperCase();
return s.length() == 1? "0" + s : s;
}