Track Awesome Unicode Updates Weekly
:joy: :ok_hand: A curated list of delightful Unicode tidbits, packages and resources.
🏠 Home · 🔍 Search · 🔥 Feed · 📮 Subscribe · ❤️ Sponsor · 😺 jagracey/Awesome-Unicode · ⭐ 809 · 🏷️ Miscellaneous
Jul 04 - Jul 10, 2016
💥 Lowercase Transformation Collisions / Wait a second... what did I just read?
- String length is typically determined by counting codepoints. This means that surrogate pairs would count as two characters. Combining multiple diacritics may be stacked over the same character.
a + ̈ == ̈a
, increasing length, while only producing a single character.
- Similarily, reversing strings often is a non-trivial task. Again, surrogate pairs and diacritics must be reversed together. ES Reverser (⭐859) provides a pretty good solution.
- Upper and lower case mappings are not always one-to-one. They can also be:
- One-to-many: (ß → SS )
- Contextual: (…Σ ↔ …ς AND …ΣΤ… ↔ …στ… )
- Locale-sensitive: ( I ↔ ı AND İ ↔ i )
Unicode Blocks / Wait a second... what did I just read?
- Version 9.0.0 (Latest Version, August 2016 - adds exactly 7,500 characters)
Jun 13 - Jun 19, 2016
One-To-Many Case Mappings / Wait a second... what did I just read?
- python-ftfy (⭐3.3k) - Given Unicode text, make its representation consistent and possibly less broken.
- vim-troll-stopper (⭐166) - Stop Unicode trolls from messing with your code.
Recursive HTML Tag Renaming Script / Wait a second... what did I just read?
May 30 - Jun 05, 2016
Myths of Unicode
- Unicode is simply a 16-bit code - Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, actually, correct. It is the single most common myth about Unicode, so if you thought that, don't feel bad.
- You can use any unassigned codepoint for internal use - No. Eventually that hole will be filled with a different character. Instead use private use or noncharacters.
- Every Unicode code point represents a character - No. There are lots of nonCharacters (FFFE, FFFF, 1FFFE,…) There are also surrogate code points, private and unassigned codepoints, and control/format “characters" (RLM, ZWNJ,…)
- Unicode will run out of space - If it were linear, we would run out in 2140 AD. But it isn't linear. See http://www.unicode.org/roadmaps/
- Case mappings are 1-1 - No. They can also be:
- One-to-many: (ß → SS )
- Contextual: (…Σ ↔ …ς AND …ΣΤ… ↔ …στ… )
- Locale-sensitive: ( I ↔ ı AND İ ↔ i )
One-To-Many Case Mappings / Wait a second... what did I just read?
- PhantomScript (⭐39) - 👻 🔦 Invisible JavaScript code execution & social engineering
- ESReverser (⭐859) - A Unicode-aware string reverser written in JavaScript.
- mimic (⭐3.7k) - [ab]using Unicode to create tragedy
- Emojipedia - Information about specific emoji, news blog.
- emojitracker - Realtime emoji use on Twitter.
- World Translation Foundation - A way to promote, explore, and translate the written word into the pictorial alphabet of Emoji.
- Can I Emoji? - Displays the current status of native Emoji support across iOS, Android and Windows.
Recursive HTML Tag Renaming Script / Wait a second... what did I just read?
- The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets - By Joel Spolsky
- Space Yourself - Smashing Magazine's Spacing Guide
- Shapecatcher - Draw the character you're looking for.
Unicode Blocks / Wait a second... what did I just read?
- Universal repertoire - Every writing system ever used shall be respected and represented in the standard
- Logical order - In bidirectional text are the characters stored in logical order, not in a way that the representaion
- Efficiency - The documentation must be efficient and complete.
- Unification - Where different cultures or languages use the same character, it shall be only included once. This point is
- Characters, not glyphs - Only characters, not glyphs shall be encoded. In a nutshell, glyphs are the actual graphical
- Dynamic composition - New characters can be composed of other, already standardized characters. For example, the character “Ä” can be composed of an “A” and a dieresis sign (“ ¨ ”).
- Semantics - Included characters must be well defined and distinguished from others.
- Stability - Once defined characters shall never be removed or their codepoints reassigned. In the case of an error, a codepoint shall be deprecated.
- Plain Text - Characters in the standard are text and never mark-up or metacharacters.
- Convertibility - Every other used encoding shall be representable in terms of a Unicode encoding.
- Version 5.0.0 (unavailable)