Question 1

What is a Unicode codepoint?

Accepted Answer

A codepoint is a number assigned to a character in the Unicode standard. It is written as U+XXXX (e.g. U+0041 for "A", U+1F600 for 😀). A codepoint is not the same as a byte — characters above U+007F require multiple bytes in UTF-8. JavaScript strings are sequences of UTF-16 code units, so emoji and other characters above U+FFFF are represented as surrogate pairs.

Question 2

What are zero-width characters?

Accepted Answer

Zero-width characters are codepoints that have no visible glyph but affect text rendering or processing. Examples: U+200B (zero-width space), U+200C (zero-width non-joiner), U+200D (zero-width joiner, used in emoji sequences), U+FEFF (byte-order mark / zero-width no-break space). They are sometimes used to hide text in phishing attacks or to fingerprint documents.

Question 3

What are confusable characters?

Accepted Answer

Confusable characters are codepoints from non-Latin scripts that look visually identical or very similar to ASCII characters. For example, Cyrillic "а" (U+0430) looks like Latin "a" (U+0061), and Cyrillic "о" (U+043E) looks like Latin "o" (U+006F). This is the basis of IDN homograph attacks — registering a domain like "pаypal.com" with a Cyrillic "а" that looks identical to the real domain.

Question 4

What is a bidi override character?

Accepted Answer

Bidirectional override characters (U+202A–U+202E) control the direction of text rendering. They can be used to make malicious code look like a comment, or to reverse the apparent order of characters in a filename. The "Trojan Source" attack uses bidi overrides to hide malicious code in source files that looks harmless to reviewers.

Question 5

Why does my string have more codepoints than characters?

Accepted Answer

Some visible "characters" are composed of multiple codepoints. Emoji with skin tone modifiers, family emoji, and flag emoji are sequences of 2–7 codepoints joined by zero-width joiners. Accented characters can be either a single precomposed codepoint (NFC) or a base character plus a combining diacritic (NFD). This tool counts codepoints, not grapheme clusters.

#	Char	Hex	Block / category	Flags
1	H	U+0048	Basic Latin·Letter
2	e	U+0065	Basic Latin·Letter
3	l	U+006C	Basic Latin·Letter
4	l	U+006C	Basic Latin·Letter
5	o	U+006F	Basic Latin·Letter
6	,	U+002C	Basic Latin·Symbol
7	SPACE	U+0020	Basic Latin·Whitespace
8	世	U+4E16	CJK Unified Ideographs·Letter
9	界	U+754C	CJK Unified Ideographs·Letter
10	SPACE	U+0020	Basic Latin·Whitespace
11	🌍	U+1F30D	Emoji·Symbol
12	SPACE	U+0020	Basic Latin·Whitespace
13	—	U+2014	General Punctuation·Punctuation
14	SPACE	U+0020	Basic Latin·Whitespace
15	р	U+0440	Cyrillic·Letter	confusable
16	a	U+0061	Basic Latin·Letter
17	y	U+0079	Basic Latin·Letter
18	p	U+0070	Basic Latin·Letter
19	a	U+0061	Basic Latin·Letter
20	l	U+006C	Basic Latin·Letter
21	SPACE	U+0020	Basic Latin·Whitespace
22	v	U+0076	Basic Latin·Letter
23	s	U+0073	Basic Latin·Letter
24	SPACE	U+0020	Basic Latin·Whitespace
25	p	U+0070	Basic Latin·Letter
26	a	U+0061	Basic Latin·Letter
27	y	U+0079	Basic Latin·Letter
28	p	U+0070	Basic Latin·Letter
29	a	U+0061	Basic Latin·Letter
30	l	U+006C	Basic Latin·Letter

Unicode Codepoint Inspector

About Unicode Codepoint Inspector

Security use cases

Pipeline

Frequently asked