2026/01/06

Understanding Zero-Width Characters: A Complete Guide

Learn everything about zero-width characters (ZWSP, ZWJ, ZWNJ, WJ) - what they are, how they work, their legitimate uses, and why they appear in AI-generated text. Complete guide with examples and detection methods.

Have you ever copied text from ChatGPT or another AI tool and noticed something strange? Maybe your code didn't work as expected, or a regex pattern failed to match, even though the text looked perfectly normal? You're not alone. I've been there too, and it took me a while to figure out what was going on.

The culprit? Zero-width characters - invisible Unicode characters that don't take up any visual space but can cause all sorts of problems. These characters are officially defined in the Unicode Standard, maintained by the Unicode Consortium, and they serve legitimate purposes in typography, linguistics, and text processing. However, they can also be used for watermarking AI-generated content, which is why you might encounter them in text from AI tools.

What Are Zero-Width Characters?

Zero-width characters are special Unicode characters that have zero visual width - meaning they don't display anything when you look at text, but they still exist in the character sequence. Think of them as invisible markers that can affect how text is processed, displayed, or interpreted by software.

These characters are part of the official Unicode Standard, which is the international standard for text encoding. They were originally designed for legitimate typographic and linguistic purposes, such as:

  • Complex script handling: Languages like Arabic, Persian, and Thai use these characters for proper text rendering
  • Emoji sequences: Combining multiple emoji into complex sequences (like family emoji)
  • Typography control: Preventing unwanted line breaks or controlling text flow
  • Linguistic processing: Handling word boundaries in languages without spaces

However, because they're invisible and can be embedded in text without affecting its appearance, they've also been adopted for other purposes, including watermarking AI-generated content.

Types of Zero-Width Characters

There are several types of zero-width characters, each with its own specific purpose and Unicode code point. Let's break down the most common ones:

TypeNameUnicodeDescriptionCommon Uses
ZWSPZero Width SpaceU+200BAn invisible character with zero width, defined in Unicode Standard for word separation in scripts like Thai. Can appear in text through various means.Word separation in Thai, watermarking, text processing
ZWJZero Width JoinerU+200DA non-printing character defined in Unicode Standard that joins adjacent characters, commonly used in complex scripts and emoji sequences (see Unicode Emoji Standard).Emoji sequences, complex scripts, watermarking
ZWNJZero Width Non-JoinerU+200CAn invisible character defined in Unicode Standard that prevents the joining of adjacent characters, used in typography for scripts like Persian and Arabic.Persian/Arabic typography, preventing character joining
WJWord JoinerU+2060An invisible character defined in Unicode Standard that prevents line breaks between words, ensuring text stays together.Preventing line breaks, keeping text together

References: All these characters are officially defined in the Unicode Standard. For detailed technical specifications, see the Unicode Character Database and the Unicode Technical Reports.

Zero Width Space (ZWSP) - U+200B

The Zero Width Space is probably the most commonly encountered zero-width character, especially in AI-generated text. As the name suggests, it's an invisible space character that doesn't take up any visual space.

Legitimate uses:

  • Thai language: Used for word separation in Thai script, which doesn't use spaces between words
  • Text processing: Can be used to mark word boundaries in text processing systems
  • Line breaking: Some systems use it to indicate where line breaks are allowed

Example:

const text = "Hello\u200BWorld";
console.log(text.length); // Returns 11 (includes the invisible space)
console.log(text === "HelloWorld"); // Returns false!

Why it appears in AI text: AI services may insert ZWSP characters as part of watermarking schemes. Since they're invisible, they don't affect the reading experience but can be detected programmatically.

Zero Width Joiner (ZWJ) - U+200D

The Zero Width Joiner is used to join adjacent characters together, particularly in complex scripts and emoji sequences. It's one of the most common zero-width characters found in AI-generated text.

Legitimate uses:

  • Emoji sequences: Combining multiple emoji into complex sequences. For example, the family emoji 👨‍👩‍👧‍👦 is created using ZWJ to join individual emoji
  • Complex scripts: Used in languages like Arabic, Persian, and Indic scripts to control character joining
  • Ligatures: Creating ligatures in certain writing systems

Example:

// Family emoji uses ZWJ
const family = "👨\u200D👩\u200D👧\u200D👦";
console.log(family); // Displays as a single family emoji

Why it appears in AI text: ZWJ is frequently used in AI watermarking because it's common enough in legitimate text (especially with emoji) that it doesn't raise suspicion, but can still be detected programmatically.

Zero Width Non-Joiner (ZWNJ) - U+200C

The Zero Width Non-Joiner does the opposite of ZWJ - it prevents adjacent characters from joining together. It's primarily used in scripts where characters normally join, like Arabic and Persian.

Legitimate uses:

  • Persian/Arabic typography: Preventing unwanted character joining in Persian and Arabic text
  • Text formatting: Controlling how characters are displayed in certain contexts
  • Linguistic processing: Marking boundaries where characters should not join

Example:

// In Persian/Arabic text, ZWNJ prevents character joining
const persianText = "مثال\u200Cمثال"; // Prevents joining

Why it appears in AI text: Less common than ZWJ or ZWSP in AI watermarks, but still used by some services as part of watermarking schemes.

Word Joiner (WJ) - U+2060

The Word Joiner is used to prevent line breaks between words, ensuring that certain text sequences stay together on the same line.

Legitimate uses:

  • Preventing line breaks: Keeping text like "price: $100" together on one line
  • Technical formatting: Ensuring code snippets, URLs, or technical terms don't break awkwardly
  • Typography: Maintaining visual consistency in formatted text

Example:

const price = "price:\u2060$100";
// The WJ prevents line breaks between "price:" and "$100"

Why it appears in AI text: Used less frequently in watermarking but can still appear in AI-generated content, especially in formatted or technical text.

Legitimate Uses of Zero-Width Characters

Before we dive into why these characters appear in AI text, it's important to understand that they have many legitimate and important uses:

1. Complex Script Rendering

Languages like Arabic, Persian, Thai, and various Indic scripts rely on zero-width characters for proper text rendering. These characters control how letters connect, how words are separated, and how text flows visually.

Example in Thai:

// Thai text uses ZWSP for word separation
const thaiText = "สวัสดี\u200Bครับ"; // "Hello" in Thai

2. Emoji Sequences

Modern emoji rely heavily on ZWJ to create complex sequences. Without ZWJ, we wouldn't have emoji like:

  • 👨‍👩‍👧‍👦 (family)
  • 👨‍💻 (technologist)
  • 🏳️‍🌈 (rainbow flag)

How it works:

// Family emoji is created by joining individual emoji with ZWJ
const family = "👨\u200D👩\u200D👧\u200D👦";

3. Typography and Text Formatting

Zero-width characters help control text flow, prevent unwanted line breaks, and maintain formatting consistency. This is especially important in:

  • Technical documentation
  • Code examples
  • Formatted text with specific layout requirements

4. Text Processing and NLP

In natural language processing and text analysis, zero-width characters can mark word boundaries, indicate special formatting, or provide metadata about text structure.

Why Zero-Width Characters Appear in AI-Generated Text

Now, here's where things get interesting. While zero-width characters have legitimate uses, they're also being used by AI services for watermarking. Here's why:

Watermarking and Content Tracking

AI companies may insert zero-width characters into their generated text as a form of watermarking. This serves several purposes:

Content attribution: By embedding invisible markers, AI services can track where their generated content ends up. This helps them understand usage patterns and content distribution.

Detection: Watermarks allow AI services (and others) to detect AI-generated content in the wild. This is becoming increasingly important as AI-generated content becomes more common.

Research and improvement: Tracking how AI-generated content is used helps companies improve their models and understand real-world usage patterns.

Legal and compliance: Watermarks can help with copyright and content ownership tracking, which is important as AI-generated content becomes more prevalent.

The Watermarking Debate

It's worth noting that the use of zero-width characters for watermarking is a topic of ongoing research and debate. While some AI services may use these characters for watermarking, it's important to understand that:

  • Not all zero-width characters are watermarks: These characters may appear due to copy-paste operations, browser rendering, text processing pipelines, or legitimate typographic needs
  • Detection isn't definitive: The presence of zero-width characters doesn't definitively prove they were inserted by an AI service
  • Other watermarking methods exist: Some AI services use statistical watermarking (patterns in word choice) rather than character insertion

However, regardless of their origin, these invisible characters can cause real problems for developers and content creators.

How to Detect Zero-Width Characters

If you suspect your text contains zero-width characters, there are several ways to detect them:

Method 1: Using JavaScript in Browser Console

The easiest way to check for zero-width characters is using JavaScript in your browser's console:

// Function to detect all zero-width characters
function detectZeroWidth(text) {
    const zeroWidthChars = {
        'ZWSP': '\u200B',  // Zero Width Space
        'ZWJ': '\u200D',   // Zero Width Joiner
        'ZWNJ': '\u200C',  // Zero Width Non-Joiner
        'WJ': '\u2060'     // Word Joiner
    };

    const results = {};

    for (const [name, char] of Object.entries(zeroWidthChars)) {
        const count = (text.match(new RegExp(char, 'g')) || []).length;
        if (count > 0) {
            results[name] = count;
        }
    }

    return results;
}

// Usage
const text = "Your text here";
const detected = detectZeroWidth(text);
console.log('Detected zero-width characters:', detected);

Method 2: Using Python

Python makes it easy to detect and count zero-width characters:

def detect_zero_width(text):
    """Detect zero-width characters in text"""
    zero_width_chars = {
        'ZWSP': '\u200B',  # Zero Width Space
        'ZWJ': '\u200D',   # Zero Width Joiner
        'ZWNJ': '\u200C',  # Zero Width Non-Joiner
        'WJ': '\u2060'     # Word Joiner
    }

    results = {}
    for name, char in zero_width_chars.items():
        count = text.count(char)
        if count > 0:
            results[name] = count

    return results

# Usage
text = "Your text here"
detected = detect_zero_width(text)
print(f"Detected zero-width characters: {detected}")

Method 3: Using Online Unicode Analyzers

Several online tools can help you visualize and detect zero-width characters:

Method 4: Using Text Editors

Many code editors have extensions or built-in features to reveal zero-width characters:

VS Code:

  • Install the "Zero Width Characters" extension
  • Or use the built-in "Render Whitespace" feature (though it may not show all zero-width characters)

Sublime Text:

  • Use the "Unicode Character Highlighter" plugin
  • Or enable "Show All Characters" in view settings

Vim:

  • Use :set list to show invisible characters
  • Configure listchars to display zero-width characters

Notepad++:

  • Enable "Show All Characters" from the View menu
  • Zero-width characters may appear as special symbols

Problems Caused by Zero-Width Characters

Even though these characters are invisible, they can cause real problems in various scenarios:

1. String Length Mismatches

Zero-width characters are counted in string length, which can cause unexpected behavior:

const text = "Hello\u200BWorld";
console.log(text.length); // Returns 11, not 10
console.log(text === "HelloWorld"); // Returns false!

// This can break validation
if (text.length === 10) {
    // This will never execute because length is 11
}

2. Regex Pattern Failures

Regular expressions may fail to match text containing zero-width characters:

// This regex won't match if there's a zero-width character
const pattern = /^HelloWorld$/;
const text = "Hello\u200BWorld";
console.log(pattern.test(text)); // Returns false!

// Even with word boundaries
const wordPattern = /\bHello\b/;
const text2 = "Hello\u200BWorld";
console.log(wordPattern.test(text2)); // May return false

3. Database Storage Issues

Some database systems don't handle zero-width characters well:

  • Encoding errors: Older SQL databases may throw encoding errors
  • Search failures: Queries won't match text with hidden characters
  • Index corruption: Some database systems may have issues with these characters in indexes
  • Storage overhead: While minimal, these characters do take up space

4. API Integration Problems

Many APIs expect clean text without special Unicode characters:

// API validation may fail
const apiData = {
    username: "user\u200Bname",
    // Some APIs reject this
};

// JSON parsing is usually fine, but validation may fail
fetch('/api/user', {
    method: 'POST',
    body: JSON.stringify(apiData)
});

5. Code and Programming Issues

When using AI-generated text in code, zero-width characters can break:

  • Code comments: May cause parsing issues
  • String literals: Can break string matching
  • Configuration files: May cause parsing errors
  • Template strings: Can break template processing

6. Content Management Systems

Some CMS platforms strip or mishandle zero-width characters:

  • Text truncation: Characters may be counted but not displayed, causing truncation issues
  • Formatting loss: May interfere with text formatting
  • Display issues: Can cause rendering problems in the frontend
  • Search functionality: May break search features

7. Text Processing and Analysis

Zero-width characters can interfere with:

  • Word counting: May affect word count accuracy
  • Text analysis: Can interfere with NLP tools
  • Plagiarism detection: May cause false positives or negatives
  • Text comparison: Can break text diff tools

Real-World Examples

Let me share some real-world scenarios where zero-width characters caused problems:

Example 1: Form Validation Failure

// User pastes AI-generated text into a form
const username = "john\u200Bdoe"; // Contains ZWSP

// Validation checks length
if (username.length > 8) {
    showError("Username too long");
    // This triggers even though it looks like 8 characters
}

// Database query fails
db.query("SELECT * FROM users WHERE username = ?", [username]);
// No match found because database has "johndoe" without ZWSP

Example 2: Email Parsing Issue

// Email address with zero-width character
const email = "user\u200B@example.com";

// Email validation
const emailRegex = /^[^\s@]+@[^\s@]+\.[^\s@]+$/;
console.log(emailRegex.test(email)); // May return false

// Email sending fails
sendEmail(email, "Subject", "Body");

Example 3: URL Processing

// URL with zero-width character
const url = "https://example.com/page\u200B1";

// URL validation
try {
    new URL(url); // May throw error or create invalid URL
} catch (e) {
    console.error("Invalid URL");
}

// Fetch fails
fetch(url); // Request fails

How to Remove Zero-Width Characters

If you've detected zero-width characters in your text and want to remove them, you have several options:

Method 1: Using Our Cleaning Tool

The easiest way is to use our watermark cleaning tool. It's designed specifically for this purpose and handles all types of zero-width characters:

  1. Paste your text into the tool
  2. Click "Clean Text"
  3. Copy the cleaned result

The tool processes everything locally in your browser - no data is sent to any server, ensuring complete privacy.

Method 2: JavaScript Function

You can create a simple JavaScript function to remove zero-width characters:

function removeZeroWidth(text) {
    return text
        .replace(/\u200B/g, '')  // Zero Width Space
        .replace(/\u200D/g, '')  // Zero Width Joiner
        .replace(/\u200C/g, '')  // Zero Width Non-Joiner
        .replace(/\u2060/g, ''); // Word Joiner
}

// Usage
const cleaned = removeZeroWidth("Hello\u200BWorld");
console.log(cleaned); // "HelloWorld"

Or using a single regex:

function removeZeroWidth(text) {
    return text.replace(/[\u200B-\u200D\u2060]/g, '');
}

Method 3: Python Function

In Python, you can remove zero-width characters like this:

import re

def remove_zero_width(text):
    """Remove zero-width characters from text"""
    # Remove all zero-width characters
    return re.sub(r'[\u200B-\u200D\u2060]', '', text)

# Usage
text = "Hello\u200BWorld"
cleaned = remove_zero_width(text)
print(cleaned)  # "HelloWorld"

Method 4: Using a Library

Several libraries can help with Unicode character handling:

JavaScript:

  • unorm - Unicode normalization
  • punycode - Encoding/decoding

Python:

  • unicodedata - Built-in Unicode database
  • unidecode - ASCII transliterations

Best Practices

Here are some best practices for dealing with zero-width characters:

1. Always Clean User Input

If you're accepting text input from users (especially if it might come from AI tools), clean it before processing:

function cleanUserInput(input) {
    // Remove zero-width characters
    return input.replace(/[\u200B-\u200D\u2060]/g, '');
}

2. Validate Before Storage

Clean text before storing it in databases:

function sanitizeForDatabase(text) {
    return text
        .replace(/[\u200B-\u200D\u2060]/g, '') // Remove zero-width
        .trim(); // Remove leading/trailing whitespace
}

3. Be Careful with Emoji

Remember that some emoji legitimately use ZWJ. If you're removing zero-width characters, you might break emoji sequences:

// This emoji uses ZWJ - removing it will break it
const family = "👨\u200D👩\u200D👧\u200D👦";
const broken = family.replace(/\u200D/g, ''); // Breaks the emoji

Consider preserving ZWJ in emoji contexts, or at least being aware of this limitation.

4. Log Detections

If you're cleaning text, consider logging when zero-width characters are detected:

function cleanAndLog(text) {
    const before = text.length;
    const cleaned = text.replace(/[\u200B-\u200D\u2060]/g, '');
    const after = cleaned.length;

    if (before !== after) {
        console.warn(`Removed ${before - after} zero-width characters`);
    }

    return cleaned;
}

5. Test Your Code

Always test your code with text that contains zero-width characters:

// Test cases
const testCases = [
    "Hello\u200BWorld",
    "Test\u200DString",
    "Normal text"
];

testCases.forEach(text => {
    const cleaned = removeZeroWidth(text);
    console.assert(cleaned.length <= text.length, "Cleaning should not increase length");
});

Frequently Asked Questions (FAQ)

Here are some common questions about zero-width characters:

Q: Are zero-width characters always watermarks?

No, not necessarily. Zero-width characters have many legitimate uses:

  • Emoji sequences (family emoji, etc.)
  • Complex script rendering (Arabic, Persian, Thai)
  • Typography and text formatting
  • Text processing and NLP

They may also appear due to:

  • Copy-paste operations
  • Browser rendering
  • Text processing pipelines
  • Font rendering

The presence of zero-width characters doesn't definitively prove they were inserted by an AI service.

Q: Will removing zero-width characters break my text?

Usually not, but there are exceptions:

  • Emoji sequences: Removing ZWJ from emoji sequences will break them (e.g., 👨‍👩‍👧‍👦 becomes separate emoji)
  • Complex scripts: Removing zero-width characters from Arabic, Persian, or Thai text may affect rendering
  • Formatted text: May affect text flow or formatting in some cases

For most English text and code, removing zero-width characters is safe.

Q: How do I know if my text has zero-width characters?

You can:

  1. Use the detection methods described above (JavaScript, Python, online tools)
  2. Use our watermark cleaning tool - it will show you if any are detected
  3. Check in your code editor with appropriate extensions
  4. Use Unicode analysis tools

Q: Are zero-width characters harmful?

Not harmful in the security sense, but they can cause:

  • Code bugs and failures
  • Database issues
  • API integration problems
  • Text processing errors
  • Formatting issues

They're more of an annoyance than a security threat, but they can definitely cause problems.

Q: Can I prevent zero-width characters from being inserted?

If you're generating text yourself, you can avoid inserting them. However, if you're receiving text from AI services or other sources, you can't prevent them from being inserted - but you can detect and remove them.

Q: Do all AI services use zero-width characters for watermarking?

No. Different AI services use different methods:

  • Some use zero-width characters
  • Some use statistical watermarking (patterns in word choice)
  • Some use semantic watermarking
  • Some may not watermark at all

The use of zero-width characters for watermarking is not officially documented by most AI services.

Q: Is it legal to remove zero-width characters?

This depends on the terms of service of the AI service you're using. Generally, removing invisible tracking characters is similar to removing cookies or tracking pixels from websites. However, you should:

  • Review the terms of service for the AI tool you're using
  • Consult legal counsel if you have concerns
  • Consider the ethical implications

Q: Will removing zero-width characters make AI text undetectable?

Not necessarily. Removing zero-width characters only removes one potential detection method. Advanced AI detection systems may use:

  • Statistical analysis of writing patterns
  • Vocabulary and sentence structure analysis
  • Semantic analysis
  • Other steganographic methods

Removing zero-width characters helps, but doesn't guarantee undetectability.

Additional Resources

If you want to dive deeper into zero-width characters and Unicode, here are some authoritative resources:

Bottom Line

Zero-width characters are fascinating and complex. They serve legitimate purposes in typography, linguistics, and text processing, but they can also cause problems when they appear unexpectedly in AI-generated text or other sources.

Understanding what they are, how to detect them, and how to handle them is essential for anyone working with text processing, especially in the age of AI-generated content. Whether you're a developer dealing with code, a content creator working with AI tools, or just someone curious about how text works, knowing about zero-width characters can save you a lot of headaches.

If you've encountered zero-width characters in your text and want to clean them up, try our watermark cleaning tool →. It's free, works entirely in your browser, and handles all the common zero-width character types.

Remember: these characters aren't inherently bad - they're tools that can be used for good or problematic purposes. The key is understanding them and knowing how to work with them effectively.


← Back to Home