Tokens are a big reason today’s generative AI falls short

Tokenization limits generative AI's performance; character-based models like MambaByte show promise, but require significant computational resources.

: Generative AI models like GPT-4o process text using tokenization due to transformer architecture limitations, often resulting in confusing outputs. This introduces challenges across different languages and numerical data, highlighting inefficiencies. Alternative models like MambaByte could be a solution but are still in early research stages.

Generative AI models including industry leaders like GPT-4o rely on a technique called tokenization, breaking text into smaller units for processing by transformer architectures. This method, while practical, introduces biases and inconsistencies, as tokens can represent various elements like words, syllables, or even individual characters. This often leads to unexpected and incorrect outputs, particularly when dealing with different languages and numerical data, as these tokenization methods don't translate seamlessly across linguistic and temporal contexts.

Languages such as Chinese, Japanese, Korean, Thai, and Khmer suffer due to tokenization methods designed primarily for English, raising bias and performance issues. A 2023 Oxford study showed transformers can take twice as long to process tasks in non-English languages, and users of these less token-efficient languages may face poorer model performance and higher costs. Similarly, the tokenization of agglutinative and logographic languages results in a higher token count, which further complicates processing and comprehension.

To mitigate tokenization issues, researchers are exploring models like MambaByte, which works with raw bytes rather than tokens. Byte-level state space models can handle more data without performance penalties, offering better noise resilience. However, these are still in the early research phases and face computational feasibility challenges. Models like MambaByte may eventually provide a solution if we address these computational constraints.