Tokens are a big reason today’s generative AI falls short
Tokenization limits generative AI's performance; character-based models like MambaByte show promise, but require significant computational resources.
Generative AI models including industry leaders like GPT-4o rely on a technique called tokenization, breaking text into smaller units for processing by transformer architectures. This method, while practical, introduces biases and inconsistencies, as tokens can represent various elements like words, syllables, or even individual characters. This often leads to unexpected and incorrect outputs, particularly when dealing with different languages and numerical data, as these tokenization methods don't translate seamlessly across linguistic and temporal contexts.
Languages such as Chinese, Japanese, Korean, Thai, and Khmer suffer due to tokenization methods designed primarily for English, raising bias and performance issues. A 2023 Oxford study showed transformers can take twice as long to process tasks in non-English languages, and users of these less token-efficient languages may face poorer model performance and higher costs. Similarly, the tokenization of agglutinative and logographic languages results in a higher token count, which further complicates processing and comprehension.
To mitigate tokenization issues, researchers are exploring models like MambaByte, which works with raw bytes rather than tokens. Byte-level state space models can handle more data without performance penalties, offering better noise resilience. However, these are still in the early research phases and face computational feasibility challenges. Models like MambaByte may eventually provide a solution if we address these computational constraints.