Published: Sep 1, 2023
When you feed a text prompt to a large language model, or any text-based input to audio, video, or image models, it’s crucial to know that these models don’t process the raw text (often referred to as ‘strings’) as-is. Instead, they convert the text into smaller pieces called tokens. For anyone in the role of a prompt engineer, understanding the intricacies of tokenization is vital. It helps you grasp how a model interprets and processes the text you provide, which is key to getting the desired output.
Tokenization serves as a sort of ‘pre-processing’ step, breaking down a chunk of text into more manageable units—be it sentences, words, subwords, or individual characters. Imagine having a set of Legos: each Lego block could represent a character, word, or even a subword. Tokenization is like sorting these Lego blocks in particular orders or groupings, making it easier for the machine to build meaning from them.
Character Tokenization
Imagine you have the sentence “Hello, World!“. Character tokenization would break this down into individual characters like this:
H, e, l, l, o, ,, , W, o, r, l, d, !
Each character, including spaces and punctuation, becomes a separate “token” or unit. It’s like breaking down a Lego castle into individual blocks. These characters often get converted into numbers for the machine to understand, but let’s keep it simple for now.
Pros and Cons: This method is pretty straightforward but not very efficient for understanding words or context. It’s like trying to understand the plot of a movie by looking at each frame—possible but hard!
Word Tokenization
Instead of breaking down the sentence “Hello, World!” into individual characters, word tokenization would look like:
Hello, World, !
In this method, each word becomes its own unit. You’ll notice that punctuation like the comma and exclamation mark are also treated as separate units.
Pros and Cons: This gives a better understanding of the text’s structure but can struggle with variations of words and may not handle punctuation well. In terms of our Lego analogy, you can think of this as breaking down the castle into larger sections—like a wall or a turret—but still not recognizing the castle as a whole entity.
Subword Tokenization
This is a kind of middle-ground approach. For a complicated sentence with uncommon words, it might look like:
Chat, G, PT, is, cool!
Here, “ChatGPT” is broken down into ‘Chat,’ ‘G,’ ‘PT’.
Pros and Cons: This strikes a balance. It’s good for understanding complex words by breaking them down into known parts while still recognizing common words as whole entities. So, in our Lego analogy, it’s like breaking down the castle into individual blocks for the uncommon or tricky structures, but keeping the well-known sections intact.
Subword tokenization is smart. It learns the best way to break down words based on the text data it’s trained on.
And that’s it! You now know the basics of tokenization. It’s like different strategies for deconstructing a Lego castle so you can understand how it’s built—either block by block, section by section, or a mix of both.
Why Tokenization Matters for Prompt Engineering
When working with Large Language Models (LLMs) like ChatGPT, it’s crucial to understand that these models don’t process text word-by-word; they process it token-by-token. Tokens are smaller pieces of text that can be as small as a single character or as large as a whole word. This affects how the model understands and manipulates text, which has implications for tasks you might think are simple, like reversing the letters in a word.
Example: The ‘Lollipop’ Dilemma
Imagine asking ChatGPT to reverse the letters in the word “lollipop.” One would think it’s a straightforward task, but you may get a garbled response. Why? Because the tokenizer in ChatGPT breaks the word “lollipop” into tokens: “l,” “oll,” and “ipop.” Since the model sees these tokens instead of the individual letters, reversing them becomes a challenge.
How to Solve This
A quick trick to fix this is to insert dashes between the letters in the word you’d like to reverse. For example, instead of “lollipop,” input “l-o-l-l-i-p-o-p.” This forces the tokenizer to treat each letter as an individual token, making it easier for the model to reverse them.
Practical Tips for Prompt Engineering
Delimiters
We’ve covered how important it is to provide clear and specific instructions and how using simple delimiters improve the results from your prompts.
Tokenization Knowledge becomes valuable here because some simple delimiters like a quadruple hashtag ####
will be converted into a single token, this makes it very easy for the identify and at the same time doesn’t waste the valuable and limited token space (see next section).
Understanding Tokens Limitation
The token limit is another critical aspect to consider, especially for GPT-3.5 Turbo, which has an approximate limit of 4,000 tokens for input and output combined. Exceeding this limit will result in errors. Knowing how many tokens you’re using can be essential to avoid overstepping this limitation.
By understanding tokenization and these practical tips, you’ll be better equipped to engineer prompts that generate useful and accurate outputs from Large Language Models.