I’m making this post to share with friends who ask about AI. I need this to throw at people, as in “read this now”. It will have all the necessary links, resources and terminology in simple English.
What do I need?
You will need a programming language. I write JavaScript and PHP, it’s what I prefer to write with, however almost all of AI is written with Python. You can of course write in JavaScript but it is not recommended considering everything is written in Python for AI.
If you’re not familiar with Python or programming in general, I recommended studying it on freecodecamp.org. After you have the basics of Python, you’ll also need some math, but don’t panic because you can learn what you need with Khan Academy, Brilliant or GPT.
Linear Algebra
Linear Algebra is important to understand vectors and matrices. Words, sentences, and model parameters are represented as vectors or matrices. Each word is embedded into a vector space (word embeddings), and the model weights are organized into matrices.
Matrix Multiplication which is used to transform input vectors through layers of the neural network, essentially how data passes and is processed across neurons.
Calculus
Derivatives and Gradients: Understanding derivatives helps with backpropagation, the algorithm used to train the model. You’ll calculate gradients of the loss function to update the weights in the model.
Gradient Descent: A basic grasp of gradient descent is needed to see how the model’s weights are optimized. This algorithm minimizes the error between the model’s predictions and the actual data.
Probability and Statistics
Softmax Function: Converts raw model outputs (logits) into probabilities. The model assigns a probability to each possible next word, and the word with the highest probability is selected.
Cross-Entropy Loss: This is a common loss function used to measure how well the model’s predicted probability distribution matches the actual distribution. It’s essential for training.
Maximum Likelihood Estimation (MLE): The model is trained to maximize the likelihood of the correct output, given the input. This involves using probability distributions over possible outcomes.
Optimization
Backpropagation: Understand the chain rule of derivatives (from calculus) to see how backpropagation works in neural networks, allowing the model to adjust its weights based on errors.
Stochastic Gradient Descent (SGD): You’ll need to be comfortable with this optimization technique, which is used to update weights in a way that minimizes the loss function over small batches of data.
Discrete Mathematics
Attention Mechanism: The core of the GPT model is the self-attention mechanism. Understanding how the model calculates attention scores between different words in a sentence (using dot products and normalization) is crucial.
Dot Products: Used to measure the similarity between different word vectors in the attention mechanism. You should understand how dot products work to get how the model focuses on certain parts of the input.
Entropy
Understanding entropy can help in learning about the uncertainty in probability distributions, particularly when discussing loss functions like cross-entropy.
Linear Transformations in Attention
The attention mechanism in GPT relies on linear transformations of the input data. This involves applying matrices (query, key, and value) to the input data and using dot products to compute attention scores.
“What is this telling me?”
When you are finding difficulty with a particular piece of math literally ask yourself “What is this telling me?”. It’s the best way to optimally pin point what information you get by performing a certain function.
Where do I start?
Get used to Andrej Karpathy because he is the AI boss.
This is for a general audience to understand the Large Language Model (LLM). Start right here.
But this is a playlist of 10 videos called Neural Network: Zero to Hero, this is where we explore the rules underneath the magic of AI.
Within this playlist you have ‘Let’s build GPT: from scratch, in code, spelled out.’ You will learn the transformer architecture; everything will make sense with this video.
The paper “Attention Is All You Need” (2017) introduced the model which is central to GPT’s design.
And you also have ‘Let’s reproduce GPT-2 (124M)’.
Terminology
Every new technology brings new terminology. Here is the essential list.
What is ChatGPT?
ChatGPT is an interactive, text-based AI system. It uses probabilities to generate different answers to the same question. Essentially, it is a language model, meaning it predicts the most likely next word in a sentence based on the data it was trained on.
GPT stands for Generative Pre-trained Transformer. This refers to the type of model and how it was trained, which we’ll explain shortly.
People & AI
There are three general groups when it comes to AI:
Tech Accelerators: They want AI to advance quickly, even aiming for futuristic breakthroughs like the singularity (where AI surpasses human intelligence).
Tech Luddites: They are skeptical of rapid AI growth and prefer to slow it down.
Dave: He doesn’t care much about the debate. He just enjoys the fun and uses it casually.
What is a Neural Network?
A neural network is the system that powers AI like ChatGPT. The most important type for GPT is called a Transformer (“Attention Is All You Need”).
The transformer is like the blueprint for how GPT works—how it processes data and predicts text using an attention mechanism. It revolutionized how AI models handle language by making them more efficient and capable of processing large sequences of text.
What is Training?
Training means teaching the model using a large set of data. For example, if we train the model with the US Constitution, the model learns to predict the next word of any sentence from that document. However, the model can sometimes make mistakes or “hallucinate” (generate incorrect or nonsensical responses).
Once training is complete, the model can generate predictions or responses based on any input provided.
What are Tokens?
Tokens are chunks of words, not full words. AI splits text into these smaller pieces to process it more efficiently. For example, the word “chatbot” might be broken into “chat” and “bot” as two separate tokens.
Tokens help the model process and understand language at a more granular level, allowing for better prediction and handling of a wide variety of inputs.
What is Tokenization?
Tokenization is the process of splitting text into tokens—small, manageable pieces of language. These can be words, parts of words, or even characters. Tokenization allows the model to handle text more efficiently and process information in smaller chunks.
For instance, the word “unbelievable” might be tokenized into “un,” “believe,” and “able.” Tokenization is crucial because it enables the model to process even unknown or complex words effectively.
What are Model Parameters?
Parameters are the internal settings of the model that adjust during training. For example, GPT-2 has 124 million parameters. The more parameters a model has, the more complex patterns and relationships it can learn from data, resulting in a more accurate and detailed model.
Parameters are like dials that the model tunes to improve its predictions and performance.
What are Models?
A model is the AI system itself, trained on a specific set of data. For example, a model can be trained to perform various tasks, like converting text to video or generating images from text. Each model is specialized based on the data and tasks it was trained for.
In the case of GPT models, they are trained primarily on large text datasets, allowing them to understand and generate human-like text.
Training on Data Types
A model can be trained on any data type, but GPT is primarily trained on text data. You can take the GPT model and train it on a specific text dataset, like books, articles, or conversations. This enables the model to specialize in different domains, from casual conversation to more technical topics.
What are Weights?
Weights are numerical values inside the model that adjust during training. They control how much influence each part of the input data (tokens) has on the final prediction. As the model processes data, the weights are continuously updated to improve accuracy.
Weights get “tuned” based on how well the model performs during training, allowing it to make better predictions over time.
What is the Transformer Architecture?
The transformer is a type of neural network used in GPT models. It uses a mechanism called self-attention to decide which words or tokens in a sentence are most important when making predictions.
Transformers allow the model to process text in parallel, meaning it can handle entire sequences of words at once, making them more efficient and powerful for tasks like language generation.
What is a Context Window?
A context window refers to how much input text the model can “see” or remember at once when making predictions. For example, if the model has a context window of 2048 tokens, it can only process and generate text based on that number of tokens at a time.
A larger context window allows the model to consider more text at once, making it better at handling long conversations or documents.
What is Backpropagation?
Backpropagation is a key part of training neural networks. It’s the process where the model learns from its mistakes by sending information backward through the network to adjust the weights and improve accuracy.
During training, after the model makes a prediction, backpropagation compares the prediction with the correct answer and adjusts the model’s parameters to reduce the error for future predictions.
What is Inference?
Inference is the process of using a trained model to make predictions based on new input data. Once the model is trained, it no longer needs to learn; it simply infers or predicts based on the patterns it learned during training.
For example, when you ask ChatGPT a question, it uses inference to generate a response by predicting the next word or phrase based on your input and its training data.
What is Fine-Tuning?
Fine-tuning is the process of taking a pre-trained model and training it further on a specific dataset to specialize it for a particular task. For example, you might take a general language model like GPT and fine-tune it on medical texts to create a specialized AI for healthcare-related questions.
What is Zero-Shot Learning?
Zero-shot learning refers to the ability of a model to perform a task it has never explicitly been trained on. For example, if ChatGPT is asked to generate a recipe for a dish it’s never seen in its training data, it may still be able to do so by inferring from its general knowledge of cooking.
What is Few-Shot Learning?
Few-shot learning allows the model to perform a task after seeing only a few examples. For instance, if you give ChatGPT a couple of examples of how to write a particular kind of poem, it can then generate similar poems even if it hasn’t been trained specifically for that task.
What is Pre-training?
Pre-training refers to the process where the model is initially trained on a large, diverse dataset (like the entirety of Wikipedia or books). This gives the model a general understanding of language and the world, which can later be fine-tuned for specific applications.
What is Transfer Learning?
Transfer learning is the idea that a model trained on one task (like understanding general language) can be adapted to perform a different, more specific task (like legal document classification) with minimal additional training.
What do Models Do?
Models like those in ChatGPT are part of machine learning, specifically a type called deep learning. These models use neural networks with many layers to learn patterns in data. In ChatGPT’s case, the model processes large amounts of text to predict language patterns, allowing it to generate human-like responses. While based on machine learning principles, models like GPT use advanced techniques like the transformer architecture, which is designed to handle large-scale language tasks efficiently.
Another example is image recognition models, such as those used in facial recognition or object detection. These models, like Convolutional Neural Networks (CNNs), also use deep learning but are optimized for visual data. CNNs learn to identify patterns in images by recognizing edges, textures, and shapes, making them highly effective for tasks like classifying objects or detecting faces in photos. Just like GPT models process text, CNN models process visual data through deep learning techniques.
Difference between CNNs and Transformers:
Convolutional Neural Networks (CNNs) are specialized for processing visual data, such as images and videos, by detecting patterns like edges, textures, and shapes. On the other hand, transformers are designed for handling text and sequential data, making them ideal for language tasks like those performed by ChatGPT. While CNNs excel at tasks like object detection, transformers focus on understanding and generating language.
AI Landscape Tree
1. Artificial Intelligence (AI)
Goal: Simulate human intelligence. Types: Narrow AI (e.g., language models, image recognition) General AI (not yet achieved)
2. Machine Learning (ML)
Subset of AI: Models learn patterns from data to make predictions or decisions.
Types of Learning:
Supervised Learning: Learns from labeled data (e.g., classification, regression)
Unsupervised Learning: Learns from unlabeled data (e.g., clustering)
Reinforcement Learning: Learns through trial and error (e.g., game-playing AIs)
3. Deep Learning (DL)
Subset of Machine Learning: Uses neural networks with many layers (hence “deep”) to learn complex patterns.
Common Deep Learning Models:
GPT (Generative Pre-trained Transformer): For text/language tasks like ChatGPT.
CNN (Convolutional Neural Networks): For image and video processing (e.g., facial recognition).
RNN (Recurrent Neural Networks): For sequence data (e.g., time-series prediction, language translation).
4. Transformer Architecture
Used in Models like GPT: Designed to handle language processing and large text data efficiently through mechanisms like self-attention.
5. Convolutional Neural Networks (CNNs)
Used in Image/Video Recognition: Specialized for processing visual data, identifying features like edges and textures.