Jim Grisanzio

Building Free & Open Source Software Communities

Jim Grisanzio, Headshot

Pages

About
Communities
Contact
Copyright
Duke’s Corner
OpenSolaris
Photo Profiles
Presentations
Resume

Podcasts

The Long Build
Duke’s Corner (archive)
Groundbreakers (archive)

Categories

Developers
Geopolitics
Money
Science
Whatever

RSS/ATOM

I’ve been reading “Why We Need Continual Learning” by Malika Aubakirova and Matt Bornstein recently. I also listened to a podcast interview from Malika on a16z . Now, I’m no AI researcher. But I do like exploring the scientific foundations on which advanced software tools are built, especially since I use these applications every day and hope to leverage them more in the future. So although I don’t fully understand what’s actually happening underneath, poking around a bit is an interesting exercise. What follows below is what I’ve learned from the article. Consider it a work in progress. If you want the expert version from Malika and Matt, go read their original piece for a deep dive. This text here is just me working through things as best as I can at my level. At the end of this post, I also include a long list of terms with definitions. I’ll make that a standard feature in similar upcoming posts for my own short-term recall practice and also for long term memory consolidation. Memory practice (the human kind) is a hobby of mine.

Anyway, here we go. The authors open their article on continual learning by referring back to Christopher Nolan’s “Memento,” which is a film about a man named Leonard Shelby who suffers from anterograde amnesia that prevents him from forming new memories. Every few minutes his world resets and he wakes up in the same perpetual present with no idea what just happened in the past. He tattoos notes on his body and carries Polaroids as memory aids just to function throughout the day. It turns out that he’s very resourceful because he uses whatever he can in his environment to get by. He even appears pretty capable within any given scene in the movie. But, as the authors put it, his tragedy is that “he can never compound. Every experience remains external.” So, I guess that means he can’t learn based on his present moment to prepare for the future like most of us who have normal memories.

That seems to be a good description of where AI models are right now. I thought about this when I first used ChatGPT and Grok a few years ago. It was clear from my chats that the models were not “learning” from our conversations at all. I kept spinning around in circles. And, in fact, some of those earlier models didn’t know even basic facts from current events, which was shocking since AI was sold to us as being super smart. That’s when I realized that the “learning” for LLMs took place at some point in the past and then they were locked shut while life continued on. That experience of an AI not knowing simple bits in the news rarely happens now so the user experience has improved significantly. However, there’s a lot more to it that I didn’t realize from those first few frustrating conversations.

What’s Actually Happening When You Type Into That Text Box

Here’s what I didn’t fully understand before reading this article. When you type into a chat window and stuff happens before you get an answer, that process is not the model learning anything from your input. It’s reading what you gave it and generating a response. When the conversation ends, the model forgets everything. The next conversation starts from exactly the same place as every other new conversation. Initially, that felt unnerving so I had to figure out ways to leverage the knowledge from the LLM without all that forgetting going on.

The text box we type into is just a door into the system. What matters is the context window behind the door, which is everything the model can see at once. So, your message, the whole conversation history, any documents you shared, and any background instructions — all of these things represent what the model is working with when it responds. And it has a size limit. When it fills up, older content gets dropped to make room for new content. So if you spend an hour explaining your company’s internal processes to an AI assistant and then start a fresh conversation the next day in a new text box, the AI has no memory of the previous conversation. You have to start over. Not because it forgot. Because it never learned in the first place.

There’s a name for this phenomenon. The article calls it in-context learning, which is really just the model making smart use of whatever sits in front of it right now. It’s temporary by design. The model reads, responds, and moves on. It’s similar to glancing at your notes before a meeting rather than actually deeply studying, internalizing, and using the material beforehand. When the meeting ends, those casual notes go back in the drawer and are forgotten.

The Frozen Model Problem

To understand why this matters, you need to know a little about what’s inside these models. During training, a model reads an insane amount of text and gradually adjusts billions of numerical values called parameters or weights. You can think of each weight as a dial on a pipe connecting two nodes in the network controlling how much signal flows through. The model trains by turning billions of those dials very slightly over and over again until it gets good at predicting language. That right there is really impressive to me given the scale of information these models are working with. But when the training process ends, all those dials get locked. That stage represents deployment. The model then goes out into the world with its knowledge frozen in place.

Training works because it’s a compression process. The model can’t store everything it reads verbatim. It has to find the underlying patterns, generalize the data, and build something compact that transfers to new situations it’s never seen before. The authors describe this as lossy compression, and that lossiness is actually what produces what seems like intelligence to us when we talk to an AI. When I first read that I thought of a camera compressing a RAW file to a JPEG file. The RAW image contains all the available data but it’s a massive size and requires editing in post production to produce a beautiful image. The JPEG, however, is much smaller because it’s been compressed by the camera to just what’s needed to display a good quality image at a certain size. I’ve always understood that process in photography, but I didn’t realize that LLMs are going through a similar process.

Here’s another way to think about it. Remember first learned how to ride a bike? You didn’t read the entire manual every time. You just got some guidance from a friend or a parent and you practiced. You fell down a few times and adjusted your technique, and then eventually your brain distilled your experience into something automatic and compact. That’s compression. You still remember falling down, but that falling down process is no longer helpful for riding once learning has taken place. What remains is the final skill of balancing to ride. An AI model that memorizes every training sentence perfectly would be less useful than always knowing everything because it would have no ability to generalize. It would just be a search engine.

The painful irony the authors identify is this. The very mechanism that makes these models powerful during training is exactly what we stop them from doing once they’ve been deployed. We freeze the compression at the moment of release and replace it with what’s called external memory. That really hit me. I got it right away that this is a layered experience that will expand over time.

The Filing Cabinet

To compensate for frozen models, developers have built elaborate scaffolding systems, such as chat histories, retrieval databases, system prompts, external document stores, and more. All of these things make up what the article calls external memory. They are flexible and they live outside the model’s internal, frozen weights. When you need information, the system retrieves it and feeds it into the context window. Then the model reads it and responds.

This architecture works as is and the authors are honest about that. However, they make a point I hadn’t considered before. “A bigger filing cabinet is still a filing cabinet.” Retrieval is not learning. The model is looking things up, not actually knowing them. It just does it very quickly and uses natural language so you get the impression you are talking to someone who is intelligent.

Here’s another practical example. Say a hospital deploys an AI assistant to help with real world clinical decisions. That model was trained on medical literature through some cutoff date. A major new clinical trial or medical policy comes out afterward that changes how doctors treat a particular condition. The hospital can feed that paper into a retrieval database so the AI can surface it when it’s relevant. But the model doesn’t internalize that new research the way doctors would after reading it, applying it to patients, observing the outcomes, and revising their practice accordingly. The AI can retrieve the abstract. But it can’t reason from the new finding the way someone who has truly learned it can in practice. That’s the limitation these researchers are trying to fix.

The same problem exists in cybersecurity. Threats evolve daily. A frozen model can be given descriptions of new attack patterns through retrieval, but it can’t compress and generalize from those patterns the way an analyst who has spent months chasing a specific class of threat genuinely does. The knowledge stays external. It never becomes part of what the model actually knows unless the model is updated with a new learning process, which is time consuming and expensive.

What Real Learning Requires

So what’s the alternative? The article introduces a concept called continual learning, which is the field of research aimed at letting models actually update their weights based on new experience after deployment. Not just read notes. Actually learn live like humans do.

And here’s where the Memento metaphor really makes sense. The authors say that today’s AI is stuck in Leonard Shelby’s perpetual present. The scaffolding, the Polaroids and tattoos, and other memory aids work well enough within any given scene. But the model can never compound in real time. Every new thing it encounters stays external.

Think about the difference between a doctor who simply retrieves a recent study and a doctor who has spent years treating patients with that knowledge fully and personally internalized. Or consider the difference between someone who has your email history in front of them and someone who actually knows how you think over time. The article frames this cleanly. “The difference between ‘Here is what you responded to this email before’ versus ‘I understand how you think well enough to anticipate what you need’ is the difference between retrieval and learning.” Even in normal human memory, immediate retrieval is necessary to manage your present experience. However, it’s also required that your present experience be embedded into long term memory for continual learning.

The authors bring up Fermat’s Last Theorem as the hardest version of this problem. Mathematicians worked on the issue for 350 years. Eventually the problem was solved by Andrew Wiles. But he didn’t crack it by retrieving the right papers. He solved it by working in near total isolation for seven years, and inventing entirely new mathematical techniques to bridge two previously disconnected fields. That kind of discovery required genuine compression, generalization, and creative combination. Not simply fast retrieval. And the article asks directly whether a model that can’t compound from experience could ever do anything like that. The honest answer is they don’t know yet.

Why Updating Weights Is So Hard

At this point I had to ask myself if real time continual learning is so important, why can’t the LLM models do it now? The short answer is that updating a model’s weights after deployment is genuinely dangerous and technically unsolved at scale.

The most obvious problem is called catastrophic forgetting. When you update a model’s weights to learn something new, it tends to overwrite what it already knew. New learning crowds out old learning. If you fine tune a general model specifically on medical records, it might get better at clinical language while getting noticeably worse at everything else because the new training has nudged weights that were also doing other jobs. The model gets better at one thing and potentially worse at everything it was already good at. When you understand this you can really appreciate how humans have benefited from millions of years of evolution. The AI machines seem rather clunky by comparison. When humans learn, new neural connections are made in the brain that stick for a long time as new learning is layered on top. But even in humans, old learning does actually fade gradually over time if a specific neural pathway isn’t continually reinforced. It just takes a very long period of time. With AI systems, however, the new learning wipes out the new learning immediately. The authors didn’t address this issue directly in humans, but the example seems similar if you study biology.

There’s also the problem of data poisoning. If a model’s weights can be updated through interactions after deployment, bad actors could gradually manipulate its behavior through carefully crafted inputs over time. Unlike a one-time attack, poisoned weights persist across every future conversation. The damage would live in the model itself so safety alignment would degrade unpredictably immediately or some time in the future. The article notes that “even narrow fine-tuning on benign data can produce broadly misaligned behavior,” which is a sobering thought to sit with. Yet we all know this would happen right away based on our own experience being online every day fighting bots and hackers.

These aren’t hypothetical concerns. They’re real problems without clean solutions yet.

Where Things Are Heading

The article maps out a spectrum of approaches to continual learning that are organized around a question I found clarifying: where does the compaction actually happen? It seems there is a stack of technologies managing the process.

On one end you have pure retrieval. No compaction. The model just reads notes. That’s most of what exists today. In the middle there are modules, which are attachable and specialized components that let a model develop some expertise in a specific domain without retraining the entire thing from scratch. A hospital might attach a medical module to a general model so it performs at a specialist level on clinical questions, while the same base model with a different module handles legal contracts. Each module is swappable independently. That’s a practical middle ground for now.

On the far end you have full parametric learning, where the model’s weights actually update from new experience after deployment. This is the goal, but it remains largely unsolved at scale with the current technologies. But there are serious research efforts moving in this direction with things like test-time training where the model runs brief learning cycles before it generates a response. Also there are self-improvement approaches where models like AlphaEvolve and AlphaProof have generated their own training data and genuinely improved from it, at least within constrained problem domains like mathematics.

The authors frame the path forward as layered. In-context learning stays as the first line of adaptation because it works now and keeps getting better. Modules offer some personalization and domain specialization. But for genuinely novel problems, adversarial scenarios, and knowledge too tacit to put into words, models may eventually need to compress new experience directly into their parameters after training. Otherwise, as the authors put it, we stay stuck in Memento’s perpetual present.

What I Took Away

I started reading this article as someone who uses AI tools every day without thinking much about what’s happening underneath. What I came away with is a clearer sense of the gap between what these systems appear to do, what they’re actually doing now, and what’s up for the future. The models appear to learn. They respond to new information, adapt to what you give them, and most times they feel like they understand you. But the reality is that they don’t compound. They don’t learn. Everything stays external. The dials are locked. And until engineers figure out how to update those dials safely and continuously after deployment, the models we’re using now are doing something more like reading notes than actually learning. That’s a distinction with a very big difference.

Check the article and the podcast for more context. Below is a list of related terms and definitions.


Continual Learning: Vocabulary List

This list of terms below is based on the a16z article “Why We Need Continual Learning” by Malika Aubakirova and Matt Bornstein and also the podcast interview with Malika discussing the article. It’s crafted from a long conversation I had with Claude to better understand the details. It’s based on the article and also the field generally. I error checked the terms and definitions with Grok, ChatGPT, Gemini, Perplexity, and DeepSeek. So, hopefully it’s mostly accurate to help with some initial human learning about AI continual learning.

Agentic Loops

A mode of operation where the model works autonomously step by step toward a goal without you typing each instruction. Each step produces output that feeds into the next. This process can go on for many cycles. The article identifies two related problems as steps accumulate: (1) the immediate symptom is coherence degradation, where the agent loses the thread and starts making poor decisions, and (2) the underlying cause is that maintaining a growing context becomes increasingly expensive and inefficient. Both concerns together are why the article frames agentic loops as a key pressure point on the current in-context learning paradigm. For example, an agent tasked with researching a topic, drafting a report, checking sources, and revising the draft might handle the first twenty steps cleanly. But by step eighty the accumulating context has grown so large and costly that the agent starts losing track of earlier decisions and repeating work it already did.

Attention Heads

A key mechanism inside transformers that allows the model to weigh how relevant each part of the context is to every other part when generating a response. Multiple attention heads run in parallel, each learning to focus on different kinds of relationships in the text. One head might learn to track grammatical agreement between subject and verb across a long sentence, while another tracks thematic connections between paragraphs. Together they allow transformers to handle complex, long range dependencies in language that earlier architectures struggled with. For example, in the sentence “The lawyer who argued the case, despite the objections raised by her colleagues, ultimately won,” an attention head helps the model correctly connect “won” back to “lawyer” across all the intervening words.

Catastrophic Forgetting

When a model updates its weights to learn something fresh, it tends to overwrite what it already knew. In other words, new learning crowds out old learning and sometimes dramatically. This is one of the central unsolved problems in continual learning, and one of the main reasons models are not updated continuously after deployment. Think of it like overwriting a hard drive. The new files go in, but the old ones can be partially or fully lost. For example, if you fine-tune a general purpose model specifically on a medical records archive, the model will get better at clinical language but noticeably worse at writing poetry or explaining history because the new training has nudged weights that were doing other jobs.

Compression / Compaction

The process of taking a vast amount of raw information and distilling it into something compact and generalized. During training, a model compresses an enormous amount of human writing into its parameters and finds the underlying patterns rather than storing things verbatim. The article uses “compaction” as a broad organizing term for how deeply new information gets digested, which ranges from not at all (pure retrieval, where facts just sit in a database) to fully (weight-level learning, where the model actually internalizes new knowledge). For example, rather than memorizing every recipe ever written, a well-trained model compresses the underlying logic of cooking: how heat transforms food, how flavors balance, how techniques generalize across cuisines.

Continual Learning

The broader field of research aimed at letting models learn from new experience after deployment, ideally by updating their weights rather than relying on external scaffolding. It’s the opposite of the current norm, where training and deployment are completely separate and weights are frozen the moment a model is released. The goal is something closer to how humans learn continuously from experience without needing to be retrained from scratch every time the world changes. For example, a customer service model using continual learning could gradually internalize patterns from thousands of resolved support tickets over time and getting genuinely better at its job rather than just retrieving past examples.

Context Window

The full body of text the model can see at once when generating a response. It includes your message, the full conversation history, any documents you shared, and any background instructions passed to the model. It has a size limit measured in tokens. When it fills up, older content must be dropped to make space for new content. For example, if you have a long conversation with an AI assistant and then ask it to recall something you mentioned earlier, it may not be able to answer because that part of the conversation has already been pushed out of the window.

Data Poisoning

One of several serious governance and security risks the article raises around continuous weight updates. If a model’s weights can be updated after deployment interactions, bad actors could gradually manipulate its behavior through carefully crafted inputs over time, which is a slow and hard-to-detect form of corruption that lives in the weights rather than just in the context. Unlike a one-time prompt injection attack, poisoned weights persist across every future conversation. The article groups this alongside other unsolved challenges: alignment degradation, the impossibility of unlearning toxic knowledge, auditability failures, and privacy risks from user interactions being compressed into parameters. For example, an adversary could repeatedly feed a customer-facing AI subtly misleading information about a competitor’s product until the model begins reproducing those inaccuracies on its own with no obvious sign of tampering.

Distillation

A process involving two models: (1) a large, capable, frozen teacher and (2) a smaller student. The student is trained to match the teacher’s outputs as closely as possible and absorb its knowledge in a more compact form. The result is a smaller, more efficient model that performs nearly as well as the larger model on the tasks it was trained for. It’s like an apprentice learning by closely watching and mimicking a master until the skill becomes their own. For example, a large hospital system might use a massive general-purpose model as the teacher and distill its medical reasoning capabilities into a smaller model that can run efficiently on local hospital hardware without requiring a cloud connection.

External Memory

Anything outside the model’s weights used to store and retrieve information. Chat history, databases, document stores, and agent notes are all examples of external memory. Information gets fed back into the context window when necessary. The model does not update its weights from that information during deployment, but similar knowledge could later be internalized through additional training or fine-tuning. The key limitation is that external memory requires retrieval. The model has to be given the right information at the right moment, and if it isn’t, the knowledge might as well not exist. For example, a legal AI might have a database of ten thousand case summaries it can search, but if the retrieval system surfaces the wrong cases, the model has no way to compensate from its own knowledge.

Few-Shot Learning

The ability of a model to perform well on a new task after seeing only a handful of examples, rather than requiring thousands of training samples. Transformers are surprisingly good at this when examples are provided in the context window. Meta-learning approaches aim to make weight-level, few-shot learning just as effective, so the model can internalize new tasks from just a few examples even without them being available in the context. For example, if you show a model three examples of how you want your emails formatted and then ask it to format a fourth, it adapts immediately without any retraining. That’s few-shot learning in action.

Fine-Tuning

A more targeted form of additional training done after the initial training run. Instead of training from scratch on everything that’s known, you take an already-trained model and update it on a smaller or specific dataset. The new information shapes the model’s behavior for a particular use case without rebuilding it from the ground up, but the process still risks catastrophic forgetting if pushed too hard. For example, a company might take a general-purpose language model and fine-tune it on thousands of their internal support conversations, so the model learns the company’s terminology, tone, and common issue patterns without losing its broader language capabilities.

Gradient Descent

The mathematical process by which a model adjusts its weights during training. It measures how wrong the model’s predictions are on a given example and then calculates which direction to nudge each weight to reduce that error slightly. It’s called “descent” because the process is navigating downhill on a mathematical landscape, always moving toward lower error rates. Repeat this across billions of examples and the model gradually gets much better. For example, if the model predicts “cat” when the correct answer is “dog,” gradient descent works backward through the network to figure out which weights contributed to that wrong answer and adjusts them a tiny amount. Do that enough times and the model learns to tell cats from dogs reliably.

In-Context Learning (ICL)

Everything the model reads and uses during a single conversation without updating its underlying knowledge. You paste in a document, it reads it and responds. You describe a task, it follows your instructions. But when the conversation ends, none of that experience changes the model itself. The next conversation starts with the same frozen weights as always. This is a smart use of temporary information, but it’s not genuine learning. For example, if you spend an hour teaching an AI assistant about your company’s internal processes and then start a new conversation the next day, the model will have no memory of the previous conversation. You would need to paste in that information all over again.

Inference

The act of a model generating a response from input. It’s the opposite of training. Training occurs when the model learns by adjusting its weights. Inference occurs when the frozen model performs and takes what it knows and produces an output. Any time you send a message and get a reply, that’s inference. The term “inference-time compute” (below) builds on this and refers specifically to spending extra computational effort during inference to get a better result. But plain inference just means the model is running, not learning. For example, asking a model what the capital of France is and getting back “Paris” in a fraction of a second is inference in its simplest form. No learning took place. The model just performed a simple action.

Inference-Time Compute

The current dominant paradigm for improving model performance by spending more computational effort at the moment of response rather than updating weights. This includes chain-of-thought reasoning, tool use, search, and iterative problem-solving, all of which cost more compute at response time but produce better results. The article positions this process as a workaround, a scaling of what already works rather than a true solution to the learning problem. Test-time training is the most aggressive form of this learning because it actually runs gradient updates on new information during inference, which begins to compress it into weights in real time. This process sits at the boundary between the current paradigm and genuine parametric learning. For example, when you ask a model a complex math problem and it works through each step before giving a final answer rather than just guessing immediately, that is inference-time compute. The model is using more processing in the moment to arrive at a better result.

Instruction Tuning

A form of fine-tuning where the model is trained specifically on examples of instructions paired with ideal responses. It’s one of the main reasons modern models are so much better at following directions than earlier versions, which tended to just complete text rather than actually do what you asked. The model learns not just facts but the shape of helpful behavior, including how to interpret requests, how to structure answers, and when to ask for clarification. For example, an early language model asked to “summarize this article” might just continue writing in the same style as the article. An instruction-tuned model understands that the request calls for a concise, distinct summary and produces one.

KV Cache

Short for key-value cache. A technical mechanism that stores intermediate computations during inference so the model does not have to redo them from scratch for every token it generates. The article discusses it specifically in the context of KV cache compaction where the cache functions as a form of non-parametric memory but grows substantially as conversations and agent loops get longer. The authors argue that learning to compress this cache more efficiently is one of the meaningful challenges in moving from pure retrieval toward more durable knowledge storage. For example, in a long agentic task, the KV cache holds the computed representations of everything the model has processed so far. Without it, each new token would require reprocessing the entire history from scratch, which would be prohibitively slow.

Lossy Compression

Compression where some information is permanently lost in the process, as opposed to lossless compression where everything can be recovered exactly. For LLMs, the inability to store everything verbatim during training forces the model to find patterns, generalize, and abstract. That forced abstraction is precisely what makes the model seem intelligent and useful in new situations it has never seen before. A JPEG image is the familiar everyday example. Save a photo as a JPEG and the file shrinks dramatically because fine detail is discarded. But if you zoom in close enough you can see the degradation. For most purposes, though, the image is perfectly usable. The tradeoff is the point. For a language model, the equivalent is that it cannot recite every sentence it ever trained on, but it can write a new sentence in any style on any topic because it extracted the underlying structure rather than memorizing the surface.

Meta-Learning

Teaching a model how to learn rather than what to learn. The model is pre-trained in a way that positions it to update quickly and effectively with just a few new examples, rather than requiring extensive retraining. It’s the difference between educating someone to be a quick study versus simply giving them a lot of facts to memorize. A quick study can walk into an unfamiliar subject and get up to speed fast, whereas someone who only memorized facts cannot. For example, a meta-learned model shown three examples of a new classification task, say sorting customer complaints into categories it has never seen before, should be able to generalize accurately to new complaints after just those three examples rather than needing hundreds.

Modules

The article uses this as a broad middle-ground category on the compaction spectrum that sits between pure retrieval and full weight-level learning. In practice, modules can take several forms: adapter layers, LoRA-style weight updates, memory components, or cached representations. What they share is the ability to specialize a general-purpose model for a specific domain without retraining the entire model from scratch. They offer more than retrieval in that some digestion of information happens, but less than full parametric learning in that the core model does not change. For example, a hospital might attach a medical module to a general-purpose model so it performs at a specialist level on clinical questions, while the same base model with a legal module performs at a specialist level on contract review, with each module being swappable independently.

Multi-Agent Architectures

Systems where multiple AI models work in parallel with each one handling a slice of a larger task and communicating results to each other or to an orchestrating layer. If a single model is limited by its context window, a coordinated group of agents can collectively handle far more. But this shifts the problem rather than eliminating it. Each agent still faces its own context limit, and coordinating many smaller contexts introduces its own complexity for the system to manage. It’s a non-parametric workaround for scale, not a solution to the underlying constraint. For example, a research task that would overflow one model’s context window might be split across ten agents, each reading a different section of source material with a coordinating agent assembling their summaries into a final report.

Neural Network

The underlying computational structure of an LLM. It’s a network of interconnected nodes organized in layers, loosely inspired by neurons in the brain. But the analogy should not be pushed too far. Each connection between nodes has a weight that determines how strongly one node influences another. Information flows forward through the layers, gets transformed at each step, and eventually produces an output. The network learns by adjusting those weights during training until it gets good at its task. For example, in an image recognition network, early layers might learn to detect simple edges and colors, middle layers might learn to recognize shapes, and later layers might learn to identify objects. Language models work on the same principle but applied to sequences of text.

Parameters / Weights

The billions of numerical values inside a model that encode everything it learned during training. Each value represents the strength of a connection between two nodes in the neural network. During training, these values get adjusted gradually until the model becomes good at predicting language. After training they are frozen, and the model’s knowledge and capabilities are entirely determined by those fixed numbers. “Parameters” and “weights” refer to the same thing and are used interchangeably throughout the article. For example, GPT-4 is estimated to have around a trillion parameters. Each one is a small dial that was tuned during training and now stays locked in place, collectively encoding an enormous amount of compressed knowledge about language, facts, and reasoning patterns.

Parametric Learning

Learning that actually updates the model’s weights based on new experience, as opposed to in-context learning which uses information temporarily without changing anything permanent. It’s the deeper form of learning the article is ultimately arguing we need more of. When a model learns parametrically, new knowledge gets compressed into its weights the same way training data did and becomes a durable part of what it knows rather than a note it holds briefly and then discards. For example, a parametric update after a model encounters thousands of conversations about a new programming language would leave it genuinely better at that language going forward across all future conversations, not just within the session where it learned.

Regularization

A cautious approach to weight updates that penalizes changes to parameters deemed important to existing knowledge. Before updating a weight, the system estimates how critical that weight is to the model’s current capabilities. If it’s very important, the update is constrained or slowed down. This is one of the older approaches to continual learning and helps manage the stability-plasticity dilemma. But it tends to be brittle at scale. Think of it like a renovation rule that protects load-bearing walls. You can still remodel, but certain structures are off-limits because removing them would collapse the building. For example, EWC (Elastic Weight Consolidation), one of the most cited regularization methods, computes an importance score for each weight after training on a task and uses that score to resist changes when training on subsequent tasks.

Reinforcement Learning (RL)

A training approach where a model learns from feedback signals rather than from labeled examples. It tries things, receives a reward or penalty based on how well it did, and adjusts its behavior accordingly over many iterations. The article mentions RL-based feedback loops as one direction in continual learning research where models could improve from real-world deployment signals like user corrections or task outcomes. However, it’s not the central mechanism the authors emphasize. The core focus of the article is on compaction, weight updates, and memory structures. For example, the systems that learned to play chess and Go at superhuman levels used reinforcement learning by playing millions of games against themselves and adjusting strategies based on wins and losses rather than being taught explicit strategies.

Retrieval-Augmented Generation (RAG)

A common approach to giving models access to current or specialized information without retraining. Instead of baking knowledge into weights, you build a searchable database the model can query at response time. The retrieved content gets injected into the context window and the model uses it to generate its answer. It’s purely non-parametric. The model retrieves information but never internalizes it. The limitation is that retrieval only works if the right information gets surfaced at the right time, and no amount of retrieval can substitute for knowledge the model needs to reason with flexibly. For example, a financial AI might use RAG to pull in the latest earnings reports before answering questions about a company’s performance because that information changes constantly and cannot be baked into training data.

Safety Alignment

The work done during training to make a model helpful, honest, and safe to use. It involves carefully curated training data, human feedback on model outputs, and specific training objectives designed to shape the model’s values and behavior. One of the serious risks of continuous weight updates after deployment is that alignment can degrade unpredictably even from adding seemingly benign new data. It seems that fine-tuning on almost anything can shift the weights that govern behavior, not just the ones governing the specific knowledge update. For example, researchers have shown that even brief fine-tuning on ordinary instructional text can weaken safety guardrails in ways that are not obvious until the model is probed specifically for harmful outputs.

Self-Improvement

An approach where the model generates its own training data, filters out low-quality results, trains on the high-quality results, and repeats the cycle. It learns from its own work rather than from human-provided data and compounds capability over many iterations. The article cites AlphaEvolve and AlphaProof as examples of this kind of closed-loop improvement. But these systems operate in constrained domains like mathematics and algorithm optimization, not open-ended real-world learning. The article uses these examples to illustrate iterative self-training loops, and what qualifies as a genuinely new discovery in this context remains debated. For example, AlphaEvolve used self-generated solutions and automated evaluation to discover improvements to algorithms that human programmers could not find because it worked within a well-defined problem space where correctness could be verified automatically.

Stability-Plasticity Dilemma

The fundamental tension in any learning system between staying stable, meaning not forgetting what it already knows, and staying plastic, meaning remaining able to learn new things. Push too hard toward plasticity and you get catastrophic forgetting. Push too hard toward stability and the model cannot adapt to anything new. Solving this dilemma is one of the core engineering challenges in continual learning, and no approach has fully solved the problem at scale. The dilemma exists in biological brains too. That’s why human memory consolidates during sleep rather than updating continuously throughout the day. For example, a model trained to be highly stable might refuse to update its belief that a particular drug is safe even after being shown new clinical evidence, while a model trained to be highly plastic might update so aggressively that it forgets basic grammar rules after a week of medical fine-tuning.

State Space Models (SSMs)

An alternative to traditional transformer architecture that the article highlights for offering a fundamentally better scaling profile for long contexts. The article describes them as using fixed memory layers interspersed with normal attention, which unlike transformers does not grow unboundedly with every token added to the context. Traditional transformers scale quadratically with context length, while SSMs aim for near-linear scaling. However, this remains an active area of research rather than a fully settled property. The article treats SSMs as a promising architectural direction for enabling much longer agentic loops rather than a definitive solution to the broader continual learning problem. For example, a transformer handling a 100,000-token conversation requires vastly more compute than handling a 10,000-token request. But an SSM handling the same expansion would ideally require only proportionally more, which could make very long agentic tasks far more practical.

Temporal Disentanglement

A core limitation of parametric memory since a model’s weights do not separate timeless facts from information that changes over time. Both get compressed into the same parameters and are tangled together with no internal label distinguishing what’s permanent from what’s mutable. This makes continual weight updates risky because changing a time-sensitive piece of knowledge can corrupt stable knowledge stored in nearby weights. The article frames this as one of the fundamental unsolved problems standing between today’s frozen models and genuinely adaptive ones. For example, the fact that two plus two equals four and the fact that a particular person holds a particular job title are both encoded somewhere in the weights. Updating the job title risks disturbing the arithmetic, because the model has no mechanism for knowing which facts are stable laws and which are contingent facts about the world.

Test-Time Training

An approach that blurs the line between training and responding by letting the model do a small amount of learning before it generates a final answer. Rather than relying entirely on what it learned during the original training run, the model runs brief gradient updates based on what it’s currently seeing and then responds. The article describes this as running gradient descent on test-time data, compressing new information into parameters at the moment it matters, and treats it as one of the more substantive moves toward genuine continual learning because it actually changes weights at inference time. For example, if a model is asked to analyze a long, unusual technical document, test-time training would let it briefly train on that document before responding, compressing its key patterns into weights rather than just reading it as context. This method potentially produces a much more accurate analysis as a result.

The Bitter Lesson

A well-known observation in AI research. It holds that given more compute and data, general methods that let models figure things out at scale consistently outperform clever human-engineered solutions over time. Every time researchers have tried to hardcode structure and shortcuts into AI systems, the simpler but more scalable approaches have eventually won. The article invokes this phenomenon to question why we still hand-engineer memory and compression pipelines rather than letting models learn to do it themselves. For example, early chess programs used elaborate human-crafted rules about piece values and board positions. They were eventually crushed by systems that simply learned from millions of games with minimal human guidance and relied on scale rather than cleverness. The same pattern has repeated across nearly every domain in AI.

Token

The basic unit of text that a large language model processes. A token is roughly a word, though it can also be a fragment of a word, a punctuation mark, or a short common sequence like “ing” or “un.” Models do not read text the way humans do, character by character or word by word. Instead, they break input into tokens first and then process the sequence. The size of a context window is measured in tokens, not words or characters. For example, the sentence “The cat sat on the mat” would be broken into something like seven tokens, roughly one per word. But a word like “unbelievable” might be broken into two or three tokens: “un,” “believ,” “able,” because it’s less common and gets split into recognizable subunits the model has seen frequently.

Training Run

The large-scale and expensive process of building a model’s knowledge by exposing it to massive amounts of data and adjusting its weights. Training involves feeding these huge datasets through the network repeatedly and using gradient descent to nudge weights toward better predictions. The process runs on clusters of specialized hardware for weeks at a time and consumes substantial amounts of electricity. It’s all carefully controlled, occurs before deployment, and produces a fixed set of weights that define everything the model knows. Once training ends, the weights are frozen and the model goes out into the world as-is. For example, training a frontier model like GPT-4 or Claude is estimated to cost tens or hundreds of millions of dollars and requires specialized data centers. This is precisely why continuous post-deployment learning is so appealing because rerunning a full training run every time the world changes isn’t practical.

Transformer

The dominant architecture underlying most major AI models today including Claude, GPT, and Gemini, and more. At its core, a transformer predicts the next token in a sequence of text based on everything that came before it. It does this at light speed one token at a time. That sounds simple but at scale it’s not. The architecture was trained on so much human-generated text that it models statistical relationships in language and attempts to produce behavior consistent with understanding context, logic, and meaning. For example, when you ask a transformer-based model to explain a complex idea, it makes predictions about what a good explanation would look like given your question based on patterns it absorbed from vast amounts of human writing on similar topics. That’s why it seems smart. It’s familiar. Whether the final output constitutes genuine understanding is a separate philosophical debate that the article doesn’t address.


Discover more from Jim Grisanzio

Subscribe to get the latest posts sent to your email.

Posted in ,

Discover more from Jim Grisanzio

Subscribe now to keep reading and get access to the full archive.

Continue reading