What's in the Box? A Field Guide to AI Models

Maybe a year ago, I had a bit of idle curiosity about how good local AI models were getting, and I started poking around on Hugging Face, because it seemed roughly like the GitHub of AI models.

After a few minutes of browsing, I ended up on a page that said something along the lines of:

Meta-Llama-3-8B-Instruct.Q4_K_M.gguf

That was the moment I realized I had no idea what I was doing.

I’ve been using a number of AI tools for development purposes for a while now, but as I’ve started to get more ambitious about what I can do with them, I’m ending up in situations where I can’t really justify paying for metered model rates for every single prompt.

Anyways, I wanted to run a model locally. So I end up squinting at a list of forty nearly-identical files that differ only by cryptic suffixes, beside a “model card” tossing around parameters, quantization, MoE, context length, and BF16 as if I’d nod along, but for the uninitiated, it all reads less like meaningful data and more like someone decided to name things using a license-plate generator. I had no clue which file to click, what my laptop could actually run, or what any of it would cost me. I closed the tab.

Now, I’m getting a new laptop at work. I think it’s the first machine I will have owned with enough firepower to run something actually decent locally, which has incintivized me to revisit this stuff and actually make sense of it. I’m not an AI cultist; I find these tools useful in some contexts and overhyped in plenty of others. But I’ve got a backlog of code refactors that aren’t purely mechanical.

A load of systematic improvements I want to make to codebases go beyond find-and-replace, so that requires making a bit of a judgement call about what to do next. I can generally synthesize guidance for this and get an LLM to do the heavy lifting. Doing that well with my workflow currently means firing off a lot of subagents across a lot of files, and paying a provider per call, which adds up fast. Running a capable model on my own machine makes the marginal cost roughly “my home electric bill,” which is a very different proposition.

So, naturally, I did what anyone does and went looking for a decent explainer first. That did not go well. Half of what I found was the exact AI-generated slop these models are so good at producing, confidently-worded, faintly wrong, weirdly repetitive, the textual equivalent of a stock photo. The other half was some variation of “how to install Ollama”: here’s the one command, congratulations, you’re running a model. Great, thanks, internet. I didn’t know how to install an app until you helpfully explained the concept to me.

What I really needed to know was things like: What am I running? How do I tell whether it’ll fit on my machine? What do all those letters in the filename mean? Why are there forty of them? How do I choose? Will it actually be useful? The “install Ollama” genre treats the model itself as an opaque box you simply point a tool at, which is precisely the part I feel like I need to understand better. So I went back to that wall of jargon determined to actually understand it, and this is what I have pieced together.

As an aside, I write Haskell for a living. This means I’m legally obligated to inform you that every Haskeller, at the precise moment a concept finally clicks, is seized by an irresistible compulsion to write the explainer for it. This is why the internet contains roughly nine thousand blog posts patiently explaining that a monad is just a burrito, each author convinced their analogy is the one that will finally do it. So, dear reader, I cannot help myself. Here we are: yet another tutorial about a hotly discussed topic, written by someone who only just feels like they have a handle on it. At least I’m self-aware about it.

This is the post I wish I’d found instead. It’s not about prompting or which chatbot is best. It’s about what the models are underneath the chat box: what a parameter is, why some run on your phone and others need a rack of data-center GPUs, and what every piece of that intimidating jargon actually means. We’ll build up the vocabulary one concept at a time, with a couple of interactive demos along the way, and by the end that filename up top will read like an ordinary sentence, and you’ll know exactly what to do with it.

Also, this is really just a meandering tour of the concepts, so don’t necessarily expect everything to flow cohesively. This is more of a smattering of ideas that I’ve collected over the past few months as I’ve been learning about LLMs that I want to take a stab at articulating.

Remember: I am smol bean. I’m not really an ML practitioner, so I’m writing this down mainly to cement my own understanding, on the well-worn theory that you don’t really know a thing until you’ve tried to explain it, and the act of explaining is how I can shore up my own knowledge. Which means, I am probably very wrong about some of the details below. If you’re an actual linear algebra nerd and you spot something I’m getting wrong: corrections are very welcome. Treat this as a curious person’s field notes, not gospel from someone who trains hungry ghosts trapped in a jar for a living.

What a model actually is

A large language model is, mechanically, a very large pile of numbers plus a recipe for using them. The numbers are called weights (or parameters), and the recipe is the architecture, the fixed structure that says how inputs flow through those numbers to produce outputs.

When you hear “training,” that’s the process of finding good values for those numbers by showing the model enormous amounts of text and nudging the weights until its predictions get better. When you hear “inference,” that’s actually running the trained model to get an answer. Training is expensive and happens once (or occasionally); inference is what happens every time you send a message.

The core thing these models do is deceptively simple: predict the next token given everything so far. A token is a chunk of text, roughly 3 to 4 characters or about three-quarters of a word on average. “Predicting the next token” over and over, feeding each prediction back in, is what produces the (usually)fluent paragraphs you see.

Parameters: the “7B” in the model name

When you see a model called Llama 3 8B, Qwen 32B, or DeepSeek 671B, the number is the parameter count: 8 billion, 32 billion, 671 billion individual weights. This is the single most useful number for predicting two things:

How capable the model is, roughly. More parameters generally means more knowledge and stronger reasoning, all things being equal. Nonetheless, training data quality and technique can let a smaller model beat a larger older one, which is a lot of the arms race we see with models today.
How much hardware you need to run it. This is much more mechanical, and it determines whether a model will run on your machine at all.

Each parameter is a number that has to be stored in memory while the model runs. How much memory each one takes depends on its precision, which is the next concept.

Precision: what “4-bit” and “8-bit” mean

A parameter is a number, and numbers can be stored at different levels of detail. This is precision, usually measured in bits per parameter.

FP32 (32-bit floating point): 4 bytes per parameter. The “full” precision models are often trained at this precision.
FP16 / BF16 (16-bit): 2 bytes per parameter. The common default for distributing models. Half the memory of FP32, with negligible quality loss for inference.
INT8 / 8-bit: 1 byte per parameter.
INT4 / 4-bit: half a byte per parameter.

The rough memory math is just multiplication. A 7B model at 16-bit needs about 7 billion × 2 bytes ≈ 14 GB. The same model at 4-bit needs about 7 billion × 0.5 bytes ≈ 3.5 GB. That difference is exactly why quantization (below) is what makes local AI practical.

Bar chart comparing the memory footprint of the same 7B model at FP32, FP16, INT8, and INT4 precision.

Quantization

Quantization is the process of taking a model trained at high precision and squeezing its weights down to fewer bits, 8-bit or 4-bit being the popular targets. You’re trading a little accuracy for a lot of memory savings and speed.

The surprising part is how little you lose. A well-done 4-bit quantization of a model is often nearly indistinguishable from the 16-bit original for most tasks, while using a quarter of the memory. This is the central trick of the local-AI world: it’s why a 4-bit 7B model fitting in 4 GB lets you run something reasonably capable on a consumer GPU or even a laptop.¹

There’s a limit, though. Push too far (2-bit, sometimes even 3-bit) and quality degrades noticeably, with the smaller models suffering more because they have less redundancy to spare. The practical sweet spot most people land on is 4-bit for running locally, 8-bit when you have a bit more memory and want more confidence in the results, 16-bit when memory is no object.

You’ll see quantization schemes with names like Q4_K_M, GPTQ, AWQ, and others. The naming is its own rabbit hole, but the family of “K-quants” (the _K_M, _K_S suffixes you see in GGUF files) are mixed schemes that keep the more sensitive parts of the model at higher precision and squeeze the rest harder. You mostly don’t need to choose by hand; the defaults are sensible.

Dense vs. Mixture of Experts (MoE)

This is the architectural distinction that confuses people most, partly because it makes the parameter count mean two different things.

In a dense model, every parameter is used for every token. A 32B dense model does 32 billion parameters’ worth of computation for each token it processes. Simple, predictable: the parameter count and the compute cost track each other directly.

A Mixture of Experts model splits a large chunk of its parameters into many parallel sub-networks called “experts,” and for each token a small router picks just a few of them to actually run. So an MoE model has two relevant numbers:

Total parameters: everything stored in memory.
Active parameters: what actually runs per token.

For example, a model described as “8x7B” or with “47B total, 13B active” stores the full 47B (you need memory for all of it, since any expert might be picked) but only computes with about 13B per token. You pay the memory cost of the large model but get inference speed closer to the smaller active size.

The reason MoE exists: it decouples capacity from compute. You can scale knowledge (total parameters) without scaling the per-token cost as steeply. The catch for local users is that memory cost: an MoE model is memory-hungry even though it’s compute-light, so it doesn’t necessarily fit on smaller hardware just because it’s fast.

	Dense	MoE
Params used per token	All of them	A small fraction
Memory needed	Param count	Full (total) param count
Compute per token	Proportional to size	Proportional to active size
Mental model	One big network	Many specialists + a router

If you take one thing away: for MoE, judge memory by total parameters and speed by active parameters.

Diagram contrasting a dense model, where every parameter runs for each token, with an MoE model, where a router activates only a few experts per token.

Local vs. hosted (API) models

This is less about the models themselves and more about where the inference runs.

Hosted / API models run on someone else’s hardware. You send text over the network, they run the model, you get an answer back. The frontier models (the GPT, Claude, and Gemini families) are hosted, and many are closed-weight: the company never releases the actual parameter files, so you can only access them through their service. Upsides: no hardware needed, always the latest version, the largest and most capable models. Downsides: cost per use, your data leaves your machine, and you’re dependent on the provider’s uptime and policies (which seem to be increasingly draconian).

Local models run on your own hardware. You download the weight files and run them with software like llama.cpp, Ollama, LM Studio, or vLLM. These are open-weight models (Llama, Qwen, Mistral, DeepSeek, Gemma, and many others), where the parameters are published for download. Upsides: privacy (nothing leaves your machine), no per-use cost, full control, offline capability. Downsides: you’re limited by your own hardware, so you generally run smaller or more aggressively quantized models than the hosted frontier.

A quick note on terminology that trips people up: open-weight is not the same as open-source. Open-weight means you can download and run the parameters. Truly open-source would also include the training data and full training code, which most “open” models don’t release. The license matters too; some open-weight models have restrictions on commercial use, or have geographic restrictions on use (thanks EU).

Base vs. instruct (and “chat”) models

When a model finishes its initial training on raw text, it’s a base model. It’s good at continuing text but doesn’t naturally follow instructions or hold a conversation; ask it a question and it might just continue with more questions, because that’s a plausible text continuation.

To get the assistant behavior you expect, the base model goes through further training:

Instruction tuning / SFT (supervised fine-tuning): trained on examples of instructions paired with good responses, so it learns to actually answer.
Preference tuning (RLHF, DPO, and relatives): trained on comparisons of better vs. worse responses to align tone, helpfulness, and safety.

The result is an instruct or chat model. On a download page you’ll see both variants, e.g. Llama-3-8B (base) and Llama-3-8B-Instruct. For almost any interactive use, you want the instruct version. The base model is mainly interesting if you’re doing your own fine-tuning or research.

Fine-tuning more broadly is taking an existing model and training it further on your own narrower data, to specialize it (medical text, a particular code style, a company’s tone) without paying the enormous cost of training from scratch.

It sounds like the obvious move for any “I want the model to know my stuff” problem, but it’s worth knowing why it’s often the wrong one.

The primary risk is catastrophic forgetting: train hard on your narrow dataset and the model can lose general capabilities it used to have, getting better at your thing while getting worse at everything else. This is a regression you may not notice until it fumbles something basic in production.

There’s also a steep data burden. Fine-tuning rewards quality and consistency, and a few thousand mediocre or contradictory examples can actively make the model worse rather than better; assembling a clean dataset is usually more work than people expect. The result is a maintenance liability too: your fine-tune is frozen against the model it was built on, so when a noticeably better base model ships six weeks later (and it will, probably in like a week or two), you don’t get to just use it, since you have to redo your whole training run.

Lastly, fine-tuning teaches style and behavior far more reliably than it injects facts; if your goal is “the model should answer from my documents,” retrieval (the RAG approach mentioned later) usually beats fine-tuning, which has a depressing tendency to make the model state your facts confidently and the subtly-wrong neighbors of your facts equally confidently.

So, TL;DR: try prompting first, then few-shot examples in the context, then retrieval, and reach for fine-tuning only when you specifically need a behavior or format those can’t produce. The cost has come down a lot (techniques like LoRA train a small set of add-on weights instead of the whole model, which is cheaper and sidesteps some of the forgetfulness issues), but just because it’s cheap to run doesn’t mean you should reach for it unless you have no other option.

Attention

Every concept so far has treated the model as a black box that maps tokens to tokens. Attention is the mechanism inside that box, and it’s worth understanding at a conceptual level because it underpins a lot of the terminology you come across in the LLM space, such as “context windows”.

The idea

When the model processes a token, it needs to decide which other tokens in the context are relevant to it. In the sentence “The trophy didn’t fit in the suitcase because it was too big,” figuring out what “it” refers to means looking back at “trophy” and “suitcase” and weighing them. Attention is the formal version of that looking-back: for each token, the model computes a relevance score against every other token, then builds that token’s understanding as a weighted blend of the ones that scored high. “Attending to” a token just means giving it a high weight in that blend.

Here’s the mechanism, in three steps. For the token “it,” the model produces a score against every other token measuring how relevant each one is. Those raw scores get normalized into weights that add up to 1 (so “trophy” might get 0.71, “suitcase” 0.22, everything else a sliver). Finally, the model takes a weighted blend of all the tokens, mostly “trophy,” a bit of “suitcase,” traces of the rest, and that blend becomes the token’s updated, context-aware understanding. After this step, “it” effectively carries the meaning of “trophy.”

The standard names are just labels for the pieces of that process: each token’s query is what it’s looking for, each token’s key is what it offers (the query gets matched against keys to produce the scores), and each token’s value is what it actually contributes to the blend once the weights are set. You don’t need the linear algebra to hold the intuition: every token runs a weighted lookup over all the others and pulls in what’s relevant. This is what lets the model handle long-range dependencies that simpler approaches couldn’t, and it’s the core innovation of the transformer, the architecture essentially all current LLMs are built on.

Attention, interactive: a weighted lookup over the other tokens

Click a token to make it the query. It scores every other token; the scores become weights that sum to 1; the result is a blend of the high-weight tokens.

focusdiffusebalanced

trophy

0.52

suitcase

0.16

it (query)

0.07

big

0.06

fit

0.04

The

0.02

didn't

0.02

the

0.02

because

0.02

was

0.02

too

0.02

it rebuilds itself as a blend of trophy (52%), suitcase (16%), big (6%), plus smaller traces of the rest.

The names: a token's query (what it's looking for) is matched against every token's key (what it offers) to get the scores; the weights then pull in each token's value (what it contributes). The focus slider is the softmax temperature, the same knob as sampling temperature, applied to attention.

Why it’s expensive

Notice the “every token against every other token.” For n tokens that’s roughly n² comparisons. Double the context and you quadruple the attention work; this is the quadratic scaling mentioned in the speed section. On top of the compute, the model stores a key and value for every token so it doesn’t recompute them on each step. That stored set is the KV cache from the context section, and it grows linearly with context length, eating memory and bandwidth as the conversation gets longer.

So plain attention carries a couple of costs: compute that grows with the square of length, and a cache that grows linearly. Both are why early models capped out at 512 or 2,048 tokens. Pushing the window higher wasn’t a matter of flipping a setting; it ran straight into significant technical hurdles in terms of what you could do with available hardware.

Context windows

The context window is how much text the model can “see” at once, measured in tokens, counting both your input and its output. A model with an 8K context can work with about 8,000 tokens (roughly 6,000 words) before something has to give; modern models range from 8K up into the hundreds of thousands or millions. This is one of the most misunderstood specs on a model card.

It’s a hard limit, and it’s shared

The window is a fixed ceiling baked into how the model was built, not a soft preference. Everything has to fit inside it at once: the system prompt, the entire conversation history, any documents you’ve pasted or attached, and the space reserved for the model’s own reply. They all draw from the same budget. If you’re 500 tokens from the limit, the model can only produce a 500-token answer, no matter how much you want more.

When you exceed the window, the model doesn’t error out gracefully on its own; the surrounding software has to decide what to drop. The usual strategy is to truncate from the oldest end, sliding the window forward so the earliest messages fall out of view. This is why a long chat session can seem to “forget” how it started: those early tokens have literally scrolled off the edge of what the model can see. Some tools instead summarize older turns to compress them back into budget, which preserves the gist but loses detail and exact wording.

How context windows have grown over time

Larger windows came from attacking the issues mentioned earlier: avoiding evaluating every token against every other token, and improving KV caching. None of this is one breakthrough; it’s a stack of complementary techniques.

More efficient exact attention. FlashAttention is the standout: it computes the same attention result but reorganizes the work to avoid writing the giant intermediate score matrix to slow memory, processing it in tiles in fast on-chip memory instead. You get the same answer, but using far less memory traffic, so longer sequences become tractable.
Shrinking the KV cache. Since the cache is a major bottleneck, several designs reduce it. Grouped-query and multi-query attention let many query heads share one set of keys and values instead of each having its own, cutting cache size several-fold with little quality loss. Smaller cache means more room for more tokens.
Not attending to everything. If full n² attention is the problem, have each token attend to only some others. Sliding-window attention restricts a token to a fixed neighborhood of recent tokens; sparse and other patterned schemes pick a structured subset. These break the quadratic curve toward something closer to linear, at the cost of no longer being mathematically exact. They lean on the fact that most relevance is local, with a few long-range links.
Position encoding that extrapolates. The model tracks token positions through a positional encoding. Some schemes (notably RoPE, rotary positional embeddings) can be stretched or interpolated to cover more positions than the model originally trained on, letting a model be extended to a longer window with only modest additional training rather than a full retrain. A lot of “we extended this model to 128K context” work is exactly this.

Stacking these techniques together: very large windows come from combining cheaper-but-exact attention (FlashAttention), a smaller per-token footprint (grouped-query attention), approximations that dodge the quadratic term (sparse/sliding patterns), and position tricks that let training generalize to longer inputs (RoPE scaling). And this ties back to the previous section’s warning: stretching the window via these methods is why a model can technically accept 128K tokens while still degrading well before it. In exchange for extending the context window, you lose some amount of quality since attention is shaved down a bit; the model’s ability to use all of it well doesn’t automatically come along for free.

Compaction: the trick behind “infinite” chats

Summarization deserves its own section, because it has become prevalent within most chat products and coding assistants, and it explains a lot of otherwise-baffling behavior.

Here’s the problem it solves. The model itself is stateless: it has no memory between messages. The only thing that makes a chat feel like a continuous conversation is that the app resends the entire history with every single turn. Message twenty isn’t answered using some stored memory of messages one through nineteen; all nineteen are physically re-fed into the context window alongside your new message, every time. That’s why a long conversation gets slower and (on metered APIs) more expensive as it goes: each turn is reprocessing everything before it.

Eventually that history bumps against the window ceiling, and naive truncation would mean the assistant abruptly forgets the start of the conversation. Compaction (also called context compression or, in some tools, “summarizing the conversation”) is the fix: when the history grows too large, the system pauses, asks the model to write a compact summary of everything so far, and then replaces the old turn-by-turn transcript with that summary. The conversation continues from the summary plus the most recent messages, freeing up room while keeping the thread.

The catch is that compaction is lossy. A summary keeps what it judged important and drops the rest. This is the mechanism behind a familiar frustration: you’re deep in a long session, you refer back to a specific detail from much earlier, “use the variable name we agreed on,” “remember that constraint about the budget,” and the assistant has no idea what you mean. It didn’t malfunction. That detail didn’t survive the compaction step; it got summarized away, and from the model’s point of view it was never said. Coding agents that run for a long time hit this constantly, which is why they increasingly write durable notes to a file or a to-do list rather than trusting the conversation to remember.

Two practical consequences worth internalizing. First, “memory” in most chat products is a combination of this in-context history plus, sometimes, a separate stored-facts feature, not the model actually retaining anything on its own. Second, when a tool offers to “start a new chat” to fix sluggish or confused behavior, this is usually why: a fresh conversation is an empty, uncompacted window, which is both faster and sharper than a long one that’s been compacted several times over. If something important needs to survive, you must restate it explicitly rather than assume a long-running conversation still holds it.

A large context window doesn’t mean it uses the whole thing well

This is the part that matters most in practice and is least visible from the spec sheet. The advertised number is the size of the window, not a promise that the model attends to all of it equally. Two distinct effects are at work.

The first is “lost in the middle.” Models reliably attend best to the start and end of their context and weakest to the middle.² A fact buried halfway through a long document is measurably more likely to be missed than the same fact placed at the top or bottom, even when it’s well within the stated window. The simplest fix: put the instructions and the most important material at the very beginning or the very end of what you send, not sandwiched in the center.

Lost in the middle, interactive: find the buried fact

One sentence (the highlighted access code) is hidden in a long document of filler. Put it near the start, middle, or end, choose how long the document is, then ask the model to recall it. Watch how reliably it finds the same fact depending on where it sits and how much text surrounds it.

fact position

document length

The quarterly logistics review covered warehouse throughput and staffing.Shipments from the northern depot were delayed by two days in March.Routine maintenance on the loading docks is scheduled for the weekend.The procurement team renegotiated the packaging supplier contract.Forklift inspections passed without any flagged safety concerns.Inventory counts reconciled cleanly against the central ledger.The night shift reported normal activity across all three bays.A new barcode scanner rollout begins at the end of the month.Cold-storage temperatures stayed within tolerance all quarter.Visitor badges must be returned to the front desk before leaving.The break room coffee machine was finally replaced last Tuesday.Pallet wrapping was switched to a thinner recyclable film.Outbound trucks are weighed twice before departing the yard.The fire drill last month cleared the building in four minutes.Parking in the south lot is reserved for delivery vehicles.The quarterly logistics review covered warehouse throughput and staffing.Shipments from the northern depot were delayed by two days in March.Routine maintenance on the loading docks is scheduled for the weekend.The procurement team renegotiated the packaging supplier contract.Forklift inspections passed without any flagged safety concerns.

The access code for the east gate is velvet-marble-87.

Inventory counts reconciled cleanly against the central ledger.The night shift reported normal activity across all three bays.A new barcode scanner rollout begins at the end of the month.Cold-storage temperatures stayed within tolerance all quarter.Visitor badges must be returned to the front desk before leaving.The break room coffee machine was finally replaced last Tuesday.Pallet wrapping was switched to a thinner recyclable film.Outbound trucks are weighed twice before departing the yard.The fire drill last month cleared the building in four minutes.Parking in the south lot is reserved for delivery vehicles.The quarterly logistics review covered warehouse throughput and staffing.Shipments from the northern depot were delayed by two days in March.Routine maintenance on the loading docks is scheduled for the weekend.The procurement team renegotiated the packaging supplier contract.Forklift inspections passed without any flagged safety concerns.Inventory counts reconciled cleanly against the central ledger.The night shift reported normal activity across all three bays.A new barcode scanner rollout begins at the end of the month.Cold-storage temperatures stayed within tolerance all quarter.Visitor badges must be returned to the front desk before leaving.

~64K tokens of context · scroll to see the whole document

recall vs. position at 64K

here: 18%

recall odds here: 18%

Where the answers land: a density heatmap of many trials

The y-axis is the model's answer instead of duration. Each “Ask” above drops one trial into this grid at its position; darker cells hold more trials. Run a sweep at the current length to fill it in, then watch the correct-answer band fade out through the middle while wrong codes and blanks pile up.

fewer trials

more trials

Try this: put the fact near the start at 4K and ask a few times — it almost always nails it. Move it to the middle at 128K and ask again: now it misses far more often, sometimes inventing a plausible-looking wrong code. Same fact, same question; only its position and the surrounding volume changed. The curves here are illustrative, but the effect is what real long-context retrieval tests show.

Why does the middle get shortchanged? A few mechanisms stack up, and this is firmly in the territory where I’m reconstructing my own understanding rather than reporting settled fact, but here’s how I read it. The cleanest part of the story goes back to attention and that softmax from earlier. Attention spreads a fixed budget of weight (it sums to 1) across every token in context. With a handful of tokens, even a middling token gets a meaningful slice. With tens of thousands of tokens competing, the weight any single buried token can attract is diluted toward nothing unless it’s a very strong match, so a mildly-relevant fact in the crowd gets drowned out. More haystack means less attention per straw.

On top of that dilution sits a learned positional bias. Models pick up habits from how their training data is shaped, and text overwhelmingly front-loads and back-loads what matters: introductions state the thesis, conclusions restate it, the first and last lines of an email tend to be where people get around to asking for what they actually want. The model learns that the edges are where the important stuff tends to live, and it allocates attention accordingly, a prior that helps on typical text but actively hurts when the thing you care about is sitting in the middle. There are also subtler effects from how positions are encoded (some schemes represent nearby tokens more sharply than distant ones, and a long context pushes the middle far from both ends at once), but these two factors are the two I have a reasonable hypothesis for.³

The second effect is degradation well before the limit. A model rated for 128K tokens often performs noticeably worse at 100K than at 8K, even though both are “within spec.” Retrieval gets less reliable, instructions get diluted, and the model is likelier to lose the thread. There’s an industry term, the “effective context length,” for the point past which quality starts to fall off, and it’s frequently a fraction of the advertised maximum. A useful heuristic: treat the headline number as the absolute ceiling and assume your reliable working range is some way below it. Benchmarks that test this directly (variants of “needle in a haystack,” which hide a fact in a long document and check whether the model can find it) are a better guide than the spec, when you can find them.

Practical takeaways

The window is shared across prompt, history, attachments, and reply; budget for all of them.
Bigger is not automatically better. Filling a huge context with marginally-relevant material can hurt by diluting attention and increasing both cost and latency. Curating what you put in often beats dumping everything in.
Position matters: lead and trail with what’s important.
A long context is expensive in memory and (especially) prefill time, so there’s a real tradeoff against speed, not just a capability win.
Treat the advertised number as a ceiling, not a promise.

File formats you’ll encounter

If you start downloading models, a few formats show up constantly:

GGUF: the format used by llama.cpp and Ollama, designed for running on CPUs and consumer GPUs, with quantization baked in. If you’re running locally on a typical machine, this is what you’ll see most.
Safetensors: a safe, fast format for storing weights, common in the Hugging Face / PyTorch ecosystem. Often the format for the full-precision originals before someone converts them to GGUF.
PyTorch .bin / .pt: older checkpoint formats; functional but being displaced by safetensors, partly because the old pickle format could execute arbitrary code on load.

Temperature and the rest of the sampling knobs

Here’s a fact that surprises people: the model doesn’t output a word. It outputs a probability for every possible next token at once, the whole vocabulary, tens of thousands of options, each with a score. “Sampling” is the separate step that picks one token from that distribution. The model produces the probabilities; the sampler makes the choice. Temperature and its friends are knobs on that choice, not on the model itself, which is why you can change them freely at inference time without retraining anything.

Temperature controls how sharply the sampler favors the high-probability tokens.

Low temperature (near 0) sharpens the distribution toward the single most likely token. At 0 it’s effectively deterministic: same input, same output, always picking the top option. This gives focused, predictable, repetitive text. You want this for code, extraction, math, anything with a correct answer.
High temperature (say 1.0 and up) flattens the distribution, giving lower-probability tokens a realer chance of being picked. This gives variety and surprise, and eventually incoherence as you push it higher. You want some of this for brainstorming, fiction, and creative work.

Mechanically, temperature divides the scores before they’re turned into probabilities. Dividing by a small number exaggerates the gaps between options (the leader pulls away); dividing by a large number compresses them (the field bunches up). A temperature of exactly 1 leaves the model’s own distribution untouched.

The other two knobs you’ll commonly see both work by truncating the set of candidates before sampling, which is a different lever than temperature’s reshaping:

Top-k: only consider the k most likely tokens, discard the rest. Top-k of 40 means “never pick anything outside the top 40 candidates.”
Top-p (nucleus sampling): consider the smallest set of tokens whose probabilities add up to p. Top-p of 0.9 means “keep the most likely options until they cover 90% of the probability mass, then sample from just those.” This adapts to context: when the model is confident the set is tiny, when it’s uncertain the set is larger.

In practice top-p and temperature together cover most needs, and most tools ship with reasonable defaults (often something like temperature 0.7–0.8, top-p 0.9). Temperature reshapes the odds, top-k/top-p cut off the long tail. If you ever see a model producing weird, off-the-rails output, an accidentally high temperature is a common culprit; if it’s flat and repetitive, temperature too low.

One caveat that connects back to determinism: low temperature reduces randomness but doesn’t always guarantee identical output across different hardware or runs, because of floating-point imprecision. For most purposes, though, temperature 0 is your “be consistent” setting.

What actually makes some models faster

“Faster” splits into two measurements that people often conflate:

Latency to first token: how long before the response starts. Dominated by the time to process your prompt (the “prefill” step, where the model reads everything you sent).
Throughput / tokens per second: how fast text comes out after it starts (the “decode” step, generating one token at a time).

Several factors drive both, and understanding them explains most of the speed differences you’ll notice.

1. Active parameter count. Every token generated requires computing through the active parameters. A 7B dense model does far less work per token than a 70B dense model, so it’s roughly an order of magnitude faster, all else equal. This is the biggest single factor and the reason MoE matters: a model with 47B total but 13B active generates at roughly 13B speed, not 47B speed. Fast and knowledgeable, at the cost of needing memory for all 47B.

2. Memory bandwidth, not raw compute. This is the counterintuitive one. Generating tokens one at a time is usually memory-bound, not compute-bound: the bottleneck is reading all those weights out of memory for each token, not doing the arithmetic. A token only gets generated as fast as the hardware can stream the relevant weights through. This is why GPUs (with very high memory bandwidth) crush CPUs at inference, why a model that fits entirely in fast GPU memory dramatically outruns one that spills over into slower system RAM, and why quantization speeds things up: 4-bit weights are a quarter the bytes to move compared to 16-bit, so there’s simply less to read per token.

It helps to know the three places a model’s weights can live, because they have wildly different bandwidth. GPU memory (VRAM) is the fast one, soldered onto a discrete graphics card; this is what you want the whole model to fit inside. System RAM is much slower for this purpose and is where weights spill when they don’t fit in VRAM, which tanks speed. And then there’s unified memory, Apple Silicon’s trick (the M-series chips), where the CPU and GPU share one fast memory pool. That’s why a MacBook with, say, 64GB of unified memory can run models that would need an expensive discrete GPU on a PC: there’s no slow CPU-to-GPU handoff, and the whole pool is reasonably fast. It’s the main reason Macs became a surprisingly popular local-inference platform despite not having traditional gaming GPUs.

3. Whether the model fits in the right memory. As mentioned earlier in the article, there’s a cliff, not a slope, here. If a model fits in your GPU’s memory (VRAM), it’s fast. If it doesn’t and gets partially “offloaded” to CPU/system RAM, speed can drop by 10x or more, because the slow part bottlenecks everything. This is why people obsess over fitting a model in VRAM, and why a slightly smaller or more-quantized model that fits often beats a larger one that almost fits in terms of local usefulness.

4. Context length. Longer prompts cost more, in two ways. Prefill has to process every input token, so a long prompt means a longer wait for the first output token. And the model maintains a KV cache (a running memory of the attention computation for all prior tokens) that grows with context length, consuming memory and bandwidth that scale with how much text is in play. A 100K-token context is meaningfully slower and hungrier than a 2K one on the same model.

5. Batching (mostly a hosted-model thing). Inference servers process many users’ requests together in a batch, which uses the hardware far more efficiently since the weights get read once and reused across requests. It’s a big reason hosted APIs feel snappy at scale: your request is sharing the expensive weight-reading work with everyone else’s. Running locally, you’re a batch of one, so you don’t get this benefit.

6. Architectural and software optimizations. Beyond size, techniques like FlashAttention (a more memory-efficient attention implementation), speculative decoding (a small fast model drafts tokens that the big model verifies in bulk), and grouped-query attention (which shrinks the KV cache) all buy speed without changing the parameter count. Two models of identical size can differ substantially in speed based on these. This is also why the same model often gets faster over time as the runtime software improves around it.

The short version: active parameters and memory bandwidth set the performance ceiling; fitting in VRAM, context length, and batching determines how close you get to that ceiling.⁴

Reasoning models: trading speed for thinking

A recent and important split in the model world is between ordinary models and reasoning models (also called “thinking” models). An ordinary model starts emitting its answer immediately, token by token. A reasoning model is trained to first spend a chunk of tokens working through the problem (planning, trying approaches, checking itself) before it commits to a final answer. You’ll often see this surfaced in the interface as a collapsible “thinking” section that appears before the real response.

The mechanism connects to everything above: those thinking tokens are just more generated tokens, so reasoning models are slower and burn more compute per answer. The bet is that for hard problems (math, multi-step logic, tricky code) the extra tokens spent reasoning buy a more correct answer, the same way you’d solve a hard problem on scratch paper instead of blurting the first thing that comes to mind. This is sometimes called spending more test-time compute: getting better answers by thinking longer at the moment of asking, rather than by making the model bigger.

The tradeoff is pretty straightforward. For a quick factual question or a simple rewrite, a reasoning model is overkill: you wait longer and pay more for thinking the task didn’t need. For a gnarly refactor or a proof, it can be the difference between a usable answer and confident bullshit. Many current models now expose this as a dial you can turn up or down per request, rather than being purely one type or the other.

Multimodal models: not just text

Everything so far has operated on the assumption that we are just using text. Text in, text out. Increasingly that’s not the whole story. Multimodal models accept and sometimes produce more than text, most commonly images: you can hand them a screenshot, a photo, a diagram, or a PDF page and ask questions about it. The same tokenization idea extends, with the image getting converted into tokens the model can attend to alongside the words. Some models go further into audio or video. The practical upshot for a beginner is just to know the term and to check a model’s card for what inputs it accepts, since “can it see images?” is now a real axis of difference between models that otherwise look similar on paper. (A purely text model handed an image will, at best, politely tell you it can’t look at it, and at worst confabulate.)

Tool use: how models do things they can’t do themselves

A model on its own can only produce text. It can’t look up today’s weather, run a calculation reliably, search your files, or call an API; its knowledge is frozen at training time and it has no hands. Tool use (also called function calling) is the mechanism that bridges that gap, and it’s worth understanding because it’s the foundation under “agents” and most of the genuinely useful AI products.

The mechanism is simpler than it sounds

The model still only outputs text. The trick is that some of that text is structured as a request to call a tool, and a surrounding program (not the model) actually performs the action. The loop goes like this:

You describe the available tools to the model: their names, what they do, and the parameters they take, usually as a structured schema. Something like a get_weather tool that takes a location string.
The model decides whether answering needs a tool. If you ask “what’s the weather in The Hague?”, it recognizes its own text-prediction can’t know that and instead emits a structured call: get_weather(location="The Hague"), rather than a prose answer.
Your program executes the call. The model didn’t fetch anything; the harness around it sees the request, actually calls the weather API, and gets a result back. This is the part people miss: the model only asks; the runtime does.
The result is fed back into the context as a new message, the tool’s output. Now the model can see “it’s 14°C and raining” and write a natural-language answer using it.

So tool use is really a conversation with an extra participant: the model proposes actions, an external executor carries them out and reports back, and the model continues with that new information in context. Round-trips can chain: the model might call a search tool, read the results, then call a second tool based on what it found.

How a model knows how to do this

This isn’t an innate ability; it’s trained in, much like instruction-following. The instruct/chat tuning stage includes examples of tools being offered and correctly called, so the model learns the convention of emitting a structured call when appropriate and weaving the returned result into its answer. A base model generally won’t do this well. Models also vary in how reliably they pick the right tool, format the parameters correctly, and avoid calling tools they don’t need; “good at tool use” is now a distinct axis people benchmark separately from raw knowledge.

Why it matters

Tool use is what turns a text predictor into something that can act. A few of the things it unlocks:

Current information. A search or API tool sidesteps the frozen-knowledge problem entirely. The model doesn’t need to know today’s news; it needs to know how to look it up.
Reliable computation. Rather than predicting the answer to a math problem (which are notoriously unreliable), the model calls a calculator or runs code and uses the actual result.
Acting on the world. Tools that send email, edit files, query a database, or hit any API let the model take real actions, not just describe them.
Grounding in your data. This is the basis of retrieval-augmented generation (RAG): a tool fetches relevant chunks from your documents and feeds them into context, so the model answers from your material instead of its training-time memory.

An agent is, loosely, this loop running with enough autonomy to chain many tool calls toward a goal: decide, call a tool, observe the result, decide again, repeat until done. Everything fancy in that picture still reduces to the four-step loop above; the model proposes, an executor disposes, the result comes back as context.

One thing worth keeping straight: because the model only emits a request, the safety and correctness of what actually happens depend on the surrounding program, not the model’s good intentions. Whether a proposed delete_file call is permitted, sandboxed, or confirmed with you first is a decision the harness makes. The model asking for something and the system allowing it are two separate steps, by design.

It’s also worth noting that tool use, along with multimodality and human feedback, complicates the popular charge that these models are “just stochastic parrots” echoing their training data with no understanding. I find that debate interesting and quite unresolved, but it’s a detour from the machinery, so I’ve parked my attempt at both sides of it in a footnote.⁵

Putting it together: deciphering a model’s functionality

Remember the filename from the top, the one that made me close the tab? Meta-Llama-3-8B-Instruct.Q4_K_M.gguf. You can now read it left to right like a sentence:

Meta-Llama-3: the model family and version (Llama 3, from Meta).
8B: 8 billion parameters. Dense (no “active/total” split mentioned), so all 8B run per token.
Instruct: tuned to follow instructions and chat, so it’s ready to use, not a base model.
Q4_K_M: 4-bit K-quant, so roughly 8B × 0.5 bytes ≈ 4–5 GB of memory, comfortably runnable on a mid-range GPU or a decent laptop.
.gguf: packaged for llama.cpp / Ollama, the format for running on consumer hardware.

And those forty near-identical files differing only by suffix? Those are the same model at different quantization levels: Q8_0 (8-bit, bigger and safer), Q4_K_M (the sensible 4-bit default), Q2_K (squeezed hard, for when memory is tight and you’ll tolerate some quality loss), and so on. You’re not meant to download all of them; you pick the one that fits your hardware. The wall of cryptic files was just one model, offered at a range of sizes.

That’s the whole vocabulary. The field moves fast and the specific model names will be stale within months, but these concepts (parameters, precision, quantization, dense vs. MoE, local vs. hosted, base vs. instruct, context) seem relatively stable idioms, underneath the churn. The next time you open a model page, it should read like a menu instead of a ransom note.

So what can I actually run?

This was my original question, the one that sent me running from that Hugging Face page, so might as well circle back to it. The good news is you already have the pieces.

Start with how much fast memory you have. On a PC with a discrete GPU, that’s your VRAM (an RTX-class card might have 8, 12, 16, or 24GB). On Apple Silicon, it’s most of your unified memory (a 32GB or 64GB Mac, minus a few GB for the OS). That number, in gigabytes, is roughly your budget.

Then use the precision math from earlier: a model needs about (parameters × bytes-per-parameter) of memory, and at the common 4-bit quantization that’s roughly half a byte per parameter. So a 7–8B model at 4-bit wants ~4–5GB, a 13B wants ~8GB, a 30B-ish wants ~18–20GB, and a 70B wants ~40GB (which is why the big ones need either a high-end card, a Mac with lots of unified memory, or splitting across two GPUs). Leave yourself a few GB of headroom on top, because the KV cache for your context eats memory too, and that grows with how much text you feed in. The practical rule: pick the largest model whose 4-bit size leaves comfortable room under your memory budget. A model that fits and runs fast beats a bigger one that spills into slow memory and crawls.

From there, pick from one of the tools that do all the fiddly bits for you. Ollama is the easiest start: install it, run one command, and it pulls and runs a model. LM Studio is a friendly desktop app with a model browser and a chat UI, good if you’d rather click than type. Both sit on top of llama.cpp, the underlying engine that actually runs GGUF files efficiently on consumer hardware (CPU, GPU, or Apple Silicon); you can use it directly if you want maximum control. Any of them turns that intimidating filename into “download and chat” in a couple of minutes. (For my subagent-swarm refactor plans, the same local server these expose is what the tooling talks to, so the marginal cost of a run really does collapse to electricity.)

One last little grumble to close out this exploration of LLM concepts:

None of this feels terribly hard to grasp once someone explains it. Hugging Face is a genuinely impressive piece of infrastructure, the closest thing the field has to a shared commons, but it makes almost no effort to meet a first-time visitor where they are, at least that I could find on the landing or docs pages. A model page hands you the raw artifacts, the file list, the config, etc., and assumes you already know which file is for you, whether your machine can hold it, and what to do with it once downloaded. There’s no “you have 16 GB of memory, here’s the quant to grab,” no plain-language “this is what these suffixes mean,” no gentle path from “I have heard of models” to “I am running one.” That’s a defensible choice for what is, at heart, a tool built by practitioners for practitioners, but it does mean the on-ramp for everyone else is a wall of jargon that doesn’t really need to be that way. The information I needed wasn’t hidden so much as simply never addressed to me. None of what’s in this post is secret or advanced; it’s just the context that the place handing out the models declines to provide, and that absence is most of what makes the whole thing feel so much more intimidating than it actually is.

So there you have it. We did a nice little tour of lots of interesting concepts and jargon. Now you know what your machine can hold, and you know what to install.

Good luck!

A short glossary

Parameter / weight: one of the numbers that make up the model. Counted in billions (B).
Token: a chunk of text, ~¾ of a word.
Inference: running a trained model to get output.
Precision: bits used per parameter (FP32, FP16/BF16, INT8, INT4).
Quantization: reducing precision to save memory and increase speed.
Dense model: every parameter runs for every token.
MoE (Mixture of Experts): only a few “expert” sub-networks run per token; total params ≫ active params.
Open-weight: the parameters are downloadable (not necessarily open-source).
Base vs. instruct: raw text-continuation model vs. one tuned to follow instructions.
Fine-tuning: further training of an existing model on narrower data; good for style/behavior, risky for facts, and prone to catastrophic forgetting.
LoRA: a cheap fine-tuning method that trains a small set of add-on weights rather than the whole model.
Catastrophic forgetting: when fine-tuning on narrow data erodes the general capabilities the model previously had.
Context window: how much text the model can attend to at once.
Effective context length: the range a model actually uses well, usually below the advertised maximum.
Compaction: replacing an over-long conversation history with a model-written summary to stay under the window; lossy, so older details can vanish.
Attention: the mechanism by which each token weighs the relevance of every other token; the core of the transformer.
Transformer: the architecture, built on attention, behind essentially all current LLMs.
FlashAttention / grouped-query / sliding-window: techniques that make attention cheaper in memory or compute, enabling larger context.
Tool use / function calling: the model emits a structured request to call an external tool; a surrounding program executes it and feeds the result back.
RAG (retrieval-augmented generation): using a retrieval tool to pull relevant documents into context so the model answers from them.
Agent: a model running the tool-use loop with enough autonomy to chain calls toward a goal.
Temperature: sampling knob controlling randomness; low = focused/deterministic, high = varied/creative.
Top-k / top-p: sampling knobs that truncate the candidate tokens before picking one.
Prefill vs. decode: processing your prompt vs. generating output one token at a time.
Memory-bound: limited by how fast weights can be read from memory, the usual inference bottleneck.
KV cache: stored attention state for prior tokens; grows with context length.
GGUF / safetensors: common weight file formats.
VRAM: a GPU’s own fast memory; the model you want to run should fit inside it.
Unified memory: Apple Silicon’s shared CPU/GPU memory pool, which lets Macs run larger models than their lack of a discrete GPU would suggest.
Reasoning / “thinking” model: one trained to spend tokens working through a problem before answering; slower, but better on hard tasks.
Test-time compute: improving answers by having the model think longer when asked, rather than by making it bigger.
Multimodal: a model that handles more than text, most often images as input.
Ollama / LM Studio / llama.cpp: tools that download and run local models for you; llama.cpp is the underlying engine, the other two are friendlier front-ends.

A corollary that surprised me and reshaped how I pick models: when memory is the constraint, a bigger model at aggressive 4-bit quantization usually beats a smaller model at full 16-bit precision, for the same memory budget. A 13B model squeezed to 4-bit (~8 GB) tends to outperform a 7B model at full precision (~14 GB), despite the harsher squeezing, because parameter count buys more than precision does down in this range. The rough rule people repeat: prefer more parameters at lower precision over fewer parameters at higher precision, right up until you hit the 2-bit floor where things fall apart. ↩
The canonical source is Nelson F. Liu et al., “Lost in the Middle: How Language Models Use Long Contexts” (TACL 2024; arXiv:2307.03172). They moved a relevant fact to different positions in the context and measured retrieval, finding the characteristic U-shaped curve: strong at the start, strong at the end, sagging in the middle, and getting worse as the total context grew, even for models explicitly marketed as long-context. The interactive demo above is a toy re-creation of exactly that experimental setup. ↩
Fair warning that the why is more contested than the what. The U-shaped curve is robustly observed; the mechanism behind it is still argued over. My attention-dilution intuition is plausible but probably not the whole story, and at least one controlled study found that attention allocation explains less of the effect than you’d expect, pointing instead at things like drift in the model’s internal representations as the context grows. So treat this paragraph as “a reasonable mental model” rather than “the settled mechanism.” ↩
A personal aside, since this whole section is about speed: I remain befuddled by the models and benchmarks that chase tokens-per-second above all else. I keep drifting back to slower, more expensive models, because a slop railgun does very little for me. Firing garbage at the wall faster is still just firing garbage. And in my actual work, the model’s output is rarely the bottleneck anyway, I’ll spend far longer on compilation, type-checking, and reading the diff than I ever saved by generating it quickly, so a model that spits out confidently-wrong code at blazing speed is optimizing the one part of my loop that wasn’t particularly slow. I would rather wait and get something I don’t have to throw away. ↩
The objection is that this is all just next-token prediction, a statistical echo of the training data with no understanding behind it. The phrase is “stochastic parrot,” from a 2021 paper by Bender, Gebru, and colleagues, and the underlying argument (developed in earlier work by Bender and Koller) is as follows: understanding means connecting language to something outside language (intent, the world, referents), and a system trained only on the form of text has, by construction, no access to that outside thing and so no foothold from which to learn meaning. It can be an extraordinary mimic of meaningful text without any of the meaning. Worth taking seriously, not least because the mechanical chapters above pretty much prove the point; I just spent a whole post describing a next-token predictor. Still, a few things give me pause about the common argument that LLMs are only stochastic parrots and unable to produce anything novel / of value. One: “just predicting the next token” smuggles in a lot. Predicting well across the whole internet pressures a model to build internal machinery, and interpretability work keeps finding structured representations inside these models, features that track whether a statement is true, internal maps of space and time, the board state of a game shown only as move notation. You can call that elaborate correlation, but “it built a usable model of the board in order to predict the next move” is doing more than just parroting; the objective is prediction, but the solution found for that objective need not be shallow. After all, evolution’s fundamental objective was just reproduction, and it produced eyes. Two: the form-versus-meaning line was much less blurry in 2021 than today. Models are no longer trained on pure text in a sealed room; they’re tuned against human feedback (a thread from form to something outside it), increasingly multimodal (grounding words in images, the very language-to-world link the argument says is missing), and through tool use they act and get real results back. None of that is necessarily fatal to the argument, but the premise has become, in my opinion, somewhat outdated. Three: it’s quite difficult to frame these things such that we huumans don’t arguable fit the criteria as well. We also learn language largely by statistical exposure, and the brain is, at one level, a prediction engine minimizing surprise; I don’t think that makes us parrots, which suggests “it’s prediction” can’t by itself rule out understanding. Whatever separates real comprehension from sophisticated mimicry has to be more specific than “involves prediction,” and stating a criterion that includes us and excludes the models, without just asserting the conclusion, is harder than it looks. None of which is a proof that models understand anything, and the parrot camp has rejoinders to each point. Bender herself has noted “stochastic parrot” was never meant as an empirical hypothesis to be proved or disproved, so some of this aims where the meme stands rather than where the paper does. My own position is deflationary: I don’t need to settle whether the model “understands” to notice whether it writes a correct migration or doesn’t. The philosophical question is unsettled, and quite unlikely to ever settle; the engineering question, does this save me time without creating cognitive debt or tech debt, has an answer I can check. ↩

Index

What a model actually is

Parameters: the “7B” in the model name

Precision: what “4-bit” and “8-bit” mean

Quantization

Dense vs. Mixture of Experts (MoE)

Local vs. hosted (API) models

Base vs. instruct (and “chat”) models

Attention

The idea

Why it’s expensive

Context windows

It’s a hard limit, and it’s shared

How context windows have grown over time

Compaction: the trick behind “infinite” chats

A large context window doesn’t mean it uses the whole thing well

Practical takeaways

File formats you’ll encounter

Temperature and the rest of the sampling knobs

What actually makes some models faster

Reasoning models: trading speed for thinking

Multimodal models: not just text

Tool use: how models do things they can’t do themselves

The mechanism is simpler than it sounds

How a model knows how to do this

Why it matters

Putting it together: deciphering a model’s functionality

So what can I actually run?

A short glossary

Footnotes