Back to Blog

Multilayer Perceptrons for Real Builders: From Linear Models to the Foundations of Generative AI

Rafael Fischer
Rafael Fischer
·25 min read

1. Why AI Feels Like Magic

You type a prompt and a clean paragraph appears. You upload a dataset and a prediction appears. You fine-tune a model and its behavior suddenly shifts.

From the outside, it feels mysterious. Almost supernatural.

But what looks like magic is usually scale plus math.

If you are building AI features, you are already making decisions about model capacity, generalization, and system design. You might not use those words. You might just say, “Let’s add a model here” or “Let’s fine-tune this.” But under the surface, those are architectural decisions.

When you understand the foundations, those decisions stop being accidental. They become deliberate.

What Is Actually Under the Hood

Modern AI systems are built from a surprisingly small set of mathematical building blocks. These blocks are combined, stacked, and adjusted using data.

One of the most important of these blocks is the Multilayer Perceptron, or MLP.

Here is a simple way to think about it.

A spreadsheet applies formulas to numbers. A backend service applies functions to inputs. A graphics engine applies transformations to shapes.

An MLP does something very similar.

It takes numbers as input. It multiplies them by other numbers. It adds the results together. It applies a small rule to the result. Then it passes that output forward and repeats the process.

That is it.

There is no hidden reasoning engine. There is no symbolic rule tree. There are just layers of small numerical operations stacked on top of each other.

When you repeat these simple steps many times and train them on large amounts of data, the system starts to capture patterns. Those patterns can look intelligent. But they emerge from repetition, not from a single clever rule.

Complex behavior does not come from one sophisticated formula. It comes from many very simple operations applied again and again.

Why Builders Should Care

If you are building scoring systems, forecasting engines, personalization layers, or AI copilots, you are building on top of these primitives.

Without a mental model, the system feels like a black box. When something breaks, you do not know whether the issue is data, model capacity, or system design.

With a clear mental model, you can reason about tradeoffs. You can anticipate failure modes. You can decide when more capacity helps and when it only adds cost and instability.

Understanding MLPs is not about becoming a researcher. It is about demystifying the foundation that modern deep learning and LLMs are built on so you can design systems with intention instead of hope.

2. What Is a Multilayer Perceptron?

A Single Neuron: A Weighted Decision Rule

Let’s slow this down and make it very concrete.

Imagine you are trying to decide whether a user is "high value" or "low value". You look at three things:

  • Monthly usage
  • Payment history
  • Company size
  • Each of these is just a number. Nothing fancy.

    Now suppose you decide that some of these numbers matter more than others. Maybe usage matters a lot. Payment history matters even more. Company size matters, but not as much.

    So what do you do?

    You multiply each input by how important it is. Then you add everything together. You get a final score.

    If that score is above a certain threshold, you label the user "high value". If it is below, you label them "low value".

    That entire procedure is what we call a single artificial neuron.

    In plain language:

  • Inputs are numbers describing something about the world.
  • Weights are numbers that represent importance.
  • The weighted sum is just a combined score.
  • The activation function is the rule that turns that score into a decision.
  • The output is the prediction.
  • There is no reasoning engine hiding inside. There is no symbolic logic. It is simply a scoring formula whose importance values can be adjusted using data.

    When we say "perceptron", we mean exactly that: a learnable decision rule.

    From One Neuron to a Layer

    Now imagine you do not trust a single scoring formula. Instead, you create several of them.

    All of them look at the same inputs. But each one learns its own set of importance values.

    For the same user:

  • One neuron might care mostly about usage.
  • Another might focus heavily on payment behavior.
  • Another might react strongly to company size.
  • Another might learn that a specific combination, like high usage and late payments, is important.
  • Each neuron produces its own score.

    Instead of one number, you now have a list of numbers. Those numbers are learned signals.

    This group of neurons working side by side is called a layer.

    A helpful way to picture it is as a panel of specialists. You present the same case to several experts:

  • One evaluates financial risk.
  • One evaluates growth potential.
  • One evaluates operational stability.
  • Each expert gives you an opinion. You are not making the final decision yet. You are collecting perspectives.

    That is what a layer does.

    In simple terms:

  • One neuron produces one learned signal.
  • A layer produces many learned signals.
  • Together, those signals form a new representation of the input.
  • This is the key shift. The system is no longer just scoring raw data. It is transforming raw numbers into richer internal signals that the next layer can build on.

    From One Layer to Many

    So far, we have:

  • One neuron that produces one score.
  • One layer that produces several learned signals.
  • Now imagine stacking layers.

    The second layer does not look at the raw inputs anymore. It looks at the signals produced by the first layer.

    This is a key shift.

    Layer 1 might detect simple signals like:

  • High usage
  • Late payments
  • Large company size
  • Layer 2 can now detect combinations such as:

  • High usage AND late payments
  • Large company AND consistent payments
  • Layer 3 can go even further:

  • Patterns of patterns
  • More abstract behaviors like "risky but growing" or "stable and expanding"
  • Each layer builds on top of the previous one.

    If you come from software engineering, think about abstraction layers:

  • Low-level utilities handle raw operations.
  • Higher-level services combine them.
  • Even higher layers express business logic.
  • An MLP works in a similar way. Lower layers detect simple signals. Higher layers combine them into more meaningful internal concepts.

    In simple terms:

  • More layers mean the model can represent more complex relationships.
  • It can move from raw numbers to increasingly structured internal representations.
  • An MLP is simply this stack of fully connected layers, trained together so that all importance values adjust in coordination.

    Why Non-Linearity Matters

    This is one of the most important ideas in the entire article. And it is often explained too quickly.

    Let’s go step by step.

    So far, each neuron:

  • Multiplies inputs by weights.
  • Adds them together.
  • Mathematically, that is called a linear operation.

    Here is the surprising part: If you stack multiple layers that only do linear operations, the whole network is still just one big linear operation.

    It does not matter if you have 1 layer or 100 layers. If every step is just multiply and add, the final result can always be simplified into a single weighted formula.

    So stacking layers alone is not enough.

    Something else must happen between layers.

    That "something" is the activation function.

    After each weighted sum, we apply a small transformation. It might squash the value. It might zero out negative numbers. It might bend the output in a curved way.

    This is what introduces non-linearity.

    What does "non-linear" actually mean in practice?

    It means the relationship between inputs and outputs is no longer a straight line.

    Imagine plotting users on a chart using two features. A linear model can separate them using one straight line.

    But what if the real pattern looks like a circle? Or two curved clusters intertwined?

    A straight line cannot separate those shapes.

    Non-linearity allows the model to bend.

    Instead of drawing one straight boundary, it can build curved, flexible boundaries by combining many small transformations across layers.

    Another way to think about it:

    A linear model can only say: "If X increases, Y increases or decreases at a constant rate."

    A non-linear model can say: "X matters only when Z is high." Or: "X increases risk up to a point, then decreases it."

    That is a completely different level of expressiveness.

    In real products, user behavior is rarely linear. Churn is not triggered by one variable moving steadily upward. Fraud is not caused by one feature crossing a simple threshold.

    Variables interact. Effects compound. Patterns curve.

    Non-linearity is what allows neural networks to represent those interactions.

    Stacked layers plus non-linear activation functions give the model the ability to:

  • Combine signals.
  • Bend decision boundaries.
  • Capture conditional relationships.
  • Without non-linearity, depth would be useless. With it, depth becomes powerful.

    That is why the small activation function between layers changes everything.

    3. Why Founders Should Care

    Up to this point, this might still feel a bit academic. It is not.

    This is about product decisions. Real ones.

    Every time you choose a model, adjust features, or decide to fine-tune something, you are making tradeoffs about capacity, complexity, and risk.

    The Moment Linear Models Stop Being Enough

    Most early systems start simple. A weighted score. A handful of rules. Maybe a linear model.

    And in the beginning, that often works.

    But real user behavior is rarely clean.

    In fraud detection, risk is usually not triggered by one variable moving in isolation. It is triggered by combinations.

    In churn prediction, high usage can signal loyalty in one segment and burnout in another.

    In dynamic pricing, demand shifts differently depending on who the customer is and what else is happening in the system.

    These are not straight-line relationships.

    A linear model can only express something like: “As X increases, Y increases or decreases.”

    But real products live in a world closer to: “X matters only when Z is high and W is low.”

    That is the moment when additional model capacity becomes necessary. Not because AI is trendy. But because the structure of the problem demands it.

    Two Ways to Handle Complexity

    When reality becomes messy, you have two broad options.

    Option A: Encode the complexity manually.

    You add more rules. You engineer more features. You stack exception logic on top of exception logic.

    Over time, the system grows. Not in model capacity, but in external complexity.

    Option B: Increase model capacity.

    You allow the model to learn interactions directly from data. Instead of manually specifying every combination, you let layered transformations discover them.

    An MLP is often the first structured step in that direction. It increases expressive power without forcing you to handcraft every interaction.

    Both approaches can work. But they shift complexity to different places.

    The Tradeoff No One Talks About

    More capacity is not automatically better.

    When you increase flexibility, you also increase risk.

    The model may memorize instead of generalize. It may demand significantly more data. It becomes harder to interpret. Training becomes more sensitive to noise.

    In simple terms, expressive models amplify both signal and noise.

    For founders, the real question is not: “Should we use AI?”

    It is: “How much capacity does this problem actually require?”

    You can think of this as balancing three core levers:

  • Data. How much do you have? How clean is it? How fast does it change?
  • Model capacity. How flexible is the model? How complex are the patterns it can represent?
  • System complexity. How many rules, pipelines, and moving parts exist outside the model?
  • You can compensate for one lever by adjusting another. Less model capacity often means more manual rules. Less data often means simpler models.

    Strong builders understand they are tuning these three levers. They are not just “adding AI.”

    4. How a Multilayer Perceptron Learns

    So far, we have talked about structure. Layers. Neurons. Weights. Activations.

    Now the obvious question is: how does any of this actually learn?

    Let’s walk through it slowly, the way you would explain it on a whiteboard to a colleague.

    Step 1: The Forward Pass (Making a Prediction)

    Start with a single example.

    Imagine one real customer with real values for usage, payment history, company size, and whatever other features you are using.

    You feed those numbers into the network.

    They move through the first layer. Each neuron:

  • Multiplies inputs by its weights.
  • Adds them together.
  • Applies the activation function.
  • That produces a new set of numbers. Those numbers flow into the next layer. The same thing happens again.

    Layer by layer, the data is transformed.

    Eventually, the final layer produces an output. Maybe it is a probability like 0.82 for “this user will stay.”

    At this moment, nothing has been learned. The model is simply applying its current formula.

    The forward pass is just computation. It is the model saying, “Given my current weights, here is my guess.”

    Step 2: Measuring How Wrong It Was (The Loss)

    Now we compare that guess to reality.

    Suppose the customer actually churned. But the model predicted a high probability of staying.

    That is an error.

    We need a way to measure how big that error is. Not emotionally. Numerically.

    That number is called the loss.

    The loss answers one simple question: How far was the prediction from the truth?

    If the prediction was close, the loss is small. If the prediction was far off, the loss is large.

    This number is crucial. It is the only feedback the model receives from the real world.

    Without loss, there is no direction. The model would have no idea whether it is improving or getting worse.

    Step 3: Deciding What to Change (Backpropagation)

    Now we reach the core of learning.

    The model made a mistake. But the prediction depends on hundreds, thousands, sometimes millions of weights.

    Which ones were responsible?

    We need a systematic way to assign responsibility for the error.

    That process is called backpropagation.

    Conceptually, it works like this:

  • Start from the final error.
  • Move backward through the network.
  • Estimate how much each weight contributed to that error.
  • If a weight had a strong influence on the wrong prediction, it gets adjusted more. If it had little influence, it changes only slightly.

    A helpful analogy is an orchestra. If the music sounds wrong, you do not randomly adjust every instrument. You listen carefully and identify which section is off. Then you make small corrections.

    The same thing happens here. Each weight gets a small update. Not a drastic rewrite. Just a nudge.

    Step 4: Repeat at Scale

    One example is not enough.

    The network sees thousands, millions, sometimes billions of examples. For each one, it repeats the same loop:

  • Make a prediction.
  • Measure the loss.
  • Adjust the weights slightly.
  • Over time, the weights shift. Not randomly, but gradually, guided by feedback.

    Certain neurons become sensitive to specific patterns. Some layers specialize in detecting particular combinations.

    No one explicitly programs those behaviors. They emerge from repeated exposure to data and small corrections.

    That is the key insight.

    The network is not given rules. It is shaped by data.

    Learning in an MLP is simply this loop, executed at scale:

    Predict. Measure. Adjust. Repeat.

    5. From MLP to Deep Learning

    Up to now, we have described an MLP as a stack of layers that learn progressively more complex signals.

    Deep learning is simply what happens when this idea is pushed further.

    More layers. More parameters. More data.

    But the core logic does not change.

    What "Depth" Really Means

    When people say "deep learning", they are not talking about something mystical. They are talking about networks with many layers.

    Why does adding layers matter?

    Because each layer transforms the representation it receives.

    Think back to our earlier example:

    Layer 1 detects simple signals. Layer 2 detects combinations of signals. Layer 3 detects patterns of patterns.

    If you continue stacking layers, the representations become increasingly abstract.

    In computer vision:

  • Early layers detect edges.
  • Middle layers detect shapes.
  • Later layers detect objects.
  • In language systems:

  • Early layers detect local word patterns.
  • Middle layers detect phrase structure.
  • Later layers capture semantic relationships and context.
  • The principle is the same. Each layer builds on what the previous layer discovered.

    Depth increases the ability to represent complex structure. Not because the math changes. But because the composition becomes richer.

    For builders, this matters because: More depth means more expressive power. But it also means more data, more compute, and more potential instability.

    Depth is capacity. Capacity must match the problem.

    The MLP as the Core Building Block

    Even when architectures evolve, the core operations remain the same.

    Most modern deep learning systems still rely on:

  • Linear transformations (multiplying and adding numbers)
  • Non-linear activation functions
  • Gradient-based optimization
  • These are exactly the ingredients we described in an MLP.

    Even transformers, which power large language models, contain feed-forward blocks that are structurally very similar to MLP layers.

    This is the key idea.

    The MLP is not an outdated toy model. It is the basic computational pattern that deep learning scales.

    When you understand an MLP, you understand:

  • How representations are built.
  • How model capacity grows.
  • How learning happens through repeated small adjustments.
  • Deep learning is not a different species. It is a scaled-up version of the same idea.

    And that is why this matters for founders building generative AI products.

    If you treat LLMs as magic APIs, you stay at the surface. If you understand that they are large stacks of these same primitives, you can reason about:

  • Why fine-tuning works.
  • Why small data can overfit.
  • Why larger models cost more.
  • Why deeper systems can generalize better but fail unpredictably.
  • The MLP is foundational. Not obsolete. It is the conceptual bridge between simple models and the systems powering modern generative AI.

    6. Relationship to Generative AI and LLMs

    At this point, you might be thinking:

    "This is clear for scoring models. But what does this have to do with ChatGPT, copilots, and generative AI?"

    The answer is simple.

    Large Language Models are built on the same learning loop and the same core building blocks you just understood.

    They are not a different species of system. They are scaled, specialized versions of the same principles.

    LLMs Follow the Same Learning Principle

    An LLM is still:

  • A deep neural network.
  • Trained with backpropagation.
  • Optimizing a loss function.
  • But here is something important and often misunderstood.

    At its core, an LLM is trained as a classification system.

    Yes, classification.

    During training, the model sees a sequence of words. For example:

    "The capital of France is ___"

    The model’s job is to choose the correct next word from a very large list of possible words.

    You can think of it like this:

    Given all possible tokens in the vocabulary, the model assigns a probability to each one.

    Paris might get 0.82. London might get 0.01. Banana might get 0.000001.

    Choosing the next word is simply selecting from this probability distribution.

    That is a classification problem.

    The model is classifying which token is most likely to come next.

    Now here is the key insight.

    Generation is just classification repeated many times.

    Step 1: Predict the next token. Step 2: Append it to the text. Step 3: Predict the next token again, given the updated text. Step 4: Repeat.

    Word by word. Token by token.

    So what looks like "creative generation" is actually a sequence of classification decisions made extremely fast.

    And how does the model learn to do this well?

    The same loop we described earlier:

  • Predict the next token.
  • Measure how wrong it was compared to the actual next token in the training data.
  • Adjust weights slightly.
  • Repeat billions of times.
  • The difference is scale. Instead of thousands of weights, there are billions. Instead of small datasets, there are massive text corpora. Instead of a few layers, there are many stacked layers.

    Another difference is architecture. Transformers introduce attention mechanisms, which help the model decide which parts of the input matter most in context.

    But attention does not replace the MLP idea. It works alongside it.

    Under the hood, the learning principle is unchanged. It is still about learning better scoring rules over representations, refined through repeated feedback.

    MLP Blocks Inside Transformers

    Up to this point, we built a very specific mental model:

    An MLP layer:

  • Takes a vector of numbers.
  • Applies weighted sums.
  • Applies a non-linear rule.
  • Produces a new vector.
  • And by stacking layers, we progressively transform representations.

    Now let’s connect that directly to transformers.

    A transformer layer has two main stages:

  • Attention.
  • A feed-forward block.
  • Think of it as a two-step cycle that repeats many times.

    First, attention. Then, an MLP-style transformation. Then repeat in the next layer.

    Let’s zoom in.

    Imagine again that each word in a sentence has already been turned into a vector. So instead of words, we have a table of numbers. One vector per token.

    If we applied only an MLP to each token independently, the model would transform each word in isolation. It would never explicitly look at other words in the sentence.

    That is a limitation for language.

    Language is contextual. The meaning of a word depends on other words.

    This is where attention enters.

    Attention does not transform a token in isolation. It first allows each token to look at all other tokens.

    For one token, the model:

  • Compares it with every other token.
  • Computes how relevant each other token is.
  • Uses those relevance scores to create a weighted mixture of information.
  • So attention answers this question: "Given the whole sentence, what information should influence this token?"

    Only after this context-mixing step does the model apply the feed-forward block.

    Now the feed-forward block is where your MLP mental model clicks back in.

    The feed-forward block:

  • Takes the updated vector.
  • Applies linear transformations.
  • Applies a non-linear activation.
  • Produces a new vector.
  • This is structurally the same pattern we described in Sections 2 and 4.

    So the full transformer layer can be understood as:

  • Mix information across tokens (attention).
  • Transform each token representation with an MLP-style block.
  • Pass the result to the next layer.
  • Repeat this many times.

    Earlier layers refine local relationships. Later layers encode more abstract structure.

    The important connection to the rest of this article is this:

    Even in the most advanced generative models, the core computational pattern is still stacked transformations of vectors using weighted sums, non-linearities, and gradient-based learning.

    Attention adds dynamic context mixing. The MLP-style block performs the structured transformation.

    Together, they scale the exact principles you already understand.

    Not a new kind of intelligence. A deeper composition of the same building blocks.

    Not simple in scale. But simple in principle.

    Why This Matters for Generative AI Builders

    Now let’s connect this directly to product decisions.

    If you are building with LLMs, you will face questions like:

  • Should we fine-tune or just prompt?
  • Should we use embeddings for retrieval?
  • Why does the model hallucinate?
  • Why does performance degrade with certain inputs?
  • Why is latency increasing as we add complexity?
  • The MLP mental model helps you reason about these issues.

    Fine-tuning works because you are slightly reshaping weights through additional rounds of prediction, error measurement, and adjustment. You are nudging the internal scoring rules.

    But notice how different techniques act on different parts of the system:

  • Prompting changes the input. You are not modifying the model. You are changing what it sees and how it frames the task.
  • Fine-tuning changes the weights. You are modifying the internal transformation rules.
  • RAG (retrieval-augmented generation) changes the context. You are injecting additional information before the model makes its prediction.
  • Increasing model size changes capacity. You are expanding how complex the internal representations can become.
  • These are not interchangeable decisions. They operate at different layers of the stack.

    Embeddings are internal representations. They are the outputs of deep layers that encode structured signals about meaning.

    Adapter layers and LoRA techniques insert small trainable components into the layered structure. They modify specific transformation paths without retraining everything.

    Once you see the system this way, architectural choices become clearer. You are deciding whether to:

  • Change the data.
  • Change the context.
  • Change the weights.
  • Or change the capacity.
  • Latency increases with depth and size because each layer performs numerical transformations. More layers mean more computation.

    Hallucination and memorization issues relate to capacity and generalization. A highly expressive model can capture subtle structure. But it can also overfit patterns from its training data.

    When you understand that an LLM is a very large stack of representation-transforming layers trained through iterative feedback, you stop treating it like magic.

    You start asking better questions:

  • How much capacity does this task require?
  • Is this a data problem or a model problem?
  • Are we pushing the model outside its training distribution?
  • Are we adding complexity in the right place: data, architecture, or system design?
  • That shift in thinking is the real value.

    Understanding MLPs is not about nostalgia for early neural networks. It is about building a mental model that scales all the way up to generative AI systems used in production.

    7. From Demo to Production

    Up to now, we have been talking about models. Weights. Layers. Capacity.

    But founders do not ship models. They ship systems.

    And this is where reality tends to hit.

    In many cases, the neural network is not the first thing that breaks. The surrounding system is.

    Architecture Is Rarely the Bottleneck

    Let’s make this practical.

    Imagine you build a churn prediction model using an MLP. In a notebook, it performs beautifully. The metrics look strong. The curves look clean.

    Then you deploy it.

    Suddenly, the real world shows up.

    Your feature pipeline starts lagging. "Monthly usage" arrives 24 hours late. A new pricing tier changes user behavior. Some records are missing fields.

    The model is fine. The data feeding it is not.

    Then behavior shifts. You launch a marketing campaign and attract a new type of customer. You change pricing and user patterns adjust.

    The distribution of inputs drifts. The model was trained on yesterday’s reality. Today’s reality is slightly different.

    Now you have to decide: How often do we retrain? On a schedule? When performance drops? Fully automated?

    That is not a neural network question. It is a systems design question.

    Then come operational constraints.

    If each prediction costs $0.01 and you serve 5 million requests per day, that is real money. If your pricing API must respond in 50 milliseconds, you do not get to ignore latency.

    In production, these constraints often dominate architectural decisions.

    The MLP is rarely the bottleneck. The data pipeline, monitoring strategy, cost structure, and latency budget usually are.

    Simplicity as a Strategic Advantage

    Early-stage teams often assume that a more complex model automatically means a better product.

    In practice, simplicity is often a competitive advantage.

    A smaller model:

  • Is easier to debug when predictions look suspicious.
  • Is easier to explain to investors, clients, or regulators.
  • Trains faster.
  • Costs less to run.
  • Suppose a simple two-layer MLP achieves 92 percent of the performance of a much larger architecture.

    In an early-stage product, that simpler model is often the better business decision.

    Especially when:

  • Your dataset is still growing.
  • Requirements are evolving weekly.
  • Your infrastructure is not yet mature.
  • Model complexity should follow product maturity. Not the other way around.

    When Not to Use an MLP

    There are also cases where an MLP is simply the wrong tool.

    If you have a very small dataset, a high-capacity model will likely memorize rather than generalize. A linear model or even well-designed rules may perform better.

    If you operate in a regulated environment, you may need to justify every decision. In that case, a simpler, more interpretable model might be preferable.

    If a logistic regression already achieves stable, acceptable performance, adding layers may increase complexity without delivering meaningful gains.

    The key idea is simple.

    Do not use an MLP because it sounds modern. Use it because the structure of the problem requires non-linear capacity and you have the data and infrastructure to support it.

    That difference in mindset is what separates a demo from a production system.

    8. Final Mental Model Summary

    Let’s compress everything into a few durable mental models.

    A neuron is just a weighted decision rule. A layer is a way to extract richer signals from raw numbers. Depth is how we move from simple signals to structured abstractions. Loss is structured feedback from reality. Backpropagation is how responsibility for mistakes gets distributed and corrected.

    An MLP is not an old academic artifact. It is the foundation of modern deep learning.

    And LLMs are not magic engines. They are scaled, specialized descendants of this same principle: stacked transformations, trained through feedback, operating at massive scale.

    When you understand this, AI systems stop feeling mystical. They become design spaces.

    You start asking better questions. How much capacity do we really need? Is this a data problem or a modeling problem? Are we adding complexity in the right layer of the system?

    Foundations matter because complexity scales. Mental models compound.

    If you are building a product that relies on AI features and you want it to scale, survive scrutiny, and support real business growth, this is the level of thinking required.

    And if you want help designing or building a system that uses AI in a way that is pragmatic, robust, and production-ready, feel free to reach out.

    Building with AI should be intentional. Not accidental.


    Thanks for reading. Follow me for more content.

    Connect on LinkedIn