Tesla’s FSD Redefines Autonomous Driving
- Martin Otterbach
- Nov 18
- 16 min read
This article analyzes the technical foundations and architectural concepts behind Tesla’s FSD — from perception and planning logic to training mechanisms. The analysis draws on Phil Beisel’s seven-part series “The Magic of FSD,” which provides in-depth insights into the inner workings of Tesla’s approach to autonomous driving.

From rules to intuition: Tesla rethinks FSD code
Tesla's Full Self-Driving (FSD) system marks a fundamental break with traditional approaches to driver assistance systems. While classic systems such as Autopilot rely on rule-based decision logic (finite state machines, or FSM for short) and explicitly programmed responses, Tesla has been pursuing a completely different path since its version 12: a fully neural, end-to-end learning architecture.
Breaking with FSM systems
Tesla's motivation for this change is technical and strategic. Classic FSMs, which work with thousands of individually coded responses, are reaching their limits. Although they are comprehensible, they are prone to errors, inflexible in new situations, and difficult to scale. Even tiny changes in the environment can trigger a domino effect in the ruleset and make system maintenance a balancing act.
So instead of continuing to rely on manually maintained C++ code, Tesla now uses a model that learns from real driving situations: the neural network is trained with video data from real journeys. This also includes the reactions of human drivers. The goal is not to code rules, but to imitate behavior that has proven itself over billions of kilometers.
The architecture of this system is based on a simple principle: a neural network calculates the control commands directly from raw camera inputs. This includes the accelerator pedal, brakes, and steering. The entire decision-making process takes place inside the network. The classic code, which previously had to define an explicit path for every driving situation, is now replaced by a data-based model that adapts to the situation and can also learn in completely new scenarios.
What are the consequences of this change?
Debugging is undergoing fundamental changes.
Errors are no longer found in the source code, but in the training data set.
Rule compliance is being replaced by probability distribution.
The network learns from thousands of variants what “good” driving behavior means statistically.
Functional improvements are not created by new lines of code, but are generated through targeted data enrichment and retraining.
The introduction of this architecture in FSD v12 (presumably for the first time in Build 2023.44.30) not only ushered in a new technical era, but also forced Tesla to take a new approach to safety logic, validation, and model transparency. This is because end-to-end AI is difficult to explain, but it can be trained, observed, and refined.
Tesla is demonstrating more than just a new generation of software here. It is an attempt not to program autonomous driving, but to let it develop through training, feedback, and system intelligence.
Perception and planning with FSD
One of the key challenges of autonomous driving software is the reliable detection and interpretation of the environment. Tesla addresses this challenge with a two-stage neural system. It consists of perception and planning. Although these terms are technically established, Tesla fills them with its own data-driven interpretation.
Perception: Breaking the world down into tokens
The first step of the FSD system is to translate the camera image into structured objects. These objects, referred to as “tokens” at Tesla, are abstract representations of road markings, vehicles, pedestrians, traffic signs, construction site objects, and much more. Each object is assigned a set of characteristics: position, speed, direction of movement, intention, confidence level, and more.
This tokenization is an attempt to provide the AI with a “semantic map” of the scene. This is not intended to serve as an image, but as a meaningful representation of the relevant elements. The term “token” is deliberately chosen: similar to language processing in models such as ChatGPT or Grok, complex inputs are translated into manageable units in order to derive meaning and structure from them.
Planning: Responding to context rather than rules
Based on these tokens, the planning module generates the next driving decisions. These include steering, acceleration, and braking. The whole process takes place at a frequency of 15 to 30 decisions per second.
And what's so special about that? These decisions are not made on the basis of fixed if-then rules, but as a statistically trained response to similar situations from the training data set.
An example:
The vehicle detects an oncoming bicycle that is swaying slightly. Instead of relying on rigid distances or preprogrammed rules, the neural network compares the scene with similar cases from the training and selects the most likely safe response. In this case, slight braking and lane deviation.

End-to-end, but designed to be modular
Although Tesla is pursuing an end-to-end approach with FSD (input data (camera images) is translated directly into driving decisions), the system is logically modular. Perception and planning are designed as separate submodels, each specializing in specific tasks, but remain linked via shared training data and feedback loops.
This modular structure not only increases the scalability of the system, but also enables targeted debugging and fine-tuning. For example, if the planning responds incorrectly even though the perception was correct, the source of the error can be narrowed down. And not in the classic code, but in data-driven training.
What is particularly striking about Tesla's approach is the importance it places on understanding context. The FSD system takes into account not only current objects, but also the spatial-temporal development of a scene.
How fast is a vehicle approaching?
Is a pedestrian about to step onto the road?
Are we on a tight curve or an open country road?
All this information flows into the decision. Not through fixed rules, but through probability distributions and learned behavior. The result is a system that not only sees, but interprets and acts on what it sees.
Temporal logic and vector spaces explained
A key problem in autonomous driving lies not only in recognizing objects in space, but also in processing their movement through time. To drive safely, you need to know not only what is there, but also where something is moving, how fast, with what intention, and in what context.
Tesla's FSD system addresses this challenge with a clear paradigm shift. It does not think in terms of individual images or rigid coordinate systems, but in terms of time-based vector spaces that can map movements, intentions, and dynamics.
Are individual images not enough?
Traditional systems process camera images as high-frequency snapshots. Although they can detect movement by comparing successive images, this remains rudimentary. Pixel shifts and heuristic estimates are no substitute for a genuine understanding of movement.
Tesla, on the other hand, takes a temporalized approach. The neural network does not process individual images, but rather short video sequences that capture the scene over several seconds. These sequences are analyzed not only to recognize movements, but also to interpret them as causal events.
An example:
A child steps onto the sidewalk, hesitates, looks at the road, takes a step, and then stops. For rule-based software, this is a nightmare of unpredictability.
For a learning system, on the other hand, it is a pattern. One that has been seen dozens of times during training and is highly likely to represent an uncertain intention to cross. It is precisely this assessment that can be translated into a gentle braking maneuver. Not as a reaction to a rule, but as a probability weighting.
How does a vector space work?
At Tesla, the internal representation of these dynamics is no longer a map display or object list, but a vector field. This is a mathematical space in which each object is described by its direction of movement, speed, and intention. The scene is not static, but dynamically stretched: each road user is an arrow moving through space, and the system evaluates how these arrows interact with each other.
This way of thinking is based on neural language models. There, too, words are not stored as character strings, but as points in vector space that represent semantic proximity. Tesla applies this principle to traffic. A turning truck and a hesitant SUV in front of a crosswalk are two different but related semantic events and thus mathematically similar in the model.
FSD recognizes intentions
Another highlight of Tesla's approach is that the system not only tries to recognize where something is, but also what it is likely to do. These “intentions” are not recorded as fixed labels, but as behavioral patterns.
This means that a pedestrian who stops is assessed differently than one who accelerates, even if they are both the same distance from the road.
This semantic depth is achieved through training with real video data. The AI not only learns that a cyclist exists, but also how cyclists typically behave. For example, by looking over their shoulder, turning left, moving out slightly, and then overtaking. These recurring patterns become decision-making aids.
Causality through data, not rules
With this temporal architecture, Tesla is building a kind of “motor memory.” A structure that remembers scenes it has seen and derives behavior from them.
This not only enables smoother and safer driving, but also reduces false-positive reactions. Instead of braking abruptly at every shadow, the system “knows” that a person walking briskly and making eye contact is more likely to enter the road than someone walking parallel to the road and talking on the phone.
Tesla's data-driven world model
A neural network is only as good as the data it is trained with. But autonomous driving is not just about large amounts of data; it's primarily about the quality of the labels. These are the pieces of information that tell the system what can be seen in an image. Tesla has a decisive advantage here: instead of relying on manual labeling, the company uses data-driven, automated labeling strategies that use reality itself as a benchmark.
Labeling in the traditional sense: slow, expensive, limited
Traditionally, labeling means that people mark where a car, a stop sign, or a pedestrian can be seen in images. This work is time-consuming, expensive, and prone to errors. In addition, it is difficult to capture context or causality, such as whether an object is in motion, whether it interacts with its surroundings, or how its position in space changes.
This is a serious disadvantage, especially for FSD systems. Driving situations are highly dynamic and often ambiguous. A static image can hardly capture what is really happening.
The solution: labeling by the vehicle itself
Tesla is taking a different approach. The vehicles themselves generate training data and provide the labels themselves through their own behavior. More specifically, the behavior of the vehicles under certain software versions is compared with the actual outcomes. For example, if one version drives too close to parked cars and a new version later solves this situation differently, a “labeled data pair” is created. In other words, a before/after comparison that identifies the better option.
For example:
A vehicle drives through a construction site with version 12.3.1 and comes too close to a barrier post. Weeks later, the same route is driven again with version 12.3.8. This time, the vehicle clearly avoids the post. The difference between the two trips, including camera image, movement path, and vehicle data, is automatically labeled: “This distance is better than that one.”
This automatically generates millions of training examples with behavior-based labels: The system not only learns what it sees, but also how it can act better.
Simulation + comparison = supervised learning on a new level
This method is a hybrid of real observation and simulation. Tesla can replay sequences from the past with new software versions and then compare which version behaves more intelligently, safely, or smoothly. This comparison results in labels such as “better,” “safer,” “deviates earlier,” etc.
This creates a self-reinforcing learning process:
Old mistakes or uncertainties are identified.
New software versions attempt to avoid these mistakes.
Improved behaviors are recognized and become the training basis for the next generation of the network.
This cycle enables continuous improvement without costly human intervention. The key point is that reality itself is the corrective factor.
This shift from label to result changes everything. It reduces human bias in annotations, allows context, movement, and causality to be captured, and brings the training data into line with the real requirements of road traffic.
Mixture of Experts
In the context of autonomous mobility, scalability means not only rolling out a system to more vehicles or cities, but also intelligently handling local peculiarities, regional driving styles, and specific environmental conditions. Tesla's answer to this challenge is a so-called Mixture of Experts (MoE) architecture. This is an AI concept that has already proven itself in other areas and is now being transferred to the road.
From generalists to specialists
A neural network quickly reaches its limits when it has to function in completely different contexts, such as in snowfall in Canada or in the dense city traffic of Mumbai. Tesla is therefore increasingly relying on a specialized architecture in which different subnetworks (“experts”) are responsible for different driving situations.
Such a system works as follows:
The main networks for perception and planning remain in place.
Within these networks, there are expert layers that are selectively activated depending on the driving situation.
A so-called gating layer (a kind of intelligent “router”) decides in real time which experts are weighted how heavily, depending on the context.
Examples of expert modules
Tesla differentiates between different driving scenarios and environments and trains specific expert networks for them. For example:
Urban Intersection Expert: For complex intersections with traffic lights, pedestrians, and multi-lane turning maneuvers.
Highway Expert: For lane keeping, merging, and overtaking at high speeds.
Wet Weather Expert: For wet or slippery road conditions.
Rural Road Expert: For narrow, unmarked roads with unpredictable obstacles.
Parking Lot Expert: For slow, tight maneuvers and pedestrian detection.
Construction Zone Expert: For temporary traffic patterns, detours, and construction.
Night/Low-Light Expert: For limited visibility conditions.
These experts are not trained and deployed individually, but are dynamically combined. For example, if a Tesla is driving through an urban intersection at night in drizzling rain, the system could activate a mixture of the Night Expert (e.g., 20%) and the Urban Intersection Expert (e.g., 80%). The final decision, e.g., on speed, is then based on a weighted average.
Technical advantages of the MoE model
Adaptability: The system learns faster and more robustly because new scenarios can be fed into individual experts in a more targeted manner without having to retrain the entire model.
Efficiency: Only relevant experts are activated per inference cycle. This reduces the computational effort compared to a fully loaded mega model.
Granularity: The system response can be refined depending on the context, for example, by simultaneously taking into account weather, light, environment, and destination.
Training in modular steps
There are also advantages when it comes to training: Instead of retraining the entire network every time the data changes, Tesla can specifically adapt individual expert modules or add new ones. Special regions, new driving conditions, or legal requirements, for example.

Context becomes part of intelligence
While many systems attempt to generate “generic” driving behavior, Tesla takes a different approach with MoE. Intelligence is fragmented but orchestrated in a meaningful way. The result is a system that can not only generalize but also contextualize. It knows where it is driving, under what conditions, and adapts accordingly.
With this architecture, Tesla combines general AI power with local specialized knowledge. This is not only scalable, but also promises safer, more culturally sensitive, and more dynamic driving behavior. A decisive step toward global deployability.
The leap to autonomous scaling
Robotaxis and photon-based perception
On June 22, 2025, a new chapter in the development of autonomous driving began in Austin. Tesla launched the live operation of its robotaxi program with a small fleet of Model Y vehicles that complete several hundred trips daily in the geofenced urban area of Austin. Initially, there is still a safety observer in the passenger seat, but the vehicles are already equipped with an FSD version (presumably 13.3) that technically does not require driver supervision. This marks the transition from a trained system to an independently operating service.
Vision-only as a cornerstone
Tesla has always pursued a radically different sensor approach than most other OEMs in the field of autonomous driving: no LiDARs, no HD maps, no radar systems, but only cameras. Eight cameras capture the surroundings in 360 degrees. The special feature is not only the sensor itself, but how the signal is processed.
Photon counting instead of classic images
While conventional systems work with classic images processed by an ISP (image signal processor), Tesla accesses the raw data from the camera sensors directly. This means no JPEGs, no contrast, no color correction. Instead, the raw photon measurements (12-bit Bayer mosaic, later RCCC configuration) are fed directly into the neural networks.
Tesla no longer lets the camera decide what is “relevant” in the image, but rather the FSD network itself. The goal is not aesthetics, but responsive, safe vehicle control. The neural networks act as learning signal processors that are trained to optimize results rather than image quality.
LiDAR as a training aid. Not as part of the product
Although Tesla does not use LiDAR in production, it is used in the background as a ground truth instrument. During development, it provides precise depth data that is used to calibrate the camera-based networks. In training vehicles and validation fleets, LiDAR records distances, object sizes, and movement patterns, for example. This results in a high-quality correction data stream for training.
Once the network is sufficiently trained, LiDAR is “calculated out.” It is then only a learning aid, not part of the solution.
Simulation & Auto-Labeling: Validation at the System Level
Tesla uses its real-world robotaxi fleet not only for live operations, but also for continuous validation and expansion. With the help of special validation vehicles and a sophisticated simulation environment, each new urban area is systematically explored:
Drives through new areas are recorded.
Sensor and LiDAR data are synchronized.
The scenes are played back by the FSD software and checked for errors.
Faulty sections are automatically labeled, corrected, and added to the training set.
In addition, Tesla generates synthetic driving situations by adding virtual elements to real scenes (e.g., an intersection). These include pedestrians with atypical behavior or changing lighting conditions. This allows rare edge cases to be simulated en masse.
Retraining & Versioning: A learning cycle
Each of these validated or simulated driving sections flows back into the training set as an “improved sequence.” Retraining then creates a new model, such as FSD version 13.3.1, which is reintroduced into the fleet once it has passed validation. Tesla is thus establishing a continuous learning cycle of real driving, simulation, analysis, and improvement.
An engineering paradigm: The brain is the solution
While other approaches rely on sensors and hard programming, Tesla takes the opposite approach, namely that of minimal hardware and maximum intelligence. Neural networks replace not only map data and LiDAR, but also the camera's own image processing. Everything is concentrated on one point: a learning system that can think directly from photons to control impulses.
This principle is reminiscent of biological models. Humans also drive with senses that are prone to error, but the brain compensates. Tesla is not building a perfect eye, but a good memory.
FSD v14
From human role model to superhuman performance
With version 14 of the Full Self-Driving (FSD) system, Tesla has reached a new milestone on the road to fully autonomous mobility. Compared to previous iterations, v14 is not only an enlarged model with a higher number of parameters, but also a fundamentally better trained system.
Imitation and reinforcement: two learning paths to autonomy
The FSD training process is based on two pillars:
Imitation learning draws on high-quality human driving data to learn basic skills such as lane guidance and distance control.
Reinforcement learning (RL) goes beyond this. It uses simulated driving scenarios to specifically optimize behavior in complex, rare, or safety-critical situations. In these simulations, a so-called reward mechanism evaluates the resulting actions. Positive behavior, such as safely driving around a pedestrian who suddenly appears, is reinforced. Risky decisions, on the other hand, are penalized. The result is a system that is not only based on human driving behavior, but can also develop better solutions through targeted exploration.

Bigger, more specific, more efficient
V14 is ten times larger than v13 in terms of the number of network parameters. This allows for finer distinctions and greater context sensitivity. To keep the associated computing effort manageable, Tesla relies on the mixture-of-experts architecture already introduced in v13. Only those submodules (expert networks) that are relevant to the current driving situation are activated, significantly reducing the computational load on the vehicle hardware (HW4).
Training infrastructure: Supercomputers in the service of the road
The necessary training runs for v14 are carried out at the company's own “Cortex” data center in Austin, Texas. This center comprises tens of thousands of GPUs and is one of the most powerful dedicated AI infrastructures in the mobility sector worldwide.
Reinforcement learning in particular requires extremely high computing power, as each scenario is played through and evaluated with numerous variants in order to generate robust behavior patterns.
The result of this training methodology is driving behavior that not only appears safer, but also more natural and predictable to other road users. FSD v14 responds smoothly, anticipatively, and with a level of contextual understanding that is increasingly described as “human” or, according to Elon Musk, even “sentient.”
For the first time, the system can be operated in a completely unsupervised mode, for example in the context of robotaxi operations in Austin. In regions where it is permitted by regulation, human supervision will therefore (soon) no longer be required.
With version 14, Tesla is making the transition from an imitative system to a superior driver AI model. The combination of scalable architecture, highly dynamic training, and targeted error reduction via RL achieves an unprecedented level of confidence in road traffic.
Although this version already significantly exceeds the average human driving performance, it is expected to be only an intermediate step. The next software generations – v15, v16, and beyond – as well as the next hardware generations AI5 and AI6 will build on these foundations with the goal of completely replacing human driving in the long term.
The Reasoning System in v14
Tesla’s Full Self-Driving system is evolving beyond purely reactive driving decisions. It is beginning to “think.” Elon Musk has described FSD as soon feeling “almost sentient,” a reference to the introduction of reasoning capabilities. This marks a new phase for Tesla: a shift from autonomous vehicle control to situational, dialog-based decision-making.
At the core of this paradigm shift are two tightly integrated subsystems:
Action AI: Responsible for core driving decisions. It processes camera data, vehicle kinematics, audio input, map information, and internal states 36 times per second to generate steering, braking, and acceleration commands.
Interactivity AI: A language-capable LLM that interacts with passengers, answers questions, stores preferences, and adapts future behavior accordingly.
To make the system more transparent and interpretable, Tesla employs reasoning tokens. Those are intermediate representations that reveal how the network understands its environment.
These include:
Panoptic segmentation (assigning a class to every pixel in the camera feed)
3D occupancy grids (real-time volumetric understanding of the surrounding space)
Language tokens (used to explain actions in plain human-readable terms)
Combined with user interaction, a system emerges that does more than just react. It plans proactively while storing personal preferences, such as favored routes, behavioral tendencies, or specific comfort settings. In complex real-world scenarios like gated entries, drive-throughs, or airport pickups, FSD v14 already demonstrates how it can interpret, evaluate, and adapt to multi-step situations.
FSD v14 is not merely a software update, it represents a turning point. The control system inside the car becomes an intelligent agent with memory, logic, and interaction capabilities. What began as neural planning is evolving into machine cognition on the road.
The transition from automation to autonomy
With version 14, a self-driving assistance system becomes an independently operating entity. Tesla is shifting its focus away from the sensory overkill of traditional OEMs toward maximally scaled, software-based intelligence that learns from real and simulated driving experience.
What is special about this is not only the technical implementation, but also the underlying paradigm. FSD is not a hard-coded rule machine, but a learning system that becomes an increasingly better driver through imitation, reinforcement, and gigantic amounts of data. V14 thus marks not only a step forward in terms of functionality, but also a philosophical turning point. Autonomy is no longer understood as perfect mastery of all cases, but as constant learning in the flow of reality.
While other players rely on geofencing, human control, or hybrid approaches, Tesla is going all in, making it unmistakably clear where the real bet lies: not in better sensors, but in a better brain.
last updated: November 18th 2025



