TL;DR

Published March 11, 2025

Update on GitHub

Upvote

sandhawalia Harsimrat Sandhawalia

yaak-ai

cadene Remi Cadene

A snapshot of L2D, the world's largest self-driving dataset!

90+ TeraBytes of multimodal data (5000+ hours of driving) from 30 cities in Germany.
6 surrounding HD cameras + vehicle state (Speed/Heading/GPS/IMU)
Continuous (Gas/Brake/Steering) and discrete actions (Gear/Turn Signals)
OpenStreetMap matched waypoints from birds-eye-view.
Natural language instructions. F.ex "When the light turns green, drive over the tram tracks and then through the roundabout")
Expert (driving instructors) and student (learner drivers) policies

LeRobot goes to driving school

State-of-the art Vision Language Models and Large Language Models are trained on open-source image-text corpora sourced from the internet, which spearheaded the recent acceleration of open-source AI. Despite these breakthroughs, the adoption of end-to-end AI within the robotics and automotive community remains low, primarily due to a lack of high quality, large scale multimodal datasets like OXE. To unlock the potential for robotics AI, Yaak teamed up with the LeRobot team at 🤗 and is excited to announce Learning to Drive (L2D) to the robotics AI community. L2D is the world’s largest multimodal dataset aimed at building an open-sourced spatial intelligence for the automotive domain with first class support for 🤗’s LeRobot training pipeline and models. Drawing inspiration from the best practices of source version control, Yaak also invites the AI community to search and discover novel episodes in our entire dataset (> 1 PetaBytes), and queue their collection for review to be merged into future release (R5+).

Dataset	Observation	State	Actions	Task/Instructions	Episodes	Duration (hr)	Size TB
WAYMO	RGB (5x)	—	—	—	2030	11.3	0.5*
NuScenes	RGB (6x)	GPS/IMU	—	—	1000	5.5	0.67*
MAN	RGB (4x)	GPS/IMU	—	—	747	4.15	0.17*
ZOD	RGB (1x)	GPS/IMU/CAN	☑️	—	1473	8.2	0.32*
COMMA	RGB (1x)	GPS/IMU/CAN	☑️	—	2019	33	0.1
L2D (R4)	RGB (6x)	GPS/IMU/CAN	☑️	☑️	1000000	5000+	90+

Table 1: Open source self-driving datasets (*excluding lidar and radar). Source

L2D was collected with identical sensor suites installed on 60 EVs operated by driving schools in 30 German cities over the span of 3 years. The policies in L2D are divided into two groups — expert policies executed by driving instructors and student policies by learner drivers. Both the policy groups include natural language instructions for the driving task. For example, “When you have the right of way, take the third exit from the roundabout, carefully driving over the pedestrian crossing”.

Expert policy — Driving instructor	Student policy — Learner driver

Fig 1: Visualization: Nutron (3 of 6 cameras shown for clarity) Instructions: “When you have the right of way, drive through the roundabout and take the third exit”.

Expert policies have zero driving mistakes and are considered as optimal, whereas student policies have known sub optimality (Fig 2).

Fig 2: Student policy with jerky steering to prevent going into lane of the incoming truck

Both groups cover all driving scenarios that are mandatory for completion to obtain a driving license within the EU (German version), for example, overtaking, roundabouts and train tracks. In the release (See below R3+), for suboptimal student policies, a natural language reasoning for sub-optimality will be included. F.ex “incorrect/jerky handling of the steering wheel in the proximity of in-coming traffic ” (Fig 2)

Expert: Driving Instructor	Student: Learner Driver

Expert policies are collected when driving instructors are operating the vehicle. The driving instructors have at least 10K+ hours of experience in teaching learner drivers. The expert policies group covers the same driving tasks as the student policies group.	Student policies are collected when learner drivers are operating the vehicle. Learner drivers have varying degrees of experience (10–50 hours). By design, learner drivers cover all EU-mandated driving tasks, from high-speed lane changes on highways to navigating narrow pedestrian zones.

L2D: Learning to Drive

L2D (R2+) aims to be the largest open-source self-driving dataset that empowers the AI community with unique and diverse ‘episodes’ for training end-to-end spatial intelligence. With the inclusion of a full spectrum of driving policies (student and experts), L2D captures the intricacies of safely operating a vehicle. To fully represent an operational self-driving fleet, we include episodes with diverse environment conditions, sensor failures, construction zones and non-functioning traffic signals.

Both the expert and student policy groups are captured with the identical sensor setup detailed in the table below. Six RGB cameras capture the vehicle’s context in 360o, and on-board GPS captures the vehicle location and heading. An IMU collects the vehicle dynamics, and we read speed, gas/brake pedal, steering angle, turn signal and gear from the vehicle’s CAN interface. We synchronized all modality types with the front left camera (observation.images.front_left) using their respective unix epoch timestamps. We also interpolated data points where feasible to enhance precision (See Table 2.) and finally reduced the sampling rate to 10 hz.

Fig 3: Multimodal data visualization with Visualization: Nutron (only 3 of 6 cameras shown for clarity)

Modality	LeRobotDataset v2.1 key	Shape	alignment[tol][strategy]
image (x6)	observation.images.front_left[left_forward,..]	N3HW	asof[20ms][nearest]
speed	observation.state.vehicle.speed	N1	interp
heading	observation.state.vehicle.heading[heading_error]	N1	asof[50ms][nearest]
GPS	observation.state.vehicle.latitude[longitude/altitude]	N1	asof[50ms][nearest]
IMU	observation.state.vehicle.acceleration_x[y]	N1	interp
waypoints	observation.state.vehicle.waypoints	N2L	asof[10m][nearest]
timestamp	observation.state.timestamp	N1	observation.images.front_left
gas	action.continous.gas_pedal_normalized	N1	interp
brake	action.continous.brake_pedal_normalized	N1	interp
steering	action.continous.steering_angle_normalized	N1	interp
turn signal	action.discrete.turn_signal	N1	asof[100ms][nearest]
gear	action.discrete.gear	N1	asof[100ms][nearest]
language	task.policy	N1	—
language	task.instructions	N1	—

Table 2: Modality types, LeRobot v2.1 key, shape and interpolation strategy.

L2D follows the official German driving task catalog (detailed version) definition of driving tasks, driving sub-tasks and task definition. We assign a unique Task ID and natural language instructions to all episodes. The LeRobot:task for all episodes is set to “Follow the waypoints while adhering to traffic rules and regulations”. The table below shows a few sample episodes, their natural language instruction, driving tasks and subtasks. Both expert and student policies have an identical Task ID for similar scenarios, whereas the instructions vary with the episode.

Episode	Instructions	Driving task	Driving sub-task	Task Definition	Task ID
Visualization LeRobot Visualization Nutron	Drive straight through going around the parked delivery truck and yield to the incoming traffic	3 Passing, overtaking	3.1 Passing obstacles and narrow spots	This sub-task involves passing obstacles or navigating narrow roads while following priority rules.	3.1.1.3a Priority regulation without traffic signs (standard)
Visualization LeRobot Visualization Nutron	Drive through the unprotected left turn yielding to through traffic	4 Intersections, junctions, entering moving traffic	4.1 Crossing intersections & junctions	This sub-task involves crossing intersections and junctions while following priority rules and observing other traffic.	4.1.1.3a Right before left
Visualization LeRobot Visualization Nutron	Drive straight up to the yield sign and take first exit from the roundabout	5 Roundabouts	5.1 Roundabouts	This sub-task involves safely navigating roundabouts, understanding right-of-way rules, and positioning correctly.	5.1.1.3a With one lane

Table 3: Sample episodes in L2D, their instructions and Task ID derived from EU driving task catalog

We automate the construction of the instructions and waypoints using the vehicle position (GPS), Open-Source Routing Machine, OpenStreetMap and a Large Language Model (LLM) (See below). The natural language queries are constructed to closely follow the turn-by-turn navigation available in most GPS navigation devices. The waypoints (Fig 4) are computed by map-matching the raw GPS trace to the OSM graph and sampling 10 equidistant points (orange) spanning 100 meters from the vehicles current location (green), and serve as drive-by-waypoints.

Fig 4: L2D 6x RGB cameras, waypoints (orange) and vehicle location (green) Instructions: drive straight up to the stop stop sign and then when you have right of way, merge with the moving traffic from the left

Search & Curation


Expert policies		Student policies
GPS traces from the expert policies collected from the driving school fleet. Click here to see the full extent of expert policies in L2D.		Student policies cover the same geographical locations as expert policies. Click here to see the full extent of student policies in L2D.

We collected the expert and student policies with a fleet of 60 KIA E-niro driving school vehicles operating in 30 German cities, with an identical sensor suite. The multimodal logs collected with the fleet are unstructured and void of any task or instructions information. To search and curate for episodes we enrich the raw multimodal logs with information extracted through map matching the GPS traces with OSRM and assigning node and way tags from OSM (See next section). Coupled with a LLM, this enrichment step enables searching for episodes through the natural language description of the task.

OpenStreetMap

For efficiently searching relevant episodes, we enrich the GPS traces with turn information obtained by map-matching the traces using OSRM. We additionally use the map-matched route and assign route features, route restrictions and route maneuvers, collectively referred to as route tasks, to the trajectory using OSM (See sample Map). Appendix A1-A2 provides for more details on the route tasks we assign to GPS traces.

Fig 5: Driving tasks assigned to raw GPS trace (View map)

The route tasks which get assigned to the map-matched route, are assigned the beginning and end timestamps (unix epoch), which equates to the time when the vehicle enters and exits the geospatial linestring or point defined by the task (Fig 6).

Begin: Driving task (Best viewed in a separate tab)	End: Driving task (Best viewed in a separate tab)

Fig 6: Pink: GNSS trace, Blue: Matched route, tasks: Yield, Train crossing and Roundabout (View Map)

Multimodal search

We perform semantic spatiotemporal indexing of our multimodal data with the route tasks as described in Fig 5. This step provides a rich semantic overview of our multimodal data. To search within the semantic space for representative episodes by instructions, for example, “drive up to the roundabout and when you have the right of way turn right”, we built a LLM-powered multimodal natural language search, to search within all our drive data (> 1 PetaBytes) and retrieve matching episodes.

We structured the natural language queries (instructions) to closely resemble turn-by-turn navigation available in GPS navigation devices. To translate instructions to route tasks, we prompt the LLM with the instructions and steer its output to a list of route features, route restrictions, route maneuvers and retrieve episodes assigned to these route tasks. We perform a strict validation of the output from the LLM with a pydantic model to minimize hallucinations. Specifically we use llama-3.3-70b and steer the output to the schema defined by the pydantic model. To further improve the quality of the structured output, we used approx 30 pairs of known natural language queries and route tasks for in-context learning. Appendix A. 2 provides details on the in-context learning pairs we used.

Instructions: Drive up to the roundabout and when you have the right of way turn right

LeRobot

L2D on 🤗 is converted to LeRobotDataset v2.1 format to fully leverage the current and future models supported within LeRobot. The AI community can now build end-to-end self-driving models leveraging the state-of-the-art imitation learning and reinforcement learning models for real world robotics like ACT, Diffusion Policy, and Pi0.

Existing self-driving datasets (table below) focus on intermediate perception and planning tasks like 2D/3D object detection, tracking, segmentation and motion planning, which require high quality annotations making them difficult to scale. Instead L2D is focused on the development of end-to-end learning which learns to predict actions (policy) directly from sensor input (Table 1.). These models leverage internet pre-trained VLM and VLAM.

Releases

Robotics AI models’ performances are bounded by the quality of the episodes within the training set. To ensure the highest quality episodes, we plan a phased release for L2D. With each new release we add additional information about the episodes. Each release R1+ is a superset of the previous releases to ensure clean episode history.

1. instructions: Natural language instruction of the driving task 2. task_id: Mapping of episodes to EU mandated driving tasks Task ID 3. observation.state.route : Information about lane count, turn lanes from OSM 4. suboptimal: Natural language description for the cause of sub-optimal policies

HF	Nutron	Date	Episodes	Duration	Size	instructions	task_id	observation.state.route	suboptimal
R0	R0	March 2025	100	0.5+ hr	9,5 GB	☑️
R1	R1	April 2025	1K	5+ hr	95 GB	☑️
R2	R2	May 2025	10K	50+ hr	1 TB	☑️	☑️	☑️	☑️
R3	R3	June 2025	100K	500+ hr	10 TB	☑️	☑️	☑️	☑️
R4	R4	July 2025	1M	5000+ hr	90 TB	☑️	☑️	☑️	☑️

Table 5: L2D release dates

The entire multimodal dataset collected by Yaak with the driving school fleet is 5x larger than the planned release. To further the growth of L2D beyond R4, we invite the AI community to search and uncover scenarios within our entire data collection and build a community powered open-source L2D. The AI community can now search for episodes through our natural language search and queue their collection for review by the community for merging them into the upcoming releases. With L2D, we hope to unlock an ImageNet moment for spatial intelligence.

Fig 1: Searching episodes by natural language instructions

Closed Loop Testing

LeRobot driver

For real world testing of the AI models trained with L2D and LeRobot, we invite the AI community to submit models for closed loop testing with a safety driver, starting summer of 2025. The AI community will be able to queue their models for closed loop testing, on our fleet and choose the tasks they’d like the model to be evaluated on and, for example, navigating roundabouts or parking. The model would run in inference mode (Jetson AGX or similar) on-board the vehicle. The models will drive the vehicle with LeRobot driver in two modes

drive-by-waypoints: “Follow the waypoints adhering to driving rules and regulations” given observation.state.vehicle.waypoints
drive-by-language: “Drive straight and turn right at the pedestrian crossing”

Additional Resources

Driving task catalog (Fahraufgabenkatalog)
Official German practical driving exam
Groq

References

@article{yaak2023novel,
    author = {Yaak team},
    title ={A novel test for autonomy},
    journal = {https://www.yaak.ai/blog/a-novel-test-for-autonomy},
    year = {2023},
}
@article{yaak2023actiongpt,
    author = {Yaak team},
    title ={Next action prediction with GPTs},
    journal = {https://www.yaak.ai/blog/next-action-prediction-with-gpts},
    year = {2023},
}
@article{yaak2024si-01,
    author = {Yaak team},
    title ={Building spatial intelligence part - 1},
    journal = {https://www.yaak.ai/blog/buildling-spatial-intelligence-part1},
    year = {2024},
}
@article{yaak2024si-01,
    author = {Yaak team},
    title ={Building spatial intelligence part - 2},
    journal = {https://www.yaak.ai/blog/building-spatial-intelligence-part-2},
    year = {2024},
}

Appendix

A.1 Route tasks

List of route restrictions. We consider route tags from OSM a restriction if it imposes restrictions on the policy, for example speed limit, yield or construction. Route features are physical structures along the route, for example inclines, tunnels and pedestrian crossing. Route maneuvers are different scenarios which a driver encounters during a normal operation of the vehicle in an urban environment, for example, multilane left turns and roundabouts.

Type	Name	Assignment	Task ID	Release
Route restriction	CONSTRUCTION	VLM		R1
Route restriction	CROSS_TRAFFIC	VLM	4.3.1.3a, 4.3.1.3b, 4.3.1.3d, 4.2.1.3a, 4.2.1.3b, 4.2.1.3d	R2
Route restriction	INCOMING_TRAFFIC	VLM		R2
Route restriction	LIMITED_ACCESS_WAY	OSM		R0
Route restriction	LIVING_STREET	OSM		R0
Route restriction	LOW_SPEED_REGION (5, 10, 20 kph)	OSM		R0
Route restriction	ONE_WAY	OSM	3.2.1.3b	R0
Route restriction	PEDESTRIANS	VLM	7.2.1.3b	R1
Route restriction	PRIORITY_FORWARD_BACKWARD	OSM	3.1.1.3b	R0
Route restriction	ROAD_NARROWS	OSM		R0
Route restriction	STOP	OSM	4.1.1.3b, 4.2.1.3b, 4.3.1.3b	R0
Route restriction	YIELD	OSM	4.1.1.3b, 4.2.1.3b, 4.3.1.3b	R0
Route feature	BRIDGE	OSM		R0
Route feature	CURVED_ROAD	OSM (derived)	2.1.1.3a, 2.1.1.3b	R0
Route feature	BUS_STOP	OSM	7.1.1.3a	R0
Route feature	HILL_DRIVE	OSM		R0
Route feature	LOWERED_KERB	OSM		R0
Route feature	NARROW_ROAD	VLM
Route feature	PARKING	OSM		R0
Route feature	PEDESTRIAN_CROSSING	OSM	7.2.1.3b	R0
Route feature	TRAFFIC_CALMER	OSM		R0
Route feature	TRAIN_CROSSING	OSM	6.1.1.3a, 6.1.1.3b	R0
Route feature	TRAM_TRACKS	OSM	6.2.1.3a	R0
Route feature	TUNNEL	OSM		R0
Route feature	UNCONTROLLED_PEDESTRIAN_CROSSING	OSM	7.2.1.3b	R0
Route maneuver	ENTERING_MOVING_TRAFFIC	OSM (derived)	4.4.1.3a	R0
Route maneuver	CUTIN	VLM		R3
Route maneuver	LANE_CHANGE	VLM	1.3.1.3a, 1.3.1.3b	R3
Route maneuver	MERGE_IN_OUT_ON_HIGHWAY	OSM	1.1.1.3a, 1.1.1.3b, 1.1.1.3c, 1.2.1.3a, 1.2.1.3b, 1.2.1.3c	R0
Route maneuver	MULTILANE_LEFT	OSM (derived)	4.3.1.3b, 4.3.1.3c, 4.3.1.3d	R0
Route maneuver	MULTILANE_RIGHT	OSM (derived)	4.2.1.3b, 4.2.1.3c, 4.2.1.3d	R0
Route maneuver	PROTECTED_LEFT	OSM (derived)	4.3.1.3c, 4.3.1.3d	R0
Route maneuver	PROTECTED_RIGHT_WITH_BIKE	OSM (derived)	4.2.1.3c, 4.2.1.3d	R0
Route maneuver	RIGHT_BEFORE_LEFT	OSM (derived)	4.1.1.3a, 4.2.1.3a, 4.3.1.3a	R0
Route maneuver	RIGHT_TURN_ON_RED	OSM	4.2.1.3c	R0
Route maneuver	ROUNDABOUT	OSM	5.1.1.3a, 5.1.1.3b	R0
Route maneuver	STRAIGHT	OSM (derived)	8.1.1.3a	R0
Route maneuver	OVER_TAKE	VLM	3.2.1.3a, 3.2.1.3b	R4
Route maneuver	UNPROTECTED_LEFT	OSM (derived)	4.3.1.3a, 4.3.1.3b	R0
Route maneuver	UNPROTECTED_RIGHT_WITH_BIKE	OSM	4.2.1.3a, 4.2.1.3b	R0

OSM = Openstreetmap, VLM= Vision Language Model, derived: Hand crafted rules with OSM data

A.2 LLM prompts

Prompt template and pseudo code for configuring the LLM using groq to parse natural language queries into structured prediction for route features, restrictions and maneuvers with a pydantic model. The natural language queries are constructed to closely follow the turn-by-turn navigation available in most GPS navigation devices.

prompt_template: "You are parsing natural language driving instructions into PyDantic Model's output=model_dump_json(exclude_none=True) as JSON. Here are a few example pairs of instructions and structured output: {examples}. Based on these examples parse the instructions. The JSON must use the schema: {schema}"
groq:
model: llama-3.3-70b-versatile
temperature: 0.0
seed: 1334
response_format: json_object
max_sequence_len: 60000

Example pairs (showing 3 / 30) for in-context learning to steer the structured prediction of LLM, where ParsedInstructionModel is a pydantic model.

PROMPT_PAIRS = [
(
            "Its snowing. Go straight through the intersection, following the right before left rule at unmarked intersection",
        ParsedInstructionModel(
            	eventSequence=[
                    EventType(speed=FloatValue(value=10.0, operator="LT", unit="kph")),
                EventType(osmRouteManeuver="RIGHT_BEFORE_LEFT"),
                		EventType(speed=FloatValue(value=25.0, operator="LT", unit="kph")),
            	],
            turnSignal="OFF",
            weatherCondition="Snow",
        ),
        ),
(
            "stop at the stop sign, give way to the traffic and then turn right",
        ParsedInstructionModel(
            eventSequence=[
                	EventType(osmRouteRestriction="STOP"),
                EventType(turnSignal="RIGHT"),
                	EventType(speed=FloatValue(value=5.0, operator="LT", unit="kph")),
                EventType(osmRouteManeuver="RIGHT"),
            ],
            ),
    ),
    (
            "parking on a hill in the rain on a two lane road",
        ParsedInstructionModel(
            	osmLaneCount=[IntValue(value=2, operator="EQ")],
                osmRouteFeature=["PARKING", "HILL_DRIVE"],
            weatherCondition="Rain",
            ),
    ),
]

EXAMPLES = ""
for idx, (instructions, parsed) in enumerate(PROMPT_PAIRS):
    parsed_json = parsed.model_dump_json(exclude_none=True)
    update = f"instructions: {instructions.lower()} output: {parsed_json}"
    EXAMPLES += update

from groq import Groq
client = Groq(api_key=os.environ.get("GROQ_API_KEY"))

chat_completion = client.chat.completions.create(
                messages=[
                    {
                        "role": "system",
                        "content": prompt_template.format(examples=EXAMPLES, schema=json.dumps(ParsedInstructionModel.model_json_schema(), indent=2))
                    },
                    {
                        "role": "user",
                        "content": f"instructions : its daytime. drive to the traffic lights and when it turns green make a left turn",
                    },
                ],
                model=config["groq"]["model"],
                temperature=config["groq"]['temperature'],
                stream=False,
                seed=config["groq"]['seed'],
                response_format={"type": config['groq']['response_format']},
            )

            parsed_obj = ParsedInstructionModel.model_validate_json(chat_completion.choices[0].message.content)
            parsed_obj = parsed_obj.model_dump(exclude_none=True)

A.2 Data collection hardware

Onboard compute: NVIDIA Jetson AGX Xavier

8 cores @ 2/2.2 GHz, 16/64 GB DDR5
100 TOPS , 8 lanes MIPI CSI-2 D-PHY 2.1 (up to 20Gbps)
8x 1080p30 video encoder (H.265)
Power: 10-15V DC input, ~90W power consumption
Storage: SSD M.2 (4gen PCIe 1x4)
Video input 8 cameras:
- 2x Fakra MATE-AX with 4x GMSL2 with Power-over-Coax support

Onboard compute: Connectivity

Multi-band, Centimeter-level accuracy RTK module
5G connectivity: M.2 USB3 module with maximum downlink rates of 3.5Gbps and uplink rates of 900Mbps, dual SIM

Component	#	Vendor	Specs
RGB: Camera	1	connect-tech	Techspecs
RGB: Rugged Camera	5	connect-tech	Techspecs
GNSS	1	Taoglas	Techspecs
5G antenna	2	2J Antenna	Datasheet
NVIDIA Jetson Orin NX - 64 GB	1	Nvidia	Techspecs