When LLMs Meet PDDL, a Survey of Neurosymbolic Planning
Imagine an earthquake has just struck. Two aerial drones and two ground robots are staged and ready to deploy. Victims need extracting, roads are blocked, medical supplies need delivering, and every second counts. You turn to your favorite LLM and say: "Plan a rescue mission."
What you get back is confident, articulate, and wrong. The LLM sends a ground robot down a blocked road. It tries to extract a victim from a building that hasn't been scanned yet. It forgets the med-kit entirely. The plan reads beautifully. It just doesn't work.
This is the fundamental tension at the heart of LLM-based planning. Large language models are extraordinary at understanding natural language, reasoning about context, and generating plausible-sounding sequences of actions. But "plausible-sounding" is not the same as "valid." When plans have hard constraints (preconditions that must hold, resources that are finite, actions that must happen in a specific order), LLMs tend to hallucinate their way through them like a student who skimmed the textbook five minutes before the exam.
So what if we stopped asking LLMs to be planners, and instead let them be translators?
That's the core idea behind a growing body of neurosymbolic AI research: use LLMs to convert messy, natural language goals into formal planning representations (like PDDL), hand the actual planning off to a classical planner that guarantees valid solutions, and then translate the result back into something a human (or a robot) can act on. The LLM speaks human. The planner speaks logic. Together, they get the job done.
In this post, we'll walk through what PDDL is, why LLMs struggle with planning, how these two very different technologies complement each other, and survey the key papers pushing this field forward. We'll even run through a complete disaster response example end to end.
The Scenario: Earthquake Disaster Response
To make all of this concrete, we'll use a running example throughout the post. Here's the setup:
An earthquake disaster response environment has four zones: a staging area, Zone 1 (north sector), Zone 2, and Zone 3 (south sector). The staging area connects to Zone 1, Zone 1 connects to Zone 2, and Zone 2 connects to Zone 3. Zone 1 contains building-A with one trapped victim, and Zone 3 contains building-B with one trapped victim. The road between Zone 1 and Zone 2 is blocked by debris.
Four robots are available, all starting at the staging area:
- UAV uvl-1 can operate over Zone 1 and Zone 2.
- UAV uvl-2 can operate over Zone 3.
- UGV ground-1 starts with no cargo.
- UGV ground-2 starts carrying a med-kit.
The diagram below shows the starting state of the world: <img src="../assets/starting_conditions_gv.svg" width="600">
The Planning Problem with LLMs
Large language models are next-token predictors, not search algorithms. A classical planner systematically explores a state space, tracking what's true at every step and pruning invalid paths. An LLM generates whatever sounds right based on patterns in its training data. For open-ended text generation, that's a superpower. For multi-step planning with hard constraints, it's a liability.
The failure mode is subtle, which makes it dangerous. LLMs don't produce gibberish plans. They produce confident, well-structured, completely wrong plans. The output reads like it was written by someone who understands the problem. It just wasn't written by something that tracked the problem.
There are three core reasons LLMs struggle here:
They don't maintain world state. Every action in a plan changes the state of the world. A road gets cleared. A zone gets searched. A robot moves. A classical planner updates its state model after each action and checks preconditions against it. An LLM has no such mechanism. It's generating step N based on the vibes of steps 1 through N-1, not based on a verified model of what's currently true.
They skip intermediate steps. Humans do this too, but planners don't. If extracting a victim requires the zone to be searched, and searching requires a UAV survey, an LLM might jump straight from "survey" to "extract" because it understands the goal without internalizing the dependency chain. The result is a plan with invisible gaps that only surface when you try to execute it.
They offer no correctness guarantees. A classical planner either finds a valid plan or reports that none exists. An LLM will always give you an answer. It will never say "this problem is unsolvable." It will never flag that it violated a precondition. It will hand you a broken plan with the confidence of someone who has never been wrong.
Seeing It in Practice
To make this concrete, we took our earthquake response scenario and handed it to eight different local LLMs, asking each to produce a step-by-step plan satisfying all preconditions. We then validated every plan against the formal action model. For timing, the models were run on a MacBook Pro with M2 max chip and 64 GB of memory.
| Model | CoT Reasoning | Think Time | Valid Plan? |
|---|---|---|---|
| mistral-small3.2 (24B) | No | — | ❌ |
| llama3.1 (8B) | No | — | ❌ |
| qwen3.5 (9B) | Yes | 321.8s | ✅ |
| qwen3.5 (35B) | Yes | 414.1s | ✅ |
| qwen3 (8B) | Yes | 114.8s | ❌ |
| gemma4 (26B) | Yes | 171.4s | ✅ |
| gemma4 (31B) | Yes | 103.5s | ✅ |
| nemotron-3-nano | Yes | 66.9s | ✅ |
A few things stand out.
Non-reasoning models failed outright. Both models without chain-of-thought reasoning (Mistral and Llama) produced fundamentally broken plans. Mistral invented actions that don't exist in the domain, like delivering victims back to staging. Llama sent a UAV outside its operational range and teleported a ground robot across zones. These aren't edge cases. They're the kind of errors that get a robot stuck in the field.
Chain-of-thought helps, but isn't a guarantee. Five of the six reasoning models produced valid plans. But Qwen3 8B spent nearly two minutes thinking and still forgot that zones need to be searched before any ground operation can happen. It missed the same precondition three times. More thinking time doesn't help if a constraint never enters the reasoning loop.
The "search-rubble" precondition was the most common trap. It's an intermediate step between surveying a zone and extracting a victim or delivering supplies. LLMs that reasoned about high-level goals ("get the victim out") tended to skip this middle step, because it's not part of the intuitive story. It's a bookkeeping requirement, and LLMs don't do bookkeeping.
Correct plans all converged on the same structure. Every model that produced a valid plan arrived at essentially the same strategy: UAVs survey and scan first, one ground robot handles Zone 1 extraction and road clearing, the other transits to Zone 3 for delivery. The differences were minor (which robot does what), but the dependency ordering was identical. This makes sense: the constraint structure only admits a narrow family of valid plans.
The key takeaway is LLMs are just not built to guarantee that a sequence of actions is executable. For that, you need a planner.
PDDL 101: Teaching Robots to Think Before They Act
PDDL (Planning Domain Definition Language) is a standardized language for describing planning problems. Think of it as a way to tell a planner "here's what the world looks like, here's what actions are possible, and here's what I want to achieve." The planner's job is to figure out the sequence of actions that gets from A to B, guaranteed to be valid.
If you've worked with declarative languages or constraint solvers, PDDL will feel familiar. You describe what you want, not how to get it. The planner handles the "how."
PDDL splits every planning problem into two files: a domain file and a problem file. The domain defines the rules of the world (what types of things exist, what's true about them, and what actions can change that truth). The problem file defines a specific scenario within that world (which objects are in play, what's currently true, and what you want to be true when the plan is done). One domain can support many different problem files, the same way one game engine can run many different levels.
We'll use a subset of PDDL called STRIPS (Stanford Research Institute Problem Solver), which is the most widely used formalism in the LLM + planning literature. STRIPS keeps things simple: states are sets of true/false propositions, and actions flip propositions on or off. If you see :requirements :strips :typing at the top of a domain file, that's what's going on.
The Domain File: Rules of the World
Let's look at our earthquake response domain. Here's the skeleton:
(define (domain EarthquakeResponse)
(:requirements :strips :typing)
(:types ...)
(:predicates ...)
(:action ...
:parameters (...)
:preconditions (...)
:effect (...))
(:action ...
:parameters (...)
:preconditions (...)
:effect (...))
...
)
Everything in PDDL is an S-expression (yes, it's parentheses all the way down). Let's break down each section.
Types define the categories of objects in your world. If you're a programmer, think of these as your class hierarchy:
(:types uav ground_robot victim zone building supply_crate)
Every object in the problem will be one of these types, and actions will use types to constrain which parameters are valid.
Predicates are boolean propositions that describe the state of the world. They're your fact database. Each predicate has a name and typed parameters. note the variable name starts with ?:
(:predicates
(at ?agent - ground_robot ?zone - zone) # where is a ground robot?
(surveyed ?zone - zone) # has this zone been aerially surveyed?
(blocked ?z1 - zone ?z2 - zone) # is the road between two zones blocked?
(victim_at ?v - victim ?b - building ?z - zone) # where is a victim?
(extracted ?v - victim) # has this victim been rescued?
...
)
At any point during the plan, each grounded predicate (e.g., (surveyed zone-1)) is either true or false. The entire set of true predicates is the state.
Actions are where the magic happens. Each action has three parts:
- Parameters: the typed variables the action operates on
- Preconditions: what must be true before the action can execute
- Effects: what changes after the action executes
Here's the extract-victim action from our domain:
(:action extract-victim
:parameters (?robot - ground_robot ?v - victim ?b - building ?z - zone)
:precondition (and
(at ?robot ?z) # robot must be in the zone
(searched ?z) # zone must have been searched
(thermal_scanned ?b) # building must have been thermally scanned
(victim_at ?v ?b ?z)) # victim must actually be here
:effect (and
(extracted ?v))) # victim is now extracted
This is where PDDL earns its keep. That and in the precondition means all four conditions must hold before a robot can extract a victim. No shortcuts, no hallucinating. A classical planner will never generate a plan that attempts extraction without the zone being searched and the building being scanned first.
The key logical operators you'll encounter in STRIPS-style PDDL are:
and: all conditions must hold (used in both preconditions and effects)not: negates a predicate (used in effects to "delete" a fact from the state)
For example, when a ground robot moves, we need to update its location:
(:action move-ground
:parameters (?robot - ground_robot ?z1 ?z2 - zone)
:precondition (and (at ?robot ?z1) (adjacent ?z1 ?z2) (not (blocked ?z1 ?z2)))
:effect (and (at ?robot ?z2) (not (at ?robot ?z1))))
Notice not appears in two different roles here. In the precondition, (not (blocked ?z1 ?z2)) checks that the road isn't blocked. In the effect, (not (at ?robot ?z1)) removes the robot from its old location. The robot can't be in two places at once, and PDDL makes sure of it.
The Problem File: A Specific Scenario
The problem file takes the domain's rules and sets up a concrete situation:
(define (problem EarthquakeResponseProblem)
(:domain EarthquakeResponse)
(:objects
uvl-1 uvl-2 - uav
ground-1 ground-2 - ground_robot
staging zone-1 zone-2 zone-3 - zone
building-a building-b - building
victim-1 victim-2 - victim
med-kit-1 - supply_crate)
(:init ...)
(:goal ...)
)
Objects are the typed instances that exist in this specific scenario. We have two UAVs, two ground robots, four zones, two buildings, two victims, and one supply crate. A different earthquake might have more zones, more victims, different resources. The domain stays the same; only the problem file changes.
Init defines the initial state, the set of predicates that are true at the start. Anything not listed is false (this is called the "closed-world assumption"):
(:init
(at ground-1 staging)
(at ground-2 staging)
(has_supply ground-2 med-kit-1)
(adjacent staging zone-1)
(adjacent zone-1 zone-2)
(adjacent zone-2 zone-3)
(blocked zone-1 zone-2)
(building_in_zone building-a zone-1)
(building_in_zone building-b zone-3)
(victim_at victim-1 building-a zone-1)
(victim_at victim-2 building-b zone-3)
(uav_can_survey uvl-1 zone-1)
(uav_can_survey uvl-1 zone-2)
(uav_can_survey uvl-2 zone-3)
)
Reading this, you can reconstruct the entire scenario: both robots start at staging, ground-2 has the med-kit, the road between zone-1 and zone-2 is blocked, and so on. If (surveyed zone-1) isn't in the init, then zone-1 hasn't been surveyed yet.
Goal defines what must be true for the plan to succeed:
(:goal (and
(extracted victim-1)
(extracted victim-2)
(delivered med-kit-1 zone-3)))
Both victims extracted, med-kit delivered. The planner's job is to find a sequence of actions that transforms the init state into a state where the goal holds, with every action's preconditions satisfied along the way. That's it. No more, no less.
The Full Pipeline: From English to Rescue Plan
Now let's put it all together. We'll walk through exactly what happens when you feed a natural language mission briefing through the LLM-as-translator pipeline, step by step.
Step 1: Natural Language Input
A human operator input is seen in The Scenario: Earthquake Disaster Response section. It is messy, information-dense, and full of implicit structure. An LLM is very good at parsing text like this. A classical planner would have no idea what to do with it.
Step 2: LLM Generates PDDL
The LLM's job is translation, not planning. It reads the natural language spec and produces two files.
The domain file defines the rules of the world: what types of things exist, what actions are possible, and what conditions must hold before and after each action. Here's a snippet from the generated domain:
(:action extract-victim
:parameters (?robot - ground_robot ?v - victim ?b - building ?z - zone)
:precondition (and (at ?robot ?z) (searched ?z)
(thermal_scanned ?b) (victim_at ?v ?b ?z))
:effect (and (extracted ?v)))
Read that out loud and it almost sounds like English: "A ground robot can extract a victim from a building in a zone, if the robot is at that zone, the zone has been searched, the building has been thermal-scanned, and the victim is actually there." Every precondition that an LLM might skip over is explicitly encoded.
The problem file defines the specific situation: where everything starts, what exists, and what the goal is.
(:init
(at ground-1 staging) (at ground-2 staging)
(has_supply ground-2 med-kit-1)
(adjacent staging zone-1) (adjacent zone-1 zone-2) (adjacent zone-2 zone-3)
(blocked zone-1 zone-2)
(building_in_zone building-a zone-1) (building_in_zone building-b zone-3)
(victim_at victim-1 building-a zone-1) (victim_at victim-2 building-b zone-3)
(uav_can_survey uvl-1 zone-1) (uav_can_survey uvl-1 zone-2)
(uav_can_survey uvl-2 zone-3))
(:goal (and (extracted victim-1) (extracted victim-2)
(delivered med-kit-1 zone-3)))
Every fact from the operator's briefing is now a formal predicate. The blocked road, the UAV coverage areas, the med-kit assignment, the victim locations. Nothing is left to interpretation.
Step 3: Classical Planner Solves It
Now the LLM steps aside entirely. We hand both files to Fast Downward, a state-of-the-art classical planner, and let it search for a valid plan. Fast Downward doesn't guess. It doesn't hallucinate. It systematically explores the state space, checking preconditions at every step, and either finds a provably valid plan or reports that none exists.
For our earthquake scenario, it returns this in milliseconds:
(aerial-survey uvl-1 zone-1)
(aerial-survey uvl-2 zone-3)
(thermal-scan uvl-2 building-b zone-3)
(thermal-scan uvl-1 building-a zone-1)
(move-ground ground-1 staging zone-1)
(search-rubble ground-1 zone-1)
(extract-victim ground-1 victim-1 building-a zone-1)
(clear-road ground-1 zone-1 zone-2)
(move-ground ground-1 zone-1 zone-2)
(move-ground ground-1 zone-2 zone-3)
(search-rubble ground-1 zone-3)
(extract-victim ground-1 victim-2 building-b zone-3)
(move-ground ground-2 staging zone-1)
(move-ground ground-2 zone-1 zone-2)
(move-ground ground-2 zone-2 zone-3)
(deliver-supplies ground-2 med-kit-1 zone-3)
; cost = 16 (unit cost)
Sixteen actions. Every precondition satisfied. No robot moves through a blocked road. No victim is extracted before their building is scanned. No supplies are delivered before the zone is searched. The plan is optimal under unit cost.
Compare this to the LLM attempts from the previous section, where models invented nonexistent actions, teleported robots, and forgot entire precondition chains.
Step 4: Translation Back to Natural Language
The raw plan is perfectly valid but not particularly readable for a human operator coordinating a disaster response. So the LLM gets one more job: translate the plan back into structured, actionable English.
Phase 1: Aerial Reconnaissance
- UAV uvl-1 performs an aerial survey of Zone 1.
- UAV uvl-2 performs an aerial survey of Zone 3.
- UAV uvl-2 performs a thermal scan of building-B in Zone 3.
- UAV uvl-1 performs a thermal scan of building-A in Zone 1.
Phase 2: Ground Operations — Robot ground-1 (Victim Extraction) 5. ground-1 moves from staging to Zone 1. 6. ground-1 searches rubble in Zone 1. 7. ground-1 extracts victim-1 from building-A. 8. ground-1 clears the blocked road between Zone 1 and Zone 2. 9. ground-1 transits through Zone 2 to Zone 3. 10. ground-1 searches rubble in Zone 3. 11. ground-1 extracts victim-2 from building-B.
Phase 2: Ground Operations — Robot ground-2 (Supply Delivery) 12. ground-2 moves from staging through Zone 1 and Zone 2 to Zone 3. 13. ground-2 delivers med-kit-1 to Zone 3.
The LLM adds structure that the planner doesn't care about (phasing, grouping by robot, descriptive labels) but that makes the plan legible to a human team. This is exactly the kind of work LLMs excel at.
What Just Happened
Each component did what it's best at:
- The LLM parsed ambiguous natural language into formal representations, and later turned a formal plan into readable instructions.
- The classical planner found an optimal, provably valid action sequence.
- Neither could have done the other's job. The LLM can't guarantee valid plans. The planner can't read English.
This is the neurosymbolic handshake. And the research community has been exploring many variations of it.
Survey of Approaches
The research in this space clusters into a few distinct strategies, all sharing the same intuition: let LLMs handle language, and let something more rigorous handle planning.
Cluster 1: LLM as PDDL Translator
The most direct approach is also the most intuitive. Give the LLM a natural language goal, have it produce a PDDL problem file, and hand that to a classical planner.
LLM+P (Liu et al., 2023) established this pattern. The pipeline is straightforward: natural language in, PDDL out, Fast Downward solves it, and the LLM translates the plan back to English. With just one in-context example, GPT-4 could produce valid PDDL problem files that led to 85-100% success rates across benchmark domains. Without the planner, GPT-4 alone failed on nearly all of them. Real-robot demos confirmed the planner found optimal plans (cost 22) where the LLM alone produced suboptimal ones (cost 31).
The End-to-End Planning Framework with Agentic LLMs and PDDL (La Malfa et al., 2025) scaled this idea up with an agentic architecture. An orchestrator LLM generates the PDDL, but specialized sub-agents (syntax fixer, constraint verifier, temporal consistency checker) iteratively refine it before planning. The most frequently called agent? The syntax fixer. LLMs consistently produce malformed PDDL on the first pass, and the refinement loop is what makes the system work. This framework improved accuracy by ~12% on Google Natural Plan and solved 7-disk Tower of Hanoi at 90% accuracy, a problem where LLMs alone are hopeless.
PIP-LLM (Shi et al., 2024) extended the pattern to multi-robot teams by adding an integer programming layer. The LLM translates the command into a team-level PDDL problem (deliberately abstracting away individual robot assignments), a classical planner solves it, and then IP optimization assigns specific robots to subtasks based on capabilities and travel costs. On complex tasks, non-reasoning baselines scored 0% success while PIP-LLM hit 87%.
Cluster 2: LLM as Program Synthesizer
What if the LLM didn't just translate one problem, but wrote a reusable program that could solve any problem in a domain?
Generalized Planning in PDDL Domains with Pretrained Large Language Models (Silver et al. 2023) prompted GPT-4 to synthesize Python programs that act as generalized planners for PDDL domains. The pipeline is a three-stage prompt chain: summarize the domain, propose a strategy in natural language, then implement it in code. Automated debugging (feeding errors back to GPT-4 for up to 4 rounds) proved critical. One surprising finding: human-readable PDDL names mattered enormously. Strip the meaningful names and performance collapsed, because the LLM was leaning on language understanding, not symbolic reasoning.
Language Models For Generalised PDDL Planning (Chen et al. 2025) pushed this further, generating Python programs that serve as either policies (pick the next action) or value functions (score states for search). Their best portfolio solved 630 out of 900 benchmark problems, beating the LAMA planner's 557. And here's the twist: LLMs sometimes planned better when predicates were replaced with meaningless symbols. Silver et al's assumption that LLMs need semantic names to reason may not always hold.
Cluster 3: Grounded and Embodied Planning
Not every environment has a tidy PDDL domain file waiting to be filled in. These approaches keep the LLM-proposes-something-else-validates pattern but swap formal planning for learned feasibility models and simulators.
SayCan (Ahn et al., 2022) introduced the template: the LLM suggests actions ("Say"), and a learned affordance model scores whether each action is physically feasible right now ("Can"). This grounds the LLM's proposals in the robot's actual capabilities. SayPlan (Rana et al., 2023) scaled this to large multi-room environments using 3D scene graphs, with an iterative replanning loop where a simulator validates the plan and sends textual feedback when actions fail. Without this loop, executability on long-horizon tasks dropped from near-perfect to 13.3%.
SayCanPay (Hazra et al., 2024) added a third dimension: long-term cost. Beyond asking "can I do this action?", it asks "should I do this action, or will it lead to a dead end?" Beam search over actions (not tokens) selects plans that balance feasibility and payoff. It's a middle ground between pure LLM generation and full PDDL formalization, trading correctness guarantees for flexibility in environments where authoring a complete domain model isn't practical.
Cluster 4: Multi-Agent and Long-Horizon Coordination
Scaling to multiple heterogeneous robots with long task chains is where things get truly hard.
LaMMA-P (2024) decomposes natural language instructions into subtasks, allocates them across robots based on skill sets, and generates per-robot PDDL problem files solved by Fast Downward. Even with Llama 3.1 8B as the backbone, the framework outperformed pure CoT prompting with GPT-4o, demonstrating that architecture matters more than model size. EmboTeam (Zeng et al., 2026) took this further by compiling plans into parallel behavior trees with precondition checks and fallback mechanisms, improving task success from 12% to 55% over the LaMMA-P baseline. The classical planner proved especially critical for temporal dependencies, where removing it caused goal condition recall to drop from 0.62 to 0.22.
Cluster 5: PDDL as Scaffolding for Reinforcement Learning
Finally, PDDL's usefulness extends beyond LLM translation entirely. PaRL (Lee et al.) uses PDDL operators to define options (temporally extended actions) in a hierarchical RL framework. An AI planner selects which option to execute, while low-level RL agents learn the actual control policies. On complex multi-room navigation tasks, their approach was the only algorithm that showed consistent learning progress where flat RL methods stalled. It's a reminder that PDDL isn't just a translation target for language models. It's a general-purpose abstraction for decomposing complex tasks into manageable pieces.
Challenges and Open Questions
The neurosymbolic pipeline is compelling, but it's not a solved problem. Several hard challenges remain, and some of them are fundamental rather than engineering details.
Domain authoring is still the bottleneck (mostly). The majority of systems in this space assume a human expert has already written the PDDL domain file, the one that defines all the actions, preconditions, and effects. The LLM's job is limited to generating the problem file: the specific objects, initial state, and goals. That's the easier half of the translation. Writing a correct domain model requires deep understanding of the environment's mechanics, and a missing precondition or a subtly wrong effect can produce plans that are technically "valid" but catastrophically wrong in practice. The End-to-End Planning Framework is pushing into this territory by generating both domain and problem files, but their most frequently called sub-agent was the syntax fixer, which tells you something about how reliably LLMs handle domain authoring on the first pass.
PDDL generation is fragile even for problem files. Getting the problem file right isn't trivial either. LLMs hallucinate predicates that don't exist in the domain, omit initial conditions, and occasionally produce syntactically creative PDDL that no parser has ever seen before. Every successful system in the survey relies on some form of validation loop (syntax checking, constraint verification, planner feedback) to catch and repair these errors. The pipeline works, but it works because of the guardrails, not in spite of them.
Classical planning assumes you know everything. PDDL's default setting is full observability: the planner knows the complete state of the world at every step. Our earthquake scenario knew exactly where victims were, which roads were blocked, and what each robot was carrying. Real disaster response involves fog of war. Victims might be in unexpected locations. Roads that looked clear might be blocked. PDDL extensions for partial observability and contingent planning exist, but planner support is limited and computational costs increase dramatically.
Time and resources break the abstraction. Our scenario used unit-cost actions, where clearing a road takes the same "time" as moving between zones. Real operations care deeply about durations, fuel consumption, battery levels, and payload capacity. PDDL 2.1 and later versions support temporal and numeric planning, but the ecosystem of efficient planners thins out considerably once you leave the STRIPS fragment. This is an active area of planner development, but today there's a real gap between what PDDL can express and what planners can efficiently solve.
Agentic architectures are sensitive to their own design. The End-to-End framework discovered that adding or removing a single sub-agent (like a temporal consistency checker) dramatically changed how the orchestrator LLM behaved. These systems are not yet plug-and-play. Small architectural decisions ripple through the whole pipeline in unpredictable ways, which makes systematic evaluation and reproducibility harder than it should be.
The open frontier. The progression across the papers we surveyed is clear. LLM+P had the LLM write problem files. The End-to-End framework had it write domain files too. Chen et al. had it write reusable planning programs. The trajectory points toward systems where the LLM handles the entire formalization pipeline, from ambiguous human intent all the way to executable PDDL, with minimal human intervention. Whether that's achievable with current architectures or requires fundamentally new approaches to LLM grounding is an open question.
Where This Is Headed
The core lesson from this survey is simple: LLMs and classical planners are better together than either is alone. LLMs bring fluency, flexibility, and the ability to interface with humans in natural language. Planners bring guarantees, optimality, and the kind of rigorous state tracking that no amount of next-token prediction can replicate. The neurosymbolic pipeline is a division of labor that makes both sides more useful. What's striking is how quickly the field is moving. In 2023, LLM+P showed that a single in-context example was enough to get GPT-4 producing valid PDDL problem files. By the end of 2025, agentic frameworks were generating entire domains, debugging their own syntax, and solving problems that stump LLMs on their own.
Our earthquake scenario had four robots, four zones, and seven action types. Fast Downward solved it in milliseconds. But real disaster response involves hundreds of agents, thousands of locations, and action models with temporal constraints, resource limits, and partial information. The beauty of this architecture is that complexity scales in the planner, not in the LLM's context window. Whether the operator describes a four-robot building search or a city-wide relief coordination across fifty teams, the LLM's translation job stays roughly the same size. The hard work shifts to the component that was built for hard work. That's the whole point.
References
Liu, B., Jiang, Y., Zhang, X., Liu, Q., Zhang, S., Biswas, J., & Stone, P. (2023). LLM+P: Empowering Large Language Models with Optimal Planning Proficiency. https://arxiv.org/pdf/2304.11477.
La Malfa, E., Zhu, P., Marro, S., Bernardini, S., & Wooldridge, M. (2025). An End-to-End Planning Framework with Agentic LLMs and PDDL. University of Oxford. https://arxiv.org/pdf/2512.09629
Shi, G., Wu, Y., Kumar, V., & Sukhatme, G. S. (2024). PIP-LLM: Integrating PDDL-Integer Programming with LLMs for Coordinating Multi-Robot Teams Using Natural Language. University of Southern California & University of Pennsylvania. https://arxiv.org/pdf/2510.22784
Silver, T., Dan, S., Srinivas, K., Tenenbaum, J., Kaelbling, L., & Katz, M. (2023). Generalized Planning in PDDL Domains with Pretrained Large Language Models. MIT & IBM Research. https://arxiv.org/pdf/2305.11014
Chen, D. Z., Zenn, J., Cinquin, T., & McIlraith, S. A. (2025). Language Models For Generalised PDDL Planning: Synthesising Sound and Programmatic Policies. Vector Institute / University of Toulouse / University of Toronto. https://arxiv.org/pdf/2508.18507
Ahn, M., Brohan, A., Brown, N., Chebotar, Y., Cortes, O., David, B., Finn, C., et al. (2022). Do As I Can, Not As I Say: Grounding Language in Robotic Affordances. https://arxiv.org/pdf/2204.01691
Rana, K., Haviland, J., Garg, S., Abou-Chakra, J., Reid, I., & Suenderhauf, N. (2023). SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Robot Task Planning. Conference on Robot Learning (CoRL). https://arxiv.org/pdf/2307.06135
Hazra, R., Zuidberg Dos Martires, P., & De Raedt, L. (2024). SayCanPay: Heuristic Planning with Large Language Models using Learnable Domain Knowledge. Örebro University & KU Leuven. https://arxiv.org/pdf/2308.12682
LaMMA-P. (2024). LaMMA-P: Generalizable Multi-Agent Long-Horizon Task Allocation and Planning with LM-Driven PDDL Planner. https://arxiv.org/pdf/2409.20560
Zeng, H., Wang, M., & Li, P. (2026). EmboTeam: Grounding LLM Reasoning into Reactive Behavior Trees via PDDL for Embodied Multi-Robot Collaboration. University of Chinese Academy of Sciences / Institute of Software, CAS. https://arxiv.org/pdf/2601.11063
Lee, J., Katz, M., Agravante, D. J., Liu, M., Nangue Tasse, G., Klinger, T., & Sohrabi, S. Hierarchical Reinforcement Learning with AI Planning Models. IBM Research AI. https://arxiv.org/pdf/2203.00669
Haslum, P., Lipovetzky, N., Magazzeni, D., Muise, C., (2019) An Introduction to the Planning Domain Definition Language. https://link.springer.com/book/10.1007/978-3-031-01584-7