-
Notifications
You must be signed in to change notification settings - Fork 0
/
ppgc-diss-pprobst.tex
1470 lines (1171 loc) Β· 125 KB
/
ppgc-diss-pprobst.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
%
% This is an example file and is hereby explicitly put in the
% public domain.
%
\documentclass[ppgc,diss,english]{iiufrgs}
% Para usar o modelo, deve-se informar o programa e o tipo de documento.
% Programas :
% * cic -- Graduação em CiΓͺncia da Computação
% * ecp -- Graduação em CiΓͺncia da Computação
% * ppgc -- Programa de Pós Graduação em Computação
% * pgmigro -- Programa de Pós Graduação em Microeletrônica
%
% Tipos de Documento:
% * tc -- Trabalhos de ConclusΓ£o (apenas cic e ecp)
% * diss ou mestrado -- Dissertaçáes de Mestrado (ppgc e pgmicro)
% * tese ou doutorado -- Teses de Doutorado (ppgc e pgmicro)
% * ti -- Trabalho Individual (ppgc e pgmicro)
%
% Outras Opçáes:
% * english -- para textos em inglΓͺs
% * openright -- ForΓ§a inΓcio de capΓtulos em pΓ‘ginas Γmpares (padrΓ£o da
% biblioteca)
% * oneside -- Desliga frente-e-verso
% * nominatalocal -- LΓͺ os dados da nominata do arquivo nominatalocal.def
% Use unicode
\usepackage[utf8]{inputenc} % pacote para acentuação
%\usepackage[hyperpageref]{backref}
% NecessΓ‘rio para incluir figuras
\usepackage{graphicx} % pacote para importar figuras
\usepackage{tikz}
\usetikzlibrary{positioning, calc, decorations.pathreplacing}
\usepackage{listofitems} % for \readlist to create arrays
\usetikzlibrary{arrows, arrows.meta} % for arrow size
\usepackage[outline]{contour} % glow around text
\contourlength{1.4pt}
\tikzset{>=latex} % for LaTeX arrow head
\colorlet{myred}{red!80!black}
\colorlet{myblue}{blue!80!black}
\colorlet{mygreen}{green!60!black}
\colorlet{myorange}{orange!70!red!60!black}
\colorlet{mydarkred}{red!30!black}
\colorlet{mydarkblue}{blue!40!black}
\colorlet{mydarkgreen}{green!30!black}
\tikzset{ % node styles, numbered for easy mapping with \nstyle
my nodes/.style={circle,inner sep=0,minimum size=8mm},
input/.style={my nodes,draw=mygreen,fill=mygreen!20,text=green!50!black},
hidden/.style={my nodes,draw=violet,fill=violet!20,text=violet},
output/.style={my nodes,draw=myred!60!black,fill=myred!20,text=red!60!black},
my text/.style={text=#1,text width=1cm,align=center}
}
\newcommand{\textred}[1]{{\color{myred} #1}}
\usepackage{nicefrac} % compact symbols for 1/2, etc.
\usepackage{amsfonts} % blackboard math symbols
\usepackage{xspace}
\usepackage{amsmath}
\usepackage{amsthm}
\usepackage{amssymb}
\usepackage{algorithm}
\usepackage{algpseudocode}
\usepackage{varioref}
\usepackage[capitalise]{cleveref}
\usepackage{booktabs} % professional-quality tables
% \usepackage{times} % pacote para usar fonte Adobe Times
\usepackage{palatino}
% \usepackage{mathptmx} % p/ usar fonte Adobe Times nas fΓ³rmulas
%
\usepackage[alf,abnt-emphasize=bf]{abntex2cite} % pacote para usar citaçáes abnt
\usepackage{subcaption}
\input{definitions}
\providecommand{\floor}[1]{\ensuremath{\left\lfloor #1\right\rfloor}}
\providecommand{\ceil}[1]{\ensuremath{\left\lceil #1\right\rceil}}
\newtheorem{theorem}{Theorem}
\newtheorem{proposition}{Proposition}
\newtheorem{definition}{Definition}
\newtheorem{property}{Property}
% Hack to make todonotes use only the right margin
\makeatletter
\@mparswitchfalse%
\makeatother
\normalmarginpar
\usepackage[textsize=tiny,colorinlistoftodos,prependcaption,textwidth=width]{todonotes}
\newcommand{\mr}[2][noinline]{\todo[#1,fancyline,color=blue!20]{#2}}
\newcommand{\mri}[2][inline]{\todo[#1,fancyline,color=blue!20]{#2}}
\newcommand{\agp}[2][noinline]{\todo[color=orange!60,linecolor={orange!100},#1,fancyline,author=AndrΓ©]{#2}}
\newcommand{\agpi}[2][inline]{\todo[color=orange!60,linecolor={orange!100},#1,fancyline,author=AndrΓ©]{#2}}
\newcommand{\pp}[2][noinline]{\todo[color=purple!50,linecolor={purple!100},#1,fancyline,author=Pedro]{#2}}
\newcommand{\ppi}[2][inline]{\todo[color=purple!50,linecolor={purple!100},#1,fancyline,author=Pedro]{#2}}
% nominata
\renewcommand{\nominataReit}{Prof.~Carlos Andr{\'e} Bulh{\~o}es}
% \renewcommand{\nominataReitname}{Rector}
\renewcommand{\nominataPRCA}{Prof\textsuperscript{a}.~Patricia Pranke}
% \renewcommand{\nominataPRCAname}{Vice-Rector}
\renewcommand{\nominataPRAPG}{Prof.~J{\'u}lio Ot{\'a}vio Jardim Barcellos}
% \renewcommand{\nominataPRAPGname}{Dean of Graduate Studies}
\renewcommand{\nominataDir}{Prof\textsuperscript{a}.~Carla Maria Dal Sasso Freitas}
% \renewcommand{\nominataDirname}{Director of the Institute of Informatics}
\renewcommand{\nominataCoordPPGC}{Prof.~Alberto Egon Schaeffer Filho}
% \renewcommand{\nominataCoordnamePPGC}{Coordinator of the PPGC}
\renewcommand{\nominataBibchefe}{Alexsander Borges Ribeiro}
% \renewcommand{\nominataBibchefename}{Chief Librarian of the Institute of Informatics}
%
% Informaçáes gerais
%
\title{Discovering and Learning Preferred Operators for Classical Planning with Neural Networks}
\translatedtitle{Descoberta e Aprendizado de Operadores Preferidos para Planejamento ClΓ‘ssico com Redes Neurais}
\author{Minini}{Pedro Probst}
% alguns documentos podem ter varios autores:
%\author{Flaumann}{Frida Gutenberg}
%\author{Flaumann}{Klaus Gutenberg}
% orientador e co-orientador sΓ£o opcionais (nΓ£o diga isso pra eles :))
\advisor[Prof.~Dr.]{Rolf Peter Ritt}{Marcus}
%\coadvisor[Prof.~Dr.]{Pereira}{AndrΓ© Grahl}
% a data deve ser a da defesa; se nao especificada, sΓ£o gerados
% mes e ano correntes
\date{July 3rd}{2023}
% A seguir sΓ£o apresentados comandos especΓficos para alguns
% tipos de documentos.
% RelatΓ³rio de Pesquisa [rp]:
% \rp{123} % numero do rp
% \financ{CNPq, CAPES} % orgaos financiadores
% Trabalho Individual [ti]:
% \ti{123} % numero do TI
% \ti[II]{456} % no caso de ser o segundo TI
%
% palavras-chave
% iniciar todas com letras maiΓΊsculas
%
\keyword{Classical planning}
\keyword{Heuristic search}
\keyword{Preferred operators}
\keyword{Machine learning}
%
% palavras-chave na lingua estrangeira
% iniciar todas com letras maiΓΊsculas
%
\translatedkeyword{Planejamento clΓ‘ssico}
\translatedkeyword{Busca heurΓstica}
\translatedkeyword{Operadores preferidos}
\translatedkeyword{Aprendizado de mΓ‘quina}
%
% inicio do documento
%
\begin{document}
% folha de rosto
% Γ s vezes Γ© necessΓ‘rio redefinir algum comando logo antes de produzir
% a folha de rosto:
% \renewcommand{\coordname}{Coordenadora do Curso}
\maketitle
% dedicatoria
\clearpage
\begin{flushright}
\mbox{}\vfill
{\sffamily\itshape
``All this from a slice of gabagool?''\\}
--- \textsc{Tony Soprano}
\end{flushright}
% agradecimentos
\chapter*{Acknowledgements}
I want to express my sincere gratitude to Marcus Ritt and AndrΓ© Grahl Pereira for their support and guidance throughout this two-year journey. I consider myself incredibly fortunate to have had such dedicated mentors.
I also extend my appreciation to my colleague, Rafael Vales Bettker, whose tireless dedication and ability to program long hours into the night have contributed significantly to the extensibility and success of our research. It was a pleasure to work alongside such a responsible individual.
Lastly, I want to express my deep gratitude to my mother for being the only person I consider family.
% abstract in english
\begin{abstract}
In a planning task, an agent must choose the most efficient action from a potentially large set of actions at each step. During a heuristic search, logic-based planners use preferred operators to reduce the branching factor significantly. This work presents a method for sampling and learning preferred operators, aiming for their applicability across the entire state space of a planning task. We demonstrate that these learned preferred operators have competitive results compared to the current best logic-based approach.
Our objective is to identify ideal preferred operators, situated along the shortest paths leading to some goal. However, due to the huge size of search state spaces, we introduce a novel sampling strategy tailored for extracting preferred operators that approximate the ideal ones. Our research shows we can obtain high-quality preferred operators from a sample set covering a fraction of the state space.
To understand this new category of preferred operators, we conduct controlled experiments using planning tasks where we have access to the entire state space with perfect cost-to-goal estimates. We systematically compare the proposed approach to baselines, evaluate the effectiveness of learned preferred operators learned from several sample set sizes, and assess their performance when combined with different heuristic functions.
\end{abstract}
% abstract in portuguese
\begin{translatedabstract}
Em uma tarefa de planejamento, um agente deve escolher a ação mais eficiente de um conjunto potencialmente grande de açáes em cada passo. Durante uma busca heurΓstica, planejadores lΓ³gicos usam operadores preferidos para reduzir significativamente o fator de ramificação. Este trabalho apresenta um mΓ©todo para amostragem e aprendizagem de operadores preferidos, visando sua aplicabilidade em todo o espaΓ§o de estados de uma tarefa de planejamento. Demonstramos que esses operadores preferidos aprendidos tΓͺm resultados prΓ³ximos Γ melhor abordagem lΓ³gica atual.
Nosso objetivo é identificar os operadores preferidos ideais, que estão situados ao longo dos caminhos mais curtos que levam a algum objetivo. No entanto, devido ao enorme tamanho dos espaços de estado, apresentamos uma nova estratégia de amostragem adaptada para extrair operadores preferidos que aproximam os ideais. Nossa pesquisa mostra que podemos obter operadores preferidos de alta qualidade a partir um conjunto de amostras que abrange uma fração do espaço de estados.
Para obter uma compreensΓ£o mais aprofundada sobre essa nova categoria de operadores preferidos, realizamos experimentos controlados usando tarefas de planejamento sobre as quais temos acesso a todo o espaΓ§o de estados com estimativas perfeitas de custo para o objetivo. NΓ³s comparamos sistematicamente a abordagem proposta com \textit{baselines}, avaliamos a eficΓ‘cia dos operadores preferidos aprendidos com variados tamanhos de conjuntos de amostras e avaliamos o desempenho quando combinados com diferentes funçáes heurΓsticas.
\end{translatedabstract}
% lista de abreviaturas e siglas
% o parametro deve ser a abreviatura mais longa
% A NBR 14724:2011 estipula que a ordem das abreviaçáes
% na lista deve ser alfabΓ©tica (como no exemplo abaixo).
\begin{listofabbrv}{SPMD}
\item[BCE] Binary Cross-Entropy
\item[BFS] Breadth-First Search
\item[\bsp] Backward State Space
\item[DFS] Depth-First Search
\item[DQ] Dual-Queue
\item[FF] Fast-Forward
\item[\bfsrw] Focused Sampling Method
\item[\fsp] Forward State Space
\item[GBFS] Greedy Best-First Search
\item[GPU] Graphics Processing Unit
\item[IPC] International Planning Competition
\item[MSE] Mean Squared Error
\item[NN] Neural Network
\item[ResNet] Residual Network
\item[\sai] Sample Improvement
\item[\sas] Simplified Action Structures Plus
\item[SAT] Propositional Satisfiability Problem
%\item[SPG] Shortest Path Graph
\item[STRIPS] Stanford Research Institute Problem Solver
\item[\sui] Successor Improvement
\item[\bfsrs] Expansion from Random Successors
\end{listofabbrv}
\begin{listofsymbols}{$\alpha\beta\pi\omega$}
%\item[$G$] Sample set graph
%\item[$G'$] Shortest path sample set graph
\item[$d^{*}$] Longest distance between the goal condition and any initial state
\item[\h] Heuristic
\item[\hstar] Perfect heuristic
\item[\hnn] Learned heuristic
\item[\hadd] Add heuristic
\item[\hff] FF heuristic
\item[\hgc] Goal-count heuristic
\item[\postartable] Oracle of ideal preferred operators
\item[\postar] Learned ideal preferred operators
\item[\poff] FF preferred operators
\item[\pofsm] \bfsrw-based learned preferred operators
\item[\pog] Shortest-path-graph-based learned preferred operators
\item[\pogstar] Shortest-path-graph-based (with \hstar-values) learned preferred operators
\end{listofsymbols}
\listoffigures
\listoftables
\listofalgorithms
\tableofcontents
%
% - - - - - - - - - - - - - - - -- - - - - -
% intro
% - - - - - - - - - - - - - - - -- - - - - -
% introduce the problem,
% show why the problem is interesting,
% and how we attack it
%
\chapter{Introduction}
\label{cha:introduction}
Planning tasks can be solved through heuristic search, which systematically expands potential states in a search space based on an informed guess, or heuristic, about which states are likely to lead to a solution.
In planning, preferred operators act in conjunction with heuristic functions and help reduce the number of expanded states during the planning process by prioritizing states considered advantageous.
This study introduces a novel approach to deriving preferred operators in planning tasks. Unlike the traditional logic-based methods, we use a sample-based approach to discover preferred operators. By training a neural network (NN) with a sample set consisting of pairs of states and their preferred operators discovered during the sampling procedure, the NN learns to generalize preferred operators for the given planning task.
\section{Planning}
\label{sec:intro-planning}
Planning involves determining a series of operators (or actions) that transform a given initial state to satisfy a goal condition.
In a classical planning task, the agent acts in a fully-observable environment, i.e., with access to all relevant information of the current state of the world, such as the positions of objects. The agent starts in a initial state and needs to fulfill a specific goal condition. This is achieved by using deterministic operators to modify the current state of the world. A solution plan for the planning task is as a sequence of operators that successfully satisfy the goal condition when applied to the initial state. A state expansion involves the application of all relevant operators to a given state, thereby generating its successor states.
For example, in a Blocks World task, consider the initial state (left) and the goal state (right) shown in Figure \ref{fig:intro-blocks}. The agent needs to apply a sequence of operators to reach the goal state from the initial state. We can define operators such as ``pick up block X'', ``put block X on block Y'', and ``put block X on the table.'' In this example, the agent can find one of the possible plans by applying the following operators: pick up block G, put block G on the table, pick up block B, put block B on block G, and put block R on block B. %Applying each of these operators generate a successor state.
\begin{figure}[ht]
\caption[Initial state of a Blocks World task]{Initial state and goal state of a Blocks World task.}
\vspace{\baselineskip}
\centering
\begin{tikzpicture}
% Initial state
\node[align=center] at (7,-0.3) {(initial state)};
\drawCube{7}{0.5}{0}{R}{red}{0.7}
\drawCube{7}{1.2}{0}{B}{blue}{0.7}
\drawCube{7}{1.9}{0}{G}{green}{0.7}
% Goal state
\node[align=center] at (10.2,-0.3) {(goal state)};
\drawCube{10}{0.5}{0}{G}{green}{0.7}
\drawCube{10}{1.2}{0}{B}{blue}{0.7}
\drawCube{10}{1.9}{0}{R}{red}{0.7}
\end{tikzpicture}
\label{fig:intro-blocks}
\end{figure}
Planners are software systems specifically designed to find plans for planning tasks. Planners that do not rely on reasoning about SAT formulas may use several algorithms and techniques to explore the search space of possible operators and states in order to find an optimal plan, where the best possible solution is returned, or a satisfactory plan, where a suboptimal solution can be returned. Planning systems typically take as input a formal representation of the planning problem, including the initial state, goal condition, and a set of available operators. The formal representation of a planning task can be specified using various notations~(\cref{sec:background-planning-notation}).
In this study, we focus on discovering a satisfactory plan for a given task using a best-first search algorithm.%, without the need to achieve the best possible plan.
\section{Heuristic Search}
\label{sec:intro-heuristic-search}
Planners commonly use a best-first search algorithm such as greedy best-first search (GBFS)~\cite{Doran.Michie/1966}. GBFS arranges states in a priority queue based on their cost-to-goal estimate (also known as heuristic value or $h$-value) provided by the heuristic function. It first expands states with the lowest cost-to-goal estimate to find a solution. Various domain-independent heuristics effectively compute the cost-to-goal estimate of a state by using domain logic, which has information that permits reasoning about operators and other rules, such as mutual exclusive relations (mutexes) that indicate infeasible states. These heuristics are based mainly on techniques such as abstractions~\cite{Culberson.Schaeffer/1998}, delete relaxation~\cite{Hoffmann.Nebel/2001}, and landmarks~\cite{Hoffmann.etal/2004,Helmert.Domshlak/2009}. The heuristic function is the most important component of a planner since it guides the search.
\section{Preferred Operators}
\label{sec:intro-preferred-ops}
Operators considered advantageous for specific reasons, such as generating states closer to the goal, are referred to as preferred operators~\cite{Helmert/2006,Richter.Helmert/2009}. These operators are used alongside heuristic functions to assist planners in minimizing the number of expanded states when solving a planning task.
By prioritizing the expansion of states generated by preferred operators, planners benefit from additional guidance, often resulting in a higher success rate for solving tasks than relying only on the heuristic function. Learning preferred operators shares similarities with learning policies, as both involve selecting actions more likely to result in desirable outcomes. The existing methods for identifying preferred operators are currently limited to logic-based approaches. The current most effective method uses the preferred operators computed alongside the Fast-Forward (FF) heuristic~\cite{Hoffmann.Nebel/2001}, as implemented in the Fast Downward planning system~\cite{Helmert/2006}. Planners incorporating preferred operators emerged as winners in the satisficing track of the International Planning Competition (IPC) in the years 2004~\cite{Helmert/2006}, 2008~\cite{Richter.lama.etal/2010}, 2011~\cite{Richter.lama.etal/2011}, and 2018~\cite{Seipp-fast.etal/2018}.
\section{Learning with Neural Networks}
\label{sec:intro-deep-learning}
Learning with NNs refers to a machine learning approach that involves designing and training interconnected layers of artificial neurons, known as neural networks, inspired by the biological brain.
It involves learning hierarchical representations of data, where each layer in the network progressively extracts more complex and abstract features. Through the usage of NNs, learning algorithms can automatically discover and capture patterns from large-scale datasets in various tasks, including image classification, speech recognition, and natural language processing.
Learning models based on NNs can learn in different ways. This study focuses on supervised learning, which uses labeled datasets to classify or make predictions. In this case, a dataset used to train an NN can be represented as multiple pairs in the format $(y, x)$. The target $y$ refers to the desired output or label associated with a specific input instance $x$. With supervised learning, an NN can effectively learn and generalize from the provided labeled data to make accurate predictions or classify new, unseen inputs.
\section{Learning in Planning}
\label{sec:intro-learning-planning}
In recent years, interest in using NNs to learn heuristic functions~\cite{Ferber.etal/2020a,Yu.etal/2020,Shen.etal/2020,Ferber.etal/2022,OToole/2022,Bettker.etal/2022} or policies~\cite{Toyer.etal/2018,Toyer.etal/2020,Stahlberg.etal/2022} to solve planning tasks has increased. The general approach for supervised methods is to generate a set of samples as pairs of states and cost-to-goal values and use them for training an NN. However, it is challenging to learn effective heuristic functions since state spaces tend to grow exponentially in size as the amount of information needed to describe them increases, but the portion of the state space that we can actually sample is relatively small. Logic-based heuristics can be applied to any domain, while learned heuristics depend on the learning model, and even with domain-independent models, transfer learning can be difficult~\cite{Shen.etal/2020}. Moreover, learned heuristics are generally slow to compute, thus they need to be more informed, i.e., expand fewer states when compared to logic-based heuristics to reach the goal. These characteristics also apply to learning preferred operators.
\section{Contributions}
\label{sec:intro-contributions}
This study represents the first attempt to discover preferred operators from a sample set and use an NN to learn them. We present a new sampling method and a novel sample-based technique for identifying preferred operators. The technique involves backward search from the goal condition (also known as regression), constructing a graph with sampled states representing their successor-predecessor relationships, and determining the operators used to reach the goal condition as preferred operators for each state. We show that an NN can learn the preferred operators from a subset of the state space and effectively extend this learning to the entire state space across diverse planning domains. Furthermore, the proposed approach outperforms the current best logic-based preferred operator method in the benchmark tasks. In particular, this work presents:
\begin{itemize}
\item A novel method based on shortest path graphs to discover preferred operators in an existing sample set~(\cref{sec:sample-discovered-po}).
\item A new sampling method tailored for discovering preferred operators~(\cref{sec:sample-learn-po}).
\item An analysis of the quality of the learned preferred operators and a comparison to existing logic-based methods~(\cref{cha:exp-experiments}).
\end{itemize}
%
% - - - - - - - - - - - - - - - -- - - - - -
% background
% - - - - - - - - - - - - - - - -- - - - - -
% these are all things that exist already
% and that the reader needs to know before
%
\chapter{Background}
\label{cha:background}
This chapter provides essential information for comprehending the subsequent chapters of this work.
\section{Classical Planning Notation}
\label{sec:background-planning-notation}
In this section we present two ways to represent a classical planning task. The first one, STRIPS~\cite{Fikes.Nilsson/1971}, represents a planning task using propositional facts (Boolean variables). The second way, \sas~\cite{Backstrom.Nebel/1995}, represents a planning task using multi-valued state variables. %The third way, PDDL~\cite{Ghallab.etal/1998}, is based on predicate logic and is commonly used for writing planning tasks to be used as input for planners.
\subsection{STRIPS}
\label{sec:background-strips}
\begin{definition}[STRIPS Planning Task]\label{def:strips}
A planning task in STRIPS representation is defined by~$\Pi=\langle\mathcal{F},\mathcal{O},s_{0},s^{*}\rangle$, where~$\mathcal{F}$ has all the possible facts (propositions) that can be used to describe a state, $\mathcal{O}$~is a set of operators,~$s_{0} \subseteq \mathcal{F}$ is an initial state, and~$s^{*} \subseteq \mathcal{F}$ is the goal condition that specifies the facts that should be true to solve the task.
A state $s$ is defined as $s \subseteq \mathcal{F}$, where each fact in $s$ can be true ($f \in s$) or false ($f \notin s$). %The set~$\mathcal{S}$ of all states over $\mathcal{P}$ is the state space.
An operator~$o \in \mathcal{O}$ is defined by a triple~$\langle\pre(o),\add(o),\del(o)\rangle$ that specifies its precondition, add-effects and delete-effects, respectively, which are denoted as sets of facts.
\end{definition}
In STRIPS, we say that operator $o$ is applicable to state $s$ if its preconditions are satisfied by $s$, i.e., $\pre(o) \subseteq s$, and produces a successor state~$s' = \sucs(s,o) = (s \setminus \del(o)) \cup \eff(o)$. In other words, we progress a state $s$ with operator $o$ by setting the propositions in $\del(o)$ to false and in $\add(o)$ to true. The set of successor states of $s$ is denoted by~$\sucs(s)=\{\sucs(s,o)\mid o\in \mathcal{O}, \pre(o) \subseteq s\}$.
STRIPS formulas use conjunction of propositions with logical connectives to express compound preconditions and effects. STRIPS has no negated preconditions, and it is not possible to directly specify conditional effects, where the effect of an action depends on the initial state. However, we can simulate conditional effects by creating new operators in the planning task.
\begin{definition}[Plan]\label{def:plan}
A sequence of operators~$\pi=(o_1,\ldots,o_k)$ is valid for a state~$s_0$ if produces a sequence of states~$s_1,\ldots,s_k$ such that $s_i=\sucs(s_{i-1},o_i)$. A sequence~$\pi$ for the initial state~$s_{0}$ is called a plan if $s^{*} \subseteq s_{k}$. The plan length is defined as $|\pi|$. Among all the valid plans, the one with the minimum length is referred to as an optimal plan $\pi^{*}$. An optimal plan is the shortest plan that successfully achieves the goal condition from the initial state. Since in this work we only consider unitary cost operators, the plan cost is equal to the plan length.
\end{definition}
Heuristics based on delete-relaxation obtain the heuristic from a \emph{relaxed} version of the planning task. A relaxed planning task in STRIPS is defined as $\Pi^{+}=\langle\mathcal{P},\mathcal{O'},s_{0},s^{*}\rangle$, where $\mathcal{O'} = \{ \langle \pre(o), \add(o), \emptyset \rangle \,|\, o \in O \}$, i.e., the delete-effects of the planning task are ignored. Relaxed tasks can be solved efficiently even though finding the optimal solution is $\mathcal{NP}$-hard~\cite{Bylander/1994}. For example, the heuristic \hadd approximates the perfect heuristic value for a state $s$ as the sum of the costs of achieving each proposition in $s^{*}$ independently of the others.
\subsection{\sas}
\label{sec:background-sas}
We also use the \sas representation to describe classical planning tasks independent of any particular domain. \sas shares similarities with STRIPS, but it allows state variables to have an arbitrary and potentially non-binary finite domain. These multi-valued variables can express mutexes that are not explicitly recognized in STRIPS. Suppose we have a planning task with a robot in a grid with $n$ possible locations it needs to visit. With STRIPS, we need $n$ facts to indicate all the possible locations where the robot can be, and the STRIPS representation does not explicitly capture the mutex that the robot cannot exist in multiple locations simultaneously. In \sas, this information can be naturally expressed using a single multi-valued variable for the location of the robot. The variable has a finite domain consisting of the $n$ possible locations, enforcing that the robot can only occupy one location at a time. Thus, only one proposition from the $n$ possibilities can be true in any state.
\begin{definition}[\sas Planning Task]\label{def:sas}
A \sas task is defined as $\Pi=\langle\mathcal{V},\mathcal{O},s_0,s^*\rangle$, where $\mathcal{V}$ is a set of state variables, and each variable $v\in \mathcal{V}$ has a finite domain~$\dom(v)$, that represents the possible mutually exclusive values of each variable, $\mathcal{O}$ is a set of operators where each operator $o \in \mathcal{O}$ is defined as a pair of preconditions and effects $(\pre(o),\eff(o))$, both partial states~$s$ defined as a partial function $s:\mathcal{V}\rightarrow \mathcal{D}$, where $\mathcal{D}=\cup_{v\in \mathcal{V}}\dom(v)$, such that $s(v)\in \dom(v)$ whenever $s(v)$ is defined. Otherwise, $s(v)$ is undefined and written as $s(v)=\bot$. Given a partial state $s$, $var(s) \in \mathcal{V}$ is a finite set which lists all variables assigned in $s$. A (complete) state $s$ is a partial state defined on all variables in~$\mathcal{V}$, i.e, $var(s) = \mathcal{V}$. The state~$s_0$ defines the initial state, and the partial state~$s^*$ defines the goal condition.
\end{definition}
An operator $o \in \mathcal{O}$ is applicable to a state $s$ if its preconditions are fulfilled by $s$, i.e, $s \subseteq \pre(o)$, and it generates a successor state $s'=\sucs(s,o):=\eff(o)\circ s$. Here, $s'=t\circ s$ is defined as $s'(v)=t(v)$ for all $v$ such that $t(v)$ is defined, and for all other cases, $s'(v)=s(v)$. The set of all successor states resulting from state $s$ is denoted by $\sucs(s)=\{{\sucs(s,o)\mid o\in \mathcal{O}, s \subseteq \pre(o)}\}$. A state variable can never be made undefined once made defined by an operator.
We can represent planning tasks in STRIPS using~\sas by converting each fact to a state variable with domain true and false. On the other hand,~\sas can be represented in STRIPS by converting each possible variable assignment to a fact. Note that \sas supports partial assignments while STRIPS assumes all states are complete (unmentioned facts are considered false).
~\sas tasks are generally more concise than STRIPS tasks. In STRIPS, if we have $n$ facts, the size of the set of states would be $2^n$. In~\sas, with $n$ variables, the size of the set of states would be $dom(v_{1}) \times dom(v_{2}) \times \ldots \times dom(v_{n})$.
%\subsection{PDDL}
%\label{sec:background-pddl}
%We can also represent a classical planning task using PDDL. Tasks specified in PDDL are written in a Lisp-like syntax and are separated into two definitions, the domain, which specifies the operators and predicates, and the task, which specifies the objects, the initial state, and the goal condition. The following definitions describe a restricted version of PDDL that encodes STRIPS-like planning tasks.
%\begin{definition}[PDDL Domain]\label{def:pddl-domain}
%A PDDL domain is defined as a tuple $D = (\mathcal{P}, \mathcal{O})$, where $\mathcal{P}$ is a set of predicates representing the state of the world. Each predicate $p \in \mathcal{P}$ can be true or false in a given state. The set of operators is defined as $\mathcal{O}$. Each operator $o \in \mathcal{O}$ has a pre-condition $\pre(o)$ and an effect $\eff(o)$.
%\end{definition}
%In PDDL, effects are not explicitly separated into add-effects and delete-effects, instead, delete-effects are represented by negating the predicates.
%\begin{definition}[PDDL Task]\label{def:pddl-domain}
%A PDDL task is a triple $T = (D, s_{0}, s^{*})$, where $D$ is the domain definition associated with the problem, $s_{0}$ is the initial state, represented as $s_{0} = \{p_1, p_2, \ldots, p_n\}$, where $p_i \in \mathcal{P}$, and $s^{*}$ is the goal condition, denoted as $s^{*} = \{q_1, q_2, \ldots, q_m\}$, where $q_i \in \mathcal{P}$ represents the desired state of predicate $q_i$ to be achieved.
%\end{definition}
%Actions represent the possible operations or transformations that can be performed, while predicates describe the properties or conditions that can be true or false within the domain. Objects are the entities involved in the planning task, while the initial state describes the starting configuration or state of the objects. The goal condition defines the desired state that a planner aims to achieve.
%As an example, consider Blocks World, where $(ontable~?x)$ represents a predicate that indicates if a block $x$ is on the table or not, and $(pickup~?x)$ is an action that defines the act of picking up block $x$. This action is further specified by a set of preconditions $(and~(clear~?x)~(ontable~?x)~(handempty))$ that must hold true before $(pickup~?x)$ is applied, i.e., $x$ must be clear (no other block above it), $x$ must be on the table, and the hand must be empty. The effects of applying $(pickup~?x)$ are $$(and~(not~(ontable~?x))~(not~(clear~?x))~(not~(handempty))~(holding~?x)),$$
%that is, $x$ is not on the table, $x$ is not clear, the hand is not empty, and the hand is holding $x$.
\section{Regression}
\label{background-regression}
Progression involves determining the successor states of the current state by applying a sequence of operators, and regression involves determining predecessor states that can lead to the current partial state.
%In regression, for each operator~$o \in \mathcal{O}$ we construct a backward operator~$o_r \in \mathcal{O}_r$ as follows. From an operator~$o = (\pre(o), \eff(o))$, we generate the corresponding backward operator~$o_r = (\eff(o), \pre(o))$ if $\eff(o)$ and $\pre(o)$ are defined. Otherwise, $o_r=(\eff(o),\bot)$ if $\pre(o)=\emptyset$, or $o_r=o$ if $\eff(o)=\emptyset$.
In regression, an operator $o$ is considered \emph{relevant} for partial state $s$ if $\eff_r = \dom(\eff(o)) \cap \dom(s) \neq \emptyset$, and \emph{consistent} if $s \subseteq \eff(o)|_{\eff_r}$. Relevance requires that at least one variable is defined both in the effect and in the partial state to be regressed, and consistency requires an agreement on such variables. An operator $o$ is \emph{backward applicable} in partial state $s$ if it is both relevant and consistent with $s$, and it leads to a predecessor $\pred(s,o)$ given by $\pred(s,o)=\pre(o)\circ (s|_{\dom(s)\setminus\eff_r})$. Note that $s \subseteq \sucs(\pred(s,o),o)$, but can differ from $s$ since the operator $o$ may have changed the values of variables that were not defined in $s$.
A partial state $s$ has predecessors $\pred(s)=\{\pred(s,o)\mid o \in \mathcal{O}\}$, where $o_{r}$ is applicable to $s$. A regression sequence from state $s_0$ is valid if $o_i$ can be applied to $s_{i-1}$ and produces $s_i=\pred(s_{i-1},o_i)$. All partial states~$s_k$ can reach a partial state $s_{0} \subseteq s$ in at most $k$ forward applications of the reversed operator sequence.
Since the goal is a partial state, a valid regression sequence $\rho=(o_1,\ldots,o_k)$ will generate a sequence of partial states that can reach the goal in at most $k$ steps.
Note that progression and regression in planning are asymmetric because each state in the regression (backward) state space can represent a set of states in the progression (forward) state space~\cite{Alcazar.etal/2013}.
\section{State Spaces}
\label{background-state-spaces}
Let $S$ be a set of states, $s_0 \in S$ be an initial state, $s^{*} \in S$ be the goal condition, and $\sucs : S \to 2^S$ be a successor function, which maps each state to a set of possible successor states and determines the available transitions. A state space is a tuple $\spp=\langle S, s_0, s^{*}, \sucs \rangle$.
\begin{definition}[State Space of $\Pi$]
For any planning task $\Pi$ with states $S$, initial state $s_0$, goal condition $s^{*}$, and successor function $\sucs$, the corresponding state space of $\Pi$ is denoted as $\spp(\Pi) = \langle S, s_0, s^{*}, \sucs \rangle$.
\end{definition}
The forward state space (\fsp) refers to the set of states that can be reached from the initial state $s_0$ by applying the successor function $\sucs$ in the forward direction. It represents the states that can be reached forward from the initial state towards the goal condition.
\begin{definition}[Forward State Space of $\Pi$]
The forward state space for a planning task $\Pi$ is denoted as $\fsp(\Pi) = \langle S_{\text{F}}, s_0, s^{*}, \sucs \rangle$, where $S_{\text{F}}$ is the set of states reachable from the initial state, and $\sucs : S_{\text{F}} \to 2^{S_{\text{F}}}$ is the successor function that maps each state to a set of possible successor states and determines the available transitions in the forward direction.
\end{definition}
In other words, $\fsp(\Pi)$ is defined as the subset of states and transitions within $\spp$ that are relevant to the planning task $\Pi$ and the forward expansion towards the goal condition. States from the \fsp are expanded when solving a task, for example by using a best-first search algorithm.
The backward state space (\bsp), on the other hand, refers to the set of states that can be reached from the goal condition $s^{*}$ by applying a predecessor function $\pred$. It represents the states that can be reached backward from the goal condition towards the initial state (regression).
\begin{definition}[Backward State Space of $\Pi$]
The backward state space of a planning task $\Pi$ is denoted as $\bsp(\Pi) = \langle S_{\text{B}}, s_{0}, s^{*}, \pred \rangle$, where $S_{\text{B}}$ is the set of states reachable from the goal condition and $\pred : S_{\text{B}} \to 2^{S_{\text{B}}}$ is a predecessor function that maps each state to a set of possible predecessor states and determines the available transitions in the backward direction.
\end{definition}
\section{Heuristic Functions}
\label{sec:background-heuristics}
A heuristic function $h:\mathcal{S}\rightarrow \mathbb{R}^{\geq 0}\cup\{\infty\}$ estimates the optimal plan length from a state $s$ to the goal $s^*$, where the perfect heuristic function is defined as $\hstar(s) = |\pi_{s}^{*}|$, i.e., the length of the optimal plan from $s$ to $s^{*}$. Heuristic functions are used to guide search algorithms, optimization techniques, or decision-making processes by providing informed estimates or approximations based on available information or problem-specific knowledge. The goal of a heuristic function is to efficiently guide the search or decision-making process towards more promising paths or solutions, even in the absence of complete or perfect information. A heuristic function can have several properties that indicate its quality, such as:
\begin{itemize}
\item Admissibility: $h(s) \le \hstar{(s)}$, i.e., the heuristic never overestimates the true cost-to-goal.
\item Consistency (or monotonicity): $h(s) \le h(s') + cost(o)$ for all transitions from a state $s$ to a successor $s'$, i.e., $s \xrightarrow{o} s'$, where $cost(o)$ is the cost of applying operator $o$ to reach $s'$ from $s$.
\item Goal-awareness: $h(s) = 0$ for all goal states.
\item Safeness: $h(s) = \infty$ implies $\hstar (s) = \infty$, for example in dead-ends.
\end{itemize}
Heuristics are typically derived from a model of the task, such as the \sas model introduced earlier, but can also be obtained by learning the map of some state $s$ to its heuristic value $h(s)$, where the desired output of the NN can be either the direct cost-to-goal estimates or some form of encoding representing these estimates.
In this study, we use a propositional representation of a state to learn heuristic functions and preferred operators~\cite{Ferber.etal/2020a,Yu.etal/2020,Ferber.etal/2022,OToole/2022,Bettker.etal/2022}. We use the notation from~\citet{Bettker.etal/2022}.
Consider a planning task $\Pi=\langle\mathcal{V},\mathcal{O},s_0,s^*\rangle$, where $\mathcal{V}=\{v_1,\ldots,v_n\}$ is a set of variables, and $D(v_i)=\{d_{i1},\ldots,d_{i,s_i}\}$ the domains of variable $v_i$, where $i\in[n]$.
A state $s$ is represented as a sequence of facts $\mathcal{F}(s)=(f_{11},f_{12},\ldots,f_{1,s_1},\ldots,f_{n1},f_{n2},\ldots,f_{n,s_n})$. Each fact $f_{ij}=[s(v_i)=d_{ij}]$ corresponds to a variable $v_i$ taking on a specific value $d_{ij}$ in state $s$. If the variable $v_i$ has the assigned value $d_{ij}$ in state $s$, then the fact $f_{ij}$ is considered true.
The facts $\mathcal{F}_{i}=\{f_{i1},\ldots,f_{i,s_i}\}$ associated with variable $v_i$ must adhere to the consistency condition $\sum_{f\in \mathcal{F}i} f\leq 1$. This means that each variable can take at most one value. When $v_i$ is undefined, $\sum_{f\in \mathcal{F}_i} f=0$.
For example, let $\mathcal{F}$ be a set of facts, and let $f_i, f_j \in \mathcal{F}$ be two facts. We say that $f_i$ and $f_j$ are mutex if $\neg(f_i \land f_j)$ holds, i.e., $f_i$ and $f_j$ cannot both be true at the same time in any valid state of the planning problem.
We use the notation $\mutex(\mathcal{F})$ to denote when the constraint $\sum_{f\in \mathcal{F}} [f]\leq 1$ must be satisfied for the states of the planning task.
\section{Greedy Best-First Search}
\label{sec:background-gbfs}
Greedy best-first search (GBFS, \cref{alg:gbfs}) is a best-first search algorithm typically used by planners to solve planning tasks by expanding the \fsp. GBFS organizes states in a priority queue based on their cost-to-goal estimate, which is determined by a heuristic function $h$. GBFS expands states with the \emph{lowest} cost-to-goal estimate first in order to find a solution. Unlike the \astar algorithm~\cite{Hart.etal/1968}, GBFS does not consider the cost of the path taken ($g$-value). This \emph{can} make GBFS efficient when the heuristic function is accurate, but it sacrifices optimality guarantees, whereas~\astar is optimal when the heuristic used is both admissible and consistent.
\begin{algorithm}[tb]
\caption{Greedy best-first search}
\label{alg:gbfs}
\begin{algorithmic}[1]
\Procedure{GBFS}{$s_{0}, s^{*}, h$}
\State $Q \gets$ priority queue ordered by lowest $h$
\If{$h(s_{0}) < \infty$}
\State Insert $s_0$ into $Q$ with priority $h(s_0)$
\State Mark the initial state $s_0$ as visited
\EndIf
\While{$Q$ is not empty}
\State $s \gets$ state with the highest priority in $Q$
\State Remove $s$ from $Q$
\If{$s$ satisfies the goal condition $s^{*}$}
\State \textbf{return} the solution
\EndIf
\State $S' \gets \{s' \mid s' \in succ(s)\}$
\ForAll{successor states $s' \in S'$}
\If{$s'$ has not been visited \textbf{and} $h(s') < \infty$}
\State Mark $s'$ as visited
\State Insert $s'$ into $Q$ with priority $h(s')$
\EndIf
\EndFor
\EndWhile
\State \textbf{return} failure (no solution found)
\EndProcedure
\end{algorithmic}
\end{algorithm}
\section{Preferred Operators}
\label{sec:background-preferred-operators}
Preferred operators can be described as operators that, given a particular state $s$, tend to generate successors more likely to satisfy the goal condition when compared to other successors of $s$. Preferred operators provide a way to prioritize the expansion of certain states over others during the search. Although the method of identifying preferred operators varies,~\citet{Hoffmann.Nebel/2001} introduced the first approach in the original Fast-Forward (FF), where preferred operators are computed alongside the FF heuristic.
Specifically, the FF planner computes a relaxed planning graph~(\cref{alg:computing-rpg}) that represents the relaxed task, ignoring delete-effects. From the relaxed planning graph, FF extracts the relaxed plan~(\cref{alg:extracting-relaxed-plan}) with its length as the cost-to-goal estimate for a state $s$. To extract the relaxed plan, the algorithm marks the goal facts that need to be achieved at each layer $k$ of the relaxed planning graph, then iterate from the last layer to the initial layer, marking the actions at layer $k$ that achieve the goal facts of the same layer. The preferred operators, then called \emph{helpful actions}, are defined as the set $\{o \mid pre(o) \subseteq s, add(o) \cap S_{k=1}^{*} \neq \emptyset \}$, where $S_{k=1}^{*}$ denotes the set of goal facts at layer $1$ of the relaxed plan (note that the preconditions of operators that add a fact of the goal are also inserted to $S^{*}$ as subgols). In summary, the preferred operators of the FF planner are a subset of all possible operators that satisfy two conditions: they can be applied given the current state and contribute to achieving at least one goal fact at the first layer of the relaxed plan.\footnote{\cref{alg:computing-rpg,alg:extracting-relaxed-plan} were modified from~\citet{Wickler.etal/2015}.}
In the original implementation of the FF planner, the preferred operators prune the search space and only evaluate successors generated via preferred operators. However, this makes the search incomplete~\cite{Richter.Helmert/2009}, i.e., it does not guarantee a solution or determine if there is none, and FF restarts without preferred operators if they fail.
\begin{algorithm}[tb]
\caption{Computing the relaxed planning graph}
\label{alg:computing-rpg}
\begin{algorithmic}[1]
\Procedure{RelaxedPlanningGraph}{relaxed planning task $\Pi^{+}$}
\State $P_{0} \gets s_{0}$ \textcolor{gray}{\# \emph{Initial proposition layer}.}
\State $k \gets 0$ \textcolor{gray}{\# \emph{Index of the current layer}.}
\While{$s^{*} \nsubseteq P_{k}$}
\State $k \gets k + 1$
\State \textcolor{gray}{\# \emph{Computes the action layer $O_{k}$ at layer $k$, i.e., all the actions}}
\State \textcolor{gray}{\# \emph{with the preconditions satisfied in the preceding layer}.}
\State $O_{k} \gets \{o \in O\ \mid pre(o) \subseteq P_{k-1}\}$
\State $P_{k} \gets P_{k-1}$
\ForAll{$o \in O_{k}$}
\State \textcolor{gray}{\# \emph{Computes the proposition layer $P_{k}$ at layer $k$}.}
\State $P_{k} \gets P_{k} \cup add(o)$
\EndFor
\If{$P_{k} = P_{k-1}$}
\State \textbf{return} failure
\EndIf
\EndWhile
\State $G^{+} \gets $ [$P_{0}, O_{1}, P_{1},\ldots,O_{k}, P_{k}$] \textcolor{gray}{\# \emph{The relaxed planning graph}.}
\State \textbf{return} $G^{+}$
\EndProcedure
\end{algorithmic}
\end{algorithm}
\begin{algorithm}[tb]
\caption{Extracting the relaxed plan}
\label{alg:extracting-relaxed-plan}
\begin{algorithmic}[1]
\Procedure{RelaxedPlan}{relaxed planning graph $G^{+}$, goal facts $s^{*}$}
\State $P \gets$ proposition layers of $G^{+}$
\State $O \gets$ action layers of $G^{+}$
%\State $l \gets$ number of layers in $G^{+}$
%\If{$s^{*} \nsubseteq P_{l}$}
% \State \textbf{return} failure
%\EndIf
\State $plan \gets \emptyset$
\State \textcolor{gray}{\# \emph{firstlayer($x$,$y$) returns the number of the first layer at which $x$ appears in $y$.}}
\State \textcolor{gray}{\# \emph{M = maximum(index of the first layer where each fact of $s^{*}$ first appears in $P$).}}
\State $M \gets$ $\max \{\text{firstlayer}(s_{i}^{*}, P) \mid s_{i}^{*} \in s^{*}\}$
\For{$k \gets 0$ \textbf{to} $M$}
\State \textcolor{gray}{\# \emph{$S_{k}^{*}$ are the goal facts that need to be achieved in $P_{k}$.}}
\State $S_{k}^{*} \gets$ $\{s_{i}^{*} \in s^{*} \mid \text{firstlayer}(s_{i}^{*}, P_{k}) = k\}$
\EndFor
\For{$k \gets M$ \textbf{to} $1$}
\ForAll{$s_{k}^{*} \in S_{k}^{*}$}
\State \textcolor{gray}{\# \emph{Selects the action $o$ that achieves the goal fact $s_{k}^{*}$}}
\State \textcolor{gray}{\# \emph{and appears for the first time in layer $k$}.}
\State $o \gets$ $\text{firstlayer}(o, O_{k}) = k \mid s_{k}^{*} \in add(o)$
\State $plan \gets plan \cup \{o\}$
\State \textcolor{gray}{\# \emph{Now add the preconditions of $o$ as sub-goals to $S^{*}$}}
\State \textcolor{gray}{\# \emph{in the layer where $p$ first appears}.}
\ForAll{$p \in pre(o)$}
\State $S_{firstlayer(p, P)}^{*} \gets S_{firstlayer(p, P)}^{*} \cup \{p\}$
\EndFor
\EndFor
\EndFor
\State \textbf{return} $plan$
\EndProcedure
\end{algorithmic}
\end{algorithm}
The current approach to extract preferred operators is the one implemented in the Fast Downward planning system~\cite{Helmert/2009}, based on domain transition graphs, compatible with \sas, instead of planning graphs as previously described, where the preferred operators obtained with the computation of the FF heuristic are the current best.
% A DTG is defined as follows.
%Let $\Pi = \langle \mathcal{V}, \mathcal{O}, s_0, s^* \rangle$ be a \sas planning task, and let $v \in \mathcal{V}$. The DTG of variable $v$ is the labeled directed graph $\text{DTG}(v, \Pi)$ with vertices $D_v$ and an arc $(d, d')$ labeled with operator $o \in \mathcal{O}$ whenever either $\pre_{o}(v) = d$ and $\eff_{o}(v) = d'$, or $\pre_{o}(v)$ is undefined and $\eff_{o}(v) = d'$. We refer to $(d, d')$ as a value transition of $v$. We write $d \xrightarrow{o} d'$ if there exists an operator $o$ that can modify the value of $v$ from $d$ to $d'$.
%To summarize, the DTG of a variable $v$ shows all the possible values of $v$ as vertices, and the arcs with their corresponding operators represent the transitions between those values. Furthermore, in Fast Downward, ignoring delete-effects in a multi-valued task is equivalent to assume that each state variable can hold several values simultaneously.
Fast Downward extends the ``vanilla'' GBFS algorithm to support preferred operators and ensure completeness. This is achieved by introducing a dual-queue approach: the ``default queue'' receives all generated states (default behavior without preferred operators), while the ``preferred queue'' only accepts states generated by preferred operators. We call this DQ-GBFS (dual-queue greedy best-first search). In this version of the algorithm, the expansion of states occurs alternately from both queues or may use boosting~\cite{Richter.Helmert/2009}.
With a boost value $n$, when a state with a lower $h$-value than any previously expanded state is encountered during the search (meaning the search progresses) the preferred queue is used for the next $n$ expansions, as long as it has elements. The boost value is cumulative, so each time the search progresses, $n$ expansions are added to the remaining number of subsequent expansions from the preferred queue.
% With a boost value $n$, if the search expands a state with an $h$-value lower than all previously expanded states (indicating progress in the search), the preferred queue is used for the subsequent $n$ expansions, as long as it contains elements. Furthermore, the boost value is cumulative, meaning that whenever the search progresses, $n$ expansions are added to the remaining number of subsequent expansions from the preferred queue.
In Fast Downward, the initial expansion in the search originates from the default queue, so even if we have preferred operators that always generate a successor closest to the goal, an inaccurate heuristic can result in a suboptimal plan. Furthermore, there are cases where a preferred operators generates a successor state that has already been generated. Consequently, the state is not added to the preferred queue, which means it cannot be expanded first (as the preferred queue has higher priority), and another state, potentially further from the goal, is expanded instead.
\section{Neural Networks}
\label{sec:background-neural-nets}
A multi-layer perceptron is a commonly used NN architecture comprising an input layer, one or more hidden layers, and an output layer. The neurons within the hidden layers typically apply an activation functions to the weighted sum of their inputs, which can introduce linear and nonlinear transformations. A bias term represents a constant value that is optionally added to the weighted sum of inputs of a neuron before applying the activation function, introducing an offset in the activation that allows the NN to account for any systematic errors or deviations in the data. The weights and biases of the neurons are learned through a process called backpropagation, which optimizes a loss function using gradient descent or its variations. For more information on backpropagation and gradient descent, refer to~\citet{Goodfellow.etal/2016}.
The mean squared error (MSE) loss function is commonly used for regression problems, where the goal is to predict continuous values. By minimizing the MSE, the NN aims to make its predictions as close as possible to the actual values.
\begin{definition}[Mean Squared Error]\label{def:mse}
Given a prediction $\hat{y}$ and the corresponding target value $y$, the MSE loss is computed as the mean of the squared differences between the prediction and the target:
$$\text{MSE}(\hat{y}, y) = \frac{1}{N} \sum_{i=1}^{N} (\hat{y}_i - y_i)^2,$$
where $\hat{y}_i$ represents the $i$-th element of the prediction vector $\hat{y}$, $y_i$ represents the $i$-th element of the target vector $y$, and $N$ is the total number of elements in the vectors.
\end{definition}
The binary cross-entropy (BCE) loss function is commonly used for binary classification problems, where the goal is to predict a binary outcome, or multi-label classification problems where there can be more than one correct outcome~\cite{Tsoumakas.etal/2007}.
\begin{definition}[Binary Cross-Entropy]\label{def:mse}
Given a prediction $\hat{y}$ and the corresponding binary target value $y$, the BCE loss is computed as the average of the element-wise cross-entropy between the prediction and the target:
$$\text{BCE}(\hat{y}, y) = -\frac{1}{N} \sum_{i=1}^{N} \left[y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)\right],$$
where $\hat{y}_i$ represents the $i$-th element of the prediction vector $\hat{y}$, $y_i$ represents the $i$-th element of the target vector $y$, and $N$ is the total number of elements in the vectors.
\end{definition}
In this study, we use a feedforward NN with residual blocks.
%\begin{definition}[Feedforward Neural Network]
Let us consider a feedforward NN with L layers. For simplicity, assume each layer has the same number of neurons, denoted by $N$. The input to the network is represented $x$, and the output is denoted by $y$. The activation function of a neuron in the $l$-th layer is denoted as $a^l(\cdot)$, and the weights connecting the $i$-th neuron in layer $l-1$ to the $j$-th neuron in layer $l$ are denoted as $w^{l}_{ij}$. The bias of the $j$-th neuron in layer $l$ is denoted as $b^{l}_{j}$. The output of the $j$-th neuron in layer $l$ is given by:
$$z^{l}_{j} = \sum_{i=1}^{N} w^{l}_{ij} a^{l-1}_{i} + b^{l}_{j}$$
The activation of the $j$-th neuron in layer $l$ is then computed as:
$$a^{l}_{j} = a^{l}(z^{l}_{j})$$
%\end{definition}
\cref{fig:neural-network} shows an example of a simple fully-connected NN. Common activation functions include sigmoid, rectified linear units (ReLU), and tanh~(\cref{fig:activation-functions}).
\begin{figure}[tb]
\caption[A fully-connected neural network]{A neural network with $n$ input neurons, two hidden layers with $m$ neurons, and an output layer with $k$ neurons.}
\vspace{\baselineskip}
\centering
\begin{tikzpicture}[x=1.5cm,y=1.0cm,scale=0.8]
\foreach\i in {1,2,3}
\node[input] (x\i) at (0,-1.5*\i) {$x_\i$};
\node[input] (x4) at (0,-7) {$x_n$};
\foreach\i in {1,2,3,4}
\node[hidden] (w_1\i) at (2.5,0.75-1.5*\i) {$w_\i^{(1)}$};
\node[hidden] (w_15) at (2.5,-7.75) {$w_m^{(1)}$};
\foreach\i in {1,2,3,4}
\node[hidden] (w_2\i) at (5.0,0.75-1.5*\i) {$w_\i^{(2)}$};
\node[hidden] (w_25) at (5.0,-7.75) {$w_m^{(2)}$};
\foreach\i in {1,2,3}
\node[output] (y\i) at (7,-1.5*\i) {$y_\i$};
\node[output] (y4) at (7,-7) {$y_k$};
% connections
\foreach\i in {1,2,3,4,5} \foreach\j in {1,2,3,4}
{
\draw[black,thick] (w_1\i) -- (x\j);
\draw[black,thick] (w_2\i) -- (w_1\j);
\draw[black,thick] (w_2\i) -- (y\j);
}
% dots
\foreach\i/\j in {x3/x4,w_14/w_15,w_24/w_25,y3/y4}
\node at ($(\i)!0.5!(\j)$) {\strut$\vdots$};
\end{tikzpicture}
\label{fig:neural-network}
\end{figure}
\begin{figure}[tb]
\caption[Common activation functions]{Sigmoid, ReLU, and tanh activation functions.}
\vspace{\baselineskip}
\centering
\includegraphics[scale=1.0]{img/sigmoid-relu-tanh}
\label{fig:activation-functions}
\end{figure}
The data ``fed'' to an NN is typically divided into two sets. The training set is used to train the model by adjusting its parameters based on input samples and corresponding target values. The validation set is used to evaluate the performance of the model and detect overfitting during training, which occurs when a model excessively fits the training data but fails to generalize to unseen data.
An epoch is an iteration of the entire training dataset, involving feeding each sample, updating weights based on the loss function, and performing forward and backward propagation.
Early stop, or patience, halts training if there is no improvement in a chosen metric for a specified number of consecutive epochs, preventing overfitting and and preserving the performance of the model at the point of best validation metric.
%Early stop, or patience, monitors a chosen metric (e.g., validation loss) and halts training if there is no improvement for a specified number of consecutive epochs. Early stopping prevents overfitting, enhances generalization, and preserves the performance of the model at the point of best validation metric.
\begin{definition}[Early Stop]
Let $M$ represent the chosen metric (e.g., validation loss) and $n$ denote the number of consecutive epochs without improvement allowed. Given the current epoch $t$, the training is stopped if the following condition is met:
$$M(t) > \min\{M(t-1), M(t-2), \ldots, M(t-n)\}$$
where $M(t)$ represents the metric value at epoch $t$.
\end{definition}
During training, data can be fed in batches, which are subsets of training samples processed together in each epoch iteration. This method enhances computational efficiency by simultaneously processing multiple samples, instead of updating network weights after each training sample (which is inefficient). Larger batch sizes enable parallel processing but demand more memory, while smaller batch sizes consume less memory but may lead to slower training convergence due to more frequent weight updates.
Standard NNs may suffer from the vanishing gradient problem~\cite{Hochreiter/1991}. The gradients tend to diminish as they propagate backward through multiple layers, making it challenging for earlier layers to learn meaningful representations. This problem hampers the optimization process and restricts the overall performance of the network.
Residual Neural Networks (ResNets)~\cite{He.etal/2016} can reduce the vanishing gradient problem. ResNets use skip connections that bypass layers, allowing the information to flow directly from one layer to another. This bypassing mechanism mitigates the vanishing gradient problem and facilitates the training of deep networks.
\cref{fig:residual-block} has a visual representation of a small \emph{residual block} with a skip connection.
\begin{definition}[Skip Connection]
Let us consider a ResNet architecture with $L$ layers. The output of the $l$-th layer is denoted by $a^{l}$, and the output of the previous layer is denoted by $a^{l-1}$. The residual connection between the $l$-th and $l-1$-th layers can be represented as:
$$a^{l} = a^{l-1} + \mathcal{F}(a^{l-1}, W^{l}),$$
where $\mathcal{F}$ represents a residual function, typically implemented as a fully connected layer, and $W^{l}$ denotes the learnable parameters of this function.
\end{definition}
\begin{figure}[tb]
\caption[A residual block]{A residual block with two hidden layers.}
%\vspace{\baselineskip}
\centering
\begin{tikzpicture}[scale=1.0]
\node[fill=myorange!30] (l1) {layer 1};
\node[blue!50!black, right=of l1, label={below:activation}] (act1) {$a(x)$};
\node[fill=myblue!30, right=of act1] (l2) {layer 2};
\node[right=of l2, font=\Large, label={below:add}, inner sep=0, pin={60:$x + \mathcal F(x)$}] (add) {$\oplus$};
\node[blue!50!black, right=of add, label={below:activation}] (act2) {$a(x)$};
\draw[->] (l1) -- (act1);
\draw[->] (act1) -- (l2);
\draw[<-] (l1) -- ++(-2,0) node[below, pos=0.8] {$x$};
\draw[->] (l2) -- (act2) node[above, pos=0.8] {};
\draw[->] ($(l1)-(1.5,0)$) to[out=90, in=90] node[below=1ex, midway, align=center] {skip connection\\(identity)} node[above, midway] {$x$} (add);
\draw[decorate, decoration={brace, amplitude=1ex, raise=1cm}] (l2.east) -- node[midway, below=1.2cm] {$\mathcal F(x)$} (l1.west);
\end{tikzpicture}
\label{fig:residual-block}
\end{figure}
\section{Learning Heuristic Functions}
\label{sec:related-h}
There have been several studies on learning heuristic functions to solve planning tasks. These studies can be categorized into two primary approaches.
The first approach heavily relies on domain logic to generate samples and instantiate networks. Its objective is to generalize across domains or a planning formalism.
On the other hand, the second approach minimally uses domain logic for generating samples and focuses on generalization within a state space.
The first approach uses structured NNs with architectures specifically designed to align with the characteristics and requirements of the given input task. Examples include learning domain-independent heuristic functions with hypergraph NNs~\cite{Shen.etal/2020}, and learning policies with action schema networks~\cite{Toyer.etal/2018,Toyer.etal/2020} and graph NNs~\cite{Stahlberg.etal/2022}. These approaches are generally applicable to state spaces that differ from the ones they were originally trained on. The major limitation of this set of approaches is the heavy reliance on domain logic, and it can also have significant computational overhead, as in the case of action schema networks. Moreover, in domain-independent approaches such as hypergraph NNs, where the network can be applied to a different domain than the one it was originally trained on, the search performance is considerably inferior when compared to logic-based heuristics in terms of coverage (number of solved tasks within a given time limit).
The second approach uses supervised learning to train a feedforward NN with pairs of states and cost-to-goal estimates. Samples can be generated with forward search from the initial state~\cite{Ferber.etal/2020a}, backward search from the goal~\cite{Yu.etal/2020,OToole/2022,Bettker.etal/2022}, or a combination of both~\cite{Ferber.etal/2022}.
This set of approaches requires fewer computational resources but is limited to the specific state space they were trained on. Search performance is also a problem, since these approaches typically solve less tasks than simple logic-based heuristics such as goal-count. An advantage is that this set of approaches relies on domain logic to a lesser extent, and only during sample generation, such as determining applicable operators and mutexes.
Regarding sampling by backward search,~\citet{Yu.etal/2020} use depth-first search,~\citet{OToole/2022} use random walks, and~\citet{Bettker.etal/2022} use a combination of breadth-first search and random walks. In these cases, the cost-to-goal estimates are assigned as the lowest distance from the goal at which the states were generated.~\citet{Ferber.etal/2022} use a combination of backward and forward searches, using bootstrap learning~\cite{Arfaee.etal/2011}.
Specifically,~\citet{Bettker.etal/2022} introduce several cost-to-goal improvement methods to enhance sample quality and emphasize that the effectiveness of learned heuristics relies on two crucial factors: a sample set that encompasses diverse regions of the state space and accurate cost-to-goal estimates; one without the other is insufficient.
In this study, we aim to use minimal domain logic. As such, we use learned heuristic functions and learned preferred operators in most experiments. In particular, we follow the second set of approaches described previously. In the following sections, we introduce relevant concepts for the proposed approach to compute sample-based preferred operators in~\cref{cha:proposed-approach}.
\subsection{Generating Samples with FSM}
\label{sec:sample-learn-h}
To learn heuristic functions, sampled states are labeled with a cost-to-goal estimate. In regression-based methods, the value assigned to a sampled state is determined by its distance to the goal condition. In this work, we generate samples by regression, expanding partial states in the backward state space \bsp. Precisely, we follow the best-performing approach used by~\citet{Bettker.etal/2022}, using a combination of breadth-first search (BFS) with multiple random walk (RW) rollouts, named ``focused sampling method''~(\bfsrw), which aims to achieve good coverage near the goal (\bfs) while obtaining a diverse set of samples from the remaining state space (\rw).
A regression rollout refers to a sequence of partial state expansions, concluding under two conditions: when the last expanded state has no predecessors or when it reaches the regression limit (depth) $L$. The process of generating samples halts once the desired number of samples $N$ is reached. Random walks can have multiple rollouts due to the regression limit $L$, whereas BFS has a single rollout. Repeated states are not sampled and expanded during each RW rollout to avoid cycles during a backward search, and the current rollout ends abruptly if only repeated states are available to continue the search. However, repeated states are permitted between rollouts.
\citet{Bettker.etal/2022} set the regression limit to $L=\ceil{\facts/\overline{\eff}}$, where $\overline{\eff}=\sum_{o\in \mathcal{O}} |\eff(o)|/|\mathcal{O}|$, i.e., the number of facts per mean number of effects in the operators of the input task. Unlike using a fixed $L$~\cite{Yu.etal/2020, OToole/2022}, this adaptive $L$ aims to approximate the longest distance $d^{*}$ between the goal condition and any potential initial state.
\bfsrw consists of two phases. The first phase uses BFS to generate $p_\bfsrw$ of the $N$ samples. The BFS expands a state from layer $k$ and generates $n$ states from layer $k+1$. These generated states are sampled only when $N + n \leq p_{\bfsrw}N$; otherwise, no states are sampled, and BFS expands another state. The set of unexpanded states, i.e., leaves, is denoted as $Q$. The second phase involves multiple random walk rollouts starting from randomly selected states in $Q$. This process continues until the sample set reaches $N$. During the random walk phase, states already sampled in the BFS phase are not sampled again.
After finishing regression,~\citet{Bettker.etal/2022} improve the cost-to-goal estimates of each sampled partial state, and complete them to full states~(\cref{sec:sample-completion}).
Generally, accurate cost-to-goal estimates result in improved learned heuristics, leading to fewer expanded states during a search. However, this correlation is not always guaranteed~\cite{Holte/2010}. To enhance the cost-to-goal estimates of each sampled state,~\citet{Bettker.etal/2022} developed two methods used in this study. The first method, \sai~(\cref{sec:sample-sai}), minimizes estimates across repeated samples, while the second method, \sui~(\cref{sec:sample-sui}), minimizes estimates across the successors of samples.
%\begin{algorithm}
%\caption{Sampling states for heuristic values using \bfsrw}
%\begin{algorithmic}[1]
%\Procedure{\bfsrw}{$\Pi$, $N$, $L$, $p_\bfsrw$}
%\State \textbf{Phase 1}
%\State Initialize empty set $Q$
%\State Initialize empty set $S_{bfs}$
%\State $S_{bfs}$, $Q \gets$ BFSLimited($s_{0}$, $p_{\bfsrw}N$)
%
%\State Initialize empty set $S_{rw}$
%\State \textbf{Phase 2}
%\State Initialize Boolean array $V$ of size $|Q|$ with all indexes set to $False$
%\While{$|S| < N$}
% \State $rnd\_idx \gets$ random index of $Q$ where $V$[$rnd\_idx$]$ = False$
% \State $V$[$rnd\_idx$]$~\gets True$
% \If{all positions in $V$ are $True$}
% \State Set all positions in $V$ to $False$~\emph{\# ``Replacement''}
% \EndIf
% \State Initialize a partial state $s$
% \State $s \gets$ $Q$[$rnd\_idx$]
% \State $n_{rw} \gets min(L - h(s), N - |S|)$ \# \emph{$n$ states to sample in this RW rollout}
% \State $S_{rw} \gets$ RW($s$, $n_{rw}$) \# \emph{Random walk sampling starting from $s$}
% \State $S \gets S \cup (S_{rw} \setminus S_{bfs}) $
%\EndWhile
%\State \textbf{return} $S$
%\EndProcedure
%\end{algorithmic}
%\end{algorithm}
%\begin{algorithm}
%\caption{Breadth-first search of \bfsrw}
%\begin{algorithmic}[1]
%\Procedure{BFSLimited}{$s_{0}$, $M$}
%\State Initialize set $Q$ with $s_{0}$
%\State Initialize set $K$ with $s_{0}$ \emph{\# i.e., layer $k$}
%\State Initialize empty set $K_{+1}$ \emph{\# i.e., layer $k+1$}
%\State Initialize set $S_{bfs}$ with $s_{0}$
%\While{$|S_{bfs}| < M$ and $K$ is not empty}
% \State Shuffle $K$
% \ForAll{partial states $s$ of $K$}
% \State Initialize empty set $P_{s}$
% \State $P_{s} \gets \{s' \mid s' \in pred_{bfs}(s)~\text{and}~s' \notin S_{bfs}\}$
% \If{$|S_{bfs}| + |P_{s}| \le M$}
% \State $S_{bfs} \gets S_{bfs} \cup P_{s} $
% \State $K_{+1} \gets K_{+1} \cup P_{s} $
% \State $Q \gets (Q \cup P_{s}) \setminus \{s\} $
% \EndIf
% \If{$|S_{bfs}| = M$}
% \State break
% \EndIf
% \EndFor
% \State $K \gets K_{+1}$
% \State $K_{+1} \gets \emptyset$
%\EndWhile
%\State \textbf{return} $S_{bfs}$, $Q$
%\EndProcedure
%\end{algorithmic}
%\end{algorithm}
\subsection{Sample Completion}
\label{sec:sample-completion}
Regression sampling produces a set of partial states with undefined variables. However, during the search, the NN is trained on complete states and expects complete states as input.
Each partial state can be completed by assigning a value $s(v)\in\dom(v)$ to all fact pairs $(v,s(v))$ where $s(v)=\bot$.
As~\citet{Bettker.etal/2022} and~\citet{Ferber.etal/2022}, we use a method that assigns a random value $s(v) \in \dom(v)$ to each undefined variable to complete partial states, ensuring that the assigned values satisfy mutexes derived from the planning task. We process the undefined variables in random order and assign them random values that do not violate the mutexes. The undefined variables are set to false if we cannot complete the state after $10$\,K attempts. We use the mutexes automatically deduced by Fast Downward~\cite{Helmert/2006,Helmert/2009}.
%\subsection{Improving Cost-to-Goal Estimates}
%\label{sec:sample-improving-h}
\subsection{Sample Improvement}
\label{sec:sample-sai}
It is possible to generate duplicate states with different estimates due to multiple random walk rollouts. Thus, we update the cost-to-goal estimate for each sampled state $s$ to the minimum estimate $h(s) = \min\{h_i \mid s=s_i, i\in[N]\}$.
This procedure is called ``sample improvement'' (\sai), and is applied twice: first on partial states after regression is complete, and then on the completed states, as two partial states can be transformed into the same state.
\subsection{Successor Improvement}
\label{sec:sample-sui}
Sampling neighboring states in the state space is also common, especially for states close to the goal. Using the following approach, we can leverage this to enhance the cost-to-goal estimates, starting by constructing a graph $G$ with the relations between all partial states in the sample set.
\begin{definition}[Sample Set Graph]\label{def:graph}
A sample set graph is a directed and labeled graph $G=(V,A)$ defined by the states obtained in the sampling. Each vertex is labeled by a sampled partial state, i.e., $V = \{s_i \mid i \in [N]\}$. (Note that the undefined variables of a partial state can typically be completed in multiple ways, so each vertex represents a set of complete states.) Set $A$ contains an arc $(s,t)$ for each operator $o \in \mathcal{O}$ where $\sucs(s,o) \subseteq t$.
\end{definition}
With partial states generated by regression, we can always find at least one successor, except for the goal $s^{*}$. We compute the shortest paths to the goal in the graph $G$ using Dijkstra's algorithm. Afterward, we update the cost-to-goal estimates of each state accordingly. This process is called ``successor improvement'' (\sui) and is applied after regression, before completion. \cref{fig:g-example} illustrates how \sui enhances cost-to-goal estimates in practice.
\begin{figure}[htb]
\centering
\caption[Example sample set graph with \sui]{Constructing the sample set graph $G$ and updating the $h$-values of sampled states. The $h$-value of each state is in parenthesis.}
\vspace{\baselineskip}
\begin{subfigure}{0.60\linewidth}
\centering
\resizebox{\linewidth}{!}{%
\begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,node distance=3cm,semithick]
\tikzstyle{state}=[circle,draw,minimum size=1cm]
\node[state] (s1) {\footnotesize $s_1~(5)$};
\node[state] (s2) [right of=s1] {\footnotesize $s_2~(4)$};
\node[state] (s3) [below right of=s1] {\footnotesize $s_3~(4)$};
\node[state] (s4) [below left of=s3] {\footnotesize $s_4~(3)$};
\node[state] (s5) [right of=s4] {\footnotesize $s_5~(2)$};
\node[state] (s6) [right of=s2] {\footnotesize $s_6~(3)$};
\node[state] (s7) [right of=s3] {\footnotesize $s_7~(1)$};
\node[state] (s8) [below right of=s6] {\footnotesize $s_8~(2)$};
\node[state] (s9) [right of=s8] {\footnotesize $s_9~(1)$};
\node[state] (s10) [below left of=s9] {\footnotesize $s^*~(0)$};
\path (s1) edge [draw=myorange] node[above] {} (s2);
\path (s1) edge [draw=myblue] node[left] {} (s3);
%\path (s1) edge [bend left=70] node[above] {\footnotesize $o_{3}$} (s8);
\path (s2) edge [draw=myorange] node[above] {} (s6);
%\path (s3) edge node[below] {\footnotesize $o_{8}$} (s7);
\path (s3) edge [draw=myblue] node[left] {} (s4);
\path (s4) edge [draw=myblue] node[below] {} (s5);
\path (s5) edge [draw=myblue] node[right] {} (s7);
\path (s6) edge [draw=myorange] node[right] {} (s8);
%\path (s6) edge node[above] {} (s3);
%\path (s7) edge node[above] {} (s8);
\path (s7) edge [draw=myblue] node[below] {} (s10);
\path (s8) edge [draw=myorange] node[above] {} (s9);
\path (s9) edge [draw=myorange] node[right] {} (s10);
\end{tikzpicture}%
}
\vspace{\baselineskip}
\caption{Collected samples from two regression rollouts (blue and orange).}
\label{fig:subfig1}
\end{subfigure}
\begin{subfigure}{0.60\linewidth}
\centering
\resizebox{\linewidth}{!}{%
\begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,node distance=3cm,semithick]
\tikzstyle{state}=[circle,draw,minimum size=1cm]
\node[state] (s1) {\footnotesize $s_1~(5)$};
\node[state] (s2) [right of=s1] {\footnotesize $s_2~(4)$};
\node[state] (s3) [below right of=s1] {\footnotesize $s_3~(4)$};
\node[state] (s4) [below left of=s3] {\footnotesize $s_4~(3)$};
\node[state] (s5) [right of=s4] {\footnotesize $s_5~(2)$};
\node[state] (s6) [right of=s2] {\footnotesize $s_6~(3)$};
\node[state] (s7) [right of=s3] {\footnotesize $s_7~(1)$};
\node[state] (s8) [below right of=s6] {\footnotesize $s_8~(2)$};
\node[state] (s9) [right of=s8] {\footnotesize $s_9~(1)$};
\node[state] (s10) [below left of=s9] {\footnotesize $s^*~(0)$};
\path (s1) edge node[above] {} (s2);
\path (s1) edge node[left] {} (s3);
\path (s1) edge [draw=myred] [bend left=70] node[above] {} (s8);
\path (s2) edge node[above] {} (s6);
\path (s3) edge [draw=myred] node[below] {} (s7);
\path (s3) edge node[left] {} (s4);
\path (s4) edge node[below] {} (s5);
\path (s5) edge node[right] {} (s7);
\path (s6) edge node[right] {} (s8);
\path (s6) edge [draw=myred] node[above] {} (s3);
\path (s7) edge [draw=myred] node[above] {} (s8);
\path (s7) edge node[below] {} (s10);
\path (s8) edge node[above] {} (s9);
\path (s9) edge node[right] {} (s10);
\end{tikzpicture}%
}
\vspace{\baselineskip}
\caption{Constructed graph $G$ with new successor-predecessor relationships.}
\label{fig:subfig2}
\end{subfigure}
%\vspace{1cm} % Adjust the vertical spacing between the subfigures
\begin{subfigure}{0.60\linewidth}
\centering
\resizebox{\linewidth}{!}{%
\begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,node distance=3cm,semithick]
\tikzstyle{state}=[circle,draw,minimum size=1cm]
\node[state] (s1) {\footnotesize $s_1~(\textred{3})$};
\node[state] (s2) [right of=s1] {\footnotesize $s_2~(4)$};
\node[state] (s3) [below right of=s1] {\footnotesize $s_3~(\textred{2})$};
\node[state] (s4) [below left of=s3] {\footnotesize $s_4~(3)$};
\node[state] (s5) [right of=s4] {\footnotesize $s_5~(2)$};
\node[state] (s6) [right of=s2] {\footnotesize $s_6~(3)$};
\node[state] (s7) [right of=s3] {\footnotesize $s_7~(1)$};
\node[state] (s8) [below right of=s6] {\footnotesize $s_8~(2)$};
\node[state] (s9) [right of=s8] {\footnotesize $s_9~(1)$};
\node[state] (s10) [below left of=s9] {\footnotesize $s^*~(0)$};
\path (s1) edge node[above] {} (s2);
\path (s1) edge node[left] {} (s3);
\path (s1) edge [bend left=70] node[above] {} (s8);
\path (s2) edge node[above] {} (s6);
\path (s3) edge node[below] {} (s7);
\path (s3) edge node[left] {} (s4);
\path (s4) edge node[below] {} (s5);
\path (s5) edge node[right] {} (s7);
\path (s6) edge node[right] {} (s8);
\path (s6) edge node[above] {} (s3);
\path (s7) edge node[above] {} (s8);
\path (s7) edge node[below] {} (s10);
\path (s8) edge node[above] {} (s9);
\path (s9) edge node[right] {} (s10);
\end{tikzpicture}%
}
\vspace{\baselineskip}
\caption{Updating the cost-to-goal estimates according to the newly discovered relationships.}
\label{fig:subfig3}
\end{subfigure}
\label{fig:g-example}
\end{figure}
%
% - - - - - - - - - - - - - - - -- - - - - -
% proposed approach
% - - - - - - - - - - - - - - - -- - - - - -
% now we talk about the stuff we're doing
%
\chapter{Proposed Approach}
\label{cha:proposed-approach}
Our objective is to discover preferred operators from a sample set and use them to train a NN for predicting preferred operators. In this chapter, we first introduce ideal preferred operators, then show the proposed approach for identifying preferred operators within an existing sample set. Finally, we present a sampling algorithm designed for discovering preferred operators.
\section{Ideal Preferred Operators}
\label{sec:sample-ideal-po}
Ideally, we want preferred operators that help solve a task with the least effort, which we call ideal preferred operators.
\begin{definition}[Ideal Preferred Operator]\label{def:ideal_preferred_operator}
An ideal preferred operator generates a state with the shortest distance to the goal among all successors. Given a state $s$ and an operator $o \in \mathcal{O}$ where $\sucs(s,o) = t$, $o$ is considered an ideal preferred operator if $\hstarp{t} = \min_{s' \in \sucs(s)} \hstarp{s'}$.
\end{definition}
Every state $s$ that has a plan is associated with at least one preferred operator. These preferred operators generate successor states $t$ where $\hstarp{t} < \hstarp{s}$. Thus, only goal states and dead-end states lack preferred operators.
\cref{fig:pot-example} presents a visual example where $o_{4}$ is an ideal preferred operator of state $s_{1}$, since it leads to the successor state $s_{4}$ of minimum $h$-value among all successors of $s_{1}$.
\begin{figure}[htb]
\caption[Example of an ideal preferred operators]{Example of an ideal preferred operator $o_{4}$ of state $s_{1}$. The $h$-value of each state is in parenthesis.}
\vspace{\baselineskip}
\centering
\begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,node distance=3cm,semithick]
\tikzstyle{state}=[circle,draw,minimum size=1cm]
\node[state] (s1) {\footnotesize$s_{1}~(5)$} ;
\node[state] (s2) [right = of s1] {\footnotesize$s_{2}~(6)$} ;
\node[state] (s3) [below = of s2] {\footnotesize$s_{3}~(8)$} ;
\node[state] (s4) [below = of s1] {\footnotesize$s_{4}~(4)$} ;
\node[state] (s5) [left = of s4] {\footnotesize$s_{5}~(10)$} ;
\node[state] (s6) [left = of s1] {\footnotesize$s_{6}~(22)$} ;
\draw[->] (s1) -- (s4) node[midway,right] {\textred{$o_{4}$}} ;
\draw[->] (s1) -- (s2) node[midway,above] {$o_{2}$} ;
\draw[->] (s1) -- (s3) node[midway,right] {$o_{3}$} ;
\draw[->] (s1) -- (s5) node[midway,left] {$o_{5}$} ;
\draw[->] (s1) -- (s6) node[midway,above] {$o_{6}$} ;