LNAI 13796
Agostino Dovier Angelo Montanari Andrea Orlandini (Eds.)
AIxIA 2022 – Advances in Artificial Intelligence XXIst International Conference of the Italian Association for Artificial Intelligence AIxIA 2022, Udine, Italy, November 28 – December 2, 2022 Proceedings
123
Lecture Notes in Computer Science
Lecture Notes in Artificial Intelligence Founding Editor Jörg Siekmann
Series Editors Randy Goebel, University of Alberta, Edmonton, Canada Wolfgang Wahlster, DFKI, Berlin, Germany ZhiHua Zhou, Nanjing University, Nanjing, China
13796
The series Lecture Notes in Artificial Intelligence (LNAI) was established in 1988 as a topical subseries of LNCS devoted to artificial intelligence. The series publishes stateoftheart research results at a high level. As with the LNCS mother series, the mission of the series is to serve the international R & D community by providing an invaluable service, mainly focused on the publication of conference and workshop proceedings and postproceedings.
Agostino Dovier · Angelo Montanari · Andrea Orlandini Editors
AIxIA 2022 – Advances in Artificial Intelligence XXIst International Conference of the Italian Association for Artificial Intelligence AIxIA 2022, Udine, Italy, November 28 – December 2, 2022 Proceedings
Editors Agostino Dovier University of Udine Udine, Italy
Angelo Montanari University of Udine Udine, Italy
Andrea Orlandini National Research Council (CNRISTC) Rome, Italy
ISSN 03029743 ISSN 16113349 (electronic) Lecture Notes in Artificial Intelligence ISBN 9783031271809 ISBN 9783031271816 (eBook) https://doi.org/10.1007/9783031271816 LNCS Sublibrary: SL7 – Artificial Intelligence © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 Chapters “Approximate Inference in Probabilistic Answer Set Programming for Statistical Probabilities” and “MAP Inference in Probabilistic Answer Set Programs” are licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/). For further details see license information in the chapters. This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
This volume contains the proceeding of the 21st International Conference of the Italian Association for Artificial Intelligence, referred to for short as AIxIA. AIxIA is very active in organizing scientific initiatives as well as events for the dissemination of Artificial Intelligence in industry, society, and schools. Among these activities, a scientific international conference has been organized every two years since 1991 and then yearly since 2015. In the last two years, due to the Covid19 pandemic, the conference was organized in remote mode in Torino and Milano (LNCS 13196 and 12414). Previously, it was organized as a standard conference in Rende (2019: LNCS 11946), Trento (2018: LNCS 11298), Bari (2017: LNCS 10640), Genova (2016: LNCS 10037), Ferrara (2015: LNCS 9336), Torino (2013: LNCS 8249), Palermo (2011, LNCS 6934), Reggio Emilia (2009: LNCS 5883), Roma (2007: LNCS 4733), Milano (2005: LNCS 3673), Pisa (2003: LNCS 2929), Bari (2001: LNCS 2175), Bologna (1999: LNCS 1792), Roma (1997: LNCS 1321), Firenze (1995: LNCS 992), Torino (1993: LNCS 728), Palermo (1991: LNCS 549), and Trento (1989). The recent positive evolution of the pandemic disease allowed us to take the risk of organizing an attended conference. This seems to have been much appreciated by the community as more than 350 people attended the meeting. As for the numbers, 54 research papers were submitted to the conference and evaluated by at least three reviewers; moreover, 29 discussion papers were submitted and 16 were selected for presentation at the conference. 227 authors were involved, 155 from Italy, 13 from France, 12 from USA, 8 from India, 7 from UK, and 32 from other countries. Among the regular papers, 33 papers were selected for publication in these proceedings. The conference program included two prestigious keynote speakers, namely: Subbarao Kambhampati (Arizona State University): Symbols as a Lingua Franca for Supporting HumanAI Interaction For Explainable and Advisable AI Systems; Georg Gottlob (University of Oxford): My adventures with Datalog: Walking the thin line between theory and practice (with a paper also included in the proceedings). In addition, it offered three tutorials on hot research topics: Ferdinando Fioretto (Syracuse University): Endtoend constrained optimization learning; Antonio Lieto (University of Turin): Cognitive Design for Artificial Minds; Angelo Oddi, Riccardo Rasconi (CNRISTC, Rome), and Marco Baioletti (University of Perugia): Quantum computing and planning. AIxIA 2022 also covered many aspects of theoretical and applied AI through 17 colocated workshops devoted to specific topics and bringing together the corresponding AI communities. The workshops chairs were Andrea Formisano and Alberto Finzi. Thus, in parallel to the main program, the conference features the following workshops, for a total of 175 accepted regular papers, plus a number of invited talks: – 6th Workshop on Advances in Argumentation in Artificial Intelligence; – 11th Workshop on Machine Learning and Data Mining;
vi
Preface
– 4th Workshop on Artificial Intelligence and fOrmal VERification, Logic, Automata, and sYnthesis; – 1st Workshop on Artificial Intelligence for Cultural Heritage; – 1st Workshop on Artificial Intelligence and Creativity; – 3rd Italian Workshop on Artificial Intelligence for an Ageing Society; – R.i.C.e.R.c.A: RCRA Incontri E Confronti; – 10th Italian Workshop on Planning and Scheduling; – 9th Italian Workshop on Artificial Intelligence and Robotics; – 1st Workshop on Artificial Intelligence for Healthcare; – 6th Workshop on Natural Language for Artificial Intelligence; – 3rd Workshop on Explainable Artificial Intelligence; – 1st Workshop on Artificial Intelligence for Human Machine Interaction; – 1st Workshop on Artificial Intelligence for Public Administration; – 2nd Italian Workshop on Artificial Intelligence and Applications for Business and Industries; – 1st Workshop on Bias, Ethical AI, Explainability and the role of Logic and Logic Programming; – 1st Workshop on Strategies, Prediction, Interaction, and Reasoning in Italy. Finally, a doctoral consortium with 20 presentations from PhD students was organized on the first day of the conference. The doctoral consortium chairs were Gabriella Cortellessa and Luca Di Gaspero. The organization benefited from “Platinum” sponsorships from EUSTEMA, Danieli Automation, Generali, Intesa Sanpaolo, and TechEdge, “Gold” sponsorships from OverIT, Previnet, and ublox, and “Bronze” sponsorships from BeanTech, SMC, and Confindustria Udine. The conference was kindly supported by the Artificial Intelligence Journal, and received the patronage of the European Commission, the Friuli Venezia Giulia Region, and the Municipality of Udine. A special session devoted to industry and AI was organized by Giuseppe Serra and Fabio Mercorio. Last but not least, we would like to thank the organizing committee, in particular Andrea Brunello and Nicola Saccomanno, who did a huge amount of highquality work. Moreover, we thank the web master Nicola Gigante, our colleagues and friends Dario Della Monica and Gabriele Puppis, and all the PhD students for their help in the practical management of the conference. Finally, we thank the board of directors of the AIxIA for their constant support, the Rector of the University of Udine for the opportunity to organize the conference in the new building of the scientific library, and the technical staff of University of Udine (in particular, Renato Spoletti, Stefano Bonomi, and Ester Orlandi) for their precious work. December 2022
Agostino Dovier Angelo Montanari Andrea Orlandini
Organization
General Chair Angelo Montanari
University of Udine, Italy
Program Committee Chairs Agostino Dovier Andrea Orlandini
University of Udine, Italy National Research Council (CNRISTC), Italy
Program Committee Davide Bacciu Marco Baioletti Matteo Baldoni Stefania Bandini Adriano Barra Sebastiano Battiato Stefano Bistarelli Stefano Borgo Francesco Calimeri Alberto Casagrande Antonio Chella Alessandro Cimatti Gabriella Cortellessa Stefania Costantini Alessandro Dal Palù Dario Della Monica Stefano Ferilli Alberto Finzi Fabio Fioravanti Andrea Formisano
University of Pisa, Italy University of Perugia, Italy University of Turin, Italy Complex Systems & AI Research Center,Italy University of Salento, Italy University of Catania, Italy University of Perugia, Italy National Research Council (CNRISTC), Italy University of Calabria, Italy University of Trieste, Italy University of Palermo, Italy Fondazione Bruno Kessler, Italy National Research Council (CNRISTC), Italy University of Aquila, Italy University of Parma, Italy University of Udine, Italy University of Bari, Italy University of Naples “Federico II”, Italy University of ChietiPescara, Italy University of Udine, Italy
viii
Organization
Salvatore Gaglio Chiara Ghidini Gianluigi Greco Luca Iocchi Antonio Lieto Francesca A. Lisi Michele Loreti Fabio Mercorio Angelo Oddi Andrea Omicini Luigi Palopoli Filippo Palumbo Fabio Patrizi Luigi Portinale Gian Luca Pozzato Luca Pulina Alessandro Raffetà Riccardo Rasconi Francesco Ricca Fabrizio Riguzzi Marco Roveri Salvatore Ruggieri Enrico Scala Giovanni Semeraro Luciano Serafini Gianluca Torta Mauro Vallati Eloisa Vargiu
University of Palermo, Italy Fondazione Bruno Kessler, Italy University of Calabria, Italy University of Rome “Sapienza”, Italy University of Turin, Italy University of Bari, Italy University of Camerino, Italy University of Milan Bicocca, Italy National Research Council (CNRISTC), Italy University of Bologna “Alma Mater Studiorum”, Italy University of Trento, Italy National Research Council (CNRISTI), Italy University of Rome “Sapienza”, Italy University of Piemonte Orientale, Italy University of Turin, Italy University of Sassari, Italy University of Venezia “Ca’ Foscari”, Italy National Research Council (CNRISTC), Italy University of Calabria, Italy University of Ferrara, Italy University of Trento, Italy University of Pisa, Italy University of Brescia, Italy University of Bari, Italy Fondazione Bruno Kessler, Italy University of Turin, Italy University of Huddersfield, UK CETaqua Water Technology Center, Spain
Additional Reviewers Carlo Adornetto Damiano Azzolini Daniele Baccega Patrizio Bellan Gloria Beraldo Luigi Bonassi Fabio Buttussi
Pierluigi Cassotti Federico Cerutti Riccardo De Benedictis Alessandro De Paola Francesco Fabiano Francesco Faloci Antonino Fiannaca
Organization
Federico Fogolari Francesca Fracasso Francesca Gasparini Francesco Guarnera Dario Guidotti Eleonora Iotti Andrea Iovine Maria Mannone Marta Marchiori Manerba Claudio Masolo Ivan Mercanti
Laura Pandolfo Marco Polignano Andrea Pugnana Alessandro Quarta Chiara Renso Francesco Santini Laura State Carlo Taticchi Alessandro Umbrico Alberto Valese
ix
Contents
Hybrid Approaches The PSyKE Technology for Trustworthy Artificial Intelligence . . . . . . . . . . . . . . . Roberta Calegari and Federico Sabbatini
3
A Declarative Approach to Contrast Pattern Mining . . . . . . . . . . . . . . . . . . . . . . . . Francesca Alessandra Lisi and Gioacchino Sterlicchio
17
Graphs and Networks Approximate Inference in Probabilistic Answer Set Programming for Statistical Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Damiano Azzolini, Elena Bellodi, and Fabrizio Riguzzi Decision Trees with a Modal Flavor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dario Della Monica, Giovanni Pagliarini, Guido Sciavicco, and Ionel Eduard Stan Assisted Process Knowledge Graph Building Using Pretrained Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Patrizio Bellan, Mauro Dragoni, and Chiara Ghidini
33
47
60
Neural Networks Reduction via Lumping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dalila Ressi, Riccardo Romanello, Carla Piazza, and Sabina Rossi
75
Knowledge Enhanced Neural Networks for Relational Domains . . . . . . . . . . . . . . Alessandro Daniele and Luciano Serafini
91
Logic Tensor Networks for TopN Recommendation . . . . . . . . . . . . . . . . . . . . . . . . 110 Tommaso Carraro, Alessandro Daniele, Fabio Aiolli, and Luciano Serafini Multiagent Systems A Review of the Muddy Children Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 Yusuf Izmirlioglu, Loc Pham, Tran Cao Son, and Enrico Pontelli Multiagent Cooperative Argumentation in Arg2P . . . . . . . . . . . . . . . . . . . . . . . . . . 140 Giuseppe Pisano, Roberta Calegari, and Andrea Omicini
xii
Contents
Ethics by Design for Intelligent and Sustainable Adaptive Systems . . . . . . . . . . . 154 Luca Squadrone, Danilo Croce, and Roberto Basili Automated Planning and Scheduling Verification of Numeric Planning Problems Through Domain Dynamic Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 Enrico Scala, Thomas L. McCluskey, and Mauro Vallati Comparing MultiAgent Path Finding Algorithms in a Real Industrial Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 Enrico Saccon, Luigi Palopoli, and Marco Roveri LogicBased Ethical Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 Umberto Grandi, Emiliano Lorini, Timothy Parker, and Rachid Alami A Hybrid Recommender System with Implicit Feedbacks in Fashion Retail . . . . 212 Ilaria Cestari, Luigi Portinale, and Pier Luigi Riva Incremental TimelineBased Planning for Efficient Plan Execution and Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 Riccardo De Benedictis, Gloria Beraldo, Amedeo Cesta, and Gabriella Cortellessa Knowledge Acquisition and Completion for LongTerm HumanRobot Interactions Using Knowledge Graph Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . 241 Ermanno Bartoli, Francesco Argenziano, Vincenzo Suriani, and Daniele Nardi Construct, Merge, Solve and Adapt Applied to a Bus Driver Scheduling Problem with Complex Break Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 Roberto Maria Rosati, Lucas Kletzander, Christian Blum, Nysret Musliu, and Andrea Schaerf Topic Modelling and Frame Identification for Political Arguments . . . . . . . . . . . . 268 Shohreh Haddadan, Elena Cabrio, Axel J. Soto, and Serena Villata Substitute Plastic Film with Kraft Paper in Automatic Pallet Wrapping: An AI Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282 Eleonora Iotti, Alessandro Dal Palù, Gianluca Contesso, and Francesco Bertinelli
Contents
xiii
AI Applications Transformer Based Motion InBetweening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299 Pavithra Sridhar, V. Aananth, Madhav Aggarwal, and R. Leela Velusamy A LogicBased Tool for Dynamic Generation and Classification of Musical Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313 Antonio Lieto, Gian Luca Pozzato, Alberto Valese, and Mattia Zito Why Can Neural Networks Recognize Us by Our Finger Movements? . . . . . . . . 327 Elena Mariolina Galdi, Marco Alberti, Alessandro D’Ausilio, and Alice Tomassini Miscellany Labelled Sequent Calculi for Conditional Logics: Conditional Excluded Middle and Conditional Modus Ponens Finally Together . . . . . . . . . . . . . . . . . . . . 345 Nicola Olivetti, Nikola Panic, and Gian Luca Pozzato Deep Learning for ECoG BrainComputer Interface: EndtoEnd vs. HandCrafted Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358 ´ Maciej Sliwowski, Matthieu Martin, Antoine Souloumiac, Pierre Blanchart, and Tetiana Aksenova Quantum Circuit Compilation for the Graph Coloring Problem . . . . . . . . . . . . . . . 374 Angelo Oddi, Riccardo Rasconi, Marco Baioletti, Vieri Giuliano Santucci, and Hamish Beck Toward a Heterogeneous Multirobot Framework for PriorityBased Sanitization of Railway Stations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387 Riccardo Caccavale, Mirko Ermini, Eugenio Fedeli, Alberto Finzi, Vincenzo Lippiello, and Fabrizio Tavano Simulated Annealing for the Home Healthcare Routing and Scheduling Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402 Sara Ceschia, Luca Di Gaspero, and Andrea Schaerf MAP Inference in Probabilistic Answer Set Programs . . . . . . . . . . . . . . . . . . . . . . 413 Damiano Azzolini, Elena Bellodi, and Fabrizio Riguzzi Verifying a Stochastic Model for the Spread of a SARSCoV2Like Infection: Opportunities and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427 Marco Roveri, Franc Ivankovic, Luigi Palopoli, and Daniele Fontanelli
xiv
Contents
Natural Language Processing DelBERTo: A Deep Lightweight Transformer for Sentiment Analysis . . . . . . . . . 443 Luca Molinaro, Rosalia Tatano, Enrico Busto, Attilio Fiandrotti, Valerio Basile, and Viviana Patti A BERTBased Scoring System for Workplace Safety Courses in Italian . . . . . . . 457 Nicola Arici, Alfonso E. Gerevini, Luca Putelli, Ivan Serina, and Luca Sigalini Embedding Contextual Information in Seq2seq Models for Grounded Semantic Role Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472 Claudiu Daniel Hromei, Lorenzo Cristofori, Danilo Croce, and Roberto Basili Keynote talk Adventures with Datalog: Walking the Thin Line Between Theory and Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489 Georg Gottlob Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501
Hybrid Approaches
The PSyKE Technology for Trustworthy Artificial Intelligence Roberta Calegari1
and Federico Sabbatini2(B)
1
Alma AI – Alma Mater Research Institute for HumanCentered Artiﬁcial Intelligence, Alma Mater Studiorum—Universit` a di Bologna, Bologna, Italy [emailprotected] 2 Department of Pure and Applied Sciences (DiSPeA), University of Urbino, Via S. Chiara, 27, 61029 Urbino, Italy [emailprotected]
Abstract. Transparency is one of the “Ethical Principles in the Context of AI Systems” as described in the Ethics Guidelines for Trustworthy Artiﬁcial Intelligence (TAI). It is closely linked to four other principles – respect for human autonomy, prevention of harm, traceability and explainability – and involves numerous ways in which opaqueness can have undesirable impacts, such as discrimination, inequality, segregation, marginalisation, and manipulation. The opaqueness of many AI tools and the inability to understand the underpinning black boxes contradicts these principles as well as prevents people from fully trusting them. In this paper we discuss the PSyKE technology, a platform providing generalpurpose support to symbolic knowledge extraction from diﬀerent sorts of blackbox predictors via many extraction algorithms. The extracted knowledge results are easily injectable into existing AI assets making them meet the transparency TAI requirement. Keywords: Trustworthy Artiﬁcial Intelligence · Transparency Explainability · Symbolic knowledge extraction · PSyKE
1
·
Introduction
The innovative potential of Artiﬁcial Intelligence (AI) is clear, but AI tools can reﬂect, amplify, and even create untrustworthy behaviours, beliefs, decisions or results [15]. As we use AI systems to formalise, scale, and accelerate processes, we have the opportunity, as well as the duty, to revise and enhance the existing processes, avoiding perpetuating existing patterns of untrustworthiness, by detecting, diagnosing, and repairing them. To trust these systems, domain experts and stakeholders need to trust the decisions made by them. Europe’s This work has been partially supported by the EU ICT48 2020 project TAILOR (No. 952215) and by the European Union’s Horizon 2020 research and innovation programme under G.A. no. 101017142 (StairwAI project). c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Dovier et al. (Eds.): AIxIA 2022, LNAI 13796, pp. 3–16, 2023. https://doi.org/10.1007/9783031271816_1
4
R. Calegari and F. Sabbatini
strategy aims to create an AI Ecosystem of Excellence and Trust where ethical and legal principles are pursued in all AI systems. Transparency is one of the “Ethical Principles in the Context of AI Systems” as described in the Ethics Guidelines for Trustworthy Artiﬁcial Intelligence (EGTAI) [9] and in the ﬁrst AI regulation (the “AI Act”) [8]. It is closely linked to four other principles (respect for human autonomy, prevention of harm, traceability and explainability) and involves numerous ways in which opaqueness can have undesirable impacts, such as discrimination, inequality, exclusion, segregation, marginalisation, exploitation, and manipulation. However, the translation of ethical principles and EGTAI into practical requirements are needed to boost high quality AI innovation in Europe. Concrete methods to ensure that AI systems adhere to the transparency requirement can be borrowed from the explainability domain, since providing explanations concurs to achieve transparency. Diﬀerent strategies can be exploited to meet transparency and explainability [11]. For instance, it is possible to obtain explainable datadriven solutions only by using interpretable algorithms [16]—such as decision lists, decision trees and sparse integer linear models, and algorithms based on discrete optimisation. However, this kind of technique often has repercussions on the ﬁnal predictive performance, since most eﬀective algorithms – like artiﬁcial neural networks – are not taken into account. Deriving posthoc explanations [14] is an alternative strategy aimed at reverseengineering the blackbox (BB) inner behaviour to make it explicit. This is a way of combining the performance of predictioneﬀective (even if opaque) machine learning models with humaninterpretable output predictions. Symbolic knowledge extraction (SKE) represents one of the most promising techniques to derive posthoc explanations from subsymbolic BB models and interpret the notion of explainability under the transparency perspective, i.e. proposing a transparent model adhering to the not transparent predictor. Its main idea is to build a symbolic – and thus interpretable – model that mimics the behaviour of the original BB, intended as the capability to provide outputs that are as close as possible w.r.t. those of the underlying BB queried on the same inputs. Symbols may consist of comprehensible knowledge—e.g., lists or trees of rules that can be exploited to either derive predictions or to better understand the BB behaviour and, as a further step, as knowledge on which to perform any kind of logical reasoning. Currently, SKE techniques have been already applied in a wide variety of areas, ranging from medical diagnosis [10] to ﬁnance [1] and astrophysics [22]. Despite the wide adoption of SKE and the existence of diﬀerent techniques for extracting symbolic knowledge out of a BB, a uniﬁed and generalpurpose software technology supporting such methods and their comparison is currently lacking. In other words, the burden of implementing SKE algorithms, or selecting the best one from the state of the art, is currently on AI stakeholders alone, who are likely to realise custom solutions for a speciﬁc application need. Other than slowing down the adoption of SKE as an eﬀective method for reaching transparency, such a lack of viable technologies is somewhat anachronistic in the datadriven AI era, where a plethora of libraries and frameworks
The PSyKE Technology for TAI
5
are ﬂourishing, targeting all major programming paradigms and platforms, and making stateoftheart machine learning (ML) algorithms easily accessible to the general public—cf. SciKitLearn1 for Python. Accordingly, in this paper we present a generalpurpose Platform for Symbolic Knowledge Extraction – PSyKE – as a way to practicalise the TAI requirement – transparency in particular – from highlevel principles to concrete methods. Moreover, one of the PSyKE goals is ﬁlling the gap between the current state of the art of SKE and the available technology as well as providing a concrete toolkit for testing, evaluating and reaching transparency in AI applications. It provides a controlled experimentation environment for transparency via SKE methods enabling the creation of diﬀerent simulations/experiments for the speciﬁc application at hand. The framework comes as a toolkit in which experiments on transparency can be built and run, comparing diﬀerent solutions, and selecting the best option. More precisely, PSyKE is conceived as an open library where diﬀerent sorts of knowledge extraction algorithms can be realised, exploited, or compared. PSyKE supports rule extraction from both classiﬁers and regressors, and makes the extraction procedure as transparent as possible w.r.t. the underlying BB, depending on the particular extraction procedure at hand. The extraction of ﬁrstorder logic clauses is also supported, with the twofold advantage of providing human and machineinterpretable rules as output. These can then be used as either an explanation for the original BB or as a starting point for further symbolic computations and reasonings.
2
The PSyKE framework
PSyKE2 [18,19] is a platform providing generalpurpose support to symbolic knowledge extraction from diﬀerent sorts of blackbox predictors via many extraction algorithms. 2.1
Functionalities and Main Components
PSyKE comes as a software library providing generalpurpose support to the extraction of logic rules out of BB predictors by letting users choose the most adequate SKE method for the task and data at hand. A uniﬁed API covering virtually all extraction algorithms targeting supervised learning tasks is exposed by the framework and experiments can also be run via a GUI. Currently, PSyKE grants access to stateoftheart SKE algorithms providing the implementations of several interoperable, interchangeable, and comparable extraction SKE methods [2,6,7,13,17,20]. PSyKE is conceived as an openended project, exploitable to design and implement new extraction procedures behind a unique API. Essentially, PSyKE is designed around the notion of extractor, whose overall design is depicted in Fig. 1. Within the scope of PSyKE, an extractor is any algorithm accepting a machine learning predictor as input (classiﬁer or regressor), and producing a theory of logic rules as output. 1 2
https://scikitlearn.org/stable. https://apice.unibo.it/xwiki/bin/view/PSyKE/.
6
R. Calegari and F. Sabbatini
Fig. 1. PSyKE design
PSyKE extractors require additional information to complete the extraction task. Such information consists of the data set used to train the predictor and its schema. Data sets are required to allow the extraction procedure to inspect the BB behaviour – and therefore build the corresponding output rules – whereas schemas are required to allow (i) the extraction procedure to take decisions based on feature types, and (ii) the extracted knowledge to be more interpretable by referring to the feature names. Accordingly, extractors expect also the data set and its schema metadata as input. Figure 1 shows also the discretiser and scaler components. The former aims at providing some facilities for discretising (binarising) data sets including continuous (categorical) data. This is a procedure often needed for data sets involving these kinds of attributes to be given as input to extractors only accepting discrete or binary input features. 2.2
Architecture and API
As depicted in Fig. 2, a key role in the design of PSyKE is played by the Extractor interface, deﬁning the general contract of any knowledgeextraction procedure. Each Extractor encapsulates a single machine learning Predictor and a particular Discretisation strategy. Given a set of inputs, an extractor is capable of extracting a Theory of logic Rules out of a DataFrame, containing the examples the Predictor has been trained upon. PSyKE assumes underlying libraries to be available on the runtime adopted for implementation, from which AI facilities can be inherited. These include: a machine learning library, exposing adhoc types aimed at representing data sets, data schemas, or predictors, and a symbolic AI library, exposing adhoc types for representing and manipulating logic theories, clauses, and rules. PSyKE inherits highlevel abstractions from these libraries. These include the following components:
The PSyKE Technology for TAI
7
Fig. 2. PSyKE’s Extractor interface DataFrame — a container of tabular data, where rows commonly denote instances,
and columns denote their features, while bulk operations are available to manipulate the table as a whole, as well as any row/column of its; Predictor — a computational entity which can be trained (a.k.a. ﬁtted) against a DataFrame and used to draw predictions of type R; Classifier — a particular case of predictor where R represents a type having a ﬁnite amount of admissible values; Regressor — a particular case of predictor where R represents a type having a potentially inﬁnite (possibly continuous) amount of admissible values; Rule — a semantic, intelligible representation of the function mapping Predictor’s inputs into the corresponding outputs, for a portion of the input space; Theory — an ordered collection of rules. For example, PSyKE borrows MLrelated abstractions – such as DataFrame, Predictor, or Classifier – from either Pandas or ScikitLearn Python libraries. Similarly, it borrows highlevel symbolicAIrelated abstractions – such as Theory or Rule – from 2PKt3 [5]. PSyKE constructs its notion of Extractor upon these inherited concepts— thus designing an Extractor as any method capable of extracting logic Rules out of some trained Predictor. PSyKE extractors are bound to the particular underpinning blackbox Predictor, as well as to the Discretisation strategy exploited for the input space. Extractors also expose a method for extracting an explainable Theory from the Predictor – namely, extract – and a method to draw predictions by using the extracted rules—namely, predict. Any attempt to use the extracted rules to draw explainable predictions triggers extraction ﬁrst—i.e., the prediction procedure implies extraction. Both extraction and prediction rely on a DataFrame that must be provided by the user upon invocation. Extractors, in the general case, may also be used to perform rule induction from data, without any intermediate predictor. 3
https://github.com/tuProlog/2ppy.
8
R. Calegari and F. Sabbatini
It is worth noting that Predictors are parametric types. The metaparameter R represents the type of predictions the predictor may produce. The rules possibly extracted by such predictors – as well as the predictions extracted – may diﬀer signiﬁcantly depending on the particular data and on the selected predictors. For instance, when rules are extracted from monodimensional regressors, R may be the type of ﬂoating point numbers, whereas, for multiclass classiﬁers, R may consist of the set of types (like integer, string, ...). Depending on the nature of R, the extracted rules possibly diﬀer signiﬁcantly. However, the proposed API makes it possible to switch between diﬀerent extraction algorithms and predictors with no changes in the PSyKE architecture. Output rules produced by PSyKE’s extractors may be more tailored on humaninterpretability or agent/machineinteroperability [21]. In the former case, a Prolog theory of logic clauses is provided as output. In the latter case, the knowledge is extracted as an OWL ontology containing SWRL rules.
3
Examples
In this section some examples showing PSyKE working in diﬀerent scenarios are reported—i.e. the Iris data set4 as a classiﬁcation task and the Combined Cycle Power Plant5 (CCPP) data set as a regression case study. 3.1
Classification: The Iris Data Set
In the following we report the outcome of PSyKE when applying diﬀerent SKE techniques to the Iris data set. All the results are resumed in Fig. 3 and Table 1. Column “Predictor” represents the ML step of the process. Column “Extractor” represents the output of PSyKE. Diﬀerent extraction procedures – namely, Iter, GridEx, and Cart – are applied to some selected BB classiﬁers. These predictors are a knearest neighbor with k = 5 (5NN), a decision tree (DT) and a multilayer perceptron (MLP). A numerical assessment of the aforementioned predictors and extractors is reported in Table 1 in terms of number of extracted rules and predictive performance w.r.t. data and BB predictions. The predictive performance is expressed through both classiﬁcation accuracy and F1 score metrics. Values are averaged upon 25 executions, each one with diﬀerent random train/test splits, but the same test set percentage and same parameters for predictors and extractors. Table 1 also reports the underpinning BB predictor accuracy and the ﬁdelity and accuracy of the extraction procedure. It is worth noting that diﬀerent SKE techniques can be easily compared and the best option for the scenario at hand can be selected thanks to the controlled experimentation environment provided by PSyKE.
4 5
https://archive.ics.uci.edu/ml/datasets/iris. https://archive.ics.uci.edu/ml/datasets/combined+cycle+power+plant.
The PSyKE Technology for TAI
9
Fig. 3. Comparison between Iris data set input space partitionings performed by the algorithms implemented in PSyKE. Only the two most relevant features are reported— i.e., petal width and length.
3.2
Regression: The Combined Cycle Power Plant Data Set
In this example, PSyKE is exploited to extract rules out of diﬀerent BB regressors trained upon the CCPP data set. The data set contains 9568 instances, each one composed of 4 realvalued input attributes. Diverse regressors are trained on the CCPP data set: a 3NN, a DT and a linear regressor (LR). Same as the previous example, PSyKE is used to extract logic rules out of the selected BB models exploring some of the SKE methods it supports—namely, Iter, GridEx, GridREx and Cart. Metrics for measuring the ﬁdelity of the extractor w.r.t. the underlying BB predictions as well as the predictive accuracy w.r.t. the data are the mean absolute error (MAE) and R2 score. The same metrics are used to assess the predictive performance of the BBs
10
R. Calegari and F. Sabbatini
Table 1. Comparison between predictive performance and ﬁdelity measurements applied to the Iris data set. The best extractors are highlighted. Predictor
Extractor
Type Accuracy F1 Algorithm Rules Accuracy F1 score score (data) (BB) (data) (BB) 5NN 0.96
0.96
Iter GridEx Cart
3 3 3
0.91 0.94 0.92
0.93 0.91 0.96 0.94 0.93 0.92
0.93 0.96 0.93
DT
0.96
0.96
Iter GridEx Cart
3 3 3
0.96 0.94 0.89
0.94 0.96 0.96 0.94 0.93 0.89
0.94 0.96 0.93
MLP 0.99
0.99
Iter GridEx Cart
5 3 3
0.80 0.94 0.95
0.79 0.78 0.96 0.94 0.93 0.95
0.76 0.96 0.93
and as for the Iris case study the extracted knowledge readability is expressed as number of rules. The results of PSyKE applied to the CCPP data set are summarised in Fig. 4 and Table 2. Each one of the extraction procedures suitable for regression tasks is applied to all the aforementioned BB regressors. Figure 4 shows that all the extractors are able to capture the behaviour of the output values w.r.t. the input variables. Table 2 reports the predictive performance of predictors and extractors. Values are averaged upon 25 executions, each one with diﬀerent train/test splits, but with the same parameters for both predictors and extractors. Results show that in the case at hand all predictors have comparable performance in terms of MAE and R2 score. Conversely, it is possible to notice that Cart, GridEx and GridREx always appear more explainable than Iter in terms of the number of extracted rules. From the Table it may be easily noticed also that GridEx and Cart generally present analogous performance. This fact depends on the nature of the corresponding output rules. Indeed, they both produce rules having constant output values, introducing an undesired discretisation of the predicted variable. Both of them are able to outperform Iter also in terms of predictive performance (smaller MAE and larger R2 score). On the other hand, GridREx outperforms all the other algorithms, achieving higher ﬁdelity and readability. This depends on the regressive nature of its outputs, enabling the creation of more concise output rules performing more accurate predictions. Indeed, GridREx rules have as postconditions linear combinations of the input variables. The nature of the diﬀerent predictors and extractors used in this case study may be easily noticed in Fig. 4. The boundaries identiﬁed by the 3NN clearly follow a proximity pattern. Conversely, the DT performs variable slicing along
The PSyKE Technology for TAI
11
Fig. 4. Comparison between CCPP data set output predictions provided by the algorithms implemented in PSyKE. Only the two most relevant features are reported—i.e., ambient temperature and exhaust vacuum.
each input dimension and the LR produces a gradual output value decrement for growing input values. As for the extractors, for Cart the same considerations made for the DT hold. The hypercubic nature of Iter and GridEx is detectable by observing the rectangular boundaries provided by them. Finally, GridREx provides local linear regressive laws for hypercubic regions, merging the advantages of both DTs and LRs.
12
R. Calegari and F. Sabbatini
Table 2. Comparison between predictive performance and ﬁdelity measurements applied to the CCPP data set. The number of extracted rules is also reported. The best extractors are highlighted. Predictor
Extractor
Type MAE R2 Algorithm Rules MAE R2 score score (data) (BB) (data) (BB) 3NN 3.09
0.94
Iter 22 GridEx 5 GridREx 5 Cart 6
4.19 5.02 3.25 4.45
3.78 4.63 2.52 3.90
0.94 0.87 0.94 0.89
0.96 0.88 0.96 0.91
DT
3.31
0.92
Iter 14 GridEx 5 GridREx 5 Cart 6
4.27 5.02 3.24 4.46
4.32 5.10 3.38 4.50
0.93 0.87 0.94 0.89
0.92 0.86 0.93 0.88
LR
3.59
0.92
Iter 43 GridEx 5 GridREx 1 Cart 6
4.42 5.15 3.59 4.97
2.74 3.80 0.00 3.49
0.93 0.86 0.93 0.87
1.00 0.92 1.00 0.93
Once again it is worth noting how PSyKE technology enables diﬀerent SKE techniques to be compared. Such a comparison provide also a measure in terms of explainability and transparency that can be achieved out of the BB predictor. 3.3
PSyKE GUI
Figure 5 shows an example of PSyKE GUI screenshot in order to highlight how the toolkit also enables achieving fast and easy interactions with users. The GUI is simple and userfriendly, divided into 4 panels. The top panel is dedicated to the task selection (classiﬁcation vs. regression) and to data set selection/preprocessing. Users can choose between several predeﬁned data sets, as well as load a custom ﬁle. Furthermore, they can choose to discretise/scale the features and, on the right, it is possible to select among all the available features (i) the one to be used as output; (ii) those to be used as inputs; and (iii) those to be neglected. On the same panel it is possible to select two input features to be plotted together with the output feature. Plots appear in the rightmost central panel of the GUI. The ﬁrst one represents the data set instances, the second depicts the decision boundaries of the trained BB predictor and the third does the same for the selected extractor. Plots are shown after the proper button pressing, but each plot depends on the previous operations performed by the users. The predictor plot requires a BB predictor to be previously chosen and trained. This can be done by acting on the leftmost central panel of the interface. Several models are available, each one with corresponding text boxes
The PSyKE Technology for TAI
13
Fig. 5. PSyKE GUI
to allow users to customise the required hyperparameters. Users can also choose the traintest splitting percentage. Each parameter has a default value, so user inputs are optional. Analogously, the bottommost panel is dedicated to the selection, training and tuning of knowledge extractors. Training an extractor enables the visualisation of the third plot. The knowledge extracted with PSyKE extractors is displayed below the plots, in Prolog syntax. Finally, information about the chosen data set (e.g., number of features, classes and instances), predictor (e.g., parameters and predictive performance) and extractor (e.g., parameters, predictive performance and ﬁdelity measurements) are shown next to the corresponding selection commands (after their selection). The example reported in Fig. 5 shows the application of PSyKE to the Iris data set. The data set has been loaded without discretisation and feature pruning, then a 5NN has been trained on 80% of the data set. The Cart extractor has ﬁnally been chosen, with maximum depth and maximum leaf amount equal to 3. Only input features about petal width and length have been selected to be plotted. In conclusion, the framework provides the possibility to build diﬀerent experiments in a controlled environment, enabling easy exploitation of the technology and oﬀering the possibility to compare the results in a simple way.
4
Impact
The PSyKE technology may impact many research areas. It provides a wellgrounded technological basis and a software engineering practice for implementing/experimenting with the transparency and explainability dimensions in AI
14
R. Calegari and F. Sabbatini
applications. It provides an extensible framework for collecting the SKE methods and approaches proposed in the literature, creating a controlled environment for testing, evaluating and comparing transparency. PSyKE has an important role from the point of view of software engineering, providing a methodology that can be exploited for grounding all the TAI dimensions—i.e., the design and the implementation of a controlled experimentation environment that can act also as a sandbox for simulating the trustworthiness of an AI system. Accordingly, the framework provides a concrete example of the feasibility of building a practical toolkit for AI stakeholders to test the dimensions of TAI. Moreover, PSyKE has a role to play in the ﬁeld of XAI [12]. Integrating symbolic and subsymbolic AI – i.e., using them in synergy, as an ensemble – is a strategical research direction [4], and PSyKE oﬀers a sound technological foundation for this purpose. Finally, the distributed systems community has the need for interoperable and generalpurpose logicbased technologies that can be easy injectable into already existing systems [3]. There, PSyKE provides a technological layer easy injectable in distributed systems supporting agents’ reasoning via the production of logical knowledge that can be exploited by agents. Given all the potential of the described framework, there is room for several future research directions. PSyKE already enables the investigation of relevant research questions involving symbolic manipulation or automated reasoning, thanks to its modularity and interoperability. Under such a perspective, PSyKE enables exploring how to: (i) blend SKE with other AI techniques, and (ii) exploit SKE to build ﬂexible intelligent systems. Along these lines, future research directions will take into account the integration in the framework of a larger suite of methods for dealing with the most variety of datasets and predictors. Some preliminary experiments showed that the SKE algorithms can be exploited also for rule induction starting from data. This line is particularly interesting for all the cases in which a BB predictor is not available. Moreover, new SKE techniques are under development exploiting the combination of SKE with explainable clustering techniques increasing both performance and ﬁdelity. Finally, the framework is a preliminary example of how TAI dimensions can be tested and evaluated, and an interesting research line is to extend the environment in order to achieve a certiﬁcation of the level of transparency – or more in general trustworthiness – for given AI applications. The challenge here is to ﬁnd a way for deﬁning eﬀective metrics for the certiﬁcation of TAI dimensions.
5
Conclusion
In this paper we discuss the PSyKE technology, a platform providing generalpurpose support to symbolic knowledge extraction from diﬀerent sorts of blackbox predictors via many extraction algorithms, to be easily injectable into existing AI assets making them meet the transparency TAI requirement. The framework provides a controlled experimentation environment in which transparency and explainability can be tested, assessed and compared.
The PSyKE Technology for TAI
15
Even if still in a preliminary stage, it provides a software engineering practice for grounding all the TAI dimensions, translating them from highlevel principles to practical requirements.
References 1. Baesens, B., Setiono, R., De Lille, V., Viaene, S., Vanthienen, J.: Building creditrisk evaluation expert systems using neural network rule extraction and decision tables. In: Storey, V.C., Sarkar, S., DeGross, J.I. (eds.) ICIS 2001 Proceedings, pp. 159–168. Association for Information Systems (2001). http://aisel.aisnet.org/ icis2001/20 2. Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classiﬁcation and Regression Trees. CRC Press, Boca Raton (1984) 3. Calegari, R., Ciatto, G., Mascardi, V., Omicini, A.: Logicbased technologies for multiagent systems: a systematic literature review. Auton. Agents MultiAgent Syst. 35(1), 1:1–1:67 (2021). https://doi.org/10.1007/s10458020094783, http://link.springer.com/10.1007/s10458020094783. collection Current Trends in Research on Software Agents and AgentBased Software Development 4. Calegari, R., Ciatto, G., Omicini, A.: On the integration of symbolic and subsymbolic techniques for XAI: a survey. Intell. Artif. 14(1), 7–32 (2020). https:// doi.org/10.3233/IA190036 5. Ciatto, G., Calegari, R., Omicini, A.: 2PKt: a logicbased ecosystem for symbolic AI. SoftwareX 16(100817), 1–7 (2021). https://doi.org/10.1016/j.softx.2021. 100817, https://www.sciencedirect.com/science/article/pii/S2352711021001126 6. Craven, M.W., Shavlik, J.W.: Using sampling and queries to extract rules from trained neural networks. In: Machine Learning Proceedings 1994, pp. 37–45. Elsevier (1994). https://doi.org/10.1016/B9781558603356.500131 7. Craven, M.W., Shavlik, J.W.: Extracting treestructured representations of trained networks. In: Touretzky, D.S., Mozer, M.C., Hasselmo, M.E. (eds.) Advances in Neural Information Processing Systems 8. Proceedings of the 1995 Conference, pp. 24–30. The MIT Press, June 1996. http://papers.nips.cc/paper/1152extractingtreestructuredrepresentationsoftrainednetworks.pdf 8. European Commission: AI Act  Proposal for a regulation of the european parliament and the council laying down harmonised rules on artiﬁcial intelligence (Artiﬁcial Intelligence Act) and amending certain union legislative acts (2021). https://eurlex.europa.eu/legalcontent/EN/TXT/?uri=CELEX:52021PC0206 9. European Commission, DirectorateGeneral for Communications Networks, C., Technology: Ethics guidelines for trustworthy AI. Publications Oﬃce (2019). https://doi.org/10.2759/346720 10. Franco, L., Subirats, J.L., Molina, I., Alba, E., Jerez, J.M.: Early breast cancer prognosis prediction and rule extraction using a new constructive neural network algorithm. In: Sandoval, F., Prieto, A., Cabestany, J., Gra˜ na, M. (eds.) IWANN 2007. LNCS, vol. 4507, pp. 1004–1011. Springer, Heidelberg (2007). https://doi. org/10.1007/9783540730071 121 11. Guidotti, R., Monreale, A., Ruggieri, S., Turini, F., Giannotti, F., Pedreschi, D.: A survey of methods for explaining black box models. ACM Comput. Surv. 51(5), 1–42 (2018). https://doi.org/10.1145/3236009 12. Gunning, D., Aha, D.: DARPA’s explainable artiﬁcial intelligence (XAI) program. AI Mag. 40(2), 44–58 (2019)
16
R. Calegari and F. Sabbatini
13. Huysmans, J., Baesens, B., Vanthienen, J.: ITER: an algorithm for predictive regression rule extraction. In: Tjoa, A.M., Trujillo, J. (eds.) DaWaK 2006. LNCS, vol. 4081, pp. 270–279. Springer, Heidelberg (2006). https://doi.org/10.1007/ 11823728 26 14. Kenny, E.M., Ford, C., Quinn, M., Keane, M.T.: Explaining blackbox classiﬁers using posthoc explanationsbyexample: the eﬀect of explanations and errorrates in XAI user studies. Artif. Intell. 294, 103459 (2021). https://doi.org/10.1016/j. artint.2021.103459 15. M¨ okander, J., Morley, J., Taddeo, M., Floridi, L.: Ethicsbased auditing of automated decisionmaking systems: nature, scope, and limitations. Sci. Eng. Ethics 27(4), 1–30 (2021) 16. Rudin, C.: Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 1(5), 206–215 (2019). https://doi.org/10.1038/s422560190048x 17. Sabbatini, F., Calegari, R.: Symbolic knowledge extraction from opaque machine learning predictors: GridREx & PEDRO. In: KernIsberner, G., Lakemeyer, G., Meyer, T. (eds.) Proceedings of the 19th International Conference on Principles of Knowledge Representation and Reasoning, July 31–5 August 2022, KR 2022, Haifa, Israel (2022). https://proceedings.kr.org/2022/57/ 18. Sabbatini, F., Ciatto, G., Calegari, R., Omicini, A.: On the design of PSyKE: a platform for symbolic knowledge extraction. In: Calegari, R., Ciatto, G., Denti, E., Omicini, A., Sartor, G. (eds.) WOA 2021–22nd Workshop From Objects to Agents. CEUR Workshop Proceedings, vol. 2963, pp. 29–48. Sun SITE Central Europe, RWTH Aachen University (Oct 2021), 22nd Workshop From Objects to Agents (WOA 2021), Bologna, Italy, 1–3 September 2021. Proceedings (2021) 19. Sabbatini, F., Ciatto, G., Calegari, R., Omicini, A.: Symbolic knowledge extraction from opaque ML predictors in PSyKE: Platform design & experiments. Intell. Artif. 16(1), 27–48 (2022). https://doi.org/10.3233/IA210120 20. Sabbatini, F., Ciatto, G., Omicini, A.: GridEx: an algorithm for knowledge extraction from blackbox regressors. In: Calvaresi, D., Najjar, A., Winikoﬀ, M., Fr¨ amling, K. (eds.) EXTRAAMAS 2021. LNCS (LNAI), vol. 12688, pp. 18–38. Springer, Cham (2021). https://doi.org/10.1007/9783030820176 2 21. Sabbatini, F., Ciatto, G., Omicini, A.: Semantic webbased interoperability for intelligent agents with PSyKE. In: Calvaresi, D., Najjar, A., Winikoﬀ, M., Fr¨ amling, K. (eds.) Proceedings of the 4th International Workshop on Explainable and Transparent AI and MultiAgent Systems. EXTRAAMAS 2022. LNCS, vol. 13283, chap. 8, pp. 124–142. Springer, Cham (2022). https://doi.org/10.1007/ 9783031155659 8 22. Sabbatini, F., Grimani, C.: Symbolic knowledge extraction from opaque predictors applied to cosmicray data gathered with LISA pathﬁnder. Aeronaut. Aerosp. Open Access J. 6(3), 90–95 (2022). https://doi.org/10.15406/aaoaj.2022.06.00145
A Declarative Approach to Contrast Pattern Mining Francesca Alessandra Lisi1(B) 1
and Gioacchino Sterlicchio2
Dipartimento di Informatica and CILA, University of Bari “Aldo Moro”, Bari, Italy [emailprotected] 2 Department of Mechanics, Mathematics and Management, Polytechnic University of Bari, Bari, Italy [emailprotected]
Abstract. This paper proposes a declarative approach to the problem of contrast pattern mining. The approach is based on encodings of the data and the problem with Answer Set Programming (ASP), and evaluated in a novel AI application in the ﬁeld of Digital Forensics.
Keywords: Contrast Pattern Mining Digital Forensics
1
· Answer Set Programming ·
Introduction
Pattern mining [12] is a class of data mining tasks that consist of extracting interesting structured patterns from a dataset. These tasks encompass itemset mining, sequence mining and graph mining. The interestingness measure of a pattern is, in most of the algorithms, the number of its occurrences in the dataset. Given a threshold k, interesting patterns are those that occur at least in k data instances. In this case, the task is known as frequent pattern mining for which many algorithms have been proposed. An interesting extension of the frequent pattern mining task is the one that aims at the discovery of socalled contrast patterns. Whereas frequent patterns are statistically signiﬁcant regularities in a set of transactions, contrast patterns denote statistically signiﬁcant diﬀerences between two or more disjoint sets of transactions [6]. Recently there has been an increasing interest in declarative approaches to pattern mining, thus giving rise to a novel stream of research known under the name of Declarative Pattern Mining (DPM). So far, DPM addressed tasks such as frequent itemset mining [10,13], and sequence mining [7,17]. Diﬀerent declarative frameworks have been explored: SAT [13], Constraint Programming [5,10], and Answer Set Programming (ASP) [7,11]. In this paper we propose a declarative approach for contrast pattern mining which leverages the expressive and inferential power of ASP. To the best of our knowledge, this interesting class of pattern mining problems has not been addressed yet in DPM. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Dovier et al. (Eds.): AIxIA 2022, LNAI 13796, pp. 17–30, 2023. https://doi.org/10.1007/9783031271816_2
18
F. A. Lisi and G. Sterlicchio
Declarative approaches are generally desirable in application domains where the requirements of transparency, veriﬁability and explainability of the AI techniques employed are of paramount importance. One of these cases is the ﬁeld of Digital Forensics (DF), a branch of criminalistics that deals with the identiﬁcation, acquisition, preservation, analysis and presentation of the information content of computer systems, or in general of digital devices, by means of specialized software, and according to speciﬁc regulations. A declarative approach to DF was ﬁrst explored by Costantini et al. [2,3], and subsequently adopted by the COST Action “Digital forensics: evidence analysis via intelligent systems and practices” (DigForASP)1 . The aim of DigForASP is to promote formal and veriﬁable AI methods and techniques in the analysis of evidence [4]. In this paper, we report the preliminary results obtained by applying the proposed ASPencoded contrast pattern mining algorithm to a dataset of phone records made available within DigForASP. The paper is organized as follows. In Sect. 2 we provide the necessary preliminaries on contrast pattern mining and ASP. In Sect. 3 we introduce the proposed ASP encoding for contrast pattern mining. In Sect. 4 we describe the application of this encoding to the analysis of phone records, and report the results of some experiments. In Sect. 5 we conclude with ﬁnal remarks.
2
Preliminaries
2.1
Contrast Pattern Mining in Brief
We assume the set I = {1, ..., m} of m items, and the set T = {1, ..., n} of n transactions. Intuitively, a transaction t ∈ T is a subset of items from I, which is typically associated with a transaction identiﬁer (TID). A transactional database D ∈ {0, 1}n×m can be seen as a binary matrix, in which each row Dt represent the transaction t consisting of the items {i ∈ IDt,i = 1}, where Dt,i denote the value on the i th column and tth row of D. The subsets of I are called itemsets or patterns. In pattern mining we are interested in ﬁnding patterns that satisfy constraints relative to a set of transactions. In particular, given the pattern P ⊆ I, and a set of transactions T, the subset of T covered by P is cover(P, T ) = {t ∈ T ∀i ∈ P : Dt,i = 1}. Then the absolute support of P in T is deﬁned as: supp(P, T ) = cover(P, T ) (1) and quantiﬁes the number of transactions in T containing the pattern P. Frequent pattern mining algorithms are used to discover statistically significant regularities in a set of transactions whereas the contrast pattern mining task is about detecting statistically signiﬁcant diﬀerences (contrast) between two or more disjoint sets of transactions [6]. To this aim, we assume also a ﬁnite set L of class labels which are used by the function L(t) ∈ L to label each transaction t. In our setting, the label α ∈ L partitions T in two samples: 1
https://digforasp.uca.es/.
A Declarative Approach to Contrast Pattern Mining
19
1. T (α) = {t ∈ T L(t) = α}, i.e., the transactions labeled with α; 2. its complement T (α) = T \ T (α). The contrast pattern P with respect to α is quantiﬁed by the socalled absolute support diﬀerence, which is deﬁned as: dif f (P, α) = supp(P, T (α)) − supp(P, T (α))
(2)
The problem of contrast pattern mining concerns the enumeration of all frequent patterns with absolute support diﬀerence that exceeds the userdeﬁned minimum support threshold minDiﬀ. More speciﬁcally, given: – – – – –
the the the the the
transaction database D over the set of transactions T ; maximum pattern length threshold maxLength; minimum absolute support threshold minSupp ≥ 0; minimum absolute support diﬀerence threshold minDif f ≥ 0; label α ∈ L.
the problem of contrast pattern mining is to ﬁnd all patterns (P, dif f (P, α)) such that: 1. P  ≤ maxLength; 2. supp(P, T (α)) ≥ minSupp; 3. dif f (P, α) ≥ minDif f . To understand the meaning of contrast patterns, it is important to comment further the formula (2). Given a class α, a pattern P is a contrast pattern for that class if its support diﬀers from the support of the same pattern for the complementary class. If the diﬀerence of the support is equal to 0, it means that P is present in the same way in the two classes. Therefore this pattern does not allow to ﬁnd the diﬀerences between the classes. Conversely, the more the diﬀerence in support moves away from 0, the more P is to be understood as a pattern that allows to distinguish the two classes under comparison. Therefore, P is a representative pattern for the class α but not for the complementary class. 2.2
Answer Set Programming in a Nutshell
In the following we give a brief overview of the syntax and semantics of disjunctive logic programs in ASP. The reader can refer to, e.g., [1] for a more extensive introduction to ASP. Let U be a ﬁxed countable set of (domain) elements, also called constants, upon which a total order ≺ is deﬁned. An atom α is an expression p(t1 , . . . , tn ), where p is a predicate of arity n ≥ 0 and each ti is either a variable or an element from U (i.e., the resulting language is functionfree). An atom is ground if it is free of variables. We denote the set of all ground atoms over U by BU . A (disjunctive) rule r is of the form a1 ∨ . . . ∨ an ← b1 , . . . , bk , not bk+1 , . . . , not bm
20
F. A. Lisi and G. Sterlicchio
with n ≥ 0, m ≥ k ≥ 0, n + m > 0, where a1 , . . . , an , b1 , . . . , bm are atoms, or a count expression of the form #count{l : l1 , . . . , li } u, where l is an atom and lj is a literal (i.e., an atom which can be negated or not), 1 ≥ j ≥ i, ∈ {≤, , ≥}, and u ∈ N. Moreover, “not” denotes default negation. The head of r is the set head(r) = {a1 , . . . , an } and the body of r is body(r) = {b1 , . . . , bk , notbk+1 , . . . , notbm }. Furthermore, we distinguish between body + (r) = {b1 , . . . , bk } and body − (r) = {bk+1 , . . . , bm }. A rule r is normal if n ≤ 1 and a constraint if n = 0. A rule r is safe if each variable in r occurs in body + (r). A rule r is ground if no variable occurs in r. A fact is a ground rule with body(r) = ∅ and head(r) = 1. An (input) database is a set of facts. A program is a ﬁnite set of rules. For a program Π and an input database D, we often write Π(D) instead of D ∪ Π. If each rule in a program is normal (resp. ground), we call the program normal (resp. ground). Given a program Π, let UΠ be the set of all constants appearing in Π. Gr(Π) is the set of rules rσ obtained by applying, to each rule r ∈ Π, all possible substitutions σ from the variables in r to elements of UΠ . For countexpressions, {l : l1 , . . . , ln } denotes the set of all ground instantiations of l, governed through l1 , . . . , ln . An interpretation I ⊆ BU satisﬁes a ground rule r iﬀ head(r)∩I = ∅ whenever body + (r) ⊆ I, body − (r)∩I = ∅, and for each contained countexpression, N u holds, where N = {ll1 , . . . , ln }, u ∈ N and ∈ {≤, < , =, >, ≥}. A ground program Π is satisﬁed by I, if I satisﬁes each r ∈ Π. A nonground rule r (resp., a program Π) is satisﬁed by an interpretation I iﬀ I satisﬁes all groundings of r (resp., Gr(Π)). A subsetminimal set I ⊆ BU satisfying the GelfondLifschitz reduct Π I = {head(r) ← body + (r)I ∩ body − (r) = ∅, r ∈ Gr(Π)} is called an answer set of Π. We denote the set of answer sets for a program Π by AS(Π). The tools used in this work are part of the Potassco2 collection [9]. The main tool of the collection is the clingo ASP solver [8].
3
Mining Contrast Patterns with ASP
Within the declarative framework of ASP, the transaction database D is represented by means of facts of the following two kinds: class(t, c), and db(t, f(v)). Here, t is the TID while c represents the class, f represents a feature and v its value. In particular, we introduce the fact db(t,f(v)) if and only if Dt,i = 1. So, there is a dbfact for each feature. In DPM, patterns are represented as answer sets. More precisely, a single pattern is associated with each answer set and in our approach represented by means of the in pattern/1 and absolute diﬀ/1 predicates. The latter expresses the diﬀerence in support of the pattern between the class under consideration and the complementary class. Each pattern conveys information that allows to characterize the considered class.
2
https://potassco.org/.
A Declarative Approach to Contrast Pattern Mining 1 2 3 4
# const # const # const # const
21
minSupp = 2. maxLength = 3. minDiff = 1. class = positive .
5 6 7 8
% link facts to objects used in the encoding item ( I ) :  db (_ , I ) . transaction ( T ) :  db (T , _ ) .
9 10 11 12
13
14
15
% problem encoding ( frequent itemset mining ) { in_pattern ( I ) } :  item ( I ) . in_support ( T ) :  { conflict_at (T , I ) : item ( I ) } 0 , transaction ( T ) , class (T , class ) . out_support ( T ) :  { conflict_out (T , I ) : item ( I ) } 0 , transaction ( T ) , not class (T , class ) . conflict_at (T , I ) :  not db (T , I ) , in_pattern ( I ) , transaction ( T ) , class (T , class ) . conflict_out (T , I ) :  not db (T , I ) , in_pattern ( I ) , transaction ( T ) , not class (T , class ) .
16 17 18
% definition of absolute support difference ( Dong et al .) absolute_diff ( D ) :  N = # count { T : in_support ( T ) } , M = # count { T : out_support ( T ) } , D = N  M .
19 20 21 22
% length constraint :  maxLength +1 { in_pattern ( I ) }. :  { in_pattern ( I ) } 0.
23 24 25
% frequency constraint :  { in_support ( T ) } minSupp 2.
26 27 28
% absolute growth  rate constraint :  absolute_diff ( D ) , D < minDiff .
29 30 31 32
% print directives for an answer  set # show in_pattern /1. # show absolute_diff /1. Listing 1.1. Full ASP encoding for contrast pattern mining.
The ASP enconding for the contrast pattern mining problem introduced in Sect. 2.1 is reported in Listing 1.1. The values for minSupp, minDiﬀ and maxLength are encoded as symbolic constants. In Lines 1–4, the chosen constants are for demonstration purposes only. The predicate in pattern/1 (Line 11) is true for an item i if and only if i is included in a pattern P and encoding the most important part of a solution (P, dif f (P, α)). The predicate in support/1 (Line 12) is true for a transaction t if and only if t ∈ T . The intuition is that each t has to support each i ∈ I in the sense that t must include i. Additionally, we use the auxiliary predicates item/1 (Line 7, true for each item in D),
22
F. A. Lisi and G. Sterlicchio
transaction/1 (Line 8, true for each transaction in D) and conﬂict at/2 (Line 14) which is true for (t, i) if and only if t does not support i, that is, we have the conﬂict Dt,i = 0 and i ∈ I, thus violating the premises. In particular, the predicates in support/1 and conﬂict at/2 encode the construction of patterns for the class α. Conversely, the predicates out support/1 (Line 13) and conﬂict out/2 (Line 15) are used to generate patterns for the complementary class. Finally, the deﬁnition for the absolute support diﬀerence is encoded at Line 18. After the pattern generation step, the encoding applies three constraints corresponding to the thresholds maxLength, minSupp, and minDiﬀ. The ﬁrst constraint is expressed by Lines 21–22 and rules out patterns having 0 items or more than maxLength items. The second constraint is expressed at Line 25. In fact, patterns supported by at most minSupp2 instances are not allowed as an answer. The third constraint, encoded at Line 28, discards patterns with absolute support diﬀerence lower than minDiﬀ from the answer set. The two #show commands on Lines 31–32 allow, for each answer set, the display of the atoms that compose a solution (P, dif f (P, α)) to problem in hand. The encoding and further material can be found online.3
4
An Application in Digital Forensics
Digital Forensics (DF) is a branch of criminalistics that deals with the identiﬁcation, acquisition, preservation, analysis and presentation of the information content of computer systems, or in general of digital devices, by means of specialized software, and according to speciﬁc regulations. In particular, the phase of Evidence Analysis involves examining and aggregating evidence about possible crimes and crime perpetrators collected from various electronic devices in order to reconstruct events, event sequences and scenarios related to a crime. Results from this phase are then made available to law enforcement, investigators, intelligence agencies, public prosecutors, lawyers and judges. During the investigation of a crime, it is common to analyze the communications of a particular suspect. Since nowadays mobile phones are objects owned by anyone, it can be useful for investigators to analyze the calls or messages exchanged. The telephone records are a set of data relating to the external communications of the devices. In other words, they contain all the traces of communications (calls, SMS, and all the data traﬃc) concerning a speciﬁc user over a certain period of time. Note that phone records do not trace sensitive data such as the audio of calls sent or received. In fact, phone records only provide a trace of the communication that has taken place but not its content. The phone records can be requested by the Judicial Authority if deemed useful in order to carry out investigations involving the individual owner of the phone. Correctly analyzing the telephone records is essential to obtain useful hints. Depending on the analysis, diﬀerent kinds of information can be extracted. The records are typically analyzed for comparing the geographical positions with 3
https://github.com/mpia3/ContrastPatternMining.
A Declarative Approach to Contrast Pattern Mining
23
respect to the declarations, and for reconstructing the network of contacts of a single user in order to trace which conversations (s)he has had with whom, where and when. In this Section we report the preliminary results obtained by applying our ASP encoding for contrast pattern mining to a dataset of phone records. 4.1
The DigForASP Dataset
For our experiments we have considered a dataset that consists of the telephone records of four users from a realworld investigative case. The dataset has been made available by Prof. David Billard (University of Applied Sciences in Geneva) under NDA to DigForASP members for academic experimentation. Each ﬁle in the dataset has the following schema: – Type: what kind of operation the user has performed (e.g., incoming/outgoing call or SMS); – Caller : who makes the call or sends an SMS; – Callee: who receives the call or SMS; – Street: where the operation has taken place; – Time: when the operation has taken place (ISO format4 HH: MM: SS); – Duration: how long the operation has been (ISO format HH: MM: SS); – Date: when the operation has taken place (format: day, month, year). The type of the operation is one of the following cases: “conﬁg”, “gprs”, “redirect”, “out sms(SUB TYPE)”, “in sms(SUB TYPE)”, “out call(SUB TYPE)”, “in call(SUB TYPE)”. Subtypes are: “simple”, “ack”, “foreign”. The dataset has undergone the mandatory anonymization process for reasons of privacy and conﬁdentiality. Therefore it does not contain data that allows tracing back to the real people involved in the investigative case. For instance, there is no phone number for the caller/callee but only a ﬁctitious name. The names and the sizes (# rows) of the four ﬁles in the dataset are the following: Eudokia Makrembolitissa (8,783), Karen Cook McNally (20,894), Laila Lalami (12,689), and Lucy Delaney (8,480). 4.2
Preprocessing and ASP Encoding of the Dataset
The DigForASP dataset in its original format cannot be considered as a set of transactions in ASP syntax. It needs to undergo a transformation into the format described in Sect. 3. In short, each row of the dataset is encoded as a collection of facts through the class and db predicates. The transformation has been done by means of a Python script. The classes refer to the operation type, namely: “in sms”, “out sms”, “in call”, “out call”, “conﬁg”, “redirect”, “gprs”. The features are: caller, callee, street a, street b, time, weekday and duration. The weekday feature does not appear in the original dataset. It has been added with the following values: (0 = Monday, ..., 6 = Sunday). The duration feature has undergone a transformation 4
Format to describe dates and times: https://en.wikipedia.org/wiki/ISO 8601.
24
F. A. Lisi and G. Sterlicchio
in order to obtain a value expressed in seconds. The time feature has been discretized into four time slots: “morning” (from 06:00:00 to 11:59:59), “afternoon” (from 12:00:00 to 17:59:59), “evening” (from 18:00:00 to 23:59:59), and “night” (from 00:00:00 to 05:59:59). Depending on the analyst’s needs, it is possible to consider (and encode) only the transactions related to speciﬁc days, months or years so as to subsequently carry out a more granular analysis. The transactions are sorted by date and time, as shown in Table 1. Table 1. ASP encoding of some transactions from Karen’s phone recordings from the morning of 07/09/2040 to the night of 08/09/2040.
4.3
Experiments
For the experiments here presented we have run the ASP encoding reported in Listing 1.1 over the largest ﬁle from the DigForASP dataset, namely Karen’s phone records, made up of more than 20,000 rows. As regards the ASP solver, we have used the version 5.4.0 of clingo, with default solving parameters. The hardware and software platform used was a laptop computer with Windows 10 (with Ubuntu 20.04.4 subsystem), AMD Ryzen 5 3500U @ 2.10 GHz, 8 GB RAM without using the multithreading mode of clingo. Multithreading reduces the mean runtime but introduces variance due to the random allocation of tasks. Such variance is inconvenient for interpreting results with repeated executions. Exploratory Tests. During an investigation it is useful to understand what kind of information the extracted patterns can oﬀer, in order to guide and support law enforcement in deciding the next steps to take during the investigation. In Listing 1.2, as an illustrative example of the potential usefulness of contrast pattern mining in the DF ﬁeld, we report the results obtained on Karen’s phone records for the class “out call”. Here, we have set the minimum support threshold
A Declarative Approach to Contrast Pattern Mining
25
to 10% and the maximum pattern length to 3. Overall, the nine contrast patterns returned by the algorithm provide a rich information about the habits of Karen as regards outgoing calls in contrast to other types of communication. Notably, they tell us that outgoing calls of Karen are mainly done in the morning (Line 8) or in the afternoon (Line 6). In particular, the answer at Line 4 highlights that outgoing calls are made mainly on Fridays. 1 2 3
4 5
6 7
8 9
in_pattern ( caller ( karen_cook_mcnally ) ) absolute_diff (430) in_pattern ( time ( evening ) ) absolute_diff (24) in_pattern ( caller ( karen_cook_mcnally ) ) in_pattern ( time ( evening ) ) absolute_diff (129) in_pattern ( weekday (4) ) absolute_diff (14) in_pattern ( weekday (4) ) in_pattern ( caller ( karen_cook_mcnally ) ) absolute_diff (126) in_pattern ( time ( afternoon ) ) absolute_diff (34) in_pattern ( caller ( karen_cook_mcnally ) ) in_pattern ( time ( afternoon ) ) absolute_diff (202) in_pattern ( time ( morning ) ) absolute_diff (37) in_pattern ( time ( morning ) ) in_pattern ( caller ( karen_cook_mcnally ) ) absolute_diff (103) Listing 1.2. Contrast patterns for the “out call” class.
Scalability Tests. With scalability tests, the goal is to assess the performance of the ASP encoding on datasets of increasing size. Once again, we have considered the ﬁle of Karen’s phone records, from we have extracted 100, 1000 and 10,000 rows for the three groups of experiments. In each group, the experiments have been conducted by varying the class for the contrast and the minimum support threshold while keeping the maximum patterns length ﬁxed to 3. The ﬁrst group of experiments considers the subset consisting of 100 rows. Observing Table 2, the class with the greatest contrast patterns concerns the “out call” operation. With this order of magnitude, the extraction times of the patterns are less than one second for all classes. In general, the memory used for this operation is at most 25 MB. The second group of experiments considers a subset consisting of 1,000 rows. From Table 3, we observe that the class with the greatest number of contrast patterns is again “out call”. It is worthwhile to note that, with an increase in the order of magnitude from hundreds to thousands, the execution time ﬂuctuates in a range between 5 and 10 s with a minimum percentage variation equal to 400% (Fig. 1 B). The memory consumed in this case is much higher than the previous batch of experiments since it jumps to a minimum of more than 300 MB, and a maximum that is around 460 MB (Fig. 1 C). The third group of experiments considers a subset consisting of 10,000 rows. Unlike the previous two groups, this group did not produce results because the amount of resources to be allocated to the RAM memory was so high
26
F. A. Lisi and G. Sterlicchio
Table 2. Number of patterns, execution time (seconds), solver time (seconds) and memory consumption (MB) for 100 rows from Karen’s phone records. in sms
out sms
Th. #Pat. Exec. t. Solv. t. Memory
Th. #Pat. Exec. t. Solv. t. Memory
10% 20% 30% 40% 50%
10% 20% 30% 40% 50%
14 0 0 0 0
0.119 0.087 0.081 0.089 0.085
0.01 0.00 0.00 0.00 0.00
23.67 22.11 22.35 22.11 21.85
0 0 0 0 0
0.085 0.076 0.086 0.086 0.086
0.00 0.00 0.00 0.00 0.00
22.31 21.93 21.67 22.31 22.18
in call
out call
Th. #Pat. Exec. t. Solv. t. Memory
Th. #Pat. Exec. t. Solv. t. Memory
10% 20% 30% 40% 50%
10% 20% 30% 40% 50%
21 14 7 0 0
0.137 0.118 0.120 0.086 0.084
0.03 0.01 0.01 0.00 0.00
24.22 24.01 24.01 21.98 21.44
32 14 14 7 7
0.136 0.122 0.128 0.121 0.117
0.03 0.02 0.01 0.01 0.01
24.23 24.23 24.23 24.23 24.44
(around 8GB) that the clingo process was killed by the operating system. Considering the pattern generation rule at Line 11 of Listing 1.1, the number of item atoms that must be combined to form the in pattern atoms is equal to 2010. Instead, in the case of 100 and 1,000 rows the number of items are respectively 180 and 670. Since the total number of combinations is deﬁned by n n! (3) = Cn,k = k k!(n − k)! and the minimum pattern length k varies from 1 to 3 in our tests, the total number of combinations for the problem in hand is given by the sum of: – groupings of class 1: – groupings of class 2: – groupings of class 3:
2010! 1!(2010−1)! ; 2010! 2!(2010−2)! ; 2010! 3!(2010−3)! .
It is clear that the computation required to solve the problem in hand is very heavy for a dataset size of tens of thousands rows or even more.
5
Final Remarks
DPM is a promising direction of research in AI. We do not expect DPM to be competitive with dedicated algorithms, but to take advantage of the versatility of declarative frameworks to propose pattern mining tools that could exploit background knowledge during the mining process to extract less but meaningful patterns. Such tools are particularly welcome in application domains where the requirement of transparency is particularly crucial. This motivation is at the
A Declarative Approach to Contrast Pattern Mining
27
Table 3. Number of patterns, execution time (sec), solver time (sec) and memory consumption (MB) for 1,000 rows from Karen’s phone records. in sms
out sms
Th. #Pat. Exec. t. Solv. t. Memory
Th. #Pat. Exec. t Solv. t. Memory
10% 20% 30% 40% 50%
10% 20% 30% 40% 50%
3 0 0 0 0
5.929 5.178 4.939 4.843 4.980
0.15 0.00 0.00 0.00 0.00
427.18 345.7 345.7 345.7 345.7
0 0 0 0 0
4.979 4.761 4.715 4.795 4.733
0.00 0.00 0.00 0.00 0.00
336.36 325.87 336.36 336.36 323.02
in call
out call
Th. #Pat. Exec. t. Solv. t. Memory
Th. #Pat. Exec. t. Solv. t. Memory
10% 20% 30% 40% 50%
10% 20% 30% 40% 50%
5 1 1 0 0
7.683 6.834 6.423 4.916 4.978
1.65 0.71 0.36 0.00 0.00
453.93 453.92 454.03 346.46 346.46
9 3 1 1 0
10.155 8.591 7.603 6.765 4.945
3.87 2.41 1.40 0.56 0.00
465.07 465.11 464.89 465.08 354.01
basis of a renewed interest of the AI community in declarative approaches. In particular, the expressive power of ASP makes the deﬁnition of algorithmic variants of the basic encoding pretty easy, mainly thanks to a clever use of constraints. Also, the availability of eﬃcient ASP solvers encourage the use in applications characterized by combinatorial problems, such as the ones in pattern mining. Contrast Pattern Mining is an interesting class of pattern mining problems. It is somehow halfway between discrimination and characterization of a data set, due to the use of class labels to guide the search for regularities. Nevertheless, to the best of our knowledge, it has not been addressed so far in DPM research. Our declarative approach is therefore a novel contribution to pattern mining which paves the way to new exciting AI applications. In particular, due to the inherent transparency, it appears to be suitable for analysing evidence in the context of DF investigations. As a case study we have considered the analysis of a realworld dataset of anonymised phone recordings. The preliminary results are encouraging, although they highlight some weaknesses. In particular, the combinatorial explosion aﬀects the scalability of the approach. However, when compared to sequential pattern mining on the same dataset [15,16], it is noteworthy that in contrast pattern mining the solver takes much less time. This is partially due to the fact that the labeling of transactions with classes make the search space smaller. For the future we plan to explore several directions of improvement of the work as regards eﬃciency and scalability. This implies diﬀerent choices for the encoding, the solver, and the computing platform. Experiments could be, for instance, replicated with other ASP solvers, such as DLV2 [14], that revealed to be scalable on large datasets. Hybrid ASPapproaches to pattern mining such as [18] could be adopted. An empirical evaluation of the approach with a more
28
F. A. Lisi and G. Sterlicchio
Fig. 1. Comparison w.r.t. the number of patterns extracted (A), execution time (B) and memory consumption (C) for the “out call” class (Tables 2 and 3).
A Declarative Approach to Contrast Pattern Mining
29
performant hardware is also planned. Besides the improvement of the current work, we intend to consider other variants of the contrast pattern mining problem. In parallel to the methodological work, we would like to beneﬁt from a tighter interaction with DF experts in order to get their feedback as regards the validity and the usefulness of our work from DF viewpoint, and their suggestions for new interesting directions of applied research in this ﬁeld. Acknowledgments. This article is based upon work from COST Action 17124 “Digital forensics: evidence analysis via intelligent systems and practices (DigForASP)”, supported by COST (European Cooperation in Science and Technology). The work is also partially funded by the Universit` a degli Studi di Bari “Aldo Moro” under the 20172018 grant “Metodi di Intelligenza Artiﬁciale per l’Informatica Forense”.
References 1. Brewka, G., Eiter, T., Truszczynski, M.: Answer set programming at a glance. Commun. ACM 54(12), 92–103 (2011). http://doi.acm.org/10.1145/2043174.2043195 2. Costantini, S., De Gasperis, G., Olivieri, R.: How answer set programming can help in digital forensic investigation. In: Ancona, D., Maratea, M., Mascardi, V. (eds.) Proceedings of the 30th Italian Conference on Computational Logic, Genova, Italy, 1–3 July 2015. CEUR Workshop Proceedings, vol. 1459, pp. 53–65. CEURWS.org (2015). http://ceurws.org/Vol1459/paper29.pdf 3. Costantini, S., De Gasperis, G., Olivieri, R.: Digital forensics and investigations meet artiﬁcial intelligence. Ann. Math. Artif. Intell. 86(13), 193–229 (2019). https://doi.org/10.1007/s1047201909632y 4. Costantini, S., Lisi, F.A., Olivieri, R.: DigForASP: a european cooperation network for logicbased AI in digital forensics. In: Casagrande, A., Omodeo, E.G. (eds.) Proceedings of the 34th Italian Conference on Computational Logic, Trieste, Italy, 19–21 June 2019. CEUR Workshop Proceedings, vol. 2396, pp. 138–146. CEURWS.org (2019). http://ceurws.org/Vol2396/paper34.pdf 5. De Raedt, L., Guns, T., Nijssen, S.: Constraint programming for data mining and machine learning. In: TwentyFourth AAAI Conference on Artiﬁcial Intelligence (2010) 6. Dong, G., Bailey, J.: Contrast Data Mining: Concepts, Algorithms, and Applications. CRC Press, Boca Raton (2012) 7. Gebser, M., Guyet, T., Quiniou, R., Romero, J., Schaub, T.: Knowledgebased sequence mining with ASP. In: IJCAI 2016–25th International Joint Conference on Artiﬁcial Intelligence, p. 8. AAAI (2016) 8. Gebser, M., Kaminski, R., Kaufmann, B., Schaub, T.: Clingo = ASP + control: preliminary report. arXiv preprint arXiv:1405.3694 (2014) 9. Gebser, M., Kaufmann, B., Kaminski, R., Ostrowski, M., Schaub, T., Schneider, M.: Potassco: the Potsdam answer set solving collection. AI Commun. 24(2), 107– 124 (2011) 10. Guns, T., Dries, A., Nijssen, S., Tack, G., De Raedt, L.: MiningZinc: a declarative framework for constraintbased mining. Artif. Intell. 244, 6–29 (2017) 11. Guyet, T., Moinard, Y., Quiniou, R., Schaub, T.: Eﬃciency analysis of ASP encodings for sequential pattern mining tasks. In: Pinaud, B., Guillet, F., Cremilleux, B., de Runz, C. (eds.) Advances in Knowledge Discovery and Management. SCI, vol. 732, pp. 41–81. Springer, Cham (2018). https://doi.org/10.1007/9783319654065 3
30
F. A. Lisi and G. Sterlicchio
12. Han, J., Cheng, H., Xin, D., Yan, X.: Frequent pattern mining: current status and future directions. Data Min. Knowl. Discov. 15(1), 55–86 (2007). https://doi.org/ 10.1007/s1061800600591 13. Jabbour, S., Sais, L., Salhi, Y.: Decomposition based SAT encodings for itemset mining problems. In: Cao, T., Lim, E.P., Zhou, Z.H., Ho, T.B., Cheung, D., Motoda, H. (eds.) PAKDD 2015. LNCS (LNAI), vol. 9078, pp. 662–674. Springer, Cham (2015). https://doi.org/10.1007/9783319180328 52 14. Leone, N., et al.: Enhancing DLV for largescale reasoning. In: Balduccini, M., Lierler, Y., Woltran, S. (eds.) Logic Programming and Nonmonotonic Reasoning  15th International Conference, LPNMR 2019, Philadelphia, PA, USA, 3–7 June 2019, Proceedings. LNCS, vol. 11481, pp. 312–325. Springer, Cham (2019). https:// doi.org/10.1007/9783030205287 23 15. Lisi, F.A., Sterlicchio, G.: Declarative pattern mining in digital forensics: preliminary results. In: Calegari, R., Ciatto, G., Omicini, A. (eds.) Proceedings of the 37th Italian Conference on Computational Logic, Bologna, Italy, June 29–1 July 2022. CEUR Workshop Proceedings, vol. 3204, pp. 232–246. CEURWS.org (2022). http://ceurws.org/Vol3204/paper 23.pdf 16. Lisi, F.A., Sterlicchio, G.: Mining sequences in phone recordings with answer set programming. In: Bruno, P., Calimeri, F., Cauteruccio, F., Maratea, M., Terracina, G., Vallati, M. (eds.) HYDRA  RCRA 2022: 1st International Workshop on Hybrid Models for Coupling Deductive and Inductive Reasoning and 29th RCRA Workshop on Experimental Evaluation of Algorithms for Solving Problems with Combinatorial Explosion. CEUR Workshop Proceedings. CEURWS.org (2022) 17. Negrevergne, B., Guns, T.: Constraintbased sequence mining using constraint programming. In: Michel, L. (ed.) CPAIOR 2015. LNCS, vol. 9075, pp. 288–305. Springer, Cham (2015). https://doi.org/10.1007/9783319180083 20 18. Paramonov, S., Stepanova, D., Miettinen, P.: Hybrid ASPbased approach to pattern mining. Theory Pract. Log. Program. 19(4), 505–535 (2019). https://doi.org/ 10.1017/S1471068418000467
Graphs and Networks
Approximate Inference in Probabilistic Answer Set Programming for Statistical Probabilities Damiano Azzolini1(B) , Elena Bellodi2 , and Fabrizio Riguzzi3 1
3
Dipartimento di Scienze dell’Ambiente e della Prevenzione, Universit` a di Ferrara, Ferrara, Italy [emailprotected] 2 Dipartimento di Ingegneria, Universit` a di Ferrara, Ferrara, Italy [emailprotected] Dipartimento di Matematica e Informatica, Universit` a di Ferrara, Ferrara, Italy [emailprotected] Abstract. “Type 1” statements were introduced by Halpern in 1990 with the goal to represent statistical information about a domain of interest. These are of the form “x% of the elements share the same property”. The recently proposed language PASTA (Probabilistic Answer set programming for STAtistical probabilities) extends Probabilistic Logic Programs under the Distribution Semantics and allows the deﬁnition of this type of statements. To perform exact inference, PASTA programs are converted into probabilistic answer set programs under the Credal Semantics. However, this algorithm is infeasible for scenarios when more than a few random variables are involved. Here, we propose several algorithms to perform both conditional and unconditional approximate inference in PASTA programs and test them on diﬀerent benchmarks. The results show that approximate algorithms scale to hundreds of variables and thus can manage real world domains. Keywords: Probabilistic Answer Set Programming · Credal Semantics · Statistical statements · Approximate inference
1
Introduction
In [14] Halpern discusses the diﬀerence between “Type 1” (T1) and “Type 2” (T2) statements: the former describes a statistical property of the world of interest while the latter represents a degree of belief. “The probability that a random person smokes is 20%” is an example of “Type 1” statement while “John smokes with probability 30%”, where John is a particular individual, is an example of “Type 2” statement. Answer Set Programming (ASP) [7] is a powerful language that allows to easily encode complex domains. However, ASP does not allow uncertainty on the data. To handle this, we need to consider Probabilistic ASP (PASP) where the uncertainty is expressed through probabilistic facts, as done in Probabilistic c The Author(s) 2023 A. Dovier et al. (Eds.): AIxIA 2022, LNAI 13796, pp. 33–46, 2023. https://doi.org/10.1007/9783031271816_3
34
D. Azzolini et al.
Logic Programming [10]. We focus here on PASP under the Credal Semantics [9], where each query is associated with a probability interval deﬁned by a lower and an upper bound. Recently, the authors of [3] introduced PASTA (“Probabilistic Answer set programming for STAtistical probabilities”), a new language (and software) where statistical statements are translated into PASP rules and inference is performed by converting the PASP program into an equivalent answer set program. However, performing exact inference is exponential in the number of probabilistic facts, and thus it is infeasible in the case of more than a few dozens of variables. In this paper, we propose four algorithms to perform approximate inference in PASTA programs: one for unconditional sampling and three for conditional sampling that adopt rejection sampling, Metropolis Hastings sampling, and Gibbs sampling. Empirical results show that our algorithms can handle programs with hundreds of variables. Moreover, we compare our algorithms with PASOCS [23], a solver able to perform approximate inference in PASP program under the Credal Semantics, showing that our algorithms reach a comparable accuracy in a lower execution time. The paper is structured as follows: Sect. 2 discusses some related works and Sect. 3 introduces background concepts. Section 4 describes our algorithms for approximate inference in PASTA programs that are tested in Sect. 5. Section 6 concludes the paper.
2
Related Work
PASTA [3] extends Probabilistic Logic Programming [20] under the Distribution Semantics [21] by allowing the deﬁnition of Statistical statements. Statistical statements, also referred to as “Probabilistic Conditionals”, are discussed in [16], where the authors give a semantics to T1 statements leveraging the maximum entropy principle. Under this interpretation, they consider the unique model that yields the maximum entropy. Diﬀerently from them, we consider all the models, thus obtaining a more general framework [3]. T1 statements are also studied in [15] and [24]: the former adopts the cross entropy principle to assign a semantics to T1 statements while the latter identiﬁes only a speciﬁc model and a sharp probability value, rather than all the models and an interval for the probability, as we do. We adopt the credal semantics [9] for PASP, where the probability of a query is deﬁned by a range. To the best of our knowledge, the only work which performs inference in PASP under the Credal Semantics is PASOCS [23]. They propose both an exact solver, which relies on the generation of all the possible combinations of facts, and an approximate one, based on sampling. We compare our approach with it in Sect. 5. Other solutions for inference in PASP consider diﬀerent semantics that assign to a query a sharp probability value, such as [6,17,19,22].
Approximate Algorithms for Statistical Probabilities
3
35
Background
We assume that the reader is familiar with the basic concepts of Logic Programming. For a complete treatment of the ﬁeld, see [18]. An Answer Set Programming (ASP) [7] rule has the form h1 ; ... ; hm : b1, ... , bn. where each hi is an atom, each bi is a literal and : is called the neck operator. The disjunction of the his is called the head while the conjunction of the bis is called the body of the rule. Particular conﬁgurations of the atoms/literals in the head/body identify speciﬁc types of rules: if the head is empty and the body is not, the rule is a constraint. Likewise, if the body is empty and the head is not, the rule is a fact, and the neck operator is usually omitted. We consider only rules where every variable also appears in a positive literal in the body. These rules are called safe. Finally, a rule is called ground if it does not contain variables. In addition to atoms and literals, we also consider aggregate atoms of the form γ1 ω1 #ζ{1 , . . . , l } ω2 γ2 where γ1 and γ2 are constants or variables called guards, ω1 and ω2 are arithmetic comparison operators (such as >, ≥, 0. Moreover, each variable in t1 , . . . , ti also appears in F . We denote an answer set program with P and its Herbrand base, i.e., the set of atoms that can be constructed with all the symbols in it, as BP . An interpretation I ⊂ BP satisﬁes a ground rule when at least one of the his is true in I when the body is true in I. A model is an interpretation that satisﬁes all the ground rules of a program P. The reduct [11] of a ground program Pg with respect to an interpretation I is a new program Pgr obtained from Pg by removing the rules in which a bi is false in I. Finally, an interpretation I is an answer set for P if it is a minimal model of Pgr . We consider minimality in terms of set inclusion and denote with AS(P) the set of all the answer sets of P. Probabilistic Answer Set Programming (PASP) [8] is to Answer Set Programming what Probabilistic Logic Programming [20] is to Logic Programming: it allows the deﬁnition of uncertain data through probabilistic facts. Following the ProbLog [10] syntax, these facts can be represented with Π :: f where f is a ground atom and Π is its probability. If we assign a truth value to every probabilistic fact (where represents true and ⊥ represents false) we obtain a world, i.e., an answer set program. There are 2n worlds for a probabilistic answer set program, where n is the number of ground probabilistic facts. Many Probabilistic Logic Programming languages rely on the distribution semantics [21], according to which the probability of a world w is computed with the formula Πi · (1 − Πi ) P (w) = ifi =
ifi =⊥
while the probability of a query q (conjunction of ground literals), is computed with the formula P (q) = P (w) w=q
when the world has a single answer set.
36
D. Azzolini et al.
For performing inference in PASP we consider the Credal Semantics [8], where every query q is associated with a probability range: the upper probability bound P(q) is given by the sum of the probabilities of the worlds w where there is at least one answer set of w where the query is present. Conversely, the lower probability bound P(q) is given by the sum of the probabilities of the worlds w where the query is present in all the answer sets of w, i.e., P(q) = P (wi ), P(q) = P (wi ) wi ∃m∈AS(wi ), m=q
wi AS(wi )>0 ∧ ∀m∈AS(wi ), m=q
Note that the credal semantics requires that every world has at least one answer set. In the remaining part of the paper we consider only programs where this requirement is satisﬁed. Example 1 (PASP Example). We consider 3 objects whose components are unknown and suppose that some of them may be made of iron with a given probability. An object made of iron may get rusty or not. We want to know the probability that a particular object is rusty. This can be modelled with: 1 2 3 4 5
0.2:: iron (1) . 0.9:: iron (2) . 0.6:: iron (3) . rusty ( X ) ; not_rusty ( X ) :  iron ( X ) . :  # count { X : rusty ( X ) , iron ( X ) } = RI , # count { X : iron ( X ) } = I , 10* RI < 6* I .
The constraint states that at least 60% of the object made of iron are rusty. This program has 23 = 8 worlds. For example, the world where all the three probabilistic facts are true has 4 answer sets. If we consider the query q rusty(1), this world only contributes to the upper probability since the query is present only in 3 of the 4 answer sets. By considering all the worlds, we get P(q) = 0.092 and P(q) = 0.2, so the probability of the query lies in the range [0.092, 0.2]. If we want to compute the conditional probability for a query q given evidence e, P (q  e), we need to consider two diﬀerent formulas for the lower and upper probability bounds [8]: P(q  e) =
P(q, e) P(q, e) , P(q  e) = P(q, e) + P(¬q, e) P(q, e) + P(¬q, e)
(1)
Clearly, these are valid if the denominator is diﬀerent from 0, otherwise the value is undeﬁned. If we consider again Example 1 with query q rusty(1) and evidence e iron(2), we get P(q  e) = 0.08 and P(q  e) = 0.2. Following the syntax proposed in [3], a probabilistic conditional is a formula of the form (C  A)[Πl , Πu ] stating that the fraction of As that are also Cs is between Πl and Πu . Both C and A are two conjunctions of literals. To perform inference, a conditional is converted into three answer set rules: i) C ; not C : A, ii) : #count{X : C, A} = V0, #count{X : A} = V1, 10*V0 < 10*Πl *V1, and iii) : #count{X : C, A} = V0, #count{X :
Approximate Algorithms for Statistical Probabilities
37
A} = V1, 10*V0 > 10*Πu *V1, where X is a vector of elements containing all the variables in C and A. If Πl or Πu are respectively 0 or 1, the rules ii) or iii) can be omitted. Moreover, if the probability values Πl and Πu have n decimal digits, the 10 in the multiplications above should be replaced with 10n , because ASP cannot deal with ﬂoating point values. A PASTA program [3] is composed of a set of probabilistic facts, a set of ASP rules, and a set of probabilistic conditionals. Example 2 (Probabilistic Conditional (PASTA program)). The following program 1 2
0.2:: iron (1) . 0.9:: iron (2) . 0.6:: iron (3) . ( rusty ( X )  iron ( X ) ) [0.6 ,1].
is translated into the PASP program shown in Example 1. The rule iii) is omitted since Πu = 1. In [3] an exact inference algorithm was proposed to perform inference with probabilistic conditionals, that basically requires the enumeration of all the worlds. This is clearly infeasible when the number of variables is greater than 20–30. To overcome this issue, in the following section we present diﬀerent algorithms that compute the probability interval in an approximate way based on sampling techniques.
4
Approximate Inference for PASTA Programs
To perform approximate inference in PASTA programs, we developed four algorithms: one for unconditional sampling (Algorithm 1) and three for conditional sampling that adopt rejection sampling (Algorithm 2), Metropolis Hastings sampling (Algorithm 3), and Gibbs sampling (Algorithm 4) [4,5]. Algorithm 1 describes the basic procedure to sample a query (without evidence) in a PASTA program. First, we keep a list of sampled worlds. Then, for a given n number of times (number of samples), we sample a world id with function SampleWorld by choosing a truth value for every probabilistic fact according to its probability. For every probabilistic facts, the process is the following: we sample a random value between 0 and 1, call it r. If r < Πi for a given probabilistic fact fi with associated probability Πi , fi is set to true, otherwise false. id is a binary string representing a world where, if the nth digit is 0, the nth probabilistic fact (in order of appearance in the program) is false, true otherwise. To clarify this, if we consider the program shown in Example 2, a possible world id could be 010, indicating that iron(1) is not selected, iron(2) is selected, and iron(3) is not selected. The probability of this world is (1 − 0.2) · 0.9 · (1 − 0.6) = 0.288. If we have already considered the currently sampled world, we look in the list of sampled worlds whether it contributes to the lower or upper counters (function GetContribution) and update the lower (lp) and upper (up) counters accordingly. In particular, GetContribution returns two values, one for the lower and one for the upper probability, each of which can be either 0 (the world id
38
D. Azzolini et al.
does not contribute to the probability) or 1 (the world id contributes to the probability). If, instead, the world had never been encountered before, we assign a probability value to the probabilistic facts in the program according to the truth value (probability Π for , 1 − Π for ⊥) that had been sampled (function SetFacts), we compute its contribution to the lower and upper probabilities (function CheckLowerUpper, with the same output as GetContribution), and store the results in the list of already encountered worlds (function InsertContribution). In this way, if we sample again the same world, there is no need to compute again its contribution to the two probability bounds. Once we have a number of samples equal to Samples, we simply return the number of samples computed for the lower and upper probability divided by Samples. Algorithm 1. Function Sample: computation of the unconditional probability from a PASTA program. 1: function Sample(Query, Samples, Program) 2: sampled ← {} 3: lp ← 0, up ← 0, n ← 0 4: while n ≤ Samples do 5: id ←SampleWorld(Program) 6: n←n+1 7: if id ∈ sampled then 8: up 0 , lp0 ← GetContribution(sampled, id) 9: up ← up + up 0 10: lp ← lp + lp 0 11: else 12: Program d ←SetFacts(Program, id) 13: lp 0 , up 0 ← CheckLowerUpper(Program d ) 14: lp ← lp + lp 0 15: up ← up + up 0 16: InsertContribution(sampled, id, lp 0 , up 0 ) 17: end if 18: end while lp up 19: return Samples , Samples 20: end function
list of sampled worlds Samples is the number of samples
a world was already sampled
When we need to account also for the evidence, other algorithms should be applied, such as rejection sampling. It is described in Algorithm 2: as in Algorithm 1, we maintain a list with the already sampled worlds. Moreover, we need 4 variables to store the joint lower and upper counters of q and e (lpqe and upqe) and ¬q and e (lpnqe and upnqe), see Eq. 1. Then, with the same procedure as before, we sample a world. If we have already considered it, we retrieve its contribution from the sampled list. If not, we set the probabilistic facts according to the sampled choices, compute the contribution to the four values, update them accordingly, and store the results. lpqe 0 is 1 if both the evidence and the query are present in all the answer sets of the current world, 0 otherwise. upqe 0 is 1 if both the evidence and the query are present in at least one answer set of the current world, 0 otherwise. lpnqe 0 is 1 if the evidence is present and the query is absent in all the answer sets of the current world, 0 otherwise. upnqe 0 is 1 if the evidence is present and the query is absent in at least one answer set of the current world, 0 otherwise. As before, we return the ratio between the number of samples combined as in Eq. 1.
Approximate Algorithms for Statistical Probabilities
39
Algorithm 2 . Function RejectionSample: computation of the conditional probability from a PASTA program using Rejection sampling. 1: function RejectionSample(Query, Evidence, Samples, Program) 2: lpqe ← 0, upqe ← 0, lpnqe ← 0, upnqe ← 0, n ← 0, sampled ← {} 3: while n ≤ Samples do 4: id ←SampleWorld(Program) 5: n←n+1 6: if id ∈ sampled then 7: lpqe 0 , upqe 0 , lpnqe 0 , upnqe 0 ← GetContribution(sampled, id) 8: lpqe ← lpqe +lpqe0 , upqe ← upqe + upqe 0 9: lpnqe ← lpnqe + lpnqe 0 , upnqe ← upnqe + upnqe 0 10: else 11: Program d ← SetFacts(Program, id) 12: lpqe 0 , upqe 0 , lpnqe 0 , upnqe 0 ← CheckLowerUpper(Program d ) 13: lpqe ← lpqe +lpqe0 , upqe ← upqe + upqe 0 14: lpnqe ← lpnqe + lpnqe 0 , upnqe ← upnqe + upnqe 0 15: InsertContribution(sampled, id, lpqe 0 , upqe 0 , lpnqe 0 , upnqe 0 ) 16: end if 17: end while lpqe upqe 18: return lpqe + upnqe , upqe + lpnqe 19: end function
In addition to rejection sampling, we developed two other algorithms that mimic Metropolis Hastings sampling (Algorithm 3) and Gibbs sampling (Algorithm 4). Algorithm 3 proceeds as follows. The overall structure is similar to Algorithm 2. However, after sampling a world, we count the number of probabilistic facts set to true (function CountTrueFacts). Then, with function CheckContribution we check whether the current world has already been considered. If so, we accept it with probability min(1, N0 /N1 ) (line 18), where N0 is the number of true probabilistic facts in the previous iteration and N1 is the number of true probabilistic facts in the current iteration. If the world was never considered before, we set the truth values of the probabilistic facts in the program (function SetFacts), compute its contribution with function CheckLowerUpper, save the values (function InsertContribution), and check whether the sample is accepted or not (line 27) with the same criteria just discussed. As for rejection sampling, we return the ratio between the number of samples combined as in Eq. 1. Finally, for Gibbs sampling (Algorithm 4), we ﬁrst sample a world until e is true (function TrueEvidence), saving, as before, the already encountered worlds. Once we get a world that satisﬁes this requirement, we switch the truth values of Block random probabilistic facts (function SwitchBlockValues, line 19) and we check the contribution of this new world as in Algorithm 2. Also there, the return value is the one described by Eq. 1.
5
Experiments
We implemented the previously described algorithms in Python 3 and we integrated them into the PASTA1 solver [3]. We use clingo [12] to compute the answer 1
Source code and datasets available at https://github.com/damianoazzolini/pasta.
40
D. Azzolini et al.
Algorithm 3. Function MHSample: computation of the conditional probability from a PASTA program using Metropolis Hastings sampling. 1: function MHSample(Query, Evidence, Samples, Program) 2: sampled ← {} 3: lpqe ← 0, upqe ← 0, lpnqe ← 0, upnqe ← 0, n ← 0, trueFacts 0 ← 0 4: while n ≤ Samples do 5: id ←SampleWorld(Program) 6: n←n+1 7: trueFacts 1 ← CountTrueFacts(id) 8: lpqe 0 , upqe 0 , lpnqe 0 , upnqe 0 ← 9: CheckContribution(Program d , trueFacts 0 , trueFacts 1 , id, sampled) 10: lpqe ← lpqe +lpqe0 , upqe ← upqe + upqe 0 11: lpnqe ← lpnqe + lpnqe 0 , upnqe ← upnqe + upnqe 0 12: trueFacts 0 ← trueFacts 1 13: end while lpqe upqe 14: return lpqe + upnqe , upqe + lpnqe 15: end function 16: function CheckContribution(Program d , N0 , N1 , id, sampled) 17: if id ∈ sampled then 18: if random < min(1, N0 /N1 ) then random is a random value 19: return GetContribution(id, sampled) 20: else 21: return 0, 0, 0, 0 22: end if 23: else 24: Program d ← SetFacts(Program, id) 25: lpqe 0 , upqe 0 , lpnqe 0 , upnqe 0 ← CheckLowerUpper(Program d ) 26: InsertContribution(sampled, id, lpqe 0 , upqe 0 , lpnqe 0 , upnqe 0 ) 27: if random < min(1, N0 /N1 ) then 28: return lpqe 0 , upqe 0 , lpnqe 0 , upnqe 0 29: else 30: return 0, 0, 0, 0 31: end if 32: end if 33: end function
∈ [0, 1]
sets. To assess the performance, we ran multiple experiments on a computer with R R Xeon E52630v3 running at 2.40 GHz with 16 Gb of RAM. Execution Intel times are computed with the bash command time. The reported values are from the real ﬁeld. We consider two datasets with diﬀerent conﬁgurations. The ﬁrst one, iron, contains programs with the structure shown in Example 2. In this case, the size of an instance indicates the number of probabilistic facts. The second dataset, smoke, describes a network where some people are connected by a probabilistic friendship relation. In this case the size of an instance is the number of involved people. Some of the people in the network smoke. A conditional states that at least 40% of the people that have a friend that smokes are smokers. An example of instance of size 5 is 1 2 3 4 5
0.5:: friend (a , b ) . 0.5:: friend (b , c ) . 0.5:: friend (a , d ) . 0.5:: friend (d , e ) . 0.5:: friend (e , c ) . smokes ( b ) . smokes ( d ) . ( smokes ( Y )  smokes ( X ) , friend (X , Y ) ) [0.4 ,1].
Approximate Algorithms for Statistical Probabilities
41
Algorithm 4. Function GibbsSample: computation of the conditional probability from a PASTA program using Gibbs sampling. 1: function GibbsSample(Query, Evidence, Samples, Block , Program) 2: sampledEvidence ← {}, sampledQuery ← {} 3: lpqe ← 0, upqe ← 0, lpnqe ← 0, upnqe ← 0, n ← 0 4: while n ≤ Samples do 5: ev ← false, n ← n + 1 6: while ev is false do 7: id ←SampleWorld(P rogram) 8: if id ∈ sampledEvidence then 9: ev ← sampledEvidence[id] 10: else 11: Program d ← SetFacts(Program, id) 12: if TrueEvidence(Program d ) then 13: ev ← true, sampledEvidence[id] ← true 14: else 15: sampledEvidence[id] ← false 16: end if 17: end if 18: end while 19: id s ←SwitchBlockValues(id, Block , Program, Evidence) 20: if id s ∈ sampled then 21: lpqe 0 , upqe 0 , lpnqe 0 , upnqe 0 ← GetContribution(sampled, id) 22: lpqe ← lpqe +lpqe0 , upqe ← upqe + upqe 0 23: lpnqe ← lpnqe + lpnqe 0 , upnqe ← upnqe + upnqe 0 24: else 25: Program d ← SetFacts(Program, id) 26: lpqe 0 , upqe 0 , lpnqe 0 , upnqe 0 ← CheckLowerUpper(Program d ) 27: lpqe ← lpqe +lpqe0 , upqe ← upqe + upqe 0 28: lpnqe ← lpnqe + lpnqe 0 , upnqe ← upnqe + upnqe 0 29: InsertContribution(sampled, id, lpqe 0 , upqe 0 , lpnqe 0 , upnqe 0 ) 30: end if 31: end while lpqe upqe 32: return lpqe + upnqe , upqe + lpnqe 33: end function
The number of probabilistic facts follows a Barab´ asiAlbert preferential attachment model generated with the networkx [13] Python package. The initial number of nodes of the graph, n, is the size of the instance while the number of edges to connect a new node to an existing one, m, is 3. In a ﬁrst set of experiments, we ﬁxed the number of probabilistic facts, for iron, and the number of people, for smoke, to 10 and plotted the computed lower and upper probabilities and the execution time by increasing the number of samples. All the probabilistic facts have probability 0.5. The goal of these experiments is to check how many samples are needed to converge and how the execution time varies by increasing the number of samples, with a ﬁxed program. For the iron dataset, the query q is rusty(1) and the evidence e is iron(2). Here, the exact values are P(q) = 0.009765625, P(q) = 0.5, P(q  e) = 0.001953125, and P(q  e) = 0.5. For the smoke dataset, the program has 21 connections (probabilistic facts): node 0 is connected to all the other nodes, node 2 with 4, 6, and 8, node 3 with 4, 5, and 7, node 4 with 5, 6, 7, and 9, and node 7 with 8 and 9. All the connections have probability 0.5. Nodes 2, 5, 6, 7, and 9 certainly smoke. The query q is smokes(8) and the evidence is smokes(4). The targets are P(q) = 0.158, P(q) = 0.75, P(q  e) = 0, and P(q  e) = 0.923. Results for all the four algorithms are shown in Figs. 1 (iron) and 2 (smoke).
42
D. Azzolini et al.
Fig. 1. Comparison of the sampling algorithms on the iron dataset. Straight lines are the results for PASTA while dashed lines for PASOCS.
Fig. 2. Comparison of the sampling algorithms on the smoke dataset. Straight lines are the results for PASTA while dashed lines for PASOCS. In Fig. 2a the target line at 0.75 is for the upper unconditional probability.
For Gibbs sampling, we set the number Block (i.e., number of probabilistic facts to resample), to 1. All the algorithms seem to stabilize after a few thousands of samples for both datasets. For iron, MH seems to slightly overestimate the upper probability. Gibbs and rejection sampling require a few seconds to take 106 samples, while Metropolis Hastings (MH) requires almost 100 s. However, for the smoke dataset, MH and Rejection sampling have comparable execution times (more than 100 s for 5 · 105 samples) while Gibbs is the slowest among the three. This may be due to a low probability of the evidence. We compared our results with PASOCS [23] (after translating by hand the probabilistic conditionals in PASP rules). We used the following settings: n min n n max 1 ut 1 p 300 sb 1 b 0 where n is the number of considered samples, n min is the minimum number of samples, n max is the maximum number of samples (1 deactivates it), ut is the uncertainty threshold (1 deactivates it), p is the percentile (since they estimate values with gaussians), sb is the number of samples to run at once during sampling, and b is the burnin value for Gibbs and Metropolis Hastings sampling (0 deactivates it). We do not select parallel
Approximate Algorithms for Statistical Probabilities
43
Fig. 3. Comparison of Gibbs sampling on the iron dataset.
Fig. 4. Comparison of Gibbs sampling and MH on the smoke dataset.
solving, since PASTA is not parallelized yet (this may be the subject of a future work). PASOCS adopts a diﬀerent approach for conditional inference: at each iteration, instead of sampling a world, it updates the probabilities of the probabilistic facts and samples a world using these values. In Fig. 1b, the execution times of PASOCS for all the tested algorithms are comparable and seem to grow exponentially with the number of samples. The lines for rejection and unconditional sampling for PASTA overlap. This also happens for the lines for MH, Gibbs, and rejection sampling for PASOCS. PASOCS seems to be slower also on the smoke dataset (Fig. 2b), but the diﬀerence with PASTA is smaller. We also plotted how PASTA and PASOCS perform in terms of number of samples required to converge. In Fig. 3, we compare Gibbs sampling on the iron dataset. Here, PASTA seems to be more stable on both lower and upper probability. However, even with 5000 samples, both still underestimate the lower probability, even if the values are considerably small. In Fig. 4 we compare PASOCS and PASTA on Gibbs sampling and Metropolis Hastings sampling on the iron dataset. Also here, PASTA seems more stable, but both algorithms are not completely settled on the real probability after 5000 samples. Finally, Fig. 5 compares the unconditional sampling of PASTA and PASOCS on both datasets. Here, the results are similar: after approximately 3000 samples, the computed probability
44
D. Azzolini et al.
Fig. 5. Comparison of unconditional sampling on the iron and the smoke datasets.
Fig. 6. Comparison between PASTA and PASOCS by increasing the number of probabilistic facts for the iron dataset.
seems to be stabilized. In another experiment, we ﬁxed the number of samples to 1000, increased the size of the instances for the iron dataset, and plot how the execution time varies with PASTA and PASOCS. The goal is to check how the execution time varies by increasing the number of samples. The query is rusty(1). Results are shown in Fig. 6. For PASOCS, we get a memory error starting from size 32. PASTA requires approximately 500 s to take 1000 samples on a program with the structure of Example 2 with 1500 probabilistic facts. Note again that, during sampling, we assume that every world has at least one answer set, since if we need to check this, all the worlds must be generated and clearly the inference will not scale.
6
Conclusions
In this paper, we propose four algorithms to perform approximate inference, both conditional and unconditional, in PASTA programs. We tested the execution time and the accuracy also against the PASOCS solver (after manually performing the conversion of probabilistic conditionals). Empirical results show that our algorithms reach a comparable accuracy in a lower execution time. As future work, we plan to better investigate the convergence of the algorithms and to develop approximate methods for abduction [1,2] in PASTA programs.
Approximate Algorithms for Statistical Probabilities
45
References 1. Azzolini, D., Bellodi, E., Ferilli, S., Riguzzi, F., Zese, R.: Abduction with probabilistic logic programming under the distribution semantics. Int. J. Approx. Reason. 142, 41–63 (2022). https://doi.org/10.1016/j.ijar.2021.11.003 2. Azzolini, D., Bellodi, E., Riguzzi, F.: Abduction in (probabilistic) answer set programming. In: Calegari, R., Ciatto, G., Omicini, A. (eds.) Proceedings of the 36th Italian Conference on Computational Logic. CEUR Workshop Proceedings, vol. 3204, pp. 90–103. Sun SITE Central Europe, Aachen, Germany (2022) 3. Azzolini, D., Bellodi, E., Riguzzi, F.: Statistical statements in probabilistic logic programming. In: Gottlob, G., Inclezan, D., Maratea, M. (eds.) Logic Programming and Nonmonotonic Reasoning (LPNMR 2022), LNCS, vol. 13416, pp. 43–55. Springer, Cham (2022). https://doi.org/10.1007/9783031157073 4 4. Azzolini, D., Riguzzi, F., Lamma, E.: An analysis of Gibbs sampling for probabilistic logic programs. In: Dodaro, C., et al. (eds.) Workshop on Probabilistic Logic Programming (PLP 2020). CEURWS, vol. 2678, pp. 1–13. Sun SITE Central Europe, Aachen, Germany (2020) 5. Azzolini, Damiano, Riguzzi, Fabrizio, Masotti, Franco, Lamma, Evelina: A comparison of MCMC sampling for probabilistic logic programming. In: Alviano, Mario, Greco, Gianluigi, Scarcello, Francesco (eds.) AI*IA 2019. LNCS (LNAI), vol. 11946, pp. 18–29. Springer, Cham (2019). https://doi.org/10.1007/9783030351663 2 6. Baral, C., Gelfond, M., Rushton, N.: Probabilistic reasoning with answer sets. Theor. Pract. Log. Prog. 9(1), 57–144 (2009). https://doi.org/10.1017/ S1471068408003645 7. Brewka, G., Eiter, T., Truszczy´ nski, M.: Answer set programming at a glance. Commun. ACM 54(12), 92–103 (2011). https://doi.org/10.1145/2043174.2043195 8. Cozman, F.G., Mau´ a, D.D.: On the semantics and complexity of probabilistic logic programs. J. Artif. Intell. Res. 60, 221–262 (2017). https://doi.org/10.1613/jair. 5482 9. Cozman, F.G., Mau´ a, D.D.: The joy of probabilistic answer set programming: Semantics, complexity, expressivity, inference. Int. J. Approx. Reason. 125, 218– 239 (2020). https://doi.org/10.1016/j.ijar.2020.07.004 10. De Raedt, L., Kimmig, A., Toivonen, H.: ProbLog: a probabilistic Prolog and its application in link discovery. In: Veloso, M.M. (ed.) IJCAI 2007, vol. 7, pp. 2462– 2467. AAAI Press/IJCAI (2007) 11. Faber, W., Pfeifer, G., Leone, N.: Semantics and complexity of recursive aggregates in answer set programming. Artif. Intell. 175(1), 278–298 (2011). https://doi.org/ 10.1016/j.artint.2010.04.002 12. Gebser, M., Kaminski, R., Kaufmann, B., Schaub, T.: Multishot ASP solving with clingo. Theory Pract. Logic Program. 19(1), 27–82 (2019). https://doi.org/ 10.1017/S1471068418000054 13. Hagberg, A.A., Schult, D.A., Swart, P.J.: Exploring network structure, dynamics, and function using NetworkX. In: Varoquaux, G., Vaught, T., Millman, J. (eds.) Proceedings of the 7th Python in Science Conference, pp. 11–15. Pasadena, CA, USA (2008) 14. Halpern, J.Y.: An analysis of ﬁrstorder logics of probability. Artif. Intell. 46(3), 311–350 (1990) 15. Jaeger, M.: Probabilistic reasoning in terminological logics. In: Doyle, J., Sandewall, E., Torasso, P. (eds.) 4th International Conference on Principles of Knowledge Representation and Reasoning, pp. 305–316. Morgan Kaufmann (1994). https:// doi.org/10.1016/B9781483214528.50124X
46
D. Azzolini et al.
16. KernIsberner, G., Thimm, M.: Novel semantical approaches to relational probabilistic conditionals. In: Proceedings of the Twelfth International Conference on Principles of Knowledge Representation and Reasoning, pp. 382–392. AAAI Press (2010) 17. Lee, J., Wang, Y.: A probabilistic extension of the stable model semantics. In: AAAI Spring Symposia (2015) 18. Lloyd, J.W.: Foundations of logic programming, 2nd edn. Springer, Heidelberg (1987). https://doi.org/10.1007/9783642831898 19. Nickles, Matthias: A tool for probabilistic reasoning based on logic programming and ﬁrstorder theories under stable model semantics. In: Michael, Loizos, Kakas, Antonis (eds.) JELIA 2016. LNCS (LNAI), vol. 10021, pp. 369–384. Springer, Cham (2016). https://doi.org/10.1007/9783319487588 24 20. Riguzzi, F.: Foundations of Probabilistic Logic Programming: Languages, Semantics, Inference and Learning. River Publishers, Gistrup, Denmark (2018) 21. Sato, T.: A statistical learning method for logic programs with distribution semantics. In: Sterling, L. (ed.) ICLP 1995, pp. 715–729. MIT Press (1995). https://doi. org/10.7551/mitpress/4298.003.0069 22. Totis, P., Kimmig, A., De Raedt, L.: SMProbLog: stable model semantics in ProbLog and its applications in argumentation. arXiv preprint arXiv:2110.01990 (2021) 23. Tuckey, D., Russo, A., Broda, K.: PASOCS: a parallel approximate solver for probabilistic logic programs under the credal semantics. arXiv preprint arXiv:2105.10908 (2021) 24. Wilhelm, M., KernIsberner, G., Finthammer, M., Beierle, C.: Integrating typed model counting into ﬁrstorder maximum entropy computations and the connection to Markov logic networks. In: Bart´ ak, R., Brawner, K.W. (eds.) Proceedings of the ThirtySecond International Florida Artiﬁcial Intelligence Research Society Conference, pp. 494–499. AAAI Press (2019)
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Decision Trees with a Modal Flavor Dario Della Monica1 , Giovanni Pagliarini2,3 , Guido Sciavicco3(B) , and Ionel Eduard Stan2,3 1
University of Udine, Udine, Italy [emailprotected] 2 University of Parma, Parma, Italy {giovanni.pagliarini,ioneleduard.stan}@unife.it 3 University of Ferrara, Ferrara, Italy [emailprotected]
Abstract. Symbolic learning is the subﬁeld of machine learning that deals with symbolic algorithms and models, which have been known for decades and successfully applied to a variety of contexts, and of which decision trees are the quintessential expression. The main limitation of current symbolic models is the fact that they are essentially based on classical propositional logic, which implies that data with an implicit dimensional component, such as temporal, e.g., time series, or spatial data, e.g., images, cannot be properly dealt with within the standard symbolic framework. In this paper, we show how propositional logic in decision trees can be replaced with the more expressive (propositional) modal logics, and we lay down the formal bases of modal decision trees by ﬁrst systematically delineating interesting and wellknown properties of propositional ones and then showing how to transfer these properties to the modal case.
Keywords: Machine learning from dimensional data
1
· Decision trees · Modal logic · Learning
Introduction
The most iconic and fundamental separation between subﬁelds of machine learning is the one between functional and symbolic learning. Functional learning is the process of learning a function that represents the theory underlying a certain phenomenon, while symbolic learning is the process of learning a logical description that represents that phenomenon. Whether one or the other approach should be preferred raised a longstanding debate among experts, which roots in the fact that functional methods tend to be more versatile and statistically accurate than symbolic ones, while symbolic methods are able to extract models that can be interpreted, explained, and then enhanced using humanexpert knowledge. These characteristics of symbolic methods, both for political reasons (consider, for instance, the recent General Data Protection Regulation (GDPR) of the European Union [13], that highlights c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Dovier et al. (Eds.): AIxIA 2022, LNAI 13796, pp. 47–59, 2023. https://doi.org/10.1007/9783031271816_4
48
D. Della Monica et al.
the need for interpretable/explainable automatic learningbased decisionmaking processes, including those involving AI technologies) and technical ones (interpretable models are often easier to train, explore, integrate, and implement), are sometimes used as arguments for preferring a symbolic approach over a functional one. From a logical standpoint, canonical symbolic learning methods are all characterized by the use of propositional logic (they are, indeed, sometimes called propositional methods), and, among them, propositional decision trees are probably the best known. The origin of modern decision trees dates back to the ﬁfties [2]; a lot of work has been done since then, which includes, among others, [4,8,10,11,15– 17], and decision tree models extracted using popular algorithms such as ID3, C4.5, and more recent ones, have been widely applied in the literature. Diﬀerent decision tree models diﬀer in their structure and the language on which they are based, but only slightly; from a structural point of view, it can be argued that virtually all such structures and learning algorithms stemmed, in some sense, from CART [4], which already contained all the fundamental ideas of decision trees. Dimensional data, such as temporal or spatial data, cannot be dealt with in a proper, native way using propositional decision trees. The general togo strategy to treat dimensional data with propositional models, such as decision trees, is to ﬂatten the dimensional component, eﬀectively hiding it. Flattening consists in massaging the dataset in such a way that dimensional attributes become scalar ones. As an example, a multivariate time series with n temporal attributes A1 , . . . , An can be transformed by applying one or more feature extractions function to all attributes, e.g., average, minimum, maximum, and the like, to obtain (a feature representation of) an instance f1 (A1 ), f2 (A1 ), . . . , f1 (A2 ), f2 (A2 ), . . . , which can now be treated, for example, by a standard decision tree. A more general approach consists of applying the same strategy to diﬀerent windows along all dimensions, e.g., intervals in the temporal case, rectangles in the spatial one, and so on, obtaining several new attributes for each original one and each feature extraction function. At the limit, each temporal (spatial, . . . ) point may become a window. As an example, a singlevariate time series A with N ordered points ends up being represented as the (unordered) collection A(1), A(2), . . . , A(N ). Such a representation is called lagged (for temporal data) or ﬂattened (for spatial ones). In this paper, we adopt a diﬀerent point of view, aiming at laying down the formal bases of modal symbolic learning, by means of which dimensional datasets can be dealt with in a native way. To this end, we replace propositional logic by propositional modal logic (modal logic for short) and we enhance decision trees accordingly. Modal logic [3] generalizes propositional logic by allowing one to natively express the relationships that emerge among the diﬀerent worlds, e.g., time points, time intervals, multidimensional areas, that contribute to describe realworld scenarios. Since modal logic can be declined into more practical languages, such as temporal and spatial logics, and dimensional data can be seen as modal data, modal symbolic learning is immediately applicable to the dimen
Modal Decision Trees
49
sional case. Moreover, this is not the only possible application, as modal data emerge in a natural way also from nondimensional data, like, for instance, in textual and graphbased data. Here, we introduce modal decision trees, and we systematically study their logical properties, speciﬁcally, correctness. Standard decision trees are, indeed, correct, although the nature of their presentation, mostly driven by applications, tends to hide their theoretical aspects. While we are not interested in studying eﬃcient implementations of learning algorithms, the driving principle of the deﬁnition of modal decision trees is the preservation of the simplicity and interpretability that characterize propositional ones. As a result, modal decision tree learning algorithms can be implemented starting from any implementation of propositional ones, and working one’s way up. The paper is organized as follows. In Sect. 2, we provide some preliminary deﬁnitions and concepts. In Sect. 3, we deﬁne modal decision trees and study their properties. Then, in Sect. 4, we brieﬂy show how modal decision trees can be applied to learn from dimensional data, before concluding.
2
Preliminaries
Let P be a set of propositional letters. The wellformed formulas of modal logic (ML) are obtained from the following grammar: ϕ ::= p  ¬ϕ  ϕ ∧ ϕ  ♦ϕ. The other usual Boolean connectives can be derived from them, and, as standard, we use ϕ to denote ¬♦¬ϕ. The modality ♦ (resp., ) is usually referred to as it is possible that (resp., it is necessary that). Modal logic is considered as archetypical of (propositional) temporal, spatial, and spatiotemporal logics, and it is a nonconservative extension of propositional logic (PL). Its semantics is given in terms of Kripke models. A Kripke model K = (W, R, V ) over P consists of a (ﬁnite) set of worlds W , which contains a distinguished world w0 , called initial world, a binary accessibility relation R ⊆ W ×W , and a valuation function V : W → 2P , which associates each world with the set of proposition letters that are true on it. The truth relation K, w ϕ for a model K and a world w in it is expresed by the following clauses: K, w K, w K, w K, w
p ¬ϕ ϕ∧ψ ♦ϕ
iﬀ p ∈ V (w); iﬀ K, w ϕ; iﬀ K, w ϕ and K, w ψ; iﬀ ∃v s.t. wRv and K, v ϕ.
We write K ϕ as an abbreviation for K, w0 ϕ. The importance of modal logic comes from the fact that most classic temporal [5,7,14] and spatial logics [1,9] stem from (generalizations of) modal logic. Therefore, the theory of modal logic and the tools built on it can be reused to cope with more practical situations. We now introduce the notion of modal dataset and its associated problems.
50
D. Della Monica et al.
Fig. 1. An example of modal dataset with 4 instances, each described by a Kripke model.
Deﬁnition 1 (Modal dataset). Let P be a set of proposition letters. A modal dataset I = {I1 , . . . , Im } over P is a ﬁnite collection of m instances, each of which is a Kripke model over P, and such that I, J are not bisimilar, for each I, J ∈ I with I = J, that is, there exists at least one formula ϕ ∈ M L with I ϕ and J ϕ. We say that I is labeled if it is equipped with a labeling function L : I → C which associates every instance with a class from a ﬁnite set C = {C1 , . . . , Ck }. In the static case, a dataset is usually deﬁned as a collection I = {I1 , . . . , Im } of m instances described, each, by the value of n distinct attributes A = {A1 , . . . , Am }. However, since each attribute A is associated to its ﬁnite domain dom(A), that is, the ﬁnite set of all values taken by A across I, the latter naturally induces a set of propositional letters: P = {A a ∈ {}, A ∈ A, a ∈ dom(A)}. Learningwise, therefore, we can always deﬁne a static dataset as if the corresponding set of propositional letters is ﬁxed. A modal dataset immediately generalizes a static one, by postulating that instances are described by Kripke frames in which attributes change value across diﬀerent worlds. There are several scenarios that can be naturally modeled by modal, nonstatic datasets, instead; by way of example, dimensional datasets are characterized by each attribute in each instance being described by a ddimensional matrix (e.g., d = 1 in the temporal case, and d = 2 in the spatial case). In such cases, ﬁxed a set of feature extraction function(s) F = {f1 , . . . , fk }, the set of induced propositional letters becomes: P = {f (A) a ∈ {}, A ∈ A, a ∈ dom(A), f ∈ F}.
Modal Decision Trees
51
Dimensional datasets are not the only source of modal datasets; in fact, our deﬁnition of modal dataset is more general, and captures a wide range of practical situations. In the static case two instances cannot be identical, that is, there must be a propositional formula that distinguishes them; at the modal level, this requirement translates into constraining every two instances to be nonbisimilar (see, again, [3]), that is, to be distinguishable by at least one modal formula. In machine learning, several problems are associated to a labeled dataset I. Among them, a fundamental and ubiquitous one is the classiﬁcation problem, that is, the problem of synthesizing an algorithm (a classiﬁer) that is able to classify the instances of an unlabeled dataset J of the same type as I. In the symbolic context, learning a classiﬁer from a dataset requires extracting from it the logical property that deﬁne each class, that is, its characteristic formula. Then, instances are seen as models of the considered logical formalism and the classiﬁcation task is performed via model checking an instance against characteristic formulas. Although, in principle, one can be interested in learning characteristic formulas of any logic in any dataset, to modal (resp., propositional) datasets it is natural to associate modal (resp., propositional) characteristic formulas. Binary decision trees, which are typical classiﬁers, are binary trees whose leaves and edges are equipped with labels. Leaf labels identify the diﬀerent classes an instance can belong to; edge labels are atomic logical elements which are then composed to obtain complex formulas in the considered logical formalism (in the propositional case, edge labels edges are literals and formulas are Boolean combinations). A tree associates a formula to every class it features (i.e., every label occurring in a leaf) and it classiﬁes an instance into a class if and only if the instance satisﬁes the formula corresponding to that class. As there can be exponentially many leaves in a tree, the classiﬁcation process can possibly require verifying the satisfaction of an instance against exponentially many formulas. However, decision trees provide an eﬃcient mechanism for classifying an instance that does not explore the entire tree: for every node, starting from the root and going down towards the leaves, the truth of the formula associated with that node is checked against the instance to be classiﬁed and, depending on the outcome the instance is passed to the right or the left child and the process is repeated. When a leaf is reached, the instance is classiﬁed into the class that labels that leaf. Summing up, the desired properties for a family M of decision trees include: (i) correctness (every tree classiﬁes any given instance into exactly one class); (ii) completeness (for every formula ϕ of the considered formalism, there is a decision tree τ ∈ M that realizes ϕ); and (iii) eﬃciency (a decision tree τ of height h must be able to classify an instance I by checking the truth of, at most, a number of formulas polynomial in h). In the rest of this paper we consider the problem of designing modal decision trees in such a way to be correct, complete, and eﬃcient with respect to modal logic.
52
3
D. Della Monica et al.
Modal Decision Trees
Let τ = (V, E) be a full directed ﬁnite binary tree with vertexes in V and edges in E. We denote by root(τ ) the root of τ , by V ⊆ V the set of its leaves, and by V ι the set of its internal nodes (that is, nonroot and nonleaf nodes). For each nonleaf node ν we denote by (ν) (resp., (ν)) its left (resp. right) child, and by (ν) its parent. Similarly, for a tree τ , we denote by (τ ) (resp., (τ )) its left (resp. right) subtree. Finally, for a node ν, the set of its ancestors (ν included) is denoted by ∗ (ν), where ∗ is the transitive and reﬂexive closure of ; we also deﬁne + (ν) = ∗ (ν) \ {ν}. A path π τ = ν0 νh in tree τ (or, simply, π, if τ is clear form the context) of length h ≥ 0 from ν0 to νh is a ﬁnite sequence of h + 1 nodes ν0 , . . . , νh such that νi = (νi+1 ), for each i = 0, . . . , h − 1. We denote by π1 · π2 the operation of appending the path π2 to the path π1 . We also say that a path ν0 · ν1 νh is left (resp., right) if ν1 = (ν0 ) (resp., ν1 = (ν0 )). For a path π = ν0 νh , the set of its improper preﬁxes is denoted by preﬁx (π), and if ν is a node in τ , πντ (or, simply, πν , if τ is clear from the context) denotes the unique path root(τ ) ν. Finally, a branch of τ is a path πτ (or, simply, π , if τ is clear from the context) for some ∈ V .
Deﬁnition 2 (modal decisions). Fixed a modal dataset I over P, the set of decisions is: Λ = { , ⊥, p, ¬p, ♦ , ⊥  p ∈ P}. We say that p, ¬p are propositional decisions, while ♦ (resp., ⊥) are modal existential (resp., modal universal) ones, and we use the symbol λ ∈ Λ to denote a decision. For each λ ∈ Λ, the decision that corresponds to its logical negation ¬λ is univocally identiﬁed, so when λ = (resp., p, ♦ ) we use ¬λ to denote ⊥ (resp., ¬p, ⊥), and vice versa. Deﬁnition 3 (modal decision tree). Fixed a propositional alphabet P and a set of classes C, a modal decision tree τ over P and C is a structure: τ = (V, E, b, l, s), where (V, E) is a full directed ﬁnite binary tree, l : V → C is the leaflabeling function, b : V ι → V ι is the backedge function, s : E → Λ is the edgelabeling function, and the following conditions hold: ∀ν, ν ∈ V.(b(ν) = ν → ν ∈ ∗ (ν)); ∀ν, ν ∈ V.((b(ν) = ν ∧ b(ν ) = ν ) → b(ν) = b(ν )); ∀ν, ν , ν ∈ V.((b(ν) = ν ∧ ν ∈ + (ν ) ∧ ν ∈ + (ν)) → ν ∈ / V ) → b(ν ) = ν ); ∀(ν, ν ) ∈ E.((s(ν, ν ) ∈ {⊥, ⊥} ∧ ν ∈ ∀(ν, ν ), (ν, ν ) ∈ E.(s(ν, ν ) = ¬s(ν, ν )).
+
1. 2. 3. 4. 5.
(b(ν )));
For every c ∈ C, we denote by leaves τ (c) (or, simply, leaves(c), when τ is clear from the context) the set of leaves of τ labeled with c.
Modal Decision Trees
53
A propositional decision tree is a modal decision tree in which edges are labeled with propositional decisions and the backedge function plays no role (therefore, in propositional decision trees only condition 5 is still nontrivial); thus, propositional decision trees are a particular case of modal decision trees. In the following, we denote by MDT the family of modal decision trees (or modal decision tree classiﬁcation model ), and by DT its propositional counterpart (that is, the subfamily of MDT that only contains propositional trees). From now on, we use the term decision tree to refer to an element of either DT or MDT . We now show how a modal decision tree deﬁnes a modal formula for each of its classes. This is obtained by associating a formula to each branch, and then the formula of a class is the disjunction of all the formulas associated to branches whose leaf is labeled with that class. In the propositional case, each branch is associated to the conjunction of the labels that occur on its edges; as every propositional formula can be written in disjunctive normal form, propositional decision trees are complete with respect to propositional logic. Modal logic does not have a normal form that allows one to bound the nesting of modal operators, and this makes the construction of formulas more complicated. Let us ﬁrst ﬁx the following useful concepts.
Deﬁnition 4 (contributor, node agreement). Given a decision tree τ and a path π = ν0 νh , with h > 1, the contributor of π, denoted ctr (π), is deﬁned as the only node νi in π such that νi = ν1 , 0 < i < h, and b(νi ) = ν1 , if it exists, and as ν1 otherwise. Moreover, given two nodes νi , νj ∈ π, with i, j < h, we say that they agree if νi+1 = (νi ) (resp., νi+1 = (νi )) and νj+1 = (νj ) (resp., νj+1 = (νj )), and we denote this situation by A(νi , νj ), and that they disagree (denoted by D(νi , νj )), otherwise. To our purposes, we use the following grammar to generate formulas of M L: ϕ :: = λ  λ ∧ (ϕ ∧ ϕ)  λ → (ϕ → ϕ)  ♦(ϕ ∧ ϕ)  (ϕ → ϕ), where λ ∈ Λ. Deﬁnition 5 (implicative formulas). We say that a modal formula ϕ is implicative if it has the form ψ → ξ, or (ψ → ξ), and we denote by Im the set of implicative formulas. As a matter of fact, in order to assign a formula to each leaf, and then to each class, we ﬁrst associate a formula to every path (see Fig. 2 for an example). Deﬁnition 6 (path, leaf, and classformula). Let τ be a decision tree. For each path π = ν0 νh in τ , the pathformula ϕτπ (or, simply, ϕπ , when τ is clear from the context) is deﬁned inductively as: – If h = 0, then ϕπ = . – If h = 1, then ϕπ = s(ν0 , ν1 ).
54
D. Della Monica et al.
Fig. 2. On the lefthand side, an example of a modal decision tree; on the righthand side, all relevant path, leaf, and classformulas (ϕ5 is included in the second group from the top).
– If h > 1, let λ = s(ν0 , ν1 ), π1 = ν1 ctr (π), and π2 = ctr (π) νh . Then, ⎧ λ ∧ (ϕπ1 ∧ ϕπ2 ) if λ = ♦ , A(ν0 , ctr (π)), and ϕπ2 ∈ / Im, ⎪ ⎪ ⎪ ⎪ , ctr (π)), and ϕ ∈ Im; or λ = ♦ , D(ν ⎪ 0 π 2 ⎪ ⎪ ⎪ → ϕ ) if λ = ♦ , D(ν , ctr (π)), and ϕ ∈ / Im, λ → (ϕ ⎪ π π 0 π 1 2 2 ⎪ ⎨ or λ = ♦ , A(ν0 , ctr (π)), and ϕπ2 ∈ Im; ϕπ = if λ = ♦ , A(ν0 , ctr (π)), and ϕπ2 ∈ / Im, ♦(ϕπ1 ∧ ϕπ2 ) ⎪ ⎪ ⎪ ⎪ , ctr (π)), and ϕ ∈ Im; or λ = ♦ , D(ν ⎪ 0 π 2 ⎪ ⎪ ⎪ → ϕ ) if λ = ♦ , D(ν , ctr (π)), and ϕ ∈ / Im, (ϕ ⎪ π π 0 π 1 2 2 ⎪ ⎩ or λ = ♦ , A(ν0 , ctr (π)), and ϕπ2 ∈ Im. Then, for each leaf ∈ V , the leafformula ϕτ (or, simply ϕ , when τ is clear from the context) is deﬁned as: ϕπ . ϕ = π∈prefix (π )
Finally, for each class c, the classformula ϕτc (or, simply, ϕc , when τ is clear from the context), is deﬁned as: ϕπ . ϕc = ∈leaves(c)
Deﬁnition 7 (run). Let τ = (V, E, b, l, s) be a modal decision tree, ν a node in τ , and I an instance in a modal dataset I. Then, the run of τ on I from ν, denoted τ (I, ν), is deﬁned as:
Modal Decision Trees
(ν)) if I ϕπ
(ν)
if ν ∈ V (ν)) if I ϕπ
⎧ ⎪ ⎨ l(ν) τ (I, ν) = τ (I, ⎪ ⎩ τ (I,
55
(ν)
.
The run of τ on I (or the class assigned to I by τ ), denoted τ (I), is deﬁned as τ (I, root(τ )). Following the above deﬁnition, a modal decision tree classiﬁes an instance using its classformulas, and does so by checking, progressively, the pathformulas that contribute to build a leafformula, which, in turn, is one of the disjuncts that take part in a classformula. Observe that, inter alia, this implies that propositional decision trees can be seen as particular cases of modal decision trees even from a semantic point of view: formulas of the type ϕ1 ∧ ϕ2 behave exactly as in the propositional case, while those of the type ϕ1 → ϕ2 , are such that their the antecedent is always included as a conjunct in their corresponding leafformula, eﬀectively reducing it to a conjunction, as in the propositional case. Now, on the one side, the eﬃciency of classiﬁcation depends on how leafformulas are checked, while on the other side correctness and completeness depend on their semantics. Let us start by evaluating the eﬃciency of modal decision trees. Deﬁnition 8 (eﬃciency). We say that a decision tree τ of height h is eﬃcient if and only if, for every dataset I and every instance I ∈ I, it is the case that its run τ (I) can be computed in polynomial time with respect to h and to the size of I. A family of decision trees is eﬃcient if and only all of its decision trees are eﬃcient. The following result holds due to the fact that model checking an M L formula against a Kripke structure can be done in polynomial time in the sizes of the structure and the formula [6], and the fact that the size of the formula associated to a path is linear in the length of the path itself. Theorem 1 (eﬃciency of MDT ). The family MDT is eﬃcient. Now, we want to prove that modal decision trees are correct. Deﬁnition 9 (correctness). We say that a decision tree τ is correct if and only if, for every dataset I and every instance I ∈ I, it is the case that I satisﬁes exactly one of its classformulas ϕc . A family of decision trees is correct if and only all of its decision trees are correct. The following lemma can be proved by induction on the lengths of the paths, and the correctness of MDT follows.
Lemma 1. Let τ be a modal decision tree, and let π1 = ν0 νh−1 · (νh−1 ) and π2 = ν0 νh−1 · (νh−1 ) be two paths. Then, ϕπ1 ↔ ¬ϕπ2 is valid. Theorem 2 (correctness of MDT ). The family MDT is correct.
56
D. Della Monica et al.
Fig. 3. Typical presentation of an implicit temporal data set.
Finally, we discuss the completeness of modal decision trees with respect to modal logic. Deﬁnition 10 (completeness). We say that a family of decision trees is strong ly complete for a logical formalism if and only if, for each of its formula ϕ, there is a decision tree τ and a class c ∈ C such that ϕc ↔ ϕ is valid. We also say that a family of decision trees is weakly complete for a logical formalisms if and only if, for each of its formula ϕ, there is a decision tree τ and two classes c, c¯ ∈ C such that ϕc → ϕ and ϕc¯ → ¬ϕ are both valid. Modal decision trees are strongly complete with respect to propositional logic by deﬁnition, and weakly complete with respect to modal logic. Lemma 2. Let ϕ ∈ ML. Then, there exists a modal decision tree τ and two leaves c , c¯ ∈ V such that ϕπc ↔ ϕ and ϕπc¯ ↔ ¬ϕ are both valid. Theorem 3 (completeness of MDT ). The family MDT is strongly complete for P L and weakly complete for M L.
4
Applications
To show the potential of modal symbolic learning, in this section we consider two representative learning situations: from temporal data and from spatial data. As we have observed, spatial/temporal datasets can be seen as modal ones, and modal logic can be declined into suitable spatial/temporal logics that are able to describe such data. An example of dimensional dataset in the temporal case is given in Fig. 3 (left); here, m instances are described by several attributes, each one of which takes value on each of the time points that contribute to the description. Thus, this is a set of multivariate time series; examples of real
Modal Decision Trees
57
Table 1. Test results: propositional versus modal learning from the public, 1dimensional data set NATOPS (left), and from the public, 2dimensional data set INDIAN PINES. Performances are reported in percentage points. Temporal
Spatial
Seed Propositional
Modal
Propositional
Modal
Acc.
Sen.
Spe.
Acc.
Sen.
Spe.
Acc.
Sen.
Spe.
Acc.
Sen.
Spe.
1 2 3 4 5 6 7 8 9 10
79.17 83.33 80.56 77.78 84.72 77.78 83.33 80.56 80.56 75.00
79.17 83.33 80.56 77.78 84.72 77.78 83.33 80.56 80.56 75.00
95.83 96.67 96.11 95.56 96.94 95.56 96.67 96.11 96.11 95.00
88.89 88.89 93.06 91.67 91.67 88.89 84.72 91.67 84.72 87.50
88.89 88.89 93.06 91.67 91.67 88.89 84.72 91.67 84.72 87.50
97.78 97.78 98.61 98.33 98.33 97.78 96.94 98.33 96.94 97.50
59.58 62.50 63.75 62.50 62.92 57.08 71.25 62.92 58.75 62.92
59.58 62.50 63.75 62.50 62.92 57.08 71.25 62.92 58.75 62.92
96.33 96.59 96.70 96.59 96.63 96.10 97.39 96.63 96.25 96.63
79.58 79.58 67.92 79.58 75.83 71.25 80.00 75.83 77.08 79.58
79.58 79.58 67.92 79.58 75.83 71.25 80.00 75.83 77.08 79.58
98.14 98.14 97.08 98.14 97.80 97.39 98.18 97.80 97.92 98.14
Avg.
80.27 80.27 96.05 89.16 89.16 97.83 62.42 62.42 96.58 76.62 76.62 97.87
situations that can be described by sets of multivariate time series range from hospitalized patients that are constantly monitored, to diﬀerent sport activities described by the values of wearable sensors, to industrial machines whose behaviour is recorded over time. In many such situations, the relevant information is not necessarily visible at time points, but rather at time intervals, and in many cases the information to be extracted is concerned with prolonged events that take place at the same, or overlapping, or separate times, which is, again, a situation that is more naturally described with intervals rather than points. One way to extract such information is considering the multivariate time series that corresponds to each instance, as in Fig. 3 (right), and each interval that can be built on it. Each such interval is regarded as a world, as in Fig. 3 (right), and worlds are connected through intervalinterval relations. Taking the standard approach to do so results in having 12 intervalinterval relations, excluding equality, that is meets (RA ), overlaps (RO ), begins (RB ), ends (RE ), during (RD ), and later (RL ); in turn, these give rise to a multimodal logic which is known as HS (from the authors that ﬁrst introduced it, Halpern and Shoham [7]) which we can use to extract knowledge from a singledimensional dataset. In Fig. 3 (right), we have shown the relation overlaps by way of example. In the spatial case, we can generalize both the definition of world and the relations between worlds, and devise a 2dimensional version of HS, in order to apply the same idea. We performed a simple classiﬁcation experiment on two public datasets, using a prototype, simple version of MDT (available at [12]); besides being publicly available, the chosen datasets have been selected taking into account their num
58
D. Della Monica et al.
ber of attributes and instances, and their representativeness for temporal and spatial problems. The ﬁrst dataset is temporal, and known as NATOPS. It contains data generated by sensors on the hands, elbows, wrists and thumbs, in all three coordinates, along the temporal axis, of subjects performing several repetitions of aircraft hand signals, chosen among the 24 most often used ones; the problem consists in recognizing the speciﬁc signal. The second one is spatial, known as INDIAN PINES, and contains an hyperspectral image over a single landscape in Indiana (US) with 145×145 pixels, each represented by 220 spectral reﬂectance bands, and classiﬁed into one or more of sixteen types of crops; the problem is to recognize the type of cultivation in each pixel. While it would be premature to draw any conclusions from a single group of experiments, we can already see the improvement that we can expect to observe stepping from a static to a modal approach in Table 1. The results (accuracy, sensitivity, speciﬁcity) marked as modal are compared with those obtained with the same datasets using simple aggregating functions and propositional decision trees (propositional).
5
Conclusions
In this paper, we have shown how propositional decision trees can be generalized into modal decision trees. To this end, we have ﬁrst highlighted the desirable properties of a family of decision trees in terms of eﬃciency of classiﬁcation and logical properties, with respect to a given logical formalism. Then, we designed a family of eﬃcient decision trees that is correct with respect to modal logic. Applicationwise, we have argued that, on the one side, diﬀerent kinds of data are inherently nonpropositional, including dimensional (temporal, spatial, spatial/temporal) data, graphbased data, and textual data, and that, on the other side, the logical formalisms that ﬁt such cases are inherently modal. We considered two speciﬁc dimensional cases (a temporal one and a spatial one), and executed a learning experiment comparing the performances of propositional and modal decision trees on the same problem and under the same conditions. Temporal and spatial learning have been deeply studied in the machine learning literature; our purpose here is not that of comparing the performances of learning models in absolute terms, but to show the improvement that we can expect from introducing modal logic in symbolic learning schemata. The current implementation of modal decision trees is simpler than the one presented in this paper. The problem of devising an eﬃcient implementation of a learning algorithm that extracts full modal decision trees is still open. While the problem of extracting the optimal decision tree is knowingly NPhard already at the propositional level, much work has been done on approximation algorithms; adapting such algorithms to this proposal, and studying their computational complexity, is an open issue as well. Finally, decision trees are not the only symbolic learning classiﬁcation method that can be generalized from the propositional to the modal case; the same can be done, at least, with rulebased systems and ensembles of trees, giving rise to what could be called modal symbolic learning.
Modal Decision Trees
59
References 1. Aiello, M., van Benthem, J.: A modal walk through space. J. Appl. NonClass. Log. 12(3–4), 319–364 (2002) 2. Belson, W.A.: A technique for studying the eﬀects of television broadcast. J. Roy. Stat. Soc. Ser. C 5(3), 195–202 (1956) 3. Blackburn, P., de Rijke, M., Venema, Y.: Modal Logic. Cambridge University Press, Cambridge (2001) 4. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classiﬁcation and Regression Trees. Wadsworth Publishing Company, New York (1984) 5. Clarke, E.M., Emerson, E.A.: Design and synthesis of synchronization skeletons using branching time temporal logic. In: Kozen, D. (eds.) Logics of Programs. Logic of Programs 1981. LNCS, vol. 131, pp. 52–71. Springer, Heidelberg (1982). https://doi.org/10.1007/BFb0025774 6. Clarke, E.M., Grumberg, O., Peled, D.A.: Model Checking. MIT Press, Cambridge (2001) 7. Halpern, J.Y., Shoham, Y.: A propositional modal logic of time intervals. J. ACM 38(4), 935–962 (1991) 8. Kass, G.V.: An exploratory technique for investigating large quantities of categorical data. J. Roy. Stat. Soc. Ser. C 29(2), 119–127 (1980) 9. Lutz, C., Wolter, F.: Modal logics of topological relations. Log. Methods Comput. Sci. 2(2), 1–41 (2006) 10. Messenger, R., Mandell, L.: A modal search technique for predictive nominal scale multivariate analysis. J. Am. Stat. Assoc. 67(340), 768–772 (1972). https://doi. org/10.1080/01621459.1972.10481290 11. Morgan, J.N., Sonquist, J.A.: Problems in the analysis of survey data, and a proposal. J. Am. Stat. Assoc. 58(302), 415–434 (1963). https://doi.org/10.2307/ 2283276 12. Pagliarini, G., Manzella, F., Sciavicco, G., Stan, I.E.: ModalDecisionTrees.jl: interpretable models for native timeseries & image classiﬁcation (v0.80) zenodo (2022). https://doi.org/10.5281/zenodo.7040420 13. Parliament and Council of the European Union: General data protection regulation (2016). https://gdprinfo.eu/ 14. Pnueli, A.: The temporal logic of programs. In: 18th Annual Symposium on Foundations of Computer Science (SFCS 1977), pp. 46–57. IEEE (1977) 15. Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1, 81–106 (1986). https:// doi.org/10.1007/BF00116251 16. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, Burlington (1993) 17. Quinlan, J.R.: Simplifying decision trees. Int. J. Hum. Comput. Stud. 51(2), 497– 510 (1999)
Assisted Process Knowledge Graph Building Using Pretrained Language Models Patrizio Bellan1,2(B) , Mauro Dragoni1 , and Chiara Ghidini1 1
2
Fondazione Bruno Kessler, Trento, Italy {pbellan,dragoni,ghidini}@fbk.eu Free University of BozenBolzano, Bolzano, Italy
Abstract. The automated construction of knowledge graphs from procedural documents is a challenging research area. Here, the lack of annotated data, as well as raw text repositories describing realworld procedural documents, make it extremely diﬃcult to adopt deep learning approaches. Pretrained language models have shown promising results concerning the knowledge extraction tasks from the models themselves. Although several works explored this strategy to build knowledge graph, the viability of knowledge base construction by using promptbased learning strategy from such language models has not yet been investigated deeply. In this work, we present a promptbased incontext learning strategy to extract, from natural language process descriptions, conceptual information that can be converted into their equivalent knowledge graphs. Such a strategy is performed in a multiturn dialog fashion. We validate the accuracy of the proposed approach from both quantitative and qualitative perspectives. The results highlight the feasibility of the proposed approach within lowresource scenarios. Keywords: Process extraction from text · Incontext learning · Knowledge graph · Pretrained language model · Business process management
1
Introduction
The automatic building of knowledge graphs (KGs) from text is a longstanding goal in the Artiﬁcial Intelligence (AI) community that opened many challenges within speciﬁc research areas, e.g. information extraction (IE), natural language processing (NLP), and knowledge representation and reasoning (KRR). KGs aim to organize raw information with an appropriate structured form by capturing the entities described within the source repositories (represented through nodes) and their relationships (represented through labeled edges). The availability of eﬀective KGs may trigger reasoning tasks to infer unobserved facts from observed evidence, i.e., the nodes and the labeled edges contained within the KGs. The building of such KGs may pass through the analysis of complex and dynamic c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Dovier et al. (Eds.): AIxIA 2022, LNAI 13796, pp. 60–74, 2023. https://doi.org/10.1007/9783031271816_5
Process Knowledge Graph Building Using PLM
61
textual information containing both entities and relationships that have to be included within the KG. The construction of KGs by starting from this type of information may be challenging since the relevant entities contained within these texts are common sense terms that within speciﬁc contexts can assume relevant conceptual meaning. Recent advances in NLP, like the availability of large pretrained language models (PLMs) and the introduction of novel incontext learning approaches, enable the possibility to mimic fewshot learning techniques without changing any model parameter [6,18]. Conversational information seeking (CIS) systems can be exploited to extract conceptual information from natural language text and to represent such information in a structured form within a KG. These systems are drawing growing attention in both academia and industry. They aim at supporting search and question answering (among other tasks) using multiturn dialogues. Given the high quality of such language models as potential representations of relational knowledge, an interesting research direction to explore is to support the automatic construction of KGs through the use of PLM to understand how much conceptual and relational knowledge they can extract, how much such knowledge diﬀers from reality, and how it is possible to make them more eﬀective within speciﬁc contexts. In this paper, we explore the feasibility of using incontext learning to perform knowledge extraction from procedural documents in a questionandanswer multiturn dialog fashion. To the best of our knowledge, this task is performed for the ﬁrst time in the literature. An example of a multiturn dialog is shown in Fig. 1. A user interacts with a cognitive artificial agent that mimics an expert of a speciﬁc domain. The agent guides the knowledge extraction task by answering a set of questions posed incrementally by the user. Then, KG is built on top of the answers generated by the PLM. As a representative scenario, we target the Business Process Management (BPM) area, and in particular the process extraction from text task [4]. This domain is characterized by a limited size of the goldstandard data available which is highly hampering its development. We use the Generative Pretrained Transformer 3 model (GPT3) [6] as the artiﬁcial agent. We explore diﬀerent settings of the adopted PLM to perform incontext
Fig. 1. In this example of multiturn dialog, the artiﬁcial agent guides the construction of the process knowledge graph by answering the user.
62
P. Bellan et al.
learning: (i) no ﬁnetuning, (ii) by providing conceptual deﬁnitions, and (ii) by providing an extremely limited number of examples, i.e., to mimic the fewshot learning technique. Within our use case, we aim to extract entities and relations from procedural descriptions.
2
Related Work
The information extraction research area has been widely explored in the literature embracing many domains [14]. Speciﬁcally, on the use of PLMs, several works investigated their use aiming to understand both linguistic and semantic properties of possible word representations and also how PLMs can be exploited within speciﬁc knowledge and linguistic tasks. Compared to the aims mentioned above, our approach goes against the trend by trying to exploit PLMs to extract and store factual and commonsense knowledge with the aim of constructing KGs automatically. However, the adoption of PLMs has been investigated from several perspectives. A systematic comparison between neuralbased and countbased representations has been performed in [1]. Neuralbased representations were demonstrated to be more eﬀective with respect to the countbased ones in most of the tested tasks. Hence, given the neuralbased nature of PLMs, they may be considered a suitable starting point for this work. Details about the required granularity of these representations have been investigated, instead, in [12]. The capability of PLMs concerning the understanding and, in turn, the generation of sentences grammatically correct has been investigated in [15] and [20]. The former demonstrated how ungrammatical sentences do not aﬀect the understanding capability of PLMs. The latter investigated how PLMs are suitable for being used within diﬀerent domains and tasks without particular ﬁnetuning activities. However, even if the ﬂexibility of PLMs is quite high, this work highlighted how it is possible to customize them to obtain more eﬀective PLMs addressing speciﬁc tasks or to be used in speciﬁc domains. Moreover, it provides little insight into whether these models can compete with traditional approaches to representing knowledge like symbolic knowledge bases. Our work goes in this direction since we intended to verify the feasibility of PLMs in extracting symbolic knowledge from natural language texts. Finally, in [16] the authors introduced a PLM based on transformers which they called generative pretraining (GPT1). This work evolved in two further versions: GPT2 [17] and GPT3 [6]. These PLMs demonstrated their suitability to work within zeroshot environments in several tasks and their ability to store factual knowledge. Moreover, the authors of GPT3 demonstrated how it is possible to perform ﬁnetuning operations on the PLM to enhance its eﬀectiveness within speciﬁc tasks or domains. Diﬀerently from stateoftheart research, and to the best of our knowledge, this is the ﬁrst investigation concerning the extraction of conceptual knowledge from text using incontext learning which aims at dealing with an entire textual description without making assumptions regarding the input text. Then, it is
Process Knowledge Graph Building Using PLM
63
done in an incremental and ﬂexible conversational fashion to extract the required information via questionandanswer dialogues. Our work is the ﬁrst attempt to use these models on this speciﬁc problem and therefore the results and lessons learned are likely to pave the way to future eﬀorts, possibly involving diﬀerent strategies, target entities and relations, and also other PLMs.
3
Use Case
We introduced in Sect. 1 how the strategy proposed in this paper is agnostic with respect to the domain. Indeed, the use of incontext learning techniques to extract knowledge from natural language text allows one to ask for speciﬁc types of information from a PLM without specifying a priori the domain of interest. To demonstrate the feasibility of the proposed solution, we rely on a use case related to the construction of smallsize KGs by starting from natural language documents providing descriptions of procedures. This task, known as Process information extraction from text, can be regarded as the speciﬁc problem of ﬁnding a way to transform process descriptions into structured representations of diﬀerent expressivity, up to the entire formal process model diagram [3,4]. We choose this task since it is highly hampered by data scarcity issue. We extract entities from raw texts that are relevant to populate the equivalent KG, which could then be reﬁned or used to build graphical models. We exploited a part of the PET dataset [2] to validate our strategy. Such a dataset is the only publicly available annotated goldstandard dataset speciﬁc for process extraction from text tasks. It contains 45 texts annotated with process models elements and relations1 . KGs are built by means of a questionanswering style that mimic a multiturn dialog with an expert. In our setup, the GPT3 model acts as a domain expert.
4
Knowledge Extraction from Text via InContext Learning
This section describes the approach we designed and implemented to perform the extraction of knowledge, both entities and relations, from text via incontext learning. The starting point is a set of the conceptual elements (entities and/or relationships) we aim at extracting, and the ﬁrst building block of the approach is the formulation of a series of incremental questions posed, e.g., by the user in Fig. 1 from Q1 to Q3, to the GPT3 model in a sequential manner which enables the extraction of those speciﬁc entities and relationships. These questions become the speciﬁc entities that GPT3 has to solve with the help of speciﬁc prompts. 1
The description of the dataset, the annotation guidelines, the annotation schema, and the annotation process are out of the scope of this paper and the interested reader can ﬁnd all the material at https://huggingface.co/datasets/patriziobellan/ PET.
64
P. Bellan et al.
In our dialog pipeline, answers to a question can are used as inputs to formulate further questions. For instance, as shown in the ﬁgure, ﬁrstly we ask for the list of activities described in a text (Q1) and we use the answers (A1) to populate the KG with activity nodes. Then, for each activity, we ask who is the participants performing it (Q2). We use this information (A2) to populate the KG with the participant nodes and the perform relation. Finally, for each activity pair, we ask if they stand in a following relation and we use this information (A3) to complete the KG with activityactivity relations. The overall pipeline supports both options, i.e., the use of gold information to perform each step or the reuse of the output obtained from the previous step to perform the new one. In this work, we focused on the latter strategy since we intend to investigate the capability of constructing a KG from scratch and without the usage of gold information. The second building block of the approach is the construction of the prompt (the input feeds to the model) to perform incontext learning. Prompts are generated starting from templates that are ﬁlled using two types of information: (i) contextual knowledge which enable GPT3 to identify the speciﬁc domain at hand and the elements to be extracted for the diﬀerent tasks; and (ii) few examples of the task at hand. Once ready the prompts are fed into the model in order to generate the answer. The third building block of our approach is the PLM used in the conversation to mimics an expert in the ﬁeld. As motivated in Sect. 2, we decided to start from GPT3 [6] since it is one of the stateoftheart PLM and it can be adopted without ﬁnetuning it toward a speciﬁc goal. Other transformerlike models such as BERT [9] or RoBERTa [13] could not be adopted to perform incontext learning since they usually require speciﬁc training (or ﬁnetuning) toward a downstream task to exhibit acceptable performances. Realworld scenarios may not be often supported by transformerlike approaches due to lowresource issues. Hence, we have decided to directly start our investigation from GPT3 and from the notion of incontext learning since it can overcome such an issue. We, therefore, tackle the task in a questionandanswering fashion, not as an information extraction task. We used the answers generated by the model to build the knowledge graph of the process described in a text. Needless to say, this ﬁrst investigation into the usage of PLMs for process extraction from text does not aim at saying the ﬁnal word on this topic but, on the contrary, it aims at opening up the possibility to better investigate and exploit this kind of approach in the future. 4.1
InContext Learning
PLMs, such as GPT3 [6] or BERT [9], are built by using an impressive amount of data and exploiting the advances of deep learning engineering and computational power [6,18]. PLMs are becoming a hot topic in NLP as they can be adopted and ﬁnetuned, to solve complex tasks in diﬀerent domains, such as open question answering in prototypical commonsense reasoning [5]. While the ﬁnetuning of PLMs for taskspeciﬁc applications has become standard practice in NLP in the last few years, the advent of GPT3 greatly changed this paradigm. This model
Process Knowledge Graph Building Using PLM
65
opens the possibility of injecting taskspeciﬁc knowledge without doing a “classical” ﬁnetuning of the model parameters toward a speciﬁc downstream task. The model uses the knowledge provided in input to reﬁne the reasoning capabilities toward the task to solve. This technique is called incontext learning, where contextual knowledge, task instruction, very few examples of how to solve the task, and the actual input data are given as input all together in a prompt. The prompt is then sent as input into the model. This approach has been shown to be extremely useful to address the lowresource issue [19] and has been used to address topics ranging from medical dialogue summarization [8] to hate speech detection [11]. We illustrate the notion of prompt by showing an abstract example in Fig. 2. The prompts we used in our experiments are customization of this prompt.
Fig. 2. Abstract example of prompt adopted in our experiment to do incontext learning.
Lines 1–3 describe the contextual knowledge component and provide the model with contextual information that is used to narrow the model’s reasoning capability to the speciﬁc context at hand (Business Process Management, in our case). This knowledge can help the model to disambiguate the diﬀerent meanings of a word (e.g., activity). In our example, they are composed of the identiﬁcation of the domain (Business Process Management) and deﬁnitions of them. Lines 4–7 describe examples component and provide examples of the task to be solved together with the solution. It is composed of three parts containing: (i) a textual example [line 4], (ii) the task instructions to be performed upon the text [line 6], and (iii) the correct answer(s) [line 7]. In the sample prompt, we included only one example. Lines 8–10 describe the task instructions component and provide the task instructions describing the actual problem to be solved [line 10] and the process description in input [line 9] where the task has to be performed upon. Finally, line 11 is an eliciting answer mark that tells the model that the prompt is ended and to start producing an answer. At inference time, the prompt is the input feed into the model.
66
4.2
P. Bellan et al.
Implementing the Approach
While the overall approach presented here does not depend upon the particular process elements we extract, in this paper, we decided to use it for the extraction of activities, participants (that is, actors in this context), the performing relation between a participant and the activity(ies) it performs, and the sequence relation between activities (hereafter directly Fig. 3. The entities and relations contained follow relation). We focus on these in the KG of document 3.3 of the PET four elements as they constitute the dataset. Green circles represent the activ basic building blocks of any business ities. Orange circles represent the participrocess model, they enable the conpants. Blue arrows represent the directly folstruction of a structured representalow relations. Orange arrows represent the tion such as the one represented in performing relations. (Color ﬁgure online) Fig. 3 and were therefore deemed an appropriate starting point for the empirical investigation of a new approach. The graph shown in Fig. 3 is related to the KG representing the procedure described in the document doc3.3 of the PET dataset. The questions used to extract activities, participants, the performing relation, and the directly follow relation from text are reported in Fig. 4. As we can see in Fig. 1, these questions are performed incrementally: ﬁrst, we ask questions about the process activities (Q1), then we enrich the activities with the participants performing them (Q2), and ﬁnally, we ask about the precedence relation among activities (Q3). Also, question Q2 is used to retrieve both the participant and the performing relationship between activities. The incremental order of the questions is interesting because it mimics the way we often build conceptual models using followup questions. This ﬁrst work does not aim at investigating this aspect in depth. We are aware that there is a growing literature corpus on promptbased ﬁnetuning, as described, e.g. in [10]. But, an investigation into the most eﬃcient prompt is out of scope for this ﬁrst paper.
Fig. 4. The questions adopted as task instructions in prompts.
Our incontext learning approach exploits two sources of information: contextual knowledge and few examples related to the task at hand. For this speciﬁc
Process Knowledge Graph Building Using PLM
67
paper contextual knowledge consists in the text in Fig. 5: (i) a preamble identifying the business process management (BPM) context and the deﬁnitions of process elements to be extracted.
Considering the context of Business Process Management and process modeling and the following definitions: Activity: An activity is a unit of work that can be performed by an individual or a group. It is a specific step in the process. Participant: A participant is any individual or entity that participates in a business process. This could include individuals who initiate the process, those who respond to it, or those who are affected by it. Process Model: A process model is a model of a process in terms of process activities and their sequence flow relations. Flow: A flow object captures the execution flow among the process activities. It is a directional connector between activities in a Process. It defines the activitieséxecution order. Sequence Flow: A Sequence Flow object defines a fixed sequential relation between two activities. Each Flow has only one source and only one target. The direction of the flow (from source to target) determines the execution order between two Activities. A sequence relation is an ordered temporal relation between a source activity and the activity that immediately follows it in the process model.
Fig. 5. Contextual knowledge provided in prompts.
5
Empirical Assessment
We provide below the procedure adopted to evaluate the proposed approach. We start by better specifying the tasks to be solved, then by describing the experimental settings provided by the diﬀerent prompts, then the dataset used for the evaluation, and ﬁnally the obtained results. Also, even if we automatically extract the target elements (activities, participants, and relations) from the GPT3 answers, we manually validated them all. We performed the experiments by adopting the textdavinci001 engine and set all the other model’s parameters (e.g., sampling temperature) to 0.0. We want to remark here that the comparison among diﬀerent model conﬁgurations is postponed to future investigation since they are out of the scope of this paper. We performed a quantitative evaluation by applying the Graph Edit Distance (GED) [7] metric to compare the KGs created by using the gold standard annotations with the ones generated by using the information extracted from the GPT3 PLM. Then, we provided a qualitative evaluation in which we analyze, by starting from a representative example extracted from our dataset, the main pros and cons, concerning our use case, about the usage of a PLM for automatically building a KG2 .
2
The reader may ﬁnd all the material of this research at https://pdi.fbk.eu/pet/aixia/ aixia2022 material.zip.
68
5.1
P. Bellan et al.
The Tasks
The overall task we are assessing is the generation of the KGs by starting from procedural documents. We designed a multiturn dialog pipeline in which each interaction provides KG information about the nodes and the edges of the graph to obtain a process representation similar to the one proposed in Fig. 3. In order to get the information required to build the KG, our dialog pipeline addresses two categories of subtasks: process elements extraction (activities and participants) and relations extraction (participant performer and activities relations). In the Activity extraction subtask we customized prompttemplates task instructions with question Q1. We performed the extraction of Process participants together with the Performs relation extraction subtask by customizing the prompttemplates task instructions with question Q2. Finally, the Follows relation subtask compared pairs of activities to assess for each pair if they stand in sequential order. We customized prompttemplates task instructions with question Q3, by completing the instructions with a pair of activities at a time. We are aware that the extraction of Participants, and the relations Follow and Performs is inﬂuenced by the quality of the extraction of the activities. We want to remark here that we evaluated the proposed approach by comparing extracted graphs with the gold standard ones. In our experiments, we did not take into account the comparison between the accuracy of extracting such relations by using the gold annotations provided in the PET dataset. Instead, we measure the ability of the system to extract these three elements on the basis of the activities extracted by Q1, thus measuring the eﬀective quality of the incremental questionanswering interaction. 5.2
Experimental Setting
We evaluated the proposed approach with four experimental settings. Here we adopt the terminology described in Sect. 4 to explain our experimental settings. In the Raw setting the GPT3 model has been used as it is provided by the maintainers without any customization. We created this setting by providing task instructions and process description text only to the prompt template. This setting works as a baseline to observe the capability of the native model of working within complex scenarios. We built the second experimental setting, called Defs, on top of the Raw setting. We customized the prompt template by adding contextual knowledge to narrow the model’s reasoning ability. The contextual knowledge we provided is composed of the contextual information and the deﬁnition proposed in Fig. 5. The aim was to inject domainspeciﬁc conceptual knowledge into the language model to observe the capability of the system to exploit the basic domain knowledge. The third setting, called 2Shots, was built on top of the Raw setting by adding the examples component. In our experiments, we used the gold standard annotations provided by the documents 2.2 and 10.9 of the PET dataset. Here, for the extraction of Activity and Participant only the annotations related to
Process Knowledge Graph Building Using PLM
69
activities, activity data, and participants have been provided. While, for the extraction of Follows and Performs relationships, only the annotations related to sequence ﬂow and performing have been provided. This strategy has been adopted to avoid the injection of nonessential information that may cause noise in the model. Finally, in the Defs+2Shots setting we use both strategies described above. We enhanced the Defs setting with the examples component.
Fig. 6. The entities and relations contained in the KG of document 3.3 extracted using the Raw prompt.
Fig. 7. The entities and relations contained in the KG of document 3.3 extracted using the Defs prompt.
Fig. 8. The entities and relations contained in the KG of document 3.3 extracted using the 2Shots prompt Here, false positive Follows relationships have been omitted for readability purposes.
Fig. 9. The entities and relations contained in the KG of document 3.3 extracted by using the Defs+2Shots prompt. Here, false positive Follows relationships have been omitted for readability purposes
5.3
Test Dataset
We selected 7 representative documents from the PET dataset to empirically evaluate our approach. Since the dataset is annotated with process elements and process relations, we manually constructed the gold standard graph of each test text. Table 1 reports the overall statistics of the selected documents in terms of the number of words, annotated activities, participants, and performs and follows relations.
70
P. Bellan et al.
Fig. 10. The set of false positive Follows relations contained in the KG of document 3.3 extracted using the 2Shots prompt.
Fig. 11. The set of false positive Follows relations contained in the KG of document 3.3 extracted using the Defs+2Shots prompt.
We are aware that the analysis of seven documents has limitations from a statistical signiﬁcance perspective. However, the rationale behind Text word# activity# participant# follow# perform# this empirical evaluation is twofold. doc1.2 100 10 2 10 10 doc1.3 162 11 5 11 12 First, since this is a ﬁrst observational doc3.3 71 7 2 6 4 study of a promising groundbreaking doc5.2 83 7 3 6 4 strategy, we decided to select docudoc10.1 29 4 2 4 4 ments having speciﬁc characteristics doc10.6 30 4 2 4 4 doc10.13 39 3 2 2 3 in order to perform an adhoc analysis of how the pretrained language model worked on them. Second, the application of the proposed approach passed through several reﬁnement rounds before being tested since we had to understand how the pretrained language model actually works. Hence, to better understand the impact of the information provided by us to enrich the pretrained language model, the most suitable way was to observe such behaviors on a small but characteristic subset of documents. Table 1. Characteristics of test set documents.
5.4
Quantitative Evaluation
Table 2 provides the results of the empirical assessment performed. The table reports the GED measures obtained by comparing the gold standard graph with the graphs generated by each of the experimental settings. Such a measure represents the minimum amount of edit operations required to transform the gold standard graph into the generated one. The higher the measured value, the higher the diﬀerence between the two generated KGs. Hence, a low measure value means that the two KGs are similar. In general, Raw and Defs settings registered higher GED values with respect to Defs+2Shots and 2Shots ones. Nevertheless, the results highlighted a few interesting patterns. On average, the Raw setting registered the highest GED values. This result highlights the inability of the raw PLM to extract informa
Process Knowledge Graph Building Using PLM
71
tion that is useful for the construction of the ﬁnal KG. For instance, as shown in Fig. 6, when tested with the document 3.3 this prompt was able to extract only some activities, but no relations at all. For what concern participants, it was not able to address this extraction properly. Similarly, the Defs setting suﬀers from the same drawback. This is proven by the same GED value in both settings obtained in several cases with the consequence of producing very similar graph representations. Indeed, considering Fig. 7, this customization was able to extract the activities described but failed completely to extract their relations. Also, it overgenerated the participant elements and created many falsepositive performer relations. An exception is given by the doc5.2 where the Defs setting outperformed the other settings. Here, the conservative strategy (i.e., to not ﬁnetune the model with annotated procedural documents) adopted in both the Raw and Defs settings produced slightly better results than the Defs+2Shots and 2Shots ones. Among the Defs+2Shots and 2Shots, the latter demonstrated to be the most eﬀective one. Indeed, in several cases, e.g. doc10.6, the 2Shots setting produced a KG similar to the gold standard one. Comparing these two prompts for example on document 3.3 as shown in Fig. 9 and Fig. 8, both prompts are able to detect the activities and the participants described in the text. However, the Defs+2Shots prompt generated many falsepositive performer relations. Two interesting trends are worthy to discuss. First, the length of a text seems to be not related to the GED value obtained by each setting. This is an interesting aspect since it opens the hypothesis that the eﬀectiveness of the model is independent of the length of a text. Future work will focus also on a deeper investigation of this aspect. Instead, the second interesting aspect is related to the impact of the fewshot strategy within the incontext learning approach. Here, we can observe the results by splitting the GED values observed with the Raw and Defs settings and with the 2Shots and Defs+2Shots ones. It is interesting to observe how the eﬀectiveness of the ﬁrst two settings is, generally speaking, the opposite of the other two. An example is given by the doc3.3 document where, unexpectedly, the fewshot strategy overproduced incorrect Follows relations between activities causing higher GED values a s shown in Fig. 11 and Fig. 10. Finally, we may state that the 2Shots and Defs+2Shots settings registered an important eﬀectiveness demonstrating the viability of a fewshot approach integrated within an incontext learning strategy. They performed well concerning the extraction of process elements from the natural language description, even if they are inclined to generate several false Follows relations between activities.
6
Qualitative Analysis
The quantitative analysis conducted provided some preliminary insights about the actual performance of PLMs within the adopted four experimental settings. By analyzing the GED values from a qualitative perspective, and by taking into account also the diﬀerent types of process workﬂow described in the textual documents we considered, we can highlight some further considerations.
72
P. Bellan et al.
First, both the Raw and Defs settings obtained very low eﬀectiveness in the extraction of both process elements and relations. The GED values obtained were very close to their upper bounds, i.e., all extracted elements were wrong or they are not able to produce any results. Hence, we may state that these two settings are not good candidates for extracting process elements from a natural language text in a correct way. On the one side, we may conclude that, by observing the Raw strategy, in most cases, it fails concerning the extraction of all elements. This is an important point of attention because it demonstrated that PLMs per se might not be able to support the knowledge extraction task without the adoption of a ﬁnetuning strategy. On the other side, the Defs setting improves the Raw one a little bit, especially concerning the identiﬁcation of activities. However, it demonstrated to be inadequate concerning the detection of temporal relations among activities, i.e. the Follows relation. Second, we have already shown Table 2. Graph edit distance scores how both the Defs+2Shots and results. 2Shots settings demonstrated their eﬀectiveness by demonstrating the Text ID Raw Defs Defs+2Shots 2Shots viability of a fewshot approach intedoc1.2 31.0 33.0 13.0 9.0 grated within an incontext learning doc1.3 20.0 32.0 42.0 39.0 doc3.3 12.0 14.0 30.0 17.0 strategy. However, some issues were doc5.2 30.0 12.0 22.0 21.0 highlighted concerning the extraction doc10.1 19.0 19.0 4.0 6.0 of relations among activities. Indeed, doc10.6 19.0 19.0 4.0 2.0 the Defs+2Shots setting obtained doc10.13 15.0 15.0 13.0 5.0 good results in ﬁnding the activities Average 21.0 18.7 11.2 7.5 themselves, but it often fails about detecting the appropriate relations between them. On the one hand, it ﬁnds the actually existing ones. On the other hand, it ﬁnds a lot of Follows relations that are not mentioned in the original text. The trend of obtaining many incorrect relations between activities led, obviously, to higher GED values. Overall, the 2Shots demonstrated to be more balanced since (i) it was able to ﬁnd all the process elements described in the text; and, (ii) it did not add too many nonexisting relations, especially the Follows ones. This is an important insight because, while the use of domainspeciﬁc deﬁnitions is, anyway, useful to improve the overall eﬀectiveness of the extraction process, it is important to dedicate eﬀort to detecting which may be the most appropriate deﬁnitions, e.g., not overgeneralized ones. The detection of which deﬁnitions are the most appropriate ones to instruct the model is not trivial. The PLM model may provide diﬀerent semantic meanings for the same words. Hence, it is crucial to support its disambiguation capability to inject into its correct knowledge. Finally, by analyzing the process workﬂow contained within the natural language documents adopted, we may state that the detection of split points and parallel branches is challenging. Indeed, we observed that, in general, split points are diﬃcult to be interpreted by the PLM given the necessity of taking into account a larger portion of text.
Process Knowledge Graph Building Using PLM
7
73
Conclusion
In this paper, we explored the feasibility of leveraging PLMs and incontext learning approach to automatically build KGs from textual documents in a questionanswering multiturn dialog incremental manner. The results highlighted the feasibility of the incontext learning approach when deep learning based NLP techniques are used within lowresource scenarios. The results show the feasibility of our proposed methodology. This opens the possibility to use this technique to address the construction of KGs by starting from natural language text in scenarios where it is necessary to manage the lowresources issues and by exploiting the humanintheloop paradigm given the role of the domain expert in processing the information provided by the model. We also reported a suite of lessons learned from this experience that will drive the development of further research.
References 1. Baroni, M., Dinu, G., Kruszewski, G.: Don’t count, predict! a systematic comparison of contextcounting vs. contextpredicting semantic vectors. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL 2014), vol. 1, pp. 238–247. The Association for Computer Linguistics (2014) 2. Bellan, P., van der Aa, H., Dragoni, M., Ghidini, C., Ponzetto, S.P.: PET: an annotated dataset for process extraction from natural language text tasks. In: Cabanillas, C., GarmannJohnsen, N.F., Koschmider, A. (eds.) Business Process Management Workshops (BPM 2022). LNBIP, vol. 460, pp. 315–321. Springer, Cham (2023). https://doi.org/10.1007/9783031253836 23 3. Bellan, P., Dragoni, M., Ghidini, C.: A qualitative analysis of the state of the art in process extraction from text. In: Proceedings of the AIxIA 2020 Discussion Papers Workshop Colocated with the the 19th International Conference of the Italian Association for Artiﬁcial Intelligence (AIxIA2020), Anywhere, 27th November 2020. CEUR Workshop Proceedings, vol. 2776, pp. 19–30. CEURWS.org (2020) 4. Bellan, P., Dragoni, M., Ghidini, C.: Process extraction from text: state of the art and challenges for the future. arXiv preprint arXiv:2110.03754 (2021) 5. Boratko, M., Li, X., O’Gorman, T., Das, R., Le, D., McCallum, A.: ProtoQA: a question answering dataset for prototypical commonsense reasoning. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), pp. 1122–1136. ACL (2020) 6. Brown, T.B., et al.: Language models are fewshot learners. In: Annual Conference. on Neural Information Processing Systems (NeurIPS) (2020) 7. Bunke, H.: On a relation between graph edit distance and maximum common subgraph. Pattern Recognit. Lett. 18(8), 689–694 (1997) 8. Chintagunta, B., Katariya, N., Amatriain, X., Kannan, A.: Medically aware GPT3 as a data generator for medical dialogue summarization. In: Proceedings of the 6th Machine Learning for Healthcare Conference Proceedings of the Machine Learning Research, vol. 149, pp. 354–372. PMLR (2021) 9. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pretraining of deep bidirectional transformers for language understanding. In: Proceedings of the NAACLHLT 2019, vol. 1, pp. 4171–4186. ACL (2019)
74
P. Bellan et al.
10. Gao, T., Fisch, A., Chen, D.: Making pretrained language models better fewshot learners. In: Proceedings of the ACL/IJCNLP 2021, pp. 3816–3830. ACL (2021). https://doi.org/10.18653/v1/2021.acllong.295 11. Gupta, S.: Hate speech detection using OpenAI and GPT3. Int. J. Emerging Technol. Adv. Eng. (2022) 12. Hill, F., Reichart, R., Korhonen, A.: SimLex999: evaluating semantic models with (genuine) similarity estimation. Comput. Linguist. 41(4), 665–695 (2015) 13. Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019) 14. Mart´ınezRodr´ıguez, J., Hogan, A., L´ opezAr´evalo, I.: Information extraction meets the semantic web: a survey. Semantic Web 11(2), 255–335 (2020) 15. Marvin, R., Linzen, T.: Targeted syntactic evaluation of language models. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1192–1202. Association for Computational Linguistics (2018) 16. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pretraining. OpenAI Blog (2018) 17. Radford, A., et al.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019) 18. Raﬀel, C., et al.: Exploring the limits of transfer learning with a uniﬁed texttotext transformer. J. Mach. Learn. Res. 21(1), 5485–5551 (2020) 19. Scao, T.L., Rush, A.M.: How many data points is a prompt worth? In: Proceedings of the NAACLHLT 2021, pp. 2627–2636. ACL (2021) 20. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: GLUE: a multitask benchmark and analysis platform for natural language understanding. In: Linzen, T., Chrupala, G., Alishahi, A. (eds.) Proceedings of the Workshop: Analyzing and Interpreting Neural Networks for NLP, BlackboxNLP@EMNLP 2018, pp. 353–355. ACL (2018)
Neural Networks Reduction via Lumping Dalila Ressi1(B) , Riccardo Romanello1 , Carla Piazza1 , and Sabina Rossi2 1 Università di Udine, Udine, Italy {dalila.ressi,riccardo.romanello,carla.piazza}@uniud.it 2 Università Ca’ Foscari Venezia, Venice, Italy [emailprotected]
Abstract. The increasing size of recently proposed Neural Networks makes it hard to implement them on embedded devices, where memory, battery and computational power are a nontrivial bottleneck. For this reason during the last years network compression literature has been thriving and a large number of solutions has been published to reduce both the number of operations and the parameters involved with the models. Unfortunately, most of these reducing techniques are actually heuristic methods and usually require at least one retraining step to recover the accuracy. The need of procedures for model reduction is wellknown also in the ﬁelds of Veriﬁcation and Performances Evaluation, where large eﬀorts have been devoted to the deﬁnition of quotients that preserve the observable underlying behaviour. In this paper we try to bridge the gap between the most popular and very eﬀective network reduction strategies and formal notions, such as lumpability, introduced for veriﬁcation and evaluation of Markov Chains. Elaborating on lumpability we propose a pruning approach that reduces the number of neurons in a network without using any data or ﬁnetuning, while completely preserving the exact behaviour. Relaxing the constraints on the exact deﬁnition of the quotienting method we can give a formal explanation of some of the most common reduction techniques. Keywords: Neural networks
1
· Compression · Pruning · Lumpability
Introduction
Since 2012, when AlexNet [29] won the famous ImageNet Large Scale Visual Recognition Challenge (ILSVRC), the number of proposed Artificial Neural Network (ANN or NN ) architectures has increased exponentially. Their intrinsic ﬂexibility, together with the superior performance they can achieve, made neural networks the tool of choice to solve a wide variety of tasks. As these models have evolved to process large amount of data or to solve complicated tasks, their complexity has also increased at same pace [12]. Such elaborate and deep networks are the foundation of Deep Learning (DL) and they stand out both for the large number of layers they are made of and for the higher level of accuracy they can reach on diﬃcult tasks [56]. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Dovier et al. (Eds.): AIxIA 2022, LNAI 13796, pp. 75–90, 2023. https://doi.org/10.1007/9783031271816_6
76
D. Ressi et al.
While the academic community mostly focused their eﬀorts in training large and deep models [9,28,57], being able to adopt such networks in embedded devices resulted to be a problem. Physical constraints such as battery, memory and computational power greatly limit both the number of parameters used to the deﬁne the architecture and the number of Floating Point Operations (FLOPs) required to be computed at inference time. A commonly used strategy to address this problem is called Network Compression. Compression literature has had a substantial growth during the last years, and for this reason there are many diﬀerent ways to group together methods reducing a model in similar ways. Methods focusing on ﬁnding the best possible structure to solve a particular tasks can be grouped together as Architecturerelated strategies. These kind of methods usually require to train the network from scratch each time the structure is modiﬁed. In particular, Neural Architecture Search (NAS) techniques aim to ﬁnd the best possible architecture for a certain task with minimal human intervention [14,35,44]. This is usually made possible by modelling the search as an optimization problem and applying Reinforcement Learning (LR)based methods to ﬁnd the best architecture [3,60]. In this group we can also ﬁnd Tensor Decomposition, where matrix decomposition/factorization principles are applied to the ddimensional tensors in neural networks. Tensor decomposition generalizes the widely used Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) to an arbitrary number of dimensions [7,19,54]. The goal of these techniques is to reduce the rank of tensors in order to eﬃciently decompose them into smaller ones and drastically reduce the number of operations [12]. As the rank of a tensor is usually far from being small, the most common solutions are to either to force the network to learn ﬁlters with small rank either to use an approximated decomposition [13]. Using a similar approach Lightweight or Compact Networks focus on modifying the design of the architecture such that it performs less operations while maintaining the same capability. It is the case of the MobileNet series [23,24,46], ShuﬄeNet series [37,59], and EﬃcientNet series [52,53]. They exploit the idea of using 1 × 1 ﬁlters introduced by Network in Network [32] and GoogLeNet [49,50] in their inception modules. A similar concept is explored by the SqueezeNet [26] architecture in their Fire module, where they substitute the classical convolutional layers such that they can achieve the same accuracy of AlexNet on ImageNet dataset but with a model 510 times smaller. A diﬀerent methodology consists in training a big model from the start, and then Pruning superﬂuous parameters. In particular, Weight Pruning consists in zeroing connections or parameters already close to zero [30], but more elaborated methods can also take into consideration the impact of the single weights on the ﬁnal results [18]. Even if weight pruning is a very powerful tool to reduce the network parameters [15], its major drawback is that it does not actually reduce the number of FLOPs at inference time. A more eﬀective solution consists instead in skipping completely some of the operations. It is the case of Filter Pruning, where whole nodes or ﬁlters (in case of convolutional layers) are removed from the architecture. Pruning usually
NNs Reduction via Lumping
77
requires some degree of retraining to recover the lost accuracy due to the reduced network capability, but an interesting phenomena that happens in the early stages of pruning is that most of the times the test accuracy actually increases, due to the regularization eﬀect that pruning unnecessary parameters has on the network. While weight pruning allows more control on what parameters to remove, ﬁlter pruning is usually the best solution compressionwise as it allows to drastically reduce the network parameters such that the models can be actually implemented in small embedded devices [45]. Another technique often used in conjunction with pruning is called quantization [17]. While pruning aims to reduce the number of parameters, quantization instead targets their precision. As the weights are usually represented by ﬂoating point numbers, it is possible to reduce the bits used for the number representation down to single bits [43], without aﬀecting the network accuracy. In the context of performance evaluation of computer systems, stochastic models whose underlying stochastic processes are Markov chains, play a key role providing a sound highlevel framework for the analysis of software and hardware architectures. Although the use of highlevel modelling formalism greatly simpliﬁes the speciﬁcation of quantitative models (e.g., by exploiting the compositionality properties [21]), the stochastic process underlying even a very compact model may have a number of states that makes its analysis a diﬃcult, sometimes computationally impossible, task. In order to study models with a large state space without using approximations or resorting to simulations, one can attempt to reduce the state space of the underlying Markov chain by aggregating states with equivalent behaviours. Lumpability is an aggregation technique used to cope with the state space explosion problem inherent to the computation of the stationary performance indices of large stochastic models. The lumpability method turns out to be useful on Markov chains exhibiting some structural regularity. Moreover, it allows one to eﬃciently compute the exact values of the performance indices when the model is actually lumpable. In the literature, several notions of lumping have been introduced: ordinary and weak lumping [27], exact lumping [47], and strict lumping [6]. With this paper we aim to link together the work of two diﬀerent communities, the ﬁrst one focusing on machine learning and network compression and the second one focusing on lumpingbased aggregation techniques for performance evaluation. Even if a large number of possible eﬃcient compression techniques has already been published, we aim instead to give a formal demonstration on how it is possible to deterministically remove some of the network parameters to obtain a smaller network with the same performance. Our method condenses many diﬀerent concepts together, such as some of the ideas exploited by tensor decomposition methods, ﬁlter pruning and the lumpability used to evaluate the performance of complex systems. The paper is structured as follows. In Sect. 2 we provide a literature review. Section 3 gives the necessary background. Section 4 formally describes our technique exploiting exact lumpability for quotienting NN. Section 5 presents some experimental results. Finally, Sect. 6 concludes the paper.
78
D. Ressi et al.
2
Related Work
To the best of our knowledge, the only paper similar to our work is [42], where the authors introduce the classical notion of equivalence between systems in Process Algebra to reduce a neural network into another one semantically equivalent. They propose a ﬁlter pruning technique based on some properties of the network that does not need any data to perform the compression. They also deﬁne an approximated version of their algorithm to relax some of the strong constraints they pose on the weights of the network. While data free pruning algorithms are convenient when a dataset is incomplete, unbalanced or missing, they usually achieve poorer results compared to databased compression solutions. Indeed, most pruning techniques usually require at least one stage of ﬁnetuning of the model. The recovery is often performed in an iterative fashion after removing a single parameter, but there are also techniques that retrain the model only after a certain level of compression has been carried out [4]. As deﬁned in [33] ﬁlter pruning techniques can be divided according to property importance or adaptive importance. In the ﬁrst group we ﬁnd pruning methods that look at intrinsic properties of the networks, and do not modify the training loss, such as [8,20,25,31,42,45]. Adaptive importance pruning algorithms like [34,36] usually drastically change the loss function, requiring a heavy retrain step and to look for a new proper set of hyperparameters, despite the fact that they often achieve better performances with respect to property importance methods. Avoiding to retrain the network at each pruning step as in [33,55] is usually faster than other solutions, but there is a higher risk to not being able to recover the performances. Another option consists in deciding which parameters to remove according to the impact they have on the rest of the network [40,58]. Finally, while most of the already mentioned methods focus on removing whole ﬁlters or kernels from convolutional layers, some other methods actually target only fully connected layers, or are made to compress classical neural networks [2,51].
3
Preliminaries
In this section we formally introduce the notion of neural network in the style of [42]. Moreover, we recall the concept of exact lumpability as it has been deﬁned in the context of continuous time Markov chains. Neural Networks A neural network is formed by a layered set of nodes or neurons, consisting of an input layer, an output layer and one or more hidden layers. Each node that does not belong to the input layer is annotated with a bias and an activation function. Moreover, there are weighted edges between nodes of adjacent layers. We use the following formal deﬁnition of neural network.
NNs Reduction via Lumping
79
Fig. 1. Node v behaviour on input x1 , x2 , . . . , xm
For k ∈ N, we denote by [k] the set {0, 1, . . . , k}, by (k] the set {1, . . . , k}, by [k) the set {0, . . . , k − 1}, and by (k) the set {1, . . . , k − 1}. Definition 1 (Neural Network). A Neural Network (NN) is a tuple N = (k, A ct, {S }∈[k] , {W }∈(k] , {b }∈(k] , {A }∈(k] ) where: – – – –
k is the number of layers (except the input layer); A ct is the set of activation functions; for ∈ [k], S is the set of nodes of layer with S ∩ S = ∅ for = ; for ∈ (k], W : S−1 × S → R is the weight function that associates a weight with edges between nodes at layer − 1 and ; – for ∈ (k], b : S → R is the bias function that associates a bias with nodes at layer ; – for ∈ (k], A : S → A ct is the activation association function that associates an activation function with nodes of layer . S0 and Sk denote the nodes in the input and output layers, respectively. In the rest of the paper we will refer to NNs in which all the activation association function are constant, i.e., all the neurons of a layer share the same activation function. Moreover, such activation functions A are either ReLU (Rectiﬁed Linear Unit) or LeakyReLU, i.e., they are combinations of linear functions. So, from now on we omit the set A ct from the deﬁnition of the NNs. Example 1. Figure 1 shows the behaviour of node v in Layer . The input values x1 , x2 , . . . xm are propagated by nodes u1 , u2 , . . . um respectively. Node v computes the ReLU of the weighted sum of the inputs plus the bias. The result of this application is the output of v and it is propagated to z.
80
D. Ressi et al.
The operational semantics of a neural network is as follows. Let v : S → R be a valuation for the th layer of N and V al(S ) be the set of all valuations for the th layer of N . The operational semantics of N , denoted by [[N ]], is deﬁned in terms of the semantics of its layers [[N ]] , where each [[N ]] associates with any valuation v for layer − 1 the corresponding valuation for layer according to the deﬁnition of N . The valuation for the output layer of N is then obtained by the composition of functions [[N ]] . Definition 2. The semantics of the th layer is the function [[N ]] : V al(S−1 ) → V al(S ) where for all v ∈ V al(S−1 ), [[N ]] (v) = v and for all s ∈ S , v (s ) = A (s ) W (s, s )v(s) + b (s ) . s∈S −1
The inputoutput semantics of N is obtained by composing these one layer semantics. More precisely, we denote by [[N ]] the composition of the ﬁrst layers so that [[N ]] (v) provides the valuation of the th layer given v ∈ V al(S0 ) as input. Formally, [[N ]] is inductively deﬁned by: [[N ]]1 = [[N ]]1 [[N ]] = [[N ]] ◦ [[N ]]−1 ∀ ∈ (k] where ◦ denotes the function composition. We are now in position to deﬁne the semantics of N as the inputoutput semantic function [[N ]] deﬁned below. Definition 3. The inputoutput semantic function [[N ]] : V al(S0 ) → V al(Sk ) is defined as [[N ]] = [[N ]]k . Lumpability The notion of lumpability has been introduced in the context of performance and reliability analysis. It provides a model aggregation technique that can be used for generating a Markov chain that is smaller than the original one while allowing one to determine exact results for the original process. The concept of lumpability can be formalized in terms of equivalence relations over the state space of the Markov chain. Any such equivalence induces a partition on the state space of the Markov chain and aggregation is achieved by clustering equivalent states into macrostates, reducing the overall state space. Let S be a ﬁnite state space. A (timehomogeneous) ContinuousTime Markov Chain (CTMC) over S is deﬁned by a function Q:S ×S →R such that for all u, v ∈ S with u = v it holds that:
NNs Reduction via Lumping
81
– Q(u, v) ≥ 0 and – v∈S ,v=u Q(u, v) = −Q(u, u). A CTMC deﬁned over S by Q models a stochastic process where a transition from u to v can occur according to an exponential distribution with rate Q(u, v). Given an initial probability distribution p over the states of a CTMC, one can consider the problem of computing the probability distribution to which p converges when the time tends to inﬁnity. This is the stationary distribution and it exists only when the chain satisﬁes additional constraints. The stationary distribution reveals the limit behaviour of a CTMC. Many other performance indexes and temporal logic properties can be deﬁned for studying both the transient and limit behaviour of the chain. Diﬀerent notions of lumpability have been introduced with the aim of reducing the number of states of the chain, while preserving its behaviour [1,6,22,27,38,39,47]. In particular, we consider here the notion of exact lumpability [6,47]. Definition 4 (Exact Lumpability). Let (S , Q) be a CTMC and R be an equivalence relation over S . R is an exact lumpability if for all S, S ∈ R/S , for all v, t ∈ S it holds that: Q(u, v) = Q(u, t). u∈S
u∈S
There exists always a unique maximum exact lumpability relation which allows to quotient the chain by taking one state for each equivalence class and replacing the rates of the incoming edges with the sum of the rates from equivalent states. The notion of exact lumpability is in many applicative domains too demanding, thus providing poor reductions. This issue is wellknown for all lumpability notions that do not allow any form of approximation. With the aim of obtaining smaller quotients, still avoiding rough approximations, the notion of proportional lumpability has been presented in [38,39,41] as a relaxation of ordinary lumpability. In this paper instead we introduce to proportional exact lumpability which is deﬁned as follows. Definition 5 (Proportional Exact Lumpability). Let (S , Q) be a CTMC and R be an equivalence relation over S . R is a proportional exact lumpability if there exists a function ρ : S → R>0 such that for all S, S ∈ S /R, for all v, t ∈ S it holds that: Q(u, v) = ρ(t) Q(u, t). ρ(v) u∈S
u∈S
It can be proved that there exists a unique maximum proportional exact lumpability which can be computed in polynomial time. This is true also if (S , Q) is a Labelled Graph instead of a CTMC, i.e., no constraints are imposed on Q.
82
D. Ressi et al.
Fig. 2. Proportionally exact lumpable CTMC.
Example 2. Figure 2 shows a proportionally exact lumpable Markov chain with respect to the function ρ deﬁned as: ρ(1) = 1, ρ(2) = 3, ρ(3) = 1, ρ(4) = 3, ρ(5) = 1, ρ(6) = 3, ρ(7) = 1, ρ(8) = 1 and the equivalence classes S1 = {1}, S2 = {2, 3, 4}, S3 = {5, 6, 7}, S4 = {8}.
4
Lumping Neural Networks
The idea of exploiting exact lumpability for quotienting NN has been proposed in [42] where a notion of presum preserving backward bisimulation has been considered. It can be easily observed that such a notion coincides with that of exact lumpability. The term (probabilistic) bisimulation is standard in the area of Model Checking, where (probabilistic) temporal logical properties are used for both specifying and synthesizing systems having a desired behaviour [5,10, 11,16]. Since such logics usually formalize the behaviours in terms of forward temporal operators, the bisimulation notions tend to preserve the rates of the outgoing edges [48]. However, as proved in [42], in order to preserve the behaviour of a NN it is necessary to refer to the rates/weights of the incoming edges. This is referred to as backward probabilistic bisimulation and coincides with the wellknown notion of exact lumpability used in the area of performances evaluation. In this paper we extend the proposal of [42]. We prove that in the case of ReLU and LeakyReLU activations, proportional exact lumpability preserves the behaviour of the network allowing to obtain smaller quotients. It does not require any retraining step and it ensures the same behaviour on all possible inputs. Moreover, since the neural networks we refer to are acyclic it can be computed in linear time. Definition 6 (Proportional Exact Lumpability over a NN). Let N be a NN. Let R = ∪∈[k) R be such that R is an equivalence relation over S , for all ∈ (k) and R0 is the identity relation over S0 . We say that R is a proportional
NNs Reduction via Lumping
83
exact lumpability over N if for each ∈ (k) there exists ρ : S → R>0 such that for all S ∈ S /R , for all S ∈ S−1 /R−1 , for all v, t ∈ S it holds that: ρ (v)
ρ (v)b (v) = ρ (t)b (t), u∈S W (u, v) = ρ (t) u∈S W (u, t).
There are some diﬀerences with respect to the deﬁnition of proportional exact lumpability over CTMCs. First, we impose that two equivalent neurons have to belong to the same layer. However, we could have omitted such restriction from the deﬁnition and proved that neurons from diﬀerent layers are never equivalent. This is an immediate consequence of the fact that we refer to acyclic NNs. Moreover, we demand that on input and output nodes the only admissible relation is the identity. This is a substantial diﬀerence. Since the nodes in the input layer have no incoming edges the deﬁnition of proportional lumpability given over CTMCs allows to collapse them. However, the input nodes in NNs hold the input values that have to be propagated, so they cannot be collapsed. This is true also for the output nodes, since they represent the result of the computation. It can be proved that there always exists a unique maximum proportional exact lumpability over a NN. If we use proportional exact lumpability for reducing the dimension of a NN by collapsing the equivalent neurons, we have to modify the topology and the weights of the NN as formalized below. Definition 7 (Proportional Reduced NN). Let N = (k, {S }∈[k] , {W }∈(k] , {b }∈(k] , {A }∈(k] ) be a NN. Let R be a proportional exact lumpability over N . The NN N /R = (k, {S }∈[k] , {W }∈(k] , {b }∈(k] , {A }∈(k] ) is defined by: – S = {[v]  [v] ∈ S /R}, where v is an arbitrarily chosen representative for the class; (w,v) – W ([u], [v]) = ρ−1 (u) w∈[u] W ρ−1 (w) ; – b ([v]) = b (v); – A ([v]) = A (v). Despite the arbitrary choice of the representative, we can prove that the reduced NN’s behaviour coincides with that of the initial one over all the inputs. Theorem 1. Let N be a NN and R be a proportional exact lumpability over N . It holds that [[N /R]] = [[N ]]. Proof. Sketch. Let us focus on two neurons v and t belonging to layer 1 that are equivalent in R1 . Let ReLU be the activation function for both of them. On input x1 , x2 , . . . xm for the nodes m u1 , u2 , . . . , um of layer 0 the nodes v and t take values V al(v) = ReLU ( j=1 W1 (uj , v)xj + b1 (v)) and V al(t) = m ReLU ( j=1 W1 (uj , t)xj + b1 (t)), respectively. However, since v and t are equivalent, it holds that: m j=1
m
W1 (uj , t)xj + b1 (t) =
ρ1 (v) W1 (uj , v)xj + b1 (v) ρ1 (t) j=1
84
D. Ressi et al.
Fig. 3. Pruning one node and updating the network.
Since ρ1 (v) and ρ1 (t) are positive numbers, we get that: m V al(t) = ReLU ( j=1 W1 (uj , t)xj + b1 (t)) m = ρρ11(v) j=1 W1 (uj , v)xj + b1 (v)) = (t) ReLU (
ρ1 (v) ρ1 (t) V
al(v).
Let now z be a neuron of layer 2. The value of z depends on W2 (v, z)V al(v) + W2 (t, z)V al(t) = (W2 (v, z) +
ρ1 (v) W2 (t, z))V al(v) ρ1 (t)
So, the deﬁnition of W2 takes care of the fact that in the reduced network v represents the equivalence class, while t has been “eliminated”. Such deﬁnition ensures that the value of neuron z is unchanged. A formal proof can be obtained generalizing the above arguments. Example 3. Figure 3 shows how the pruning technique works on two nodes v, t. In particular, t input weights are proportionals to v’s. The algorithm proceeds in two steps. Firstly, t is deleted together with all its input and output edges. Secondly, the weight from v to z is modiﬁed by adding ρW+1 (t, z). The maximum proportional exact lumpability over N together with the reduced network can be eﬃciently computed by proceeding topdown from layer 1 to k − 1. Since the network is acyclic, each layer is inﬂuenced only by the previous one. Hence, the computation is linear with respect to the number of edges of the network. Theorem 2. Let N be a NN. There exists a unique maximum proportional exact lumpability R over N . Moreover, R and N /R can be computed in linear time with respect to the size of N , i.e., in time Θ( ∈(k] S−1 × S ). Intuitively, Theorem 1 exploits the following property of ReLU (LeakyReLU): ∀y ∈ R ∀r ∈ R>0 ReLU (r ∗ y) = r ∗ ReLU (y).
NNs Reduction via Lumping
85
This allows us to remove some neurons exploiting the proportionality relation with others. In order to guarantee the correctness of the removal on all possible inputs, as stated in Theorem 1, it is not possible to exploit less restrictive relationships than proportionality. This fact can also be formally proved, under the hypothesis that the input set is suﬃciently rich. However, one could ask what happens if we move from a simple proportionality relation to a linear dependence. For instance, what happens if in Deﬁnition 6 we relax the two equations by considering that t is a linear combination of v1 and v2 , i.e.:
ρ (t)
ρ (t)b (t) = ρ (v1 )b (v1 ) + ρ (v2 )b (v2 ), u∈S W (u, t) = ρ (v1 ) u∈S W (u, v1 ) + ρ (v2 ) u∈S W (u, v2 ).
In this case we could eliminate t by including its contribution on the outgoing network is preserved edges of both v1 and v2 . Unfortunately, the behaviour of the m only for those input values x1 , x2 , . . . , xm which ensure that j=1 W (uj , v1 )xj + m b (v1 ) and j=1 W (uj , v2 )xj + b (v2 ) have the same sign, since ∀y1 , y2 ∈ R, ∀r1 , r2 ∈ R>0 , ReLU (r1 ∗ y1 + r2 ∗ y2 ) = r1 ∗ ReLU (y1 ) + r2 ∗ ReLU (y2 ) iﬀ y1 ∗ y2 ≥ 0. In other terms our analysis points out that reduction techniques based on linear combinations of neurons can be exploited without retraining the network only when strong hypothesis on the sign of the neurons hold. More sophisticated methods that exploit Principal Component Analysis can be seen as a further shift versus approximation, since they do not only involve linear combinations of neurons, but also a base change and the elimination of the less signiﬁcant dimensions.
5
Experimental Results
To assess the robustness of our method we set up some simple experiments where we implemented the neural network pruning by lumping. In particular, we want to show how the accuracy is aﬀected when the weights of the node to prune are not simply proportional to the weights of another node in the same layer, but they are instead a linear combination of the weights of two or more other nodes. We designed and trained a simple Convolutional Neural Network (CNN) made of two convolutional blocks (32 3 × 3 ﬁlters each, both followed by a maxpooling layer) and after a simple ﬂatten we add three fully connected layers (fc), with 16, 128 and 10 nodes each, where the last one is the softmax layer. As required by our method, we use only ReLU activations, except for the output layer. We used the benchmark MNIST dataset, consisting of 7000 28 × 28 greyscale images of handwritten digits divided into 10 classes. After a fast training of the model we focused on the second last fully connected layer for our pruning method. We randomly selected a subset of nodes in this layer and then manually overwrote the weights of the rest of the nodes in the
86
D. Ressi et al.
same layer as linear combinations of the ﬁxed ones. We then froze this synthetic layer and retrained the network to recover the lost accuracy. The resulting model presents a fully connected layer with 2176 (2048 weight + 128 bias) parameters that can be the target of our pruning method. During the ﬁrst round of experiments we conﬁrmed that if the weights in the ﬁxed subset have all the same sign, then our method prunes the linearly dependant vectors and the updating step does not introduce any performance loss. Diﬀerently, as illustrated in Fig. 4, when the weights in the subset have diﬀerent sign, the updating step can introduce some loss. This happens only in the case that the weights are a linear combination of two or more of the weights incoming to the other nodes in the synthetic layer. In particular, the accuracy drops faster as the number of nodes involved in the linear combination increases.
Fig. 4. Accuracy loss when pruning nodes which incoming weights are linear combination of two, three and four other nodes’ weights in the same layer.
6
Conclusion
In this paper we present a data free ﬁlter pruning compression method based on the notion of lumpability. Even though we impose rigid constraints on the weights in order to obtain a reduced network, in doing so we also demonstrate how the resulting model exhibits the same exact behaviour. Regardless the limitations of our method, this work opens the door to a new research ﬁeld where the aggregation techniques typical of performance evaluation are adopted in network compression, usually explored only by the machine learning community. In the future, we would like to further analyze how our algorithm works for different study cases, and in particular to test how an approximation of the linear
NNs Reduction via Lumping
87
dependence would aﬀect the accuracy under diﬀerent conditions. Another interesting experiment would be to use SVD on the fully connected layers to estimate how many vectors are linearly independent and therefore compute the reduction potentially achieved by our method, especially for quantized networks. Acknowledgements. This work has been partially supported by the Project PRIN 2020 “Nirvana  Noninterference and Reversibility Analysis in Private Blockchains” and by the Project GNCS 2022 “Proprietà qualitative e quantitative di sistemi reversibili”.
References 1. Alzetta, G., Marin, A., Piazza, C., Rossi, S.: Lumpingbased equivalences in Markovian automata: algorithms and applications to productform analyses. Inf. Comput. 260, 99–125 (2018) 2. Ashiquzzaman, A., Van Ma, L., Kim, S., Lee, D., Um, T.W., Kim, J.: Compacting deep neural networks for light weight IoT & SCADA based applications with node pruning. In: 2019 International Conference on Artiﬁcial Intelligence in Information and Communication (ICAIIC), pp. 082–085. IEEE (2019) 3. Baker, B., Gupta, O., Naik, N., Raskar, R.: Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167 (2016) 4. Blalock, D., Gonzalez Ortiz, J.J., Frankle, J., Guttag, J.: What is the state of neural network pruning? Proc. Mach. Learn. Syst. 2, 129–146 (2020) 5. Bossi, A., Focardi, R., Macedonio, D., Piazza, C., Rossi, S.: Unwinding in information ﬂow security. Electron. Notes Theor. Comput. Sci. 99, 127–154 (2004) 6. Buchholz, P.: Exact and ordinary lumpability in ﬁnite Markov chains. J. Appl. Probab. 31, 59–75 (1994) 7. Carroll, J.D., Chang, J.J.: Analysis of individual diﬀerences in multidimensional scaling via an nway generalization of “EckartYoung” decomposition. Psychometrika 35(3), 283–319 (1970). https://doi.org/10.1007/BF02310791 8. Castellano, G., Fanelli, A.M., Pelillo, M.: An iterative pruning algorithm for feedforward neural networks. IEEE Trans. Neural Netw. 8(3), 519–531 (1997) 9. Dai, Z., Liu, H., Le, Q.V., Tan, M.: CoAtNet: Marrying convolution and attention for all data sizes. In: Advances in Neural Information Processing Systems, vol. 34, pp. 3965–3977 (2021) 10. Dang, T., Dreossi, T., Piazza, C.: Parameter synthesis using parallelotopic enclosure and applications to epidemic models. In: Maler, O., Halász, Á., Dang, T., Piazza, C. (eds.) HSB 2014. LNCS, vol. 7699, pp. 67–82. Springer, Cham (2015). https://doi.org/10.1007/9783319276564_4 11. Dang, T., Dreossi, T., Piazza, C.: Parameter synthesis through temporal logic speciﬁcations. In: Bjørner, N., de Boer, F. (eds.) FM 2015. LNCS, vol. 9109, pp. 213–230. Springer, Cham (2015). https://doi.org/10.1007/9783319192499_14 12. Deng, L., Li, G., Han, S., Shi, L., Xie, Y.: Model compression and hardware acceleration for neural networks: a comprehensive survey. Proc. IEEE 108(4), 485–532 (2020) 13. Denton, E.L., Zaremba, W., Bruna, J., LeCun, Y., Fergus, R.: Exploiting linear structure within convolutional networks for eﬃcient evaluation. In: Advances in Neural Information Processing Systems, pp. 1269–1277 (2014) 14. Elsken, T., Metzen, J.H., Hutter, F.: Neural architecture search: a survey. J. Mach. Learn. Res. 20(1), 1997–2017 (2019)
88
D. Ressi et al.
15. Frankle, J., Carbin, M.: The lottery ticket hypothesis: ﬁnding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635 (2018) 16. Gallina, L., Hamadou, S., Marin, A., Rossi, S.: A probabilistic energyaware model for mobile adhoc networks. In: AlBegain, K., Balsamo, S., Fiems, D., Marin, A. (eds.) ASMTA 2011. LNCS, vol. 6751, pp. 316–330. Springer, Heidelberg (2011). https://doi.org/10.1007/9783642217135_23 17. Han, S., Mao, H., Dally, W.J.: Deep compression: compressing deep neural networks with pruning, trained quantization and Huﬀman coding. arXiv preprint arXiv:1510.00149 (2015) 18. Han, S., Pool, J., Tran, J., Dally, W.J.: Learning both weights and connections for eﬃcient neural networks. arXiv preprint arXiv:1506.02626 (2015) 19. Harshman, R.A., et al.: Foundations of the PARAFAC procedure: models and conditions for an “explanatory” multimodal factor analysis (1970) 20. He, Y., Liu, P., Wang, Z., Hu, Z., Yang, Y.: Filter pruning via geometric median for deep convolutional neural networks acceleration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4340–4349 (2019) 21. Hillston, J.: A compositional approach to performance modelling. Ph.D. thesis, Department of Computer Science, University of Edinburgh (1994) 22. Hillston, J., Marin, A., Piazza, C., Rossi, S.: Contextual lumpability. In: Proceedings of ValueTools 2013 Conference, pp. 194–203. ACM Press (2013) 23. Howard, A., et al.: Searching for MobileNetV3. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1314–1324 (2019) 24. Howard, A.G., et al.: MobileNets: eﬃcient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017) 25. Hu, H., Peng, R., Tai, Y.W., Tang, C.K.: Network trimming: a datadriven neuron pruning approach towards eﬃcient deep architectures. arXiv preprint arXiv:1607.03250 (2016) 26. Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.: SqueezeNet: AlexNetlevel accuracy with 50x fewer parameters and < 0.5 mb model size. arXiv preprint arXiv:1602.07360 (2016) 27. Kemeny, J.G., Snell, J.L.: Finite Markov Chains. Springer, New York (1976) 28. Kolesnikov, A., et al.: Big Transfer (BiT): general visual representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 491–507. Springer, Cham (2020). https://doi.org/10.1007/9783030585587_29 29. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classiﬁcation with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, vol. 25 (2012) 30. LeCun, Y., Denker, J.S., Solla, S.A.: Optimal brain damage. In: Advances in Neural Information Processing Systems, pp. 598–605 (1990) 31. Li, H., Kadav, A., Durdanovic, I., Samet, H., Graf, H.P.: Pruning ﬁlters for eﬃcient convnets. arXiv preprint arXiv:1608.08710 (2016) 32. Lin, M., Chen, Q., Yan, S.: Network in network. arXiv preprint arXiv:1312.4400 (2013) 33. Lin, M., et al.: HRank: ﬁlter pruning using highrank feature map. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1529–1538 (2020) 34. Lin, S., et al.: Towards optimal structured CNN pruning via generative adversarial learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2790–2799 (2019)
NNs Reduction via Lumping
89
35. Liu, Y., Sun, Y., Xue, B., Zhang, M., Yen, G.G., Tan, K.C.: A survey on evolutionary neural architecture search. IEEE Trans. Neural Netw. Learn. Syst. 34(2), 550–570 (2023) 36. Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., Zhang, C.: Learning eﬃcient convolutional networks through network slimming. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2736–2744 (2017) 37. Ma, N., Zhang, X., Zheng, H.T., Sun, J.: ShuﬄeNet V2: practical guidelines for eﬃcient CNN architecture design. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 122–138. Springer, Cham (2018). https://doi.org/10.1007/9783030012649_8 38. Marin, A., Piazza, C., Rossi, S.: Proportional lumpability. In: André, É., Stoelinga, M. (eds.) FORMATS 2019. LNCS, vol. 11750, pp. 265–281. Springer, Cham (2019). https://doi.org/10.1007/9783030296629_16 39. Marin, A., Piazza, C., Rossi, S.: Proportional lumpability and proportional bisimilarity. Acta Informatica 59(2), 211–244 (2022). https://doi.org/10.1007/s0023602100404y 40. Molchanov, P., Mallya, A., Tyree, S., Frosio, I., Kautz, J.: Importance estimation for neural network pruning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11264–11272 (2019) 41. Piazza, C., Rossi, S.: Reasoning about proportional lumpability. In: Abate, A., Marin, A. (eds.) QEST 2021. LNCS, vol. 12846, pp. 372–390. Springer, Cham (2021). https://doi.org/10.1007/9783030851729_20 42. Prabhakar, P.: Bisimulations for neural network reduction. In: Finkbeiner, B., Wies, T. (eds.) VMCAI 2022. LNCS, vol. 13182, pp. 285–300. Springer, Cham (2022). https://doi.org/10.1007/9783030945831_14 43. Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A.: XNORNet: ImageNet classiﬁcation using binary convolutional neural networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 525–542. Springer, Cham (2016). https://doi.org/10.1007/9783319464930_32 44. Ren, P., et al.: A comprehensive survey of neural architecture search: challenges and solutions. ACM Comput. Surv. (CSUR) 54(4), 1–34 (2021) 45. Ressi, D., Pistellato, M., Albarelli, A., Bergamasco, F.: A relevancebased CNN trimming method for lowresources embedded vision. In: Bandini, S., Gasparini, F., Mascardi, V., Palmonari, M., Vizzari, G. (eds.) AIxIA 2021 – Advances in Artiﬁcial Intelligence, AIxIA 2021. Lecture Notes in Computer Science, vol. 13196, pp. 297– 309. Springer, Cham (2022). https://doi.org/10.1007/9783031084218_20 46. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: MobileNetV2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018) 47. Schweitzer, P.: Aggregation methods for large Markov chains. In: Procedings of the International Workshop on Computer Performance and Reliability, pp. 275–286. North Holland (1984) 48. Sproston, J., Donatelli, S.: Backward stochastic bisimulation in CSL model checking. In: 2004 First International Conference on the Quantitative Evaluation of Systems, QEST 2004. Proceedings, pp. 220–229. IEEE (2004) 49. Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015) 50. Szegedy, C., Vanhoucke, V., Ioﬀe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference On Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
90
D. Ressi et al.
51. Tan, C.M.J., Motani, M.: DropNet: reducing neural network complexity via iterative pruning. In: International Conference on Machine Learning, pp. 9356–9366. PMLR (2020) 52. Tan, M., Le, Q.: EﬃcientNet: rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114. PMLR (2019) 53. Tan, M., Le, Q.V.: EﬃcientNetV2: smaller models and faster training. arXiv preprint arXiv:2104.00298 (2021) 54. Tucker, L.R.: Some mathematical notes on threemode factor analysis. Psychometrika 31(3), 279–311 (1966). https://doi.org/10.1007/BF02289464 55. Wang, Z., Xie, X., Shi, G.: RFPruning: a retrainingfree pruning method for accelerating convolutional neural networks. Appl. Soft Comput. 113, 107860 (2021) 56. Xiao, L., Bahri, Y., SohlDickstein, J., Schoenholz, S., Pennington, J.: Dynamical isometry and a mean ﬁeld theory of CNNs: how to train 10,000layer vanilla convolutional neural networks. In: International Conference on Machine Learning, pp. 5393–5402. PMLR (2018) 57. Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., Wu, Y.: CoCa: contrastive captioners are imagetext foundation models. arXiv preprint arXiv:2205.01917 (2022) 58. Yu, R., et al.: NISP: pruning networks using neuron importance score propagation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9194–9203 (2018) 59. Zhang, X., Zhou, X., Lin, M., Sun, J.: ShuﬄeNet: an extremely eﬃcient convolutional neural network for mobile devices. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6848–6856 (2018) 60. Zoph, B., Le, Q.V.: Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578 (2016)
Knowledge Enhanced Neural Networks for Relational Domains Alessandro Daniele(B) and Luciano Seraﬁni Data and Knowledge Management Research Unit, Fondazione Bruno Kessler, Trento, Italy {daniele,serafini}@fbk.eu Abstract. In the recent past, there has been a growing interest in NeuralSymbolic Integration frameworks, i.e., hybrid systems that integrate connectionist and symbolic approaches to obtain the best of both worlds. In this work we focus on a speciﬁc method, KENN (Knowledge Enhanced Neural Networks), a NeuralSymbolic architecture that injects prior logical knowledge into a neural network by adding on its top a residual layer that modiﬁes the initial predictions accordingly to the knowledge. Among the advantages of this strategy, there is the inclusion of clause weights, learnable parameters that represent the strength of the clauses, meaning that the model can learn the impact of each rule on the ﬁnal predictions. As a special case, if the training data contradicts a constraint, KENN learns to ignore it, making the system robust to the presence of wrong knowledge. In this paper, we propose an extension of KENN for relational data. One of the main advantages of KENN resides in its scalability, thanks to a ﬂexible treatment of dependencies between the rules obtained by stacking multiple logical layers. We show experimentally the eﬃcacy of this strategy. The results show that KENN is capable of increasing the performances of the underlying neural network, obtaining better or comparable accuracies in respect to other two related methods that combine learning with logic, requiring signiﬁcantly less time for learning.
1
Introduction
In the last decade, deep learning approaches gained a lot of interest in the AI community, becoming the state of the art on many ﬁelds, such as Computer Vision [17], Machine Translation [2], Speech Recognition [14], etc. Indeed, Neural networks (NNs) are suited for pattern recognition, even in the presence of noisy data. They are particularly good at mapping lowlevel perceptions to more abstract concepts (for instance, going from images to classes). However, it is hard for a NN to reason with these highlevel abstractions. Furthermore, NNs are demanding in terms of training data. On the other hand, pure logical approaches are not suited for learning from lowlevel features and they struggle in the presence of noise. Nevertheless, they perform well in reasoning with highly abstract concepts and learning from a small number of samples. Given these opposite strengths and weaknesses, it is not a surprise that a lot of interest c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Dovier et al. (Eds.): AIxIA 2022, LNAI 13796, pp. 91–109, 2023. https://doi.org/10.1007/9783031271816_7
92
A. Daniele and L. Seraﬁni
has been drawn toward NeuralSymbolic (NeSy) systems. Indeed, the goal is to combine these two paradigms to obtain the best of the two worlds. Among NeSy methods there is KENN (Knowledge Enhanced Neural Network ) [6], a model composed of a Neural Network enhanced with additional layers which codify logical knowledge. KENN has multiple advantages over other NeSy methods, such as its capacity to learn clause weights and the ability to impose the knowledge not only during training but even at inference time. In particular, KENN showed remarkable results on the Predicate Detection task of Visual Relationship Detection Dataset (VRD Dataset) [19] using a manually curated prior knowledge proposed by [9], outperforming the previous state of the art results, with really good performances on the Zero Shot Learning subtask [6]. Moreover, it outperformed Logic Tensor Networks [29], one of its major competitors, using the same knowledge. Despite its good empirical results, KENN has been applied only on multilabel classiﬁcation tasks with no relational data. Indeed, a limitation of KENN resides in its inability to take into account binary predicates. This is because KENN expects the NN’s predictions to be stored in a matrix format, where the columns represent diﬀerent unary predicates and the rows their possible groundings (i.e., substitutions of the free variable for such predicates). For this reason, it is not straightforward to apply KENN to relational data, where binary predicates are available. In this paper, we propose an updated version of KENN which can deal with relational data. Particular attention was paid to deﬁning a scalable strategy to deal with binary predicates, obtaining good performances in terms of execution time. Indeed, KENN assumes independence between the logical rules, allowing for scalable inclusion of the underlying knowledge. However, the assumption is often violated in real scenarios, in particular in the contexts of relational domains. To deal with this problem, we propose a strategy that consists of adding multiple logical layers inside the model. We provide proof of the eﬃcacy of this strategy in a simple scenario with two logical rules. Additionally, we tested this idea on Citeseer, a dataset for Collective Classiﬁcation [28], showing that the additional layers improve the performance of the model. Moreover, the experiments on this dataset provide a comparison between KENN and two other approaches: Semantic Based Regularization (SBR) [8] and Relational Neural Machines (RNM) [22].
2
Related Works
Many previous works attempt to combine learning models with logical knowledge. Among them there is Statistical Relational Learning (SRL), a subﬁeld of Machine Learning that aims at applying statistical methods in domains that exhibit both uncertainty and relational structure [16]. Generally speaking, SRL deals with the knowledge either by combining logic rules with probabilistic graphical models (e.g., Markov Logic Networks [25], and Probabilistic Soft Logic (PSL) [1]) or by extending logic programming languages to handle uncertainty (e.g., ProbLog [7]).
Relational KENN
93
The recent achievements of deep learning methods lead to a renewed interest in another line of research, called NeuralSymbolic Integration, which focuses on combining neural network architectures with logical knowledge [3]. This can be achieved in multiple ways depending on the role of the knowledge. For instance, works like TensorLog [5], Neural Theorem Prover (NTP) [26,27], DeepProbLog [21], Neural Logic Machines [10], and NeuralLog [13] focus on the development of diﬀerentiable approaches for reasoning, which can be used in combination with neural networks. Another line of research comes from methods like ∂ILP [4,11], and Neural Logic Rule Layer (NLRL) [24]. In these cases, the goal is to learn general knowledge from the data, either from scratch or by reﬁning an initial knowledge. Finally, more related to our purposes, some methods focus on learning in the presence of prior knowledge, which acts as additional supervision. In this section, we are going to focus on these types of methods, since KENN falls in this category. There are mainly two approaches for learning in the presence of prior knowledge: the ﬁrst consists of treating logical rules as constraints on the predictions of the neural network. The problem is reduced to maximize the satisﬁability of the constraints and can be eﬃciently tackled by adding a regularization term in the Loss function. The second approach is to modify the neural network by injecting the knowledge into its structure. Two notable examples of regularization approaches are Logic Tensor Network (LTN) [29] and Semantic Based Regularization (SBR) [8]. Both methods maximize the satisfaction of the constraints, expressed as FOL formulas, under a fuzzy logic semantic. A similar strategy is employed also by Semantic Loss Function [30], but instead of relying on fuzzy logic, it optimizes the probability of the rules being true. Nevertheless, this approach is restricted to propositional logic. [12] introduces DL2. Nonetheless, it can be used only in the context of regression tasks, where the predicates correspond to comparison constraints (e.g. =, =, ≤). [23] also proposes a method that regularizes the Loss, but they focus on a speciﬁc task of Natural Language Processing. Their approach diﬀers from the others because it makes use of adversarial examples to calculate the regularization term. Finally, in [15], a distillation mechanism is used to inject FOL rules: here a teacher network (which encodes the rules) is used to regularize the Loss applied to a student network. Approaches based on regularization force the constraints satisfaction solely at training time. As a consequence, there are no guarantees that they will be satisﬁed at inference time as well. Instead, modelbased methods inject knowledge directly into the model structure, and they are naturally capable of enforcing the knowledge at inference time. Another advantage is the possibility to learn a weight that codiﬁes the importance of a logical rule directly from the data. This is no possible at all with methods based on regularization, since the logical formulas are directly codiﬁed inside the Loss function. Among the modelbased approaches there is KENN, a framework that injects knowledge on top of the NN model through an additional layer which increases the satisfaction of the constraints in a fuzzy logic semantic. Another approach is provided by Li and Srikumar who recently proposed a method that codiﬁes the
94
A. Daniele and L. Seraﬁni
logical constraints directly into the neural network model [18]. However, they restrict the rules to implications with exactly one consequent and they do not provide the possibility to learn clause weights, which in their system are added as hyperparameters. Going in the same direction, [22] proposed Relational Neural Networks (RNM). RNM can be also inserted in the set of approaches that add the logic directly into the model and, as the best of our knowledge, it is the only method other than KENN which is capable of integrating logical knowledge with a neural network while learning the clause weights. RNM integrates a neural network model with a FOL reasoner. This is done in two stages: in the ﬁrst one, the NN is used to calculate initial predictions for the atomic formulas; in the second stage a graphical model is used to represent a probability distribution over the set of atomic formulas. To obtain the ﬁnal predictions a Maximum a Posteriori (MAP) estimation is performed, ﬁnding the most probable assignment to the grounded atoms given the output of the NN and the set of constraints. At a highlevel RNM approach is similar to KENN, since in both cases a NN makes initial predictions and a post elaboration step is applied to such predictions to provide the ﬁnal classiﬁcation. However, RNM requires to solve an optimization problem at inference time and after each training step. This has the advantage of considering all the logical rules together at the same time at the expense of an increased computational eﬀort. Contrary, in KENN each rule is considered separately from the others, and the second stage is directly integrated inside the model as a diﬀerentiable function that can be trained endtoend with the NN. However, with this strategy there could be some contradictory changes when combining multiple clauses with the same predicates. We will further analyze this aspect in Sect. 3.4, proposing a strategy to handle this limitation. Moreover, in Sect. 4, we analyze this strategy empirically.
3
Knowledge Enhanced Neural Networks
We deﬁne the prior knowledge in terms of formulas of a functionfree ﬁrst order language L. Its signature is deﬁned with a set of domain constants C {a1 , a2 , ...am } and a set of predicates P {P1 , P2 ...Pq }. In our setting, predicates can be unary or binary/Binary predicates can express relations among pairs of objects in the domain, e.g. F riends(a, b) states that person a is a friend of b. The prior knowledge is deﬁned as a set of clauses: K {c1 , c2 , ...cr }. A clause k is a disjunction of literals, each of which is a possibly negated atom: c li , i=1
where k is the number of literals in c and li is the ith literal. We assume that there are no repeated literals. Since we are interested in representing only general knowledge, the literals do not contain any constant, only variables that are assumed to be universally quantiﬁed. If the predicate is binary, the two variables are x and y, otherwise only x. When an entire clause contains only variable x (i.e., only unary predicates), we call it unary. Similarly, if it contains both x and y we call it binary 1 . 1
We restrict to the case where clauses contain at most two variables.
Relational KENN
95
As an example, the clause ¬Smoker(x) ∨ Cancer(x) is unary and states that all smokers have also cancer (notice that the clauses are not assumed to be hard constraints). Instead, the clause ¬Smoker(x) ∨ ¬F riends(x, y) ∨ Smoker(y)
(1)
is binary. It states that if a person x is a smoker and he is a friend of another person y, then y is also a smoker. We will use extensively this clause in the remaining of the paper, referring to it as cSF . We deﬁne the grounding of a unary clause c, denoted by c[a], as the clause obtained by substituting the x variable with constant a. Similarly, if c is binary, its grounding c[a, b] is obtained by substituting x and y with a and b respectively. For instance, the grounding cSF [a, b] of the clause deﬁned in Eq. 1 correspond to ¬Smoker(a) ∨ ¬F riends(a, b) ∨ Smoker(b). 3.1
KENN Architecture
Suppose we have a NN for a classiﬁcation task which takes as input a matrix x ∈ Rd×n containing n features for d samples, and returns an output y ∈ [0, 1]d×q which contains the predictions for q classes corresponding to the q predicates. A prior knowledge K is also provided. It can be used by KENN to improve the predictions of the NN. Figure 1(left) shows a highlevel overview of KENN where a residual layer, called Knowledge Enhancer (KE), is inserted between the NN and the ﬁnal activation function. The role of KE is to revise the ﬁnal predictions returned by the NN in order to increase the truth value of each clause c ∈ K. It does so by calculating a residue δ, a matrix that is added to the predictions of the NN.
Fig. 1. Model architecture. Left: KENN model overview. Right: Knowledge Enhancer.
The KE works in the preactivations space, i.e. on z, and the activation function (σ) is called later. In order for KENN to work, the activation function must be monotonic and return values in the range [0, 1]2 . Since both NN and KE are 2
For more details on why the KE is applied on preactivations, please refer to [6].
96
A. Daniele and L. Seraﬁni
diﬀerentiable, the entire architecture is diﬀerentiable endtoend, making it possible to apply backpropagation algorithm on the whole model. Figure 1(right) shows the architecture of KE, which calculates the residual matrix δ. More in details, for each clause c ∈ K, the KE contains a submodule, the Clause Enhancer (CE), which proposes the changes δc to be applied on the NN’s preactivations in order to increase the satisfaction of c. Indeed, the CE computes a soft diﬀerentiable approximation of a function called tconorm boost function (TBF). Intuitively, a TBF is a function φ : Rk → Rk+ that proposes the changes to be applied on the preactivations z of k truth values, such that ⊥(σ(z + φ(z))) ≥ ⊥(σ(z)), where ⊥ : [0, 1]k → [0, 1] is a tconorm function, used in fuzzy logic to represent the semantics of the disjunction operator3 . In [6] it has been deﬁned the function 1 if i = argmaxnj=1 zj (2) φ(z)i = 0 otherwise and proved that such a function is the optimal TBF for the G¨ odel tconorm. KENN employs the softmax function as a continuous and diﬀerentiable approximation of φ. The δc matrices are combined linearly inside the KE to obtain the ﬁnal change δ to be applied on the NN’s predictions, and ﬁnally the δ is summed to the initial preactivations z and passed to the activation function: yP (a) = σ zP (a) + wc · δc[a],P (a) (3) c∈K P (x)∈c
where wc is the clause weight, P (a) a grounded atom, yP (a) its ﬁnal prediction, and zP (a) the NN’s preactivations. Finally, δc[a],P (a) is the change applied to P (a) based on the grounded clause c[a]: φ(zc )P (a) if P (a) ∈ c[a] δc[a],P (a) = (4) −φ(zc )¬P (a) if ¬P (a) ∈ c[a] where zc are the preactivations of literals of c. Note that applying a linear combination of the δc matrices can be done under the assumption of independence between the clauses. When multiple clauses share common predicates the changes proposed by KENN could only partially improve the satisfaction of the knowledge. We will further analyze this problem in Sect. 3.4. Note that, when the NN predictions satisfy the constraints, the eﬀect of the KE is to increase the conﬁdence of the current predictions. Therefore, if the NN predictions are correct with respect to the ground truth, the clause weights tend to increase during learning. 3.2
Extending KENN for Relational Domains
In the architecture deﬁned so far, the groundings involve a single object and z is deﬁned as a matrix, where columns represent predicates and rows constants. 3
In [6], function φ is called δ. Here we changed the name to avoid confusions with its output which is also referred as δ.
Relational KENN
97
Figure 2(left) introduces the representation of z: it is deﬁned as a matrix such that the element zij contains the preactivation of Pj (ai ), with Pj the j th predicate and ai the ith constant. Note that this kind of representation is common when working with neural networks since the columns (predicates) correspond to the labels and the rows (groundings) to the samples. An important aspect of this representation lies in the fact that each grounded atom can be found in the matrix exactly one time. This allows to parallelize computations of Eq. 3 since a grounded clause involves only atoms in the same row, and each row can be managed in parallel inside a GPU. This can be done only if the same atom does not appear in multiple rows, since the changes are applied independently to each row and are not aggregated together. This property always holds with unary clauses.
Fig. 2. The representation of NN’s ﬁnal preactivations. Left: unary case. Right: representation of relational data. Preactivations are represented as integers instead of reals to simplify the ﬁgure. (Color ﬁgure online)
To represent relational data, we extend KENN with an extra matrix zB , which contains the binary predicates’ preactivations. For uniformity of notation we use zU to denote the unary matrix z of the not relational KENN. Matrix zB contains one row for every pair of objects we are interested in and a column for each binary predicate. Figure 2(right) shows this representation using the classical SmokerFriendsCancer example, where the domain is composed of three constants (persons) C = {a, b, c}, the unary predicates are S and C (for Smoker and Cancer), and a binary predicate F (for F riends). The blue box shows the graph representation with nodes and edges labelled with preactivation of unary and binary predicates respectively. The grey box shows the corresponding matrix representation used by KENN. Notice that it is not required that the entire graph is computed by the NN. For instance, in the experiments on Citeseer, the Cite predicate is provided directly as a feature (see Sect. 4). The architecture of KENN for relational domains is very similar to the architecture of traditional KENN of Fig. 1, with KE substituted by a Relational KE (RKE). From a high level perspective, the RKE diﬀers from the traditional KE on the amount of inputs and outputs. As seen before, in the relational case the preactivations are divided in two diﬀerent matrices (zU and zB ) and, as a consequence, also the δ matrix and predictions y are now splitted in unary and binary matrices (δU and δB for the residues, yU and yB for the ﬁnal predictions).
98
A. Daniele and L. Seraﬁni
The RKE has the same role as the KE in the unary case. However, it is capable to consider also binary predicates. When binary knowledge is available, additional steps are required since the independence between object can not be assumed anymore. Let KU be the set of unary clauses and KB the set of binary clauses. The prior knowledge is now deﬁned as K = KU ∪ KB . The idea is to apply the KE to these two sets separately. Equation 3 can decomposed using the new deﬁned par
be tition of the knowledge: yA = σ zA + c∈KU [C] wc · δc,A + c∈KB [C] wc · δc,A , where A is a grounded atom (i.e. P (a) or P (a, b), depending on the arity of P ). We deﬁne δKU as the changes deriving from unary clauses: wc · δc[a],P (a) (5) δKU ,P (a) = c∈KU P (x)∈c
Similarly, δKB are the changes calculated from KB . Notice that the approach deﬁned so far can be directly applied to the unary knowledge KU to calculate δU since traditional KE can manage unary knowledge. Indeed, internally the RKE contains a standard KE which manages the unary clauses. We need to deﬁne a strategy to deal with binary clauses. Indeed, when a clause c contain two variables, a grounding of a unary predicate may occur in multiple groundings of c. For instance, consider the clause of Eq. 1. The two groundings cSF [a, b] and cSF [b, c] share a common grounded atom: Smoker(b). For this reason, when dealing with the predictions of a unary predicate in a relational domain, we need to account for such repetitions:
(6) δKB ,P (a) = wc · δc[a,b],P (a) + wc · δc[b,a],P (a) b=a
c∈KB P (x)∈c
c∈KB P (y)∈c
Putting all together, the predictions yP (a) for a grounded unary predicate P are: yP (a) = σ(zP (a) + δKU ,P (a) + δKB ,P (a) )
(7)
The predictions for a binary predicate R can be found only in binary clauses and any possible grounding of R can be found in only one corresponding grounding of each clause yR(a,b) = σ(zR(a,b) + δKB ,R(a,b) ), with δKB ,R(a,b) =
wc · δc[a,b],R(a,b)
(8)
c∈KB R(x,y)∈c
3.3
Time Complexity
Here we analyze the time complexity of an RKE layer with respect to domain size m, number of predicates P, and number of rules K. We also assume the maximum number L of literals in a clause to be a small constant. Let us ﬁrst analyze the time complexity for calculating the δc[a] used in Eqs. 5, and 6. Each δc[a],P (a) can be calculated in time O(1) (see Eq. 4). Computing δc[a]
Relational KENN
99
also requires constant time. The sum of Eq. 5 require time O(K), which is the time necessary to compute δKU ,P (a) . Note that neural networks are usually run on GPUs, where the computations can be parallelized. Assuming enough parallel processes (K in this case), a sum can be performed in a time logarithmic with respect to the number of addends, and complexity for δKU ,P (a) becomes O(log(K)). Finally, Eq. 5 needs to be calculated for all the grounded unary predicates P (a), for a total time of O(m · P · K) in a single process, and O(log(K)) with multiple parallel processes (each of the grounded atom can be considered independently from the others). With a similar reasoning, we found the time complexity of Eqs. 6 and 8 to be O(m2 ·P·K). Note that with enough parallel processes we can compute all the deltas in O(log(m) + log(K)). 3.4
Treatment of Dependencies Among the Rules
In the previous section we showed the eﬃcacy of the method in terms of execution time, which can be achieved thanks to the assumption of independence. However, when this assumption is violated, KENN does not provide any guarantees on the satisfaction of the knowledge. As an example, suppose that we have two grounded clauses c1 : ¬A ∨ B and c2 : ¬B ∨ C with their respective clause enhancers CE1 and CE2 , where A, B and C are grounded unary or binary predicates. The atom B appears in both clauses with opposite signs. Since the CE increase the highest literal value (see Eq. 2), if A < B 4 and C < ¬B, then CE1 increases B and CE2 decreases it. As a consequence, the satisfaction of only one between c1 and c2 is increased. The satisfaction of the entailed clause ¬A ∨ C is also not improved. For any grounded atom G, lets deﬁne G(0) as its initial prediction and G(i) as the prediction of the ith KE layer. Moreover, suppose that all KEs share the same clause weights w1 and w2 (for c1 and c2 respectively). From Eqs. 3 and 4 we can derive B (1) = B (0) +w1 −w2 , and ¬B (1) = ¬B (0) +w2 −w1 . If w1 ≥ w2 , then A(1) = A(0) < B (0) ≤ B (1) . As a consequence, the ﬁrst rule will increase again B even at the next KE layer. On the other hand, the value of ¬B is reduced, which means that there is an increased chance for C (1) > ¬B (1) , which would solve the problem since CE2 would increase C instead of ¬B. Notice that, since the weights are the same at each level, it is always true that ¬B (i+1) ≤ ¬B (i) , meaning that with enough KE layers both clauses’ satisfaction will be increased (and as a consequence, also their entailments). The problem analyzed in this section becomes even more relevant in relational domains since in these contexts an atom can be shared not only by multiple clauses but also by diﬀerent groundings of the same clause (for instance, in cSF [a, b] and cSF [b, c]). For this reason, in these contexts stacking multiple RKEs is recommended (more details in Sect. 4.2).
4
Evaluation of the Model
In this section, the relational extension of KENN is tested on the task of Collective Classiﬁcation: given a graph, we are interested in ﬁnding a classiﬁcation 4
With an abuse of notation, we use atoms symbols to refer also to their truth value.
100
A. Daniele and L. Seraﬁni
for its nodes using both features of the nodes (the objects) and the information coming from the edges of the graph (relations between objects) [28]. In Collective Classiﬁcation, there are two diﬀerent learning tasks: inductive and transductive learning. In inductive learning, there are two separate graphs, one for training and the other for testing. On the contrary, in transductive learning, there is only one graph that contains nodes both for training and testing. In other words, in inductive learning, there are no edges between nodes for training and testing, while in transductive learning there are. The tests have been performed on both tasks to analyze the behavior of KENN in the contexts of relational domains. In particular, we tested KENN with a varying number of KEs layers to validate the proposal of Sect. 3.45 . 4.1
Experimental Setup
We followed the evaluation methodology of [22], where the experiments have been carried out on Citeseer dataset [20] using SBR and RNM. The Citeseer dataset used in the evaluation is a citation network: the graph’s nodes represent documents and the edges represent citations. The nodes’ features are bagofwords vectors, where an entry is zero if the corresponding word of the dictionary is absent in the document, and one if it is present. The classes to be predicted represent possible topics for a document. The dataset contains 3312 nodes that must be classiﬁed in 6 diﬀerent classes: AG, AI, DB, IR, ML, and HCI. The classiﬁcation is obtained from the 3703 features of the nodes, with the addition of the information coming from the citations (4732 edges). We use the same NN and knowledge as in [22], allowing for the comparison with SBR and RNM. The NN is a dense network with 3 hidden layers, each with 50 hidden nodes and ReLU activation function. The knowledge consists of six rules obtained by substituting the topic T in ¬T (x) ∨ ¬Cite(x, y) ∨ T (y) with all the classes, codifying the idea that papers cite works of the same topic. Tests have been conducted by selecting 10%, 25%, 50%, 75%, and 90% of nodes for training to evaluate the eﬃcacy of the three methods on the varying of the training set dimension. For each of these values, the training and evaluation were performed 100 times, each with a diﬀerent split of the dataset. At each run the training set is created by selecting random nodes of the graph, with the constraints that the dataset must be balanced. 4.2
Results
Figure 3 shows the test accuracies obtained by KENN while increasing the number of KEs layers, starting from 0 (corresponding to the NN accuracy) up to 6. Note that, for each line in the ﬁgure, there is a surrounding border corresponding to a 99% conﬁdence interval. To calculate the intervals, we assumed the distribution of improvements obtained by the injection of the logical rules to 5
Source code of the experiments are available on https://github.com/rmazzier/ KENNCiteseerExperiments.
Relational KENN
101
Fig. 3. Accuracies of KENN at the varying of KEs layers. (Color ﬁgure online)
be a normal distribution (see ﬁgures in Appendix B and C). We also computed the pvalues for each setting, assuming as null Hypothesis that the distribution of accuracies of the NN is the same as KENN. Since the number of runs are quite high, the resulting pvalues are very small. For this reason, we can safely reject the null Hypothesis, and we are very conﬁdent that the improvements given by KENN do not depend on the random initialization of the models’ parameters or the speciﬁc choices of the splits of the dataset. More in detail, we found pvalues in the range from 8.2e−42 to 1.6e−09 in the inductive case, and from 53e−72 to 2.1e−23 for the transductive one. The only exception is with 90% of the samples in the inductive scenario where the pvalue is 0.35. This is because the improvements over the NN are very small. Indeed, in both learning paradigms, the eﬀect of the knowledge is reduced when the amount of available data is larger. This behavior is consistent with the simple intuition that, when the training data is scarce, the usage of knowledge should bring higher beneﬁts. A more important result coming from these experiments is the fact that in all cases adding a new KE layer does not reduce the test accuracy. On the contrary, most of the time the metric is increased until a certain number of layers is reached, and after that, the accuracy stabilizes. This behavior is in line with the discussion of Sect. 3.4 and conﬁrms the eﬃcacy of the proposed strategy to deal with the violation of independence assumption. Finally, Fig. 3 provide also a measure of the amount of information carried out by the knowledge. For instance, consider blue and yellow lines, corresponding to a training set with 25% and 50% of the samples, respectively. In the inductive scenario, the accuracy obtained with 25% with the addition of the knowledge is almost the same as the standard NN with 50% of the data (even higher in the transductive scenario). In this case, adding the knowledge has the same eﬀect of doubling up the training data! Indeed, one of the main motivations behind NeuralSymbolic Integration consists in reducing the required amount of training samples since collecting labeled data is costly in practice.
102
A. Daniele and L. Seraﬁni
Table 1. Improvements in terms of accuracy on inductive and transductive learning. Inductive
4.3
Transductive
% Tr SBR RNM KENN SBR
RNM KENN
10
0.005 0.040
0.052
0.063
0.068
0.110
25
0.008 0.035
0.044
0.062
0.068
0.074
50
0.005 0.019
0.036
0.052
0.058
0.064
75
0.002 0.009
0.021
0.056
0.058 0.057
90
0.003 0.009 0.001
0.054 0.054 0.043
Comparison with Other NeSy Frameworks
Table 1 shows a comparison of KENN with SBR and RNM. We used the results of KENN with 3 KEs since more layers do not provide a signiﬁcant advantage (see Sect. 4.2). As we can see from the table, in the inductive case SBR produces much lower improvements compared to the other two methods. Note that these results are in line with previous results obtained on VRD dataset, where another regularization approach (LTN) was compared with KENN [6]. Indeed, the results obtained in both VRD and Citeseer suggest better performances of modelbased approaches as compared to the ones based on regularization. Note that methods based on regularization of the loss do not impose the knowledge at inference time. In the transductive scenario, the situation is diﬀerent and SBR behaves similarly to the other two. Indeed, in this case, citations between training and test nodes are available and there is no distinction between training and inference. Finally, the results suggest that KENN is particularly useful when the training data available is scarce. On the contrary, when data is abundant, our results tend to degrade faster than RNM and SBR. However, the greatest advantage of KENN over other architectures is its scalability. This is conﬁrmed by the comparison of the execution times of the three methods: we found KENN to be very fast as compared to the other two methods with an average of 7.96 s required for a single run, as compared to the NN which requires 2.46 s (on average of 1.83 s for each KE layer). A run of SBR cost 87.36 s (almost 11 times slower than KENN), while RNM required 215.69 s per run (27 times slower)6 .
6
All the experiments have been run on the same architecture, an NVIDIA Tesla v100.
Relational KENN
5
103
Conclusions
KENN is a NeSy architecture that injects prior logical knowledge inside a neural network by stacking a residual layer on its top. In [6], it proved to be able to eﬀectively inject knowledge in the context of multilabel classiﬁcation tasks. In this work, we extended KENN for relational domains, where the presence of both unary and binary predicates doesn’t allow for the usage of the simple tabular representation of the data used in the previous version of the framework. Moreover, we propose a strategy to deal with the violation of the independence assumption made by KENN. The experiments on Citeseer show the eﬀectiveness of this strategy, obtaining statistically relevant improvements over the NN performances, meaning that KENN can successfully inject knowledge even in the presence of relational data. Finally, KENN provided quality results also in comparison with other two NeSy frameworks. In particular, the large diﬀerence in performances between KENN/RNM and SBR provides additional evidence in support of modelbased approaches in comparison to regularization ones, with KENN the best option in terms of scalability. However, the scalability of KENN largely depends on the ﬁxed structure of the knowledge, with only universally quantiﬁed formulas allowed. This is a limitation of KENN in comparison with other frameworks, like LTN, which support the usage of existential quantiﬁers.
Appendix A
Relational KENN Architecture
See Fig. 4.
104
A. Daniele and L. Seraﬁni
Fig. 4. KENN for relational domains: (a) the architecture of KENN. A graph (blue box) is represented in terms of the two matrices zU and zB and given as input to the Relational KE (RKE). Multiple RKEs are stacked together and the activation function is called; (b) the architecture of the RKE module: the unary knowledge is enforced directly by the KEU ; the binary knowledge is enforced by the KEB on matrix zM , which is created by joining zU with zB on the preelab step. zM contains multiple instances of the same atoms, for instance S[a] (red cells). As a consequence, multiple residues are returned for a single atom, and such values are summed in the postelab (blue cells). Pre and post elaboration steps are eﬃciently implemented using TensorFlow gather and scatter nd functions. (Color ﬁgure online)
Relational KENN
B
105
Results Distribution  Inductive Learning
See Fig. 5.
Fig. 5. Left: distributions of accuracies achieved by the NN and KENN (3 KE layers) on 100 runs of Inductive Learning; Right: distributions of the improvements in accuracy obtained by the injection of the logical rules.
106
C
A. Daniele and L. Seraﬁni
Results Distribution  Transductive Learning
See Fig. 6.
Fig. 6. Left: distributions of accuracies achieved by the NN and KENN (3 KE layers) on 100 runs of Transductive Learning; Right: distributions of the improvements in accuracy obtained by the injection of the logical rules.
Relational KENN
D
107
Comparison with SBR and RNM
Test Accuracy See Fig. 7.
Fig. 7. Comparison between KENN (3 KE layers), SBR and RNM in terms of accuracy improvements over the NN.
Execution Time See Fig. 8.
Fig. 8. Execution time in logarithmic scale of the diﬀerent methods. A bar labelled with number i corresponds to KENN with i KEs layers (0 represents the NN without logic).
108
A. Daniele and L. Seraﬁni
References 1. Bach, S.H., Broecheler, M., Huang, B., Getoor, L.: Hingeloss Markov random ﬁelds and probabilistic soft logic. J. Mach. Learn. Res. 18(109), 1–67 (2017) 2. Bahdanau, D., Cho, K., Bengio, K.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) 3. Besold, T.R. et al.: Neuralsymbolic learning and reasoning: a survey and interpretation. CoRR, abs/1711.03902 (2017). http://arxiv.org/abs/1711.03902 4. Campero, A., Pareja, A., Klinger, T., Tenenbaum, J., Riedel, S.: Logical rule induction and theory learning using neural theorem proving. arXiv preprint arXiv:1809.02193 (2018) 5. Cohen, W.W.: TensorLog: a diﬀerentiable deductive database. arXiv preprint arXiv:1605.06523 (2016) 6. Daniele, A., Seraﬁni, L.: Knowledge enhanced neural networks. In: Nayak, A.C., Sharma, A. (eds.) PRICAI 2019. LNCS (LNAI), vol. 11670, pp. 542–554. Springer, Cham (2019). ISBN: 9783030299088. https://doi.org/10.1007/9783030299088 43 7. De Raedt, L., Kimmig, A., Toivonen, A.: ProbLog: a probabilistic prolog and its application in link discovery. In: IJCAI, Hyderabad, vol. 7, pp. 2462–2467 (2007) 8. Diligenti, M., Gori, M., Sacc` a, C.: Semanticbased regularization for learning and inference. Artif. Intell. 244, 143–165 (2017) 9. Donadello, I.: Semantic image interpretation  integration of numerical data and logical knowledge for cognitive vision. Ph.D. thesis, Trento Univ., Italy (2018) 10. Dong, H., Mao, J., Lin, T., Wang, C., Li, L., Zhou, D.: Neural logic machines. arXiv preprint arXiv:1904.11694 (2019) 11. Evans, R., Grefenstette, E.: Learning explanatory rules from noisy data. J. Artif. Intell. Res. 61, 1–64 (2018) 12. Fischer, M., Balunovic, M., DrachslerCohen, D., Gehr, T., Zhang, C., Vechev, M.: DL2: training and querying neural networks with logic. In: International Conference on Machine Learning, pp. 1931–1941 (2019) 13. Guimar˜ aes, V., Costa, V.S.: NeuralLog: a neural logic language. CoRR, abs/2105.01442 (2021). http://arxiv.org/abs/2105.01442 14. Hinton, G., et al.: Deep neural networks for acoustic modeling in speech recognition. IEEE Sig. Process. Mag. 29, 82–97 (2012) 15. Hu, Z., Ma, X., Liu, Z., Hovy, E., Xing, E.: Harnessing deep neural networks with logic rules. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, Berlin, Germany, 7–12 August 2016, vol. 1. The Association for Computer Linguistics (2016). ISBN: 9781945626005. http://aclweb.org/anthology/P/P16/P161228.pdf 16. Koller, D., et al.: Introduction to Statistical Relational Learning. MIT Press, Cambridge (2007) 17. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classiﬁcation with deep convolutional neural networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems, NIPS 2012, USA, vol. 1, pp. 1097– 1105. Curran Associates Inc. (2012). http://dl.acm.org/citation.cfm?id=2999134. 2999257 18. Li, T., Srikumar, V.: Augmenting neural networks with ﬁrstorder logic. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 292–302. Association for Computational Linguistics, July 2019. https://doi.org/10.18653/v1/P191028. https://www.aclweb.org/ anthology/P191028
Relational KENN
109
19. Lu, C., Krishna, R., Bernstein, M., FeiFei, L.: Visual relationship detection with language priors. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 852–869. Springer, Cham (2016). https://doi.org/10.1007/ 9783319464480 51 20. Lu, Q., Getoor, L.: Linkbased classiﬁcation. In: Proceedings of the Twentieth International Conference on International Conference on Machine Learning, ICML 2003, pp. 496–503. AAAI Press (2003). ISBN: 1577351894 21. Manhaeve, R., Dumancic, S., Kimmig, A., Demeester, T., De Raedt, L.: DeepProbLog: neural probabilistic logic programming. In: Advances in Neural Information Processing Systems, pp. 3749–3759 (2018) 22. Marra, G., Diligenti, M., Giannini, F., Gori, M., Maggini, M.: Relational neural machines. arXiv preprint arXiv:2002.02193 (2020) 23. Minervini, P., Riedel, S.: Adversarially regularising neural NLI models to integrate logical background knowledge. arXiv preprint arXiv:1808.08609 (2018) 24. Reimann, J.N., Schwung, A.: Neural logic rule layers. arXiv preprint arXiv:1907.00878 (2019) 25. Richardson, M., Domingos, P.: Markov logic networks. Mach. Learn. 62(1–2), 107– 136 (2006). ISSN: 08856125. https://doi.org/10.1007/s1099400658331 26. Rockt¨ aschel, T., Riedel, S.: Learning knowledge base inference with neural theorem provers. In: Proceedings of the 5th Workshop on Automated Knowledge Base Construction, pp. 45–50 (2016) 27. Rockt¨ aschel, T., Riedel, S.: Endtoend diﬀerentiable proving. In: Advances in Neural Information Processing Systems, pp. 3788–3800 (2017) 28. Sen, P., Namata, G.M., Bilgic, M., Getoor, L., Gallagher, B., EliassiRad, T.: Collective classiﬁcation in network data. AI Mag. 29(3), 93–106 (2008) 29. Seraﬁni, L., d’Avila Garcez, A.: Logic tensor networks: deep learning and logical reasoning from data and knowledge. CoRR, abs/1606.04422 (2016) 30. Xu, J., Zhang, Z., Friedman, T., Liang, Y., Van den Broeck, G.: A semantic loss function for deep learning with symbolic knowledge. In: Dy, J., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning, Volume 80 of Proceedings of Machine Learning Research, Stockholmsm¨ assan, Stockholm, Sweden, 10–15 July 2018, pp. 5502–5511. PMLR (2018). http://proceedings.mlr. press/v80/xu18h.html
Logic Tensor Networks for TopN Recommendation Tommaso Carraro1,2(B) , Alessandro Daniele2 , Fabio Aiolli1 , and Luciano Seraﬁni2 1
Department of Mathematics, University of Padova, Padova, Italy [emailprotected] 2 Data and Knowledge Management Research Unit, Fondazione Bruno Kessler (FBK), Trento, Italy
Abstract. Despite being studied for more than twenty years, stateoftheart recommendation systems still suﬀer from important drawbacks which limit their usage in realworld scenarios. Among the wellknown issues of recommender systems, there are data sparsity and the coldstart problem. These limitations can be addressed by providing some background knowledge to the model to compensate for the scarcity of data. Following this intuition, we propose to use Logic Tensor Networks (LTN) to tackle the topn item recommendation problem. In particular, we show how LTN can be used to easily and eﬀectively inject commonsense recommendation knowledge inside a recommender system. We evaluate our method on MindReader, a knowledge graphbased movie recommendation dataset containing plentiful side information. In particular, we perform an experiment to show how the beneﬁts of the knowledge increase with the sparsity of the dataset. Eventually, a comparison with a standard Matrix Factorization approach reveals that our model is able to reach and, in many cases, outperform stateoftheart performance. Keywords: Recommender systems · topn recommendation tensor networks · Neuralsymbolic integration
1
· Logic
Introduction
Recommender system (RS) technologies are nowadays an essential component for eservices (e.g., Amazon, Netﬂix, Spotify). Generally speaking, an RS aims at providing suggestions for items (e.g., movies, songs, news) that are most likely of interest to a particular user [25]. Since the ﬁrst appearance of RSs in early 2000, Collaborative Filtering (CF) [1,16,28] has aﬃrmed of being the standard recommendation approach. In particular, Latent Factor models, and especially Matrix Factorization (MF), have dominated the CF scene [14,20,22] for years, and this has been further emphasized with the deep learning rise [7,13,19,26,27]. Despite their success, stateoftheart models still suﬀer from important drawbacks, which limit their applicability in realworld scenarios. Among the most c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Dovier et al. (Eds.): AIxIA 2022, LNAI 13796, pp. 110–123, 2023. https://doi.org/10.1007/9783031271816_8
Logic Tensor Networks for TopN Recommendation
111
crucial problems, there are data sparsity and the coldstart problem [21,25]. Data sparsity leads to datasets where the density of ratings is usually less than 1%, while coldstart makes the recommendation challenging for new users and items. One way to address these limitations is to provide additional information to the models to compensate for the scarcity of data. Following this intuition, methods based on Tensor Factorization [4] and Factorization Machines [24,30] have been proposed recently. These models allow to eﬀectively extend the useritem matrix by adding new dimensions containing content (e.g., movie genres, demographic information) and/or contextual side information (e.g., location, time). Though these techniques have been shown to improve the recommendation performance, they are usually speciﬁcally designed for one type of side information (e.g., the user or item content) and lack explainability [6,31]. Novel recommendation datasets (e.g., [5]) provide manifold side information (e.g., ratings on movie genres, actors, directors), and hence models which can exploit all the available information are required. NeuralSymbolic Integration (NeSy) [3] and Statistical Relational Learning (SRL) [23] represent good candidates to incorporate knowledge with learning. These two branches of Artiﬁcial Intelligence study approaches for the integration of some form of prior knowledge, usually expressed through FirstOrder Logic (FOL), with statistical models. The integration has been shown beneﬁcial to address data scarcity [11]. In this paper, we propose to use a Logic Tensor Network (LTN) [2] to inject commonsense knowledge into a standard Matrix Factorization model for the topn item recommendation task. LTN is a NeSy framework that allows using logical formulas to instruct the learning of a neural model. We propose to use the MindReader dataset [5] to test our model. This dataset includes a variety of information, such as users’ tastes across movie genres, actors, and directors. In this work, we show how LTN can naturally and eﬀectively exploit all this various information to improve the generalization capabilities of the MF model. In addition, an experiment that drastically reduces the density of the training ratings reveals that our model can eﬀectively mitigate the sparsity of data, outperforming the standard MF model, especially in the most challenging scenarios.
2
Related Works
The integration of logical reasoning and learning in RSs is still in its early stages. Among the NeSy approaches for RSs, the most prominent is Neural Collaborative Reasoning (NCR) [10]. In this work, the recommendation problem is formalized into a logical reasoning problem. In particular, the user’s ratings are represented using logical variables, then, logical operators are used to construct formulas that express facts about them. Afterward, NCR maps the variables to logical embeddings and the operators to neural networks which act on those embeddings. By doing so, each logical expression can be equivalently organized as a neural network, so that logical reasoning and prediction can be conducted in a continuous space. In [9], the idea of NCR is applied to knowledge graphs for RSs, while [29] uses a NeSy approach to tackle the explainability of RSs.
112
T. Carraro et al.
The seminal approach that successfully applied SRL to RSs has been HyPER [17], which is based on Probabilistic Soft Logic (PSL) [15]. In particular, HyPER exploits the expressiveness of FOL to encode knowledge from a wide range of information sources, such as multiple user and item similarity measures, content, and social information. Then, HingeLoss Markov Random Fields are used to learn how to balance the diﬀerent information types. HyPER is highly related to our work since the logical formulas that we use resemble the ones used in HyPER. After HyPER, other SRL approaches have been proposed for RSs [8,12].
3
Background
This section provides useful notation and terminology used in the remainder of the paper. 3.1
Notation
Bold notation is used to diﬀerentiate between vectors, e.g., x = [3.2, 2.1], and scalars, e.g., x = 5. Matrices and tensors are denoted with upper case bold notation, e.g., X. Then, Xi is used to denote the ith row of X, while Xi,j to denote the position at row i and column j. We refer to the set of users of a RS with U, where U = n. Similarly, the set of items is referred to as I such that I = m. We use D to denote a dataset. D is deﬁned as a set of N triples D = {(u, i, r)(j) }N j=1 , where u ∈ U, i ∈ I, and r ∈ N is a rating. We assume that a user u cannot give more than one rating to an item i, namely r1 , r2 ∈ N, r1 = r2 : {(u, i, r1 )} ∪ {(u, i, r2 )} ⊆ D. D can be reorganized in the socalled useritem matrix R ∈ Nn×m , where users are on the rows and items on the columns, such that Ru,i = r if (u, i, r) ∈ D, 0 otherwise. 3.2
Matrix Factorization
Matrix Factorization (MF) is a Latent Factor Model that aims at factorizing the useritem matrix R into the product of two lowerdimensional rectangular matrices, denoted as U and I. U ∈ Rn×k and I ∈ Rm×k are matrices containing the users’ and items’ latent factors, respectively, where k is the number of latent factors. The objective of MF is to ﬁnd U and I such that R ≈ U·I . An eﬀective way to learn the latent factors is by using gradientdescent optimization. Given the dataset D, a MF model seeks to minimize the following loss function: L(θ) =
1 N
˜ r − r2 + λθ2
(1)
(u,i,r)∈D
where r˜ = Uu · I i and θ = {U, I}. The ﬁrst term of Eq. (1) is the Mean Squared Error (MSE) between the predicted and target ratings, while the second one is an L2 regularization term. λ is an hyperparameter to set strength of the regularization.
Logic Tensor Networks for TopN Recommendation
3.3
113
Logic Tensor Networks
Logic Tensor Networks [2] (LTN) is a NeuralSymbolic framework that enables eﬀective integration of deep learning and logical reasoning. It allows to deﬁne a knowledge base composed of a set of logical axioms and to use them as the objective of a neural model. To deﬁne the knowledge base, LTN uses a speciﬁc ﬁrstorder language, called Real Logic, which forms the basis of the framework. It is fully diﬀerentiable and has a concrete semantics that allows mapping every symbolic expression into the domain of real numbers. Thanks to Real Logic, LTN can convert logical formulas into computational graphs that enable gradientbased optimization based on fuzzy logic semantics. Real Logic is deﬁned on a ﬁrstorder language L with a signature that contains a set C of constant symbols, a set X of variable symbols, a set F of functional symbols, and a set P of predicate symbols. A term is constructed recursively from constants, variables, and functional symbols. An expression formed by applying a predicate symbol to some term(s) is called an atomic formula. Complex formulas are constructed recursively using connectives (i.e., ¬, ∧, ∨, =⇒ , ↔) and quantiﬁers (i.e., ∀, ∃). To emphasize the fact that symbols are mapped onto realvalued features, we use the term grounding 1 , denoted by G. In particular, each individual (e.g., a user) is grounded as a tensor of real features (e.g., user’s demographic information), functions as real functions, and predicates as real functions that speciﬁcally project onto a value in the interval [0, 1]. A variable x is grounded to a sequence of nx individuals from a domain, with nx ∈ N+ , nx > 0. As a consequence, a term t(x) or a formula P(x), constructed recursively with a free variable x, will be grounded to a sequence of nx values too. Afterward, connectives are grounded using fuzzy semantics, while quantiﬁers using special aggregation functions. In this paper, we use the product configuration, which is better suited for gradientbased optimization [18]. Speciﬁcally, conjunctions are grounded using the product tnorm Tprod , negations using the standard fuzzy negation NS , implications using the Reichenbach implication IR , and the universal quantiﬁer using the generalized mean w.r.t. the error values MEp . The other connectives and quantiﬁers are not used in this paper, hence not reported. Tprod (u, v) = u ∗ v, u, v ∈ [0, 1] IR (u, v) = 1 − u + u ∗ v, u, v ∈ [0, 1] NS (u) = 1 − u, u ∈ [0, 1] n 1 1 MEp (u1 , . . . , un ) = 1 − ( (1 − ui )p ) p , p ≥ 1, u1 , . . . , un ∈ [0, 1] n i=1 Connective operators are applied elementwise to the tensors in input, while aggregators aggregate the dimension of the tensor in input that corresponds to 1
Notice that this is diﬀerent from the common use of the term grounding in logic, which indicates the operation of replacing the variables of a term or formula with constants or terms containing no variables.
114
T. Carraro et al.
the quantiﬁed variable. Real Logic provides also a special type of quantiﬁcation, called diagonal quantiﬁcation, denoted as Diag(x1 , . . . , xn ). It applies only to variables that have the same number of individuals (i.e., nx1 = nx2 = · · · = nxn ) and allows to quantify over speciﬁc tuples of individuals, such that the ith tuple contains the ith individual of each of the variables in the argument of Diag. An intuition about how these operations work in practice is given in Sect. 3.4. Given a Real Logic knowledge base K = {φ1 , . . . , φn }, where φ1 , . . . , φn are closed formulas, LTN allows to learn the grounding of constants, functions, and predicates appearing in them. In particular, if constants are grounded as embeddings, and functions/predicates onto neural networks, their grounding G depends on some learnable parameters θ. We denote a parametric grounding as G(·θ). In LTN, the learning of parametric groundings is obtained by ﬁnding parameters θ ∗ that maximize the satisfaction of K: θ ∗ = argmaxθ SatAggφ∈K G(φθ)
(2)
where, SatAgg : [0, 1]∗ → [0, 1] is a formula aggregating operator, often deﬁned using MEp . Because Real Logic grounds expressions in real and continuous domains, LTN attaches gradients to every subexpression and consequently learns through gradientdescent optimization. 3.4
Intuition of Real Logic Grounding
In Real Logic, diﬀerently from ﬁrstorder logic, a variable x is grounded as a sequence of nx individuals (i.e., tensors) from a domain, with nx ∈ N+ , nx > 0. As a direct consequence, a term t(x) or a formula P(x), with a free variable x, is grounded to a sequence of nx values too. For example, P(x) returns a vector x , where xi is the ith individual of x. Similarly, in [0, 1]nx , namely P(xi )ni=1 ny ×z , assuming that t maps to individuals in Rz . This t(y) returns a matrix in R formalization is intuitively extended to terms and formulas with arity greater than one. In such cases, Real Logic organizes the output tensor in such a way that it has a dimension for each free variable involved in the expression. For instance, t2 (x, y) returns a tensor in Rnx ×ny ×z , assuming that t2 maps to individuals in Rz . In particular, at position (i, j) there is the evaluation of t2 (xi , yj ), where xi denotes the ith individual of x and yj the jth individual of y. Similarly, P2 (x, y) returns a tensor in [0, 1]nx ×ny , where at position (i, j) there is the evaluation of P(xi , yj ). The connective operators are applied elementwise to the tensors in input. For instance, ¬P2 (x, y) returns a tensor in [0, 1]nx ×ny , where at position (i, j) there is the evaluation of ¬P2 (xi , yj ), namely NS (i.e., ¬) is applied to each truth value in the tensor P2 (x, y) ∈ [0, 1]nx ×ny . For binary connectives, the behavior is similar. For instance, let Q be a predicate symbol and u a variable. Then, P2 (x, y) ∧ Q(x, u) returns a tensor in [0, 1]nx ×ny ×nu , where at position (i, j, k) there is the evaluation of the formula on the ith individual of x, jth individual of y, and kth individual of u.
Logic Tensor Networks for TopN Recommendation
115
The quantiﬁers aggregate the dimension that corresponds to the quantiﬁed variable. For instance, ∀xP2 (x, y) returns a tensor in [0, 1]ny , namely the aggregation is performed across the dimension of x. Since y is the only free variable remaining in the expression, the output has one single dimension, corresponding to the dimension of y. Speciﬁcally, the framework computes P2 (x, y) ∈ [0, 1]nx ×ny ﬁrst, then it aggregates the dimension corresponding to x. Similarly, ∀(x, y)P2 (x, y) returns a scalar in [0, 1], namely the aggregation is performed across the dimensions of both variables x and y. In the case of diagonal quantiﬁcation, the framework behaves diﬀerently. For instance, ∀Diag(w, v)P2 (w, v), where w and v are two variables with the same number of individuals nw = nv , returns a scalar in [0, 1], which is the result of the aggregation of nw truth values, namely P2 (w1 , v1 ), P2 (w2 , v2 ), . . . , P2 (wnw , vnv ). Without diagonal quantiﬁcation (i.e., ∀(w, v)P2 (w, v)), the framework performs an aggregation across the dimensions of both variables, involving n2w values, namely P2 (w1 , v1 ), P2 (w1 , v2 ), . . . , P2 (wnw , vnv −1 ), P2 (wnw , vnv ). Intuitively, ∀(w, v) aggregates all the values in [0, 1]nw ×nv , while ∀Diag(w, v) aggregates only the values in the diagonal.
4
Method
Our approach uses a Logic Tensor Network to train a basic Matrix Factorization (MF) model for the topn item recommendation task. The LTN is trained using a Real Logic knowledge base containing commonsense knowledge facts about the movie recommendation domain. This section formalizes the knowledge base used by our model, how the symbols appearing in it are grounded in the real ﬁeld, and how the learning of the LTN takes place. 4.1
Knowledge Base
The Real Logic knowledge base that our model seeks to maximally satisfy is composed of the following axioms. φ1 : ∀Diag(user, movie, rating)(Sim(Likes(user, movie), rating)) φ2 : ∀(user, movie, genre)(¬LikesGenre(user, genre) ∧ HasGenre(movie, genre) =⇒ Sim(Likes(user, movie), rating− ))
(3) (4)
where user, movie, rating, and genre are variable symbols to denote the users of the system, the items of the system, the ratings given by the users to the items, and the genres of the movies, respectively. rating− is a constant symbol denoting the negative rating. Likes(u, m) is a functional symbol returning the prediction for the rating given by user u to movie m. Sim(r1 , r2 ) is a predicate symbol measuring the similarity between two ratings, r1 and r2 . LikesGenre(u, g) is a
116
T. Carraro et al.
predicate symbol denoting whether the user u likes the genre g. HasGenre(m, g) is a predicate symbol denoting whether the movie m belongs to the genre g. Notice the use of the diagonal quantiﬁcation on Axiom (3). When user, movie, and rating are grounded with three sequences of values, the ith value of each variable matches with the values of the other variables. This is useful in this case since the dataset D comes as a set of triples. Diagonal quantiﬁcation allows forcing the satisfaction of Axiom (3) for these triples only, rather than any combination of users, items, and ratings in D. 4.2
Grounding of the Knowledge Base
The grounding allows to deﬁne how the symbols of the language are mapped onto the real ﬁeld, and hence how they can be used to construct the architec(j) N ture of the LTN. In particular, given D = {(u, m, r)}N j=1 , G(user) = u j=1 , namely user is grounded as a sequence of the N user indexes in D. G(movie) = m(j) N j=1 , namely movie is grounded as a sequence of the N movie indexes (j) ∈ {0, 1} ∀j, namely rating is grounded in D. G(rating) = r(j) N j=1 with r as a sequence of the N ratings in D, where 0 denotes a negative rating and 1 a positive one. G(rating− ) = 0, namely rating− is grounded as the negative rating. G(genre) = 1, . . . , Ng , namely genre is grounded as a sequence of Ng genre indexes, where Ng is the number of genres appearing in the movies of D. G(LikesU, I) : u, m → Uu · I m , namely Likes is grounded onto a function that takes as input a user index u and a movie index m and returns the prediction of the MF model for user at index u and movie at index m, where U ∈ Rn×k and Im×k are the matrices of the users’ and items’ latent factors, respectively. G(LikesGenre) : u, g → {0, 1}, namely LikesGenre is grounded onto a function that takes as input a user index u and a genre index g and returns 1 if the user u likes the genre g in the dataset, 0 otherwise. Similarly, G(HasGenre) : m, g → {0, 1}, namely HasGenre is grounded onto a function that takes as input a movie index m and a genre index g and returns 1 if the movie m belongs to genre g in the dataset, 0 otherwise. Finally, G(Sim) : r˜, r → exp(−α˜ r − r2 ), namely Sim is grounded onto a function that computes the similarity between a predicted rating r˜ and a target rating r. The use of the exponential allows to treat Sim as a predicate since the output is restricted in the interval [0, 1]. The squared is used to give more penalty to larger errors in the optimization. α is an hyperparameter to change the smoothness of the function. Intuitively, Axiom (3) states that for each usermovierating triple in the dataset D = {(u, m, r)(j) }N j=1 , the prediction computed by the MF model for the user u and movie m should be similar to the target rating r provided by the user u for the movie m. Instead, Axiom (4) states that for each possible combination of users, movies, and genres, taken from the dataset, if the user u does not like a genre of the movie m, then the prediction computed by the MF model for the user u and movie m should be similar to the negative rating rating− , namely the user should not like the movie m. By forcing the satisfaction of Axiom (3), the model learns to factorize the useritem matrix using the ground
Logic Tensor Networks for TopN Recommendation
117
truth, while Axiom (4) acts as a kind of regularization for the latent factors of the MF model. 4.3
Learning of the LTN
The objective of our LTN is to learn the latent factors in U and I such that the axioms in the knowledge base K = {φ1 , φ2 } are maximally satisﬁed, namely argmaxθ SatAggφ∈K G(user,movie,rating)←D (φθ), where θ = {U, I}. The notation (user, movie, rating) ← D means that variables user, movie, and rating are grounded with the triples taken from the dataset D, namely user takes the sequence of user indexes, movie the sequence of movie indexes, and rating the sequence of ratings. In practice, this objective corresponds to the following loss function: L(θ) = (1 − SatAggφ∈K G(user,movie,rating)←B (φθ)) + λθ2
(5)
where B denotes a batch of training triples randomly sampled from D. An L2 regularization term has been added to the loss to prevent overﬁtting. Hyperparameter λ is used to deﬁne the strength of the regularization. Notice that the loss does not specify how the variable genre is grounded. Its grounding depends on the sampled batch B. In our experiments, we grounded it with the sequence of genres of the movies in the batch. It is worth highlighting that the loss function depends on the semantics used to approximate the logical connectives, quantiﬁers, and formula aggregating operator. In our experiments, we used the stable product configuration, a stable version of the product conﬁguration introduced in [2]. Then, we selected MEp as formula aggregating operator, with p = 2.
5
Experiments
This section presents the experiments we have performed with our method. They have been executed on an Apple MacBook Pro (2019) with a 2,6 GHz 6Core Intel Core i7. The model has been implemented in Python using PyTorch. In particular, we used the LTNtorch2 library. Our source code is freely available3 . 5.1
Dataset
In our experiments, we used the MindReader [5] dataset. It contains 102,160 explicit ratings collected from 1,174 real users on 10,030 entities (e.g., movies, actors, movie genres) taken from a knowledge graph in the movie domain. The explicit ratings in the dataset can be of three types: like (1), dislike (−1), or unknown (0). The dataset is subdivided in 10 splits. In our experiments, we used split 0. Each split has a training set, a validation set, and a test set. The 2 3
https://github.com/logictensornetworks/LTNtorch. https://github.com/tommasocarraro/LTNrec.
118
T. Carraro et al.
training set contains both ratings given on movies and on the other entities, while validation and test sets contain only ratings given on movies. The validation and test sets are built in such a way to perform a leaveoneout evaluation. In particular, for each user of the training set, one random positive movie rating is held out for the validation set, and one for the test set. The validation/test example of the user is completed by adding 100 randomly sampled negative movie ratings from the dataset. To improve the quality of the dataset, we removed the unknown ratings. Moreover, we removed the top 2% of popular movies from the test set to reduce the popularity bias and hence see how the model performs on nontrivial recommendations, as suggested in [5]. Afterward, we considered only the training ratings given on movies and movie genres since our model uses only this information. After these steps, we converted the negative ratings from 1 to 0. Our ﬁnal dataset contains 962 users, 3,034 movies, 164 genres, 16,351 ratings on movies, and 10,889 ratings on movie genres. The density of the usermovie ratings is 0.37%. 5.2
Experimental Setting
In our experiments, we compared the performance of three models: (1) a standard MF model trained on the movie ratings of MindReader using Eq. (1), denoted as MF, (2) a LTN model trained on the movie ratings of MindReader using Eq. (5) with K = {φ1 }, denoted as LTN, and (3) a LTN model trained on the movie and genre ratings of MindReader using Eq. (5) with K = {φ1 , φ2 }, denoted as LTNgenres . To compare the performance of the models, we used two widely used rankingbased metrics, namely hit@k and ndcg@k, explained in Sect. 5.3. In our experiments, we used the following procedure: (1) we generated additional training sets by randomly sampling the 80%, 60%, 40%, and 20% of the movie ratings of each user from the entire training set, referred to as 100%. Then, (2) for each training set T r ∈ {100%, 80%, 60%, 40%, 20%} and for each model m ∈ {MF, LTN, LTNgenres }: (2a) we performed a grid search of model m on training set T r to ﬁnd the best hyperparameters on the validation set using hit@10 as validation metric; then, (2b) we tested the performance of the best model on the test set in terms of hit@10 and ndcg@10. We repeated this procedure 30 times using seeds from 0 to 29. The test metrics have been averaged across these runs and reported in Table 1. Due to computational time, the grid search has been computed only for the ﬁrst run. Starting from the second run, step (2a) is replaced with the training of model m on the training set T r with the best hyperparameters found during the ﬁrst run. A description of the hyperparameters tested in the grid searches as well as the training details of the models is explained in Sect. 5.4. 5.3
Evaluation Metrics
The selected rankingbased metrics are deﬁned as follows: – hit@k: Hit Ratio measures whether a testing item is placed in the topk positions of the ranking, considering the presence of an item as a hit;
Logic Tensor Networks for TopN Recommendation
119
– ndcg@k: Normalized Discounted Cumulative Gain measures the quality of the recommendation based on the position of the target item in the ranking. In particular, it uses a monotonically increasing discount to emphasize the importance of higher ranks versus lower ones. Formally, let us deﬁne ω(r) as the item at rank r, I[·] as the indicator function, and Iu as the set of heldout items for user u. hit@k for user u is deﬁned as k hit@k(u, ω) := I I [ω(r) ∈ Iu ] ≥ 1 . r=1
Truncated discounted cumulative gain (dcg@k) for user u is deﬁned as dcg@k(u, ω) :=
k 2I[ω(r)∈Iu ] − 1 r=1
log(r + 1)
.
ndcg@k is the dcg@k linearly normalized to [0, 1] after dividing by the best possible dcg@k, where all the heldout items are ranked at the top. Notice that in this paper Iu  = 1. Speciﬁcally, for each validation/test example, the scores for the positive movie and the 100 randomly sampled negative movies are computed using the Likes(u, m) function (i.e., the dot product between user and movie latent factors). Then, a ranking is created based on these scores. The metrics evaluate the recommendation based on the position of the positive movie in the produced ranking. 5.4
Training Details
The hyperparameters tested during the grid searches explained in Sect. 5.2 vary depending on the model. For all the models, we tried a number of latent factors k ∈ {1, 5, 10, 25}, regularization coeﬃcient λ ∈ {0.001, 0.0001}, batch size in {32, 64}, and whether it was better to add users’ and items’ biases to the model. For LTN and LTNgenres , we tried α ∈ {0.05, 0.1, 0.2} for the predicate Sim and used p = 2 for the aggregator MEp of Axiom (3). For LTNgenres , we tried p ∈ {2, 5} for the aggregator MEp of Axiom (4). Notice that limp→∞ MEp (u1 , . . . , un ) = min{u1 , . . . , un }. Intuitively, p oﬀers ﬂexibility to account for outliers in the data. The higher the p, the more focus the model will have on the outliers. For all the models, the latent factors U and I, for users and items, respectively, have been randomly initialized using the Glorot initialization, while the biases with values sampled from a normal distribution with 0 mean and unitary variance. All the models have been trained for 200 epochs by using the Adam optimizer with a learning rate of 0.001. For each training, we used early stopping to stop the learning if after 20 epochs no improvements were found on the validation metric (i.e., hit@10).
120
6
T. Carraro et al.
Results
A comparison between MF, LTN, and LTNgenres is reported in Table 1. The table reports the performance of the three models on a variety of tasks with diﬀerent sparsity. Table 1. Test hit@10 and ndcg@10 averaged across 30 runs. Standard deviations are between brackets. % of training ratings Metric
MF
LTN
LTNgenres
100%
hit@10 0.4499(0.0067) 0.4636(0.0040) ndcg@10 0.1884(0.0028) 0.1899(0.0014)
0.4642(0.0054) 0.1905(0.0022)
80%
hit@10 0.4459(0.0057) 0.4585(0.0066) ndcg@10 0.1864(0.0023) 0.1881(0.0023)
0.4616(0.0069) 0.1894(0.0025)
60%
hit@10 0.4274(0.0107) 0.4475(0.0087) ndcg@10 0.1798(0.0039) 0.1853(0.0034)
0.4487(0.0080) 0.1862(0.0031)
40%
hit@10 0.3983(0.0105) 0.4087(0.0117) ndcg@10 0.1692(0.0047) 0.1726(0.0052)
0.4322(0.0102) 0.1807(0.0049)
20%
hit@10 0.2956(0.0196) 0.3764(0.0170) 0.3761(0.0160) ndcg@10 0.1367(0.0093) 0.1594(0.0069) 0.1598(0.0068)
By looking at the table, it is possible to observe that LTN outperforms MF in all the ﬁve tasks. In particular, for the dataset with 20% of training ratings, the improvement is drastic, with a 27.33% increase on hit@10. We want to emphasize that the two models only diﬀer in the loss function. This demonstrates that the loss based on fuzzy logic semantics of LTN is beneﬁcial to deal with the sparsity of data. Then, with the addition of knowledge regarding the users’ tastes across the movie genres, it is possible to further improve the results, as shown in the last column of the table. LTNgenres outperforms the other models on almost all the tasks. For the dataset with the 20% of the ratings, the hit@10 of LTNgenres is slightly worse compared to LTN. This could be related to the quality of the training ratings sampled from the original dataset. This is also suggested by the higher standard deviation associated with the datasets with higher sparsity. 6.1
Training Time
A comparison of the training times required by the models on the diﬀerent datasets is presented in Table 2. The models have been trained for 200 epochs with a learning rate of 0.001, batch size of 64, one latent factor (i.e., k = 1), without bias terms, and without early stopping. The other hyperparameters do not aﬀect training time. In particular, LTNgenres increases the time complexity considerably. This is due to Axiom 4, which has to be evaluated for each possible combination of users, items, and genres. This drawback can limit the applicability of LTNgenres in datasets with a higher number of users and items since more groundings of the formula have to be evaluated. Generally, when the number of
Logic Tensor Networks for TopN Recommendation
121
groundings becomes huge, Logic Tensor Networks have scalability issues. However, it is possible to mitigate this problem by designing logical axioms which make use of diagonal quantiﬁcation. This special quantiﬁcation allows to considerably reduce the number of evaluated groundings by explicitly specifying them. Eventually, by looking at the results in Sect. 6, it is possible to observe that the improvements of LTNgenres w.r.t. LTN are marginal. This proves that LTN can implicitly learn user preferences among movie genres without direct supervision. This ﬁnding suggests avoiding using LTNgenres in this particular scenario since the underlying MF model is powerful enough while being also more eﬃcient. We believe that LTNgenres is best suited for extremely sparse datasets and coldstart scenarios. We leave this investigation for future work. Table 2. Training time in seconds. % of training ratings MF
7
LTN LTNgenres
100%
26.99 50.87 247.30
80%
22.52 37.79 213.62
60%
18.31 28.97 145.86
40%
15.60 20.09
97.43
20%
8.12 10.68
50.85
Conclusions
In this paper, we proposed to use Logic Tensor Networks to tackle the topn recommendation task. We showed how, by design, LTN permits to easily integrate side information inside a recommendation model. We compared our LTN models with a standard MF model, in a variety of tasks with diﬀerent sparsity, showing the beneﬁts provided by the background knowledge, especially when the task is challenging due to data scarcity.
References 1. Aiolli, F.: Eﬃcient topn recommendation for very large scale binary rated datasets. In: Proceedings of the 7th ACM Conference on Recommender Systems, RecSys 2013, pp. 273–280. Association for Computing Machinery, New York (2013). https://doi.org/10.1145/2507157.2507189 2. Badreddine, S., d’Avila Garcez, A., Seraﬁni, L., Spranger, M.: Logic tensor networks. Artif. Intell. 303, 103649 (2022). https://doi.org/10.1016/j.artint.2021. 103649 3. Besold, T.R., et al.: Neuralsymbolic learning and reasoning: a survey and interpretation (2017). https://doi.org/10.48550/ARXIV.1711.03902 4. Bhargava, P., Phan, T., Zhou, J., Lee, J.: Who, what, when, and where: multidimensional collaborative recommendations using tensor factorization on sparse usergenerated data. In: Proceedings of the 24th International Conference on World Wide Web, WWW 2015, pp. 130–140. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE (2015). https://doi. org/10.1145/2736277.2741077
122
T. Carraro et al.
5. Brams, A.H., Jakobsen, A.L., Jendal, T.E., Lissandrini, M., Dolog, P., Hose, K.: MindReader: recommendation over knowledge graph entities with explicit user ratings. In: CIKM 2020, pp. 2975–2982. Association for Computing Machinery, New York (2020). https://doi.org/10.1145/3340531.3412759 6. Carraro, T., Polato, M., Aiolli, F.: A look inside the blackbox: towards the interpretability of conditioned variational autoencoder for collaborative ﬁltering. In: Adjunct Publication of the 28th ACM Conference on User Modeling, Adaptation and Personalization, UMAP 2020 Adjunct, pp. 233–236. Association for Computing Machinery, New York (2020). https://doi.org/10.1145/3386392.3399305 7. Carraro, T., Polato, M., Bergamin, L., Aiolli, F.: Conditioned variational autoencoder for topn item recommendation. In: Pimenidis, E., Angelov, P., Jayne, C., Papaleonidas, A., Aydin, M. (eds.) ICANN 2022. LNCS, vol. 13530, pp. 785–796. Springer, Cham (2022). https://doi.org/10.1007/9783031159312 64 8. Catherine, R., Cohen, W.: Personalized recommendations using knowledge graphs: a probabilistic logic programming approach. In: Proceedings of the 10th ACM Conference on Recommender Systems, RecSys 2016, pp. 325–332. Association for Computing Machinery, New York (2016). https://doi.org/10.1145/2959100.2959131 9. Chen, H., Li, Y., Shi, S., Liu, S., Zhu, H., Zhang, Y.: Graph collaborative reasoning. In: Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, WSDM 2022, pp. 75–84. Association for Computing Machinery, New York (2022). https://doi.org/10.1145/3488560.3498410 10. Chen, H., Shi, S., Li, Y., Zhang, Y.: Neural collaborative reasoning. In: Proceedings of the Web Conference 2021. ACM, April 2021. https://doi.org/10.1145/3442381. 3449973 11. Daniele, A., Seraﬁni, L.: Neural networks enhancement with logical knowledge (2020). https://doi.org/10.48550/ARXIV.2009.06087 12. Gridach, M.: Hybrid deep neural networks for recommender systems. Neurocomputing 413, 23–30 (2020). https://doi.org/10.1016/j.neucom.2020.06.025. https:// www.sciencedirect.com/science/article/pii/S0925231220309966 13. He, X., Liao, L., Zhang, H., Nie, L., Hu, X., Chua, T.S.: Neural collaborative ﬁltering. In: Proceedings of the 26th International Conference on World Wide Web, WWW 2017, pp. 173–182. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE (2017). https://doi.org/10. 1145/3038912.3052569 14. Hu, Y., Koren, Y., Volinsky, C.: Collaborative ﬁltering for implicit feedback datasets. In: 2008 Eighth IEEE International Conference on Data Mining, pp. 263–272 (2008). https://doi.org/10.1109/ICDM.2008.22 15. Kimmig, A., Bach, S., Broecheler, M., Huang, B., Getoor, L., Mansinghka, V.: A short introduction to probabilistic soft logic, pp. 1–4 (2012). https://lirias. kuleuven.be/retrieve/204697 16. Koren, Y., Bell, R.: Advances in collaborative ﬁltering. In: Ricci, F., Rokach, L., Shapira, B., Kantor, P. (eds.) Recommender Systems Handbook, pp. 145–186. Springer, Boston (2011). https://doi.org/10.1007/9780387858203 5 17. Kouki, P., Fakhraei, S., Foulds, J., Eirinaki, M., Getoor, L.: HyPER: a ﬂexible and extensible probabilistic framework for hybrid recommender systems. In: RecSys 2015, pp. 99–106. Association for Computing Machinery, New York (2015). https:// doi.org/10.1145/2792838.2800175 18. van Krieken, E., Acar, E., van Harmelen, F.: Analyzing diﬀerentiable fuzzy logic operators. Artif. Intell. 302, 103602 (2022). https://doi.org/10.1016/j.artint.2021. 103602
Logic Tensor Networks for TopN Recommendation
123
19. Liang, D., Krishnan, R.G., Hoﬀman, M.D., Jebara, T.: Variational autoencoders for collaborative ﬁltering. In: Proceedings of the 2018 World Wide Web Conference, WWW 2018, pp. 689–698. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE (2018). https://doi.org/10. 1145/3178876.3186150 20. Ning, X., Karypis, G.: SLIM: sparse linear methods for topn recommender systems. In: 2011 IEEE 11th International Conference on Data Mining, pp. 497–506 (2011). https://doi.org/10.1109/ICDM.2011.134 21. Polato, M., Aiolli, F.: Exploiting sparsity to build eﬃcient kernel based collaborative ﬁltering for topn item recommendation. Neurocomputing 268, 17–26 (2017). Advances in Artiﬁcial Neural Networks, Machine Learning and Computational Intelligence. https://doi.org/10.1016/j.neucom.2016.12.090. https://www. sciencedirect.com/science/article/pii/S0925231217307592 22. Polato, M., Aiolli, F.: Boolean kernels for collaborative ﬁltering in topn item recommendation. Neurocomput. 286(C), 214–225 (2018). https://doi.org/10.1016/j. neucom.2018.01.057 23. Raedt, L.D., Kersting, K.: Statistical relational learning. In: Sammut, C., Webb, G.I. (eds.) Encyclopedia of Machine Learning, pp. 916–924. Springer, Boston (2010). https://doi.org/10.1007/9780387301648 786 24. Rendle, S.: Factorization machines. In: 2010 IEEE International Conference on Data Mining, pp. 995–1000 (2010). https://doi.org/10.1109/ICDM.2010.127 25. Ricci, F., Rokach, L., Shapira, B.: Recommender systems: introduction and challenges. In: Ricci, F., Rokach, L., Shapira, B. (eds.) Recommender Systems Handbook, pp. 1–34. Springer, Boston (2015). https://doi.org/10.1007/9781489976376 1 26. Shenbin, I., Alekseev, A., Tutubalina, E., Malykh, V., Nikolenko, S.I.: RecVAE: a new variational autoencoder for topn recommendations with implicit feedback. In: Proceedings of the 13th International Conference on Web Search and Data Mining. ACM, January 2020. https://doi.org/10.1145/3336191.3371831 27. Steck, H.: Embarrassingly shallow autoencoders for sparse data. In: The World Wide Web Conference, WWW 2019, pp. 3251–3257. Association for Computing Machinery, New York (2019). https://doi.org/10.1145/3308558.3313710 28. Su, X., Khoshgoftaar, T.M.: A survey of collaborative ﬁltering techniques. Adv. Artif. Intell. (2009). https://doi.org/10.1155/2009/421425 29. Xian, Y., et al.: CAFE: coarsetoﬁne neural symbolic reasoning for explainable recommendation. In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management, CIKM 2020, pp. 1645–1654. Association for Computing Machinery, New York (2020). https://doi.org/10.1145/3340531. 3412038 30. Xin, X., Chen, B., He, X., Wang, D., Ding, Y., Jose, J.: CFM: convolutional factorization machines for contextaware recommendation. In: Proceedings of the TwentyEighth International Joint Conference on Artiﬁcial Intelligence, IJCAI2019, pp. 3926–3932. International Joint Conferences on Artiﬁcial Intelligence Organization, July 2019. https://doi.org/10.24963/ijcai.2019/545 31. Zhang, Y., Chen, X.: Explainable recommendation: a survey and new perspecR Inf. Retrieval 14(1), 1–101 (2020). https://doi.org/10.1561/ tives. Found. Trends 1500000066
Multiagent Systems
A Review of the Muddy Children Problem Yusuf Izmirlioglu(B) , Loc Pham, Tran Cao Son, and Enrico Pontelli New Mexico State University, Las Cruces, NM 88003, USA {yizmir,locpham}@nmsu.edu, {tson,epontell}@cs.nmsu.edu
Abstract. The “Muddy Children” puzzle is a well known problem in the multiagent epistemic reasoning literature, however it has not been studied in other ﬁelds of Artiﬁcial Intelligence. In this paper, we present the “Muddy Children” problem as a challenge to the Artiﬁcial Intelligence and Computer Science community. The interesting aspect of this problem is that agents have asymmetric and incomplete information; and each agent needs to reason about his own knowledge as well as knowledge of other agents. The existing solutions use Kripke structure and possible world semantics which are not scalable for large problem sizes. Hence we stimulate for alternative solution methodologies and discover its relation to the other problems in the applied sciences. We go over several variations of the Muddy Children puzzle and discuss the challenges for future research. Keywords: Muddy children · Multiagent systems reasoning · Analytical puzzles
1
· Epistemic
Introduction
In this paper, we present the “Muddy Children” problem as a challenge to the general Artiﬁcial Intelligence and Computer Science community. The Muddy Children is a wellknown puzzle in the multiagent epistemic reasoning literature, however it has not been studied in other ﬁelds of Artiﬁcial Intelligence. This problem has been originally introduced by [2]; it also appears in the literature under diﬀerent names, such as “Three Wise Men” [12] and “Coloured Hat” [14]. The interesting aspect of this problem is that agents have asymmetric and incomplete information and they cannot directly disclose their knowledge to the other agents. Rather, an agent can only learn partial knowledge of others through their actions. Thus agents need to perform sophisticated reasoning of available information to reach the actual state. In particular, this puzzle requires not only reasoning of an agent about himself, but also reasoning about other agents. That is, an agent need to put himself “in the shoes of others” to infer their knowledge about the world. There are several existing solutions to this puzzle using possible world semantics and epistemic reasoning. These solutions employ Kripke structures as representation of agents’ knowledge, which has exponential number of worlds in the c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Dovier et al. (Eds.): AIxIA 2022, LNAI 13796, pp. 127–139, 2023. https://doi.org/10.1007/9783031271816_9
128
Y. Izmirlioglu et al.
number of agents. As such, they are not scalable for larger problem sizes. Furthermore, the existing methods cannot oﬀer complete solutions to the variations of the problem which we present in this paper. Our objective in introducing the Muddy Children problem is to suggest research challenges and inspire new solution methodologies. We believe that the Muddy Children puzzle may have alternative or more eﬃcient solutions using methodologies in Game Theory, Dynamic Programming, Constraint Programming or other ﬁelds. In the rest of the paper, we ﬁrst explain the problem, its formal deﬁnition, followed by possible variations and then brieﬂy go over the existing solutions.
2
The Muddy Children Problem
For understandability, let us ﬁrst illustrate a particular instance of the Muddy Children problem with 3 children. We assume that all agents are truthful and they are perfect reasoners. Each child can hear the announcements and observe the actions of other agents. The children have played outside and then returned to home together. Their father looks at the children and tells them that at least one of them has mud on his forehead. Each child can see the forehead of other children but not his own. Consequently a child can observe whether the other children are muddy or not, but cannot identify his own status. The father asks the following question to the children: “Those of you who know whether he has mud on his own head or not, please raise your hand”. But no children raises their hands. The father asks the same question the second time, again no children raises their hands. However, when the father asks the same question the third time, all children raise their hands. How is this outcome possible and how did the children understand their status at the third time? The resolution of the puzzle with 3 children is as follows: After the father’s initial announcement, it is common knowledge that the number of muddy children is 1 or more. Similarly, at each round, raising or not raising hand of a child is also a public announcement action which reveals his knowledge to the other agents. After child i executes (not) raising hand action, it is common knowledge that i does (not) know whether he is muddy or not. At round 1, child 1 does not raise his hand, so it must be the case that at least one of child 2 or child 3 is muddy. Otherwise child 1 would infer that he is muddy since there is at least one muddy child. Child 2 and 3 do not raise their hand hence at least one of child 1 or child 3 must be muddy and at least one of child 1 or child 2 must be muddy. In sum, at the end of round 1, we understand that at least two children are muddy. Still, no children knows whether he is muddy or not since none of them raised their hands at round 2. Now suppose that exactly two children are muddy, say children 1 and 2. If this was the case, the two muddy children would raise their hands at round 2. The reasoning is as follows: At the end of round 1, child 1 would realize that he is muddy. Child 1 can observe that child 2 is muddy and child 3 is not muddy. He will think that “If I were not muddy, then child 2 would raise his hand at round 1. Therefore I
Muddy Children Problem
129
must be muddy”. The situation is symmetric for the other muddy child 2, so he would also understand that he is muddy at the end of round 1. Therefore the number of muddy children cannot be two and hence all children must be muddy. Until now, we have made an analysis as an outsider who only reads the narrative but does not know the status of the children beforehand. Let us look at the puzzle from the viewpoint of the individual children. We ﬁrst examine the case of child 1. At the beginning of round 0, child 1 can observe that child 2 and 3 are muddy hence the announcement action of his father does not change his beliefs. At the beginning of round 1, child 1 does not know whether he is muddy or not, and does not raise his hand. Child 1 also knows that child 2 observes that child 3 is muddy and vice versa, so the other children did not raise their hand in the ﬁrst round as he expected. After round 1, child 1 still does not know whether he is muddy or not, so he does not raise his hand in round 2. The other children did not raise their hands either. Then after the actions in round 2, child 1 performs the following reasoning: “If I were not muddy, then child 2 and 3 would raise their hand in round 2; because in round 1 no one raised their hands. Assuming that I am not muddy, at the beginning of round 2, child 2 would think that if he were not muddy child 3 would have raised his hand in round 1. Hence child 2 would understand that he is muddy and raise his hand at round 2. Since this did not happen, I must be muddy!” Therefore at the beginning of round 3, child 1 realizes that he is muddy and he raises his hand at this round. Since all children are muddy and their actions are the same at every round, the analysis is symmetric for children 2 and 3.
3
Properties
In the previous section, we have made an analysis for a speciﬁc instance of the problem with 3 children and all of them muddy. What would be the outcome if the number of children or muddy children were diﬀerent? We now provide some results about the game with diﬀerent parameters. Theorem 1. In the muddy children problem, suppose that there are n children and of them are muddy, 1 ≤ ≤ n. At round 0, the father announces that at least one child is muddy and in the consecutive rounds he asks every child whether he knows whether he is muddy or not. Then, at round , all muddy children will raise their hands and at round + 1 all nonmuddy children (if any) will raise their hands. Theorem 1 and its proof have been developed by [1,5]. Using a similar induction technique, we can establish the next theorem. Theorem 2. In the muddy children problem, suppose that there are n children and of them are muddy, 1 ≤ ≤ n. If the father announces that at least q children are muddy at round 0, then all muddy children will raise their hands at round − q + 1 and all nonmuddy children (if any) will raise their hands at round − q + 2.
130
4
Y. Izmirlioglu et al.
Possible Worlds Semantics
This section provides background information about possible world semantics. A Kripke structure represents the agents’ own beliefs and their beliefs about other agents using possible worlds. Properties of the world are represented by binaryvalued atomic propositions called fluents. A world is a complete interpretation of the ﬂuents. Beliefs of the agents are encoded by accessibility relations between possible worlds. Let us now provide the formal deﬁnition of the Kripke structure and the semantics of belief formulae. A multiagent domain AG, F includes a ﬁnite and nonempty set of agents AG and a ﬁnite set of ﬂuents F encoding the properties of the world. Belief formulae over AG, F are deﬁned by the BNF: ϕ ::= p  ¬ϕ  (ϕ ∧ ϕ)  (ϕ ∨ ϕ)  Bi ϕ  Eα ϕ  Cα ϕ where p ∈ F is a ﬂuent, i ∈ AG and ∅ = α ⊆ AG. Bi is the belief operator and Bi ϕ stands for “agent i believes in formula ϕ”. Eα ϕ, Cα ϕ denote Group Belief formulae and their semantics are deﬁned below; intuitively, Eα ϕ indicates that all agents in α believe ϕ, while Cα ϕ indicates that ϕ is common belief among α. We refer to a belief formula which does not contain any occurrences of Bi , Eα , Cα as a a fluent formula. Let LAG denote the set of belief formulae over AG, F. To exemplify, in the muddy children domain, the ﬂuents F = {m1 , ..., mn } denote whether each child is muddy. The ﬂuent formula m1 ∨ m2 ∨ ... ∨ mn states that at least one child is muddy, the belief formula ¬B2 m1 ∧ ¬B2 ¬m1 states that agent 2 does not know whether agent 1 is muddy or not. A Kripke structure M is a tuple S, π, B1 , . . . , Bn , where S is a set of worlds (denoted by M [S]), π : S → 2F is a function that associates an interpretation of F to each element of S (denoted by M [π]), and for i ∈ AG, Bi ⊆ S × S is a binary relation over S (denoted by M [i]). For convenience, we will often draw a Kripke structure M as a directed labeled graph, whose set of labeled nodes i represent S and whose set of labeled edges contains s − → t iﬀ (s, t) ∈ Bi . The label of each node has two parts: the name of the world followed by the associated interpretation. For u ∈ S and a ﬂuent formula ϕ, M [π](u) and M [π](u)(ϕ) denote the interpretation associated to u via π and the truth value of ϕ with respect to M [π](u). For a world u ∈ M [S], (M, u) is a pointed Kripke structure, also called state hereafter. Accessibility relations of an agent in the Kripke structure show uncertainity in his beliefs. That is, if the agent considers multiple worlds with diﬀerent valuations, then his beliefs involve uncertainity. Satisfaction of belief formulae is deﬁned over pointed Kripke structures [7]. Given a belief formula ϕ, a Kripke structure M = S, π, B1 , . . . , Bn and a possible state u ∈ S: (i) (M, u) p if p ∈ F and M [π](u) p; (ii) (M, u) ¬ϕ if (M, u) ϕ; (iii) (M, u) ϕ1 ∨ ϕ2 if (M, u) ϕ1 or (M, u) ϕ2 ;
Muddy Children Problem
(iv) (v) (vi) (vii)
5
131
(M, u) ϕ1 ∧ ϕ2 if (M, u) ϕ1 and (M, u) ϕ2 ; (M, u) Bi ϕ if (M, t) ϕ for every t such that (u, t) ∈ Bi ; (M, u) Eα ϕ if (M, u) Bi ϕ for every i ∈ α; (M, u) Cα ϕ if (M, u) Ekα ϕ for every k ≥ 0, where E0α ϕ = ϕ and = Eα (Ekα ϕ). Ek+1 α
Formal Definition of the Muddy Children Problem
We describe the Muddy Children problem as AG, I, A where AG = {f, 1, .., n} is the set of agents, I is the set of children who are muddy and A = {announce_atleast_one, raise_handi, not_raise_handi} is the set of possible actions, i ∈ {1, .., n}. Here n is the number of children and l = I is the number of muddy children. The action announce_atleast_one denotes the announcement of the father that at least one child is muddy. raise_handi denotes the raising hand action of child i, and not_raise_handi action denotes child i not raising his hand. For the particular problem instance in the introduction, n = 3 and I = {1, 2, 3}. The game proceeds as follows: At round 0, the father executes the action announce_atleast_one. Then, at round j ≥ 1, every child i executes exactly one of raise_handi or not_raise_handi action, i ∈ {1, .., n}. The actions of children are simultaneous. We assume that all agents announce truthfully. At each round, all agents observe the actions of other agents and update their beliefs accordingly. The game ends at round k when all children raise their hand.
6
Related Literature and Existing Solutions
The existing solutions to the Muddy Children puzzle employ possible world semantics and epistemic reasoning methods. The state of the world is represented by a Kripke structure with an external view. Namely, the Kripke structure shows the unique actual world and the agents’ beliefs by accessibility relations to other possible worlds. The Kripke structure is updated by a state transition function upon an action. If the actions of children are simultaneous, the state transition function treats them as a single action and updates the Kripke structure once. We now provide the details of the solutions in the literature. 6.1
Eliminating Possible Worlds
Let D = AG, I, A, F be the Muddy Children domain. We illustrate the solution of [5,10] with n = 3 children of which l = 2 of them are muddy, and the father announces that at least q = 1 child is muddy. But their method also works for other values of n, l, q. For this instance, AG = {f, 1, 2, 3}, I = {1, 2}, F = {m1 , m2 , m3 }, A = {atleast_one_muddy, know_muddyi, not_know_muddyi} for i ∈ {1, 2, 3}. The actions are modelled as epistemic actions, i.e., announcements of belief formulae. For example, the action know_muddyi announces the belief formula Bi mi ∨ Bi ¬mi .
132
Y. Izmirlioglu et al.
Formal deﬁnition of the transition function is as follows. Suppose that the initial state and the agents’ beliefs are represented by a pointed Kripke structure (M, s), where s is the actual world (i.e., the “real” state of aﬀairs). Consider the occurrence of an action a which announces the belief formula γ to all agents, and an agent i observes the action occurrence. At the next state (M , s), the set of worlds, their valuations, and the actual world remains the same but agent i revises his accessibility relations such that (u, v) ∈ M [i] if (u, v) ∈ M [i] and (M, v) γ. The Kripke structure showing the actual world and the beliefs of the agents at the beginning of the problem is depicted in Fig. 1(a). There are 23 = 8 possible worlds encoding diﬀerent combinations of m1 , m2 , m3 . To make the ﬁgure easy to read, each world is represented by its valuation of ﬂuents—e.g., in the world 100, child 1 is muddy (i.e., m1 is true) but children 2 and 3 are not (i.e., m2 , m3 are false). In the actual world, only children 1 and 2 are muddy (denoted by a double circle in the ﬁgure). The accessibility relations of the agents show the worlds that they consider, and the uncertainty in their belief. In the actual world, child 1 considers both 110 and 010 possible since he cannot distinguish these two worlds based on his knowledge. Child 2 considers the worlds 110 and 100 possible. As another example, in the world 100, child 3 considers 100 and 101 possible. By the nature of the Kripke structure, the accessibility relations form the belief/knowledge of an agent about the beliefs of other agents (higher order beliefs). According to the semantics of entailment explained in Sect. 4, in the actual world 110, child 1 believes that child 2 believes that child 3 is not muddy, child 1 believes that child 2 does not know whether he is muddy or not, and child 1 believes that child 2 knows whether child 1 is muddy or not. In reality, child 1 does not know his own status but he knows that child 2 knows the status of child 1.
Fig. 1. (a) The initial state (b) At the end of round 0
Muddy Children Problem
133
The method of [5,10] uses elimination of accessibility relations to those worlds which do not satisfy the announced belief formula. After the father’s announcement, the children update their beliefs by removing their accessibility relations to the world 000 which does not satisfy the announced belief formula γ1 = m1 ∨ m2 ∨ m3 . Namely, since all children hear the announcement, they stop considering this world possible. The updated Kripke structure representing the agents’ beliefs at the end of round 0 is in Fig. 1(b). After round 0, in the actual world, none of the children knows whether they are muddy or not, i.e., (M, 110) entails the belief formulae ¬B1 m1 ∧ ¬B1 ¬m1 , ¬B2 m2 ∧ ¬B2 ¬m2 , ¬B3 m3 ∧ ¬B3 ¬m3 . Thus, at round 1, the children do not raise their hands. The children’s actions are simultaneous and all of them can observe each other’s actions. We can consider the three simultaneous actions as a single epistemic action announcing the belief formula γ2 = (¬B1 m1 ∧ ¬B1 ¬m1 ) ∧ (¬B2 m2 ∧ ¬B2 ¬m2 ) ∧ (¬B3 m3 ∧ ¬B3 ¬m3 ). Upon this action, the agents update their beliefs of Fig. 1(b), by removing their accessibility relations to the worlds which do not satisfy γ2 . Note that the world 100 does not satisfy child 1 not knowing his status, the world 010 does not satisfy child 2 not knowing his status and the world 001 does not satisfy child 3 not knowing his status. Hence agents remove their accessibility relations to the worlds 100, 010, 001. The new Kripke structure at the end of round 1 is depicted in Fig. 2(a). In the actual world of the updated structure, children 1 and 2 now know that they are muddy, while child 3 does not know whether he is muddy or not. In round 2, children 1 and 2 raise their hands but child 3 does not. Similar to round 1, we can consider this as a single epistemic action which announces the belief formula γ3 = (B1 m1 ∨ B1 ¬m1 ) ∧ (B2 m2 ∨ B2 ¬m2 ) ∧ (¬B3 m3 ∧ ¬B3 ¬m3 ). Since all children observe this action, each of them removes the edges to the worlds which do not satisfy γ3 . The worlds 011, 111, 101 in Fig. 2(a) do not satisfy γ3 , because in 011, 111 child 1 does not know whether he is muddy or not, and in 101 child 2 does not know whether he is muddy or not. Consequently the agents remove the edges to the worlds 011, 111, 101. The updated Kripke structure at the end of round 2 is shown in Fig. 2(b). Now child 3 also knows his status: he is not muddy. In fact, all children know the actual state of the world at the end of round 2. Therefore all children raise their hands at round 3 and the game ends. 6.2
Logic Programming
Baral et al. [1] use Answer Set Programming (ASP) to solve the Muddy Children problem. They develop an ASP program, i.e. a set of logical rules, to encode the beliefs, actions and the state transition. The advantage of Answer Set Programming is that the state transition and the entailment of belief formulae can be computed by simple logical rules. In the ASP formulation, the possible worlds and the accessibility relations in the Kripke structure are represented by propositional atoms. The initial beliefs of the agents are given as an input to the ASP program and the initial state and the initial accessibility
134
Y. Izmirlioglu et al.
Fig. 2. (a) At the end of round 1 (b) At the end of round 2
relations are nondeterministically generated to satisfy the given beliefs. In their model, the father tells the children that at least one of them is muddy at step 0. This action is encoded as an epistemic action which announces a belief formula. In step 1, and oddindexed steps, the father executes the ask action. In step 2, and in evenindexed steps, the children reply “Yes” or “No” simultaneously. Occurrence of an ask, Reply Yes, Reply No actions are represented by logical atoms occ(ask,T), occ(announce_k(A,true),T), occ(announce_k(A,false),T) atoms respectively, where A is the agent and T is the time step. The ask action does not change beliefs of agents hence the same Kripke structure carries over to the next step. However announcement actions alter agents’ beliefs and change the Kripke structure. Their state transition works as follows. At every step, the entailment of belief formulae at each world is computed by a set of ASP rules. After the father or children announce a belief formula, the worlds which do not satisfy the announced formula are identiﬁed and the accessibility relations of agents to these worlds are removed from the structure. Hence children commonly observe the eﬀect of every action and update their beliefs. The game continues until all children answer Yes. The authors give an example with 3 children (child 1, 2 are muddy) and show that the ASP program yields an answer set in which all children respond Yes at step 6, as expected. They also prove a general proposition which states that if there are l muddy children, then the father must ask l questions before all children answer Yes. 6.3
Other Potential Solutions
Alternative solutions to the Muddy Children may be developed in the future by using Game Theory, Mathematical Programming or other ﬁelds. One potential solution may be modeling it as an incomplete information game. Agents’ strategies depend on the history of actions and they update their beliefs accord
Muddy Children Problem
135
ingly. Another solution to the puzzle can be Mathematical Programming. By constraint programming, we can impose constraints on the number of muddy children and possible conﬁgurations of the children. Constraint rules can eliminate some conﬁgurations based on the actions in the previous rounds. Dynamic programming can be used to memoize the possible conﬁgurations that agents consider at every round. Then the children’s actions are computed from their beliefs and some conﬁgurations can be eliminated at the next state.
7
Variations of the Muddy Children Puzzle
There are several variations of the muddy children puzzle with respect to the father’s announcement, the order of the children’s announcement, abilities of the children and the mistake factor. Father’s Announcement: In one variation, also discussed in [8], the father can make an announcement of the form: “Q of you have mud on their foreheads.”, where Q can be substituted by the quantiﬁers such as “At least q”, “At most q”, “Exactly q”, “An even number” etc. We assume that the father can see the foreheads of all children and always tells the truth. Order of Children: In the original formulation, at every round, the actions of the children are simultaneous. In another situation, the children can take actions in a sequential manner. This situation is equivalent to a single child taking an action at every round. The children can take action in a predetermined ﬁxed order (i.e., a permutation of (1, ..., n)) or in a random order. When child i makes his announcement action, the other children update their beliefs and this process goes on with the next child in the sequence. We represent the order of children by O. If the actions of children are sequential, O is a permutation of (1, 2, ..., n); if their actions are simultaneous, O = ∅. Agent Abilities: We can also imagine an alternative scenario where some children lack a subset of sensing or action abilities. For example, some children cannot see the foreheads of other children and/or cannot observe when other children raise their hands. Moreover some of children may not be able to raise their own hands. Let W = (X, Y, Z) denotes the abilities of children, where X, Y , Z are the sets which include the indices of children who cannot see the forehead of others, the children who cannot observe the actions of other children, and the children who cannot raise their hands, respectively. Note that a child may lack multiple abilities (i.e., the three sets may not be disjoint). We assume that sensing or action abilities of children are common knowledge among all agents. The children who cannot see the foreheads of other children will consider the actions of other children to update their beliefs and reach the actual state. The children who cannot observe the actions of other children still know the number of rounds of the problem from the father’s announcements, and hence may infer the number of muddy children. Consequently, each child needs to take into account the abilities of others while reasoning and updating his beliefs.
136
Y. Izmirlioglu et al.
Rationazibility and Mistake: Another feasible case is that not all children are perfect reasoners. Some of them are boundedly rational and can sometimes make mistakes in their reasoning. These agents are not able to process all available information and therefore their beliefs might be diﬀerent from a perfectly reasoning agent. Hence the announcement action of these children might be incorrect. If the identities of the boundedly rational children are common knowledge, this case can be handled simply by disregarding the actions of boundedly rational children. Another case is that the perfectly rational children commonly know there are exactly b (or at least b, at most b) number of boundedly rational children but do not know their identities. Then the children need to make more sophisticated reasoning to resolve this case. We denote the information about boundedly rational children by U. In the former case, U is a set which includes the index of boundedly rational children, in the latter case U = [b, b]. Considering all the above variations, we describe the general Muddy Children problem by D = AG, I, A, O, W, U. The set of actions A includes the father’s various announcement actions with diﬀerent cardinality Q. Namely, A = {number_muddyQ, know_muddyi, not_know_muddyi }, where i ∈ {1, .., n} and Q is an identiﬁer like “at least q”, “odd”, “prime”. The Active Muddy Child Problem: The Active Muddy Child [10] is another version of the Muddy Children problem in which a particular child, with index k, needs to ﬁnd out whether he is muddy or not by asking questions. There are n children and of them are nonmuddy. The father makes an announcement action at round 0, as before. The active child asks an individual child at each round whether he is muddy or not. The requested child answers the question truthfully and all agents listen to his response. The problem is to ﬁnd the optimal strategy for the active child to achieve his goal in the smallest number of time steps. Note that a strategy is a conditional plan which speciﬁes the index of the next child to ask depending on a history of children responses.
8
Challenges
Now we describe the current challenges in the Muddy Children problem and its variations, which need to be addressed for future research. Representation of the State: The initial state H of an epistemic problem is generally given as a set of the literals for the actual world and the agents’ beliefs (including their beliefs about other agents). For the muddy children instance in Sect. 2, the initial state is1 H = {m1 , m2 , m3 , C(¬B1 m1 ∧ ¬B1 ¬m1 ), C B1 m2 , C B1 m3 , ..., C B3 m1 , C B3 m2 , C(¬B3 m3 ∧ ¬B3 ¬m3 )}. However, in epistemic reasoning, state transition functions and entailment of belief formulae are deﬁned over pointed Kripke structures. Thus, we need to determine the Kripke structure(s) which corresponds to the initial state of the epistemic problem; unfortunately, in general, the resulting Kripke structure may not be unique [13]. This is indeed the issue in epistemic reasoning and planning 1
When we omit the set of agents in the formulae Cα , we assume α = AG.
Muddy Children Problem
137
methods. The state should be represented as a set of belief formulae B, but it is represented by a Kripke structure (Mt , st ) where t is the time point and st is the actual world. Then the state transition function Φ applies to the current Kripke structure to obtain the next structure i.e. (Mt+1 , st+1 ) = Φ((Mt , st ), a). Some authors [9,11] have developed transition functions which operate on the set of beliefs. The belief set is revised in order to incorporate the incoming belief formulae. However [9] allows only propositional common knowledge, and the method of [11] requires prespeciﬁcation of agents’ beliefs at the next time step for each possible belief formula in the current time step, in the action description. Ideally, agents’ beliefs at the next time step should arise endogenously as an outcome of the model, instead of being given as an input. Individual View: The existing solutions look at the problem from an external view. However, in reality, agents observe the world from their own private individual perspective. Each child has his own Kripke structure representing his beliefs and does not know the actual world. An example of an individual Kripke structure of child 2 is shown in Fig. 3(a). The actual world is 100 but child 2 considers two worlds 100 and 110 possible.
Fig. 3. (a) A private view (b) A contingency
As in external view, removing accessibility relations to the worlds which do not satisfy announced formula also works for the individual view of the Muddy Children problem. However this edge removal method is found to be problematic for other epistemic problems which involve multiple possible worlds [3,4]. The reason is that after removing edges, an agent might end up considering no world possible. As an alternative approach, researchers apply the action sequence to each possible world in the initial structure as a contingency [3,4,6]. The intuition is that the agent considers each of those worlds as if it is the actual world
138
Y. Izmirlioglu et al.
and examines the outcome upon a sequence of actions. Then the state transition function yields branching on contingencies. But this method might produce counterintuitive results for the Muddy Children problem as in the following example: In the Kripke structure in Fig. 3(b), child 2 considers as if world 110 is the actual world. He updates the structure upon every action using the same edge removal method and obtains its ﬁnal form in Fig. 4. However this Kripke structure is not realistic for the individual view of child 2: He believes that the actual world is 110 but he believes in another world 100! Thus how to solve the Muddy Children problem using a distributed setting and how to make the state transition for an individual agent is a challenge for future research.
Fig. 4. The outcome if the actual world is 110
Variations: Whether variations of Muddy Children can be solved by the existing methods or other potential methods (discussed in Sect. 6.3) is an open problem. The cardinality of the muddy children in the father’s announcement can be represented by possible worlds in the Kripke structure or Constraint Programming. The state transition function in the Kripke structure or Dynamic programming can be modiﬁed to incorporate agents’ abilities. If the number (or range) of the boundedly rational children is known, this case can be handled by considering all possible candidate subsets of children as boundedly rational. Then a perfectly rational agent will revise the possible worlds he considers, by pooling those candidate subsets of boundedly rational agents. Implementing these methods is a direction for future research.
9
Conclusion
The Muddy Children is a famous problem in epistemic reasoning literature but is not widely known in other ﬁelds of Artiﬁcial Intelligence, Computer Science, Game Theory. The challenge of this problem is that it is a repeated game and requires sophisticated reasoning about other agents’ beliefs at every round. Besides, the agents cannot directly reveal their knowledge to other agents but they need to infer other agents’ knowledge from their actions. This paper have introduced the Muddy Children puzzle and variations to the general AI and Computer Science community. We have provided some theorems about the outcome for some variations of the problem. We have illustrated the existing solutions of the puzzle which use epistemic reasoning methods and stressed that they are not scalable and cannot solve all variations. In our opinion, the Muddy Children problem may be related to other problems in AI and Game Theory, and may stimulate further research ideas and solution methodologies in other ﬁelds.
Muddy Children Problem
139
References 1. Baral, C., Gelfond, G., Son, T.C., Pontelli, E.: Using answer set programming to model multiagent scenarios involving agents’ knowledge about other’s knowledge. In: Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems, vol. 1, pp. 259–266 (2010) 2. Barwise, J.: Scenes and other situations. J. Philos. 78(7), 369–397 (1981) 3. Bolander, T., Andersen, M.: Epistemic planning for single and multiagent systems. J. Appl. NonClassical Logics 21(1) (2011) 4. Bolander, T.: A gentle introduction to epistemic planning: the del approach. arXiv preprint arXiv:1703.02192 (2017) 5. Van Ditmarsch, H., van der Hoek, W., Kooi, B.: Dynamic Epistemic Logic, 1st edn. Springer, Heidelberg (2007) 6. Engesser, T., Bolander, T., Mattmüller, R., Nebel, B.: Cooperative epistemic multiagent planning for implicit coordination. In: Ghosh, S., Ramanujam, R. (eds.) Proceedings of the Ninth Workshop on Methods for Modalities, M4M@ICLA 2017, Indian Institute of Technology, Kanpur, India, 8th to 10th January 2017. EPTCS, vol. 243, pp. 75–90 (2017) 7. Fagin, R., Halpern, J., Moses, Y., Vardi, M.: Reasoning About Knowledge. MIT Press, Cambridge (1995) 8. Gierasimczuk, N., Szymanik, J.: A note on a generalization of the muddy children puzzle. In: Proceedings of the 13th Conference on Theoretical Aspects of Rationality and Knowledge, pp. 257–264 (2011) 9. Huang, X., Fang, B., Wan, H., Liu, Y.: A general multiagent epistemic planner based on higherorder belief change. In: Proceedings of the TwentySixth International Joint Conference on Artiﬁcial Intelligence (IJCAI 2017) (2017) 10. Kominis, F., Geﬀner, H.: Beliefs in multiagent planning: from one agent to many. In: Proceedings of the TwentyFifth International Conference on Automated Planning and Scheduling, ICAPS 2015, Jerusalem, Israel, 7–11 June 2015, pp. 147–155 (2015) 11. Liu, Q., Liu, Y.: Multiagent epistemic planning with common knowledge. In: Lang, J. (ed.) Proceedings of the TwentySeventh International Joint Conference on Artiﬁcial Intelligence, IJCAI 2018, Stockholm, Sweden, 13–19 July 2018, pp. 1912–1920. ijcai.org (2018) 12. McCarthy, J.: Formalization of two puzzles involving knowledge. Formalizing Common Sense: Papers by John McCarthy, pp. 158–166 (1990) 13. Son, T.C., Pontelli, E., Baral, C., Gelfond, G.: Finitary S5theories. In: Fermé, E., Leite, J. (eds.) JELIA 2014. LNCS (LNAI), vol. 8761, pp. 239–252. Springer, Cham (2014). https://doi.org/10.1007/9783319115580_17 14. van Tilburg, G.: Doe wel en zie niet om (do well and don’t look back). Katholieke Illustratie (Catholic Illustrated Journal) 90(32), 47 (1956)
Multiagent Cooperative Argumentation in Arg2P Giuseppe Pisano1(B) , Roberta Calegari2 , and Andrea Omicini2 1
2
Alma AI – Alma Mater Research Institute for HumanCentered Artiﬁcial Intelligence, Alma Mater Studiorum, Università di Bologna, Bologna, Italy [emailprotected] Dipartimento di Informatica – Scienza e Ingegneria (DISI), Alma Mater Studiorum, Università di Bologna, Bologna, Italy {roberta.calegari,andrea.omicini}@unibo.it, http://giuseppepisano.apice.unibo.it, http://robertacalegari.apice.unibo.it, http://andreaomicini.apice.unibo.it
Abstract. This work focuses on cooperative argumentation and conversation in multiagent systems by introducing an extension of the Arg2P technology that enables parallelisation and distribution of the argumentation process. The computational model and the implementation underpinning the Arg2P technology are presented and discussed. Keywords: Argumentation · Arg2P · Cooperative argumentation Multiagent systems · Cooperative reasoning
1
·
Introduction
Humancentred intelligent systems are densely populated by agents (either software or human) capable of understanding, arguing about, and reporting, via factual assertions and arguments, what is happening and what they could make happen [19]. A multiagent system (MAS) based on argumentation, dialogue, and conversation can then work as the basis for designing humancentred intelligent systems: through argumentation, dialogue, and adherence to social justice, the behaviour of the intelligent system can be reached, shaped, and controlled [1,25], and conﬂict can be resolved by adopting a cooperative argumentation approach [10]. There, the purpose of multiagent argumentative dialogues is to let agents reach an agreement on (i) the evaluation of goals and corresponding actions (or plans), and (ii) the adoption of a decentralised strategy for reaching a goal, by allowing agents to reﬁne or revise other agents’ goals and defend one’s proposals. In this scenario, intelligent behaviours are likely to become associated with the capability of arguing about situations as well as the current state and circumstances, by reaching a consensus on what is happening around and what is c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Dovier et al. (Eds.): AIxIA 2022, LNAI 13796, pp. 140–153, 2023. https://doi.org/10.1007/9783031271816_10
Multiagent Cooperative Argumentation in Arg2P
141
needed, and by triggering and orchestrating proper decentralised semantic conversations so as to determine how to collectively act in order to reach a future desirable state [8]. Thus, argumentation [14] and related technologies become a fundamental building block for the design of these systems, thanks to their potential to be an eﬀective communication medium for heterogeneous intelligent agents while enabling a natural form of interaction between users and computational systems, towards explainability features. However, for argumentation tools to be able to meet the aforementioned expectations, a huge eﬀort is required from a software engineering perspective. The last decades’ continuous improvement in the design and development of technologies for humancentred intelligent systems has not been matched by an analogous improvement of argumentation technologies, where the technological landscape is nowadays populated by very few systems—and most of them are mere prototypes [6]. A key problem in existing argumentation technology is that a widelyacknowledged wellfounded computational model for argumentation is currently missing: this makes it diﬃcult to investigate convergence and scalability of argumentation techniques in highlydistributed environments [10,18]. At the same time, the ﬁeld has seen a constant ﬂow of theoretical contributions [17,20]. Arg2P [9] is a logicbased technology, oﬀering a thorough instantiation of the ASPIC+ framework [21] for structured argumentation. The purpose of this paper is to eﬀectively distribute the argumentation process (evaluation of arguments) so as to enable the exploitation of Arg2P in the context of cooperative argumentation, according to the aforementioned perspective. Accordingly, the work is structured as follows. Section 2 contains a brief introduction to structured argumentation. Section 3 presents the core contribution of this work, i.e., the distribution of the argumentation process and its implementation. Finally, Sect. 4 concludes the work.
2
Background Notion: Structured Argumentation
Let us start by deﬁning a generic structured argumentation framework. This introduction has two purposes: (i) to give the reader with no speciﬁc knowledge in the formal argumentation ﬁeld an idea of its main concepts and notions, (ii) to serve as a basis for the analysis contained in subsequent sections. For a more complete introduction we invite the reader to consult the vast amount of available literature on the topic [3,4]. We ﬁrst introduce the notion of argumentation language. In the argumentation language, a literal is either an atomic proposition or its negation. Notation 1. For any literal φ, its complement is denoted as φ. That is, if φ is a proposition p, then φ = ¬p, while if φ is ¬p, then φ is p. Literals are brought into relation through rules. Definition 1 (Rules). A defeasible rule r has the form: ρ : φ1 , ..., φn ⇒ ψ with 0 ≤ n, and where
142
G. Pisano et al.
– ρ is the unique identiﬁer for r; – each φ1 , . . . φn , ψ is a literal; – the set {φ1 , . . . φn } is denoted by Antecedent(r) and ψ by Consequent(r). Defeasible rules – denoted by DefRules – are rules that can be defeated by contrary evidence. Pragmatically, a defeasible rule is used to represent defeasible knowledge, i.e., tentative information that may be used if nothing could be posed against it. For the sake of simplicity, we deﬁne nonaxiom premises via defeasible rules with empty Antecedent. A theory consists of a set of rules. Definition 2 (Theory). A defeasible theory is a set Rules ⊆ DefRules. Arguments are built from defeasible rules. Given a defeasible theory, arguments can be constructed by chaining rules from the theory, as speciﬁed in the deﬁnition below—cf. [21]. Definition 3 (Argument). An argument A constructed from a defeasible theory Rules is a ﬁnite construct of the form: A : A1 , . . . An ⇒r φ with 0 ≤ n, where – r is the top rule of A, denoted by TopRule(A); – A is the argument’s unique identiﬁer; – Sub(A) denotes the entire set of subarguments of A, i.e., Sub(A) = Sub(A1 ) ∪ . . . ∪ Sub(An ) ∪ {A}; – φ is the conclusion of the argument, denoted by Conc(A); Arguments can be in conﬂict, accordingly to two kinds of attack: rebuts and undercutting, here deﬁned as in [21]. Definition 4 (Attack). An argument A attacks an argument B (i.e., A is an attacker of B) at B ∈ Sub(B) iﬀ A undercuts or rebuts B (at B ), where: – A undercuts B (at B ) iﬀ Conc(A) = TopRule(B ) – A rebuts B (at B ) iﬀ Conc(A) = φ and Conc(B ) = φ Then, an abstract argumentation framework can be deﬁned by exploiting arguments and attacks. Definition 5 (Argumentation Framework). An argumentation framework constructed from a defeasible theory T is a tuple A, , where A is the set of all arguments constructed from T , and is the attack relation over A. The corresponding argumentation graph is a directed graph whose arcs are attacks and nodes are arguments. Notation 2. Given an argumentation framework G = A, , we write AG and G to denote the framework’s arguments and attacks, respectively. Given an argumentation framework, we leverage on labelling semantics [2,14] to compute the sets of arguments that are accepted or rejected. Accordingly, each argument is associated with one label which is either IN, OUT, or UND—respectively meaning that the argument is either accepted, rejected, or undecided. Given a labelling for a framework, a IN, OUT, UND labelling for the statements claimed by the arguments in the graph can be also derived.
Multiagent Cooperative Argumentation in Arg2P
3
143
Distributed Argumentation in Arg2P
Arg2P is a logicbased technology, an easilydeployable argumentation tool built to meet the requirements of intelligent software systems.1 It is built upon 2PKt—a reboot of the tuProlog [11,13] project oﬀering a general, extensible, and interoperable ecosystem for logic programming and symbolic AI. Whereas a complete overview of the features of this speciﬁc implementation is out of the scope of this paper, we refer the reader to [7,9,24] for more details. In this section we focus on how to eﬀectively distribute its argumentation process (evaluation of arguments). A ﬁrst version of a messagebased distributed argumentation algorithm is here discussed as the basic pillar of a computational model for cooperative argumentation in MAS. We ignore issues such as agent autonomy and MAS coordination artefacts [22,23], and focus instead on the distribution issues of cooperative argumentation, which enables agent dialogue and defeasible reasoning in MAS. The ﬁrst issue when facing computational issues of cooperative argumentation is the parallelisation of the argumentation process. Parallelisation needs to be tackled under two distinct perspectives: (i) the algorithmic perspective and (ii) the data perspective. Under the algorithmic perspective, we divide the argument evaluation (w.r.t. a given semantics) into smaller subtasks to be executed in parallel. Under the data perspective, instead, we split the data used by the algorithm—i.e., the argumentation defeasible theory. Action here is therefore at the data level, looking for possible data partitioning on which the argumentation process can be run in parallel. As a premise, we ﬁrst introduce the algorithm that served as a starting point in the parallelisation of the argumentation process. Among the available libraries, Arg2P includes a querybased mode, which allows for singlequery evaluation according to the selected semantics.2 The feature is accessible in the default instance of the Arg2P framework through the predicate answerQuery(+Goal, Yes, No, Und) which requests the evaluation of the given Goal, and gets the set of facts matching the goal distributed in the three sets IN, OUT, and UND as a result. The algorithm used to evaluate a single claim (or query) according to grounded semantics is inspired by the DeLP dialectical trees evaluation [15]. Listing 1.1 shows the pseudocode – AnswerQuery(Goal) – for the answerQuery/4 predicate: given a claim (Goal) as input, the function ﬁrst builds all the arguments sustaining that claim (buildSustainingArguments(Goal)), and then requires their evaluation via the Evaluate(A, Chain) function. In order to assess the A1 , ..., An status (acceptability or rejection), three conditions are evaluated: (Cond1) if a conﬂicting argument labelled as IN exists, then A1 is OUT; (Cond2) if a cycle in the route from the root to the leaves (Chain) exists, then A1 argument is UND; 1 2
http://arg2p.apice.unibo.it. At the time of writing, only grounded semantics is fully implemented.
144
G. Pisano et al.
Listing 1.1. Structured argumentation, Arg2P answer query algorithm for grounded semantic (pseudocode). AnswerQuery ( Goal ): A1 , ..., An = b u i l d S u s t a i n i n g A r g u m e n t s ( Goal ) Res = ∅ for A in A1 , ..., An : Res = Res ∪ Evaluate (A , ∅) return Res . Evaluate (A , Chain ): if (∃ B ∈ Attacker ( A ): Evaluate (B , A ∪ Chain ) = IN ) return OUT if (∃ B ∈ Attacker ( A ): B ∈ Chain ) return UND if (∃ B ∈ Attacker ( A ): Evaluate (B , A ∪ Chain ) = UND ) return UND return IN .
(Cond3) if a conﬂicting argument labelled as ment is UND.
UND
exists, then also the A1 argu
If none of the above conditions is met, then the argument can be accepted. Example 1. Let us consider the following theory and the corresponding arguments (depicted in Fig. 1). r1 : ⇒ a A0 : ⇒r1 a r2 : a ⇒ b A1 : A0 ⇒r2 b r3 : ⇒ ¬b A2 : ⇒r3 ¬b r4 : b ⇒ c A3 : A1 ⇒r4 c According to grounded semantic A0 is IN – there are no arguments contending its claim or undercutting its inferences – whereas A1, A2 and A3 are UND—A1 and A2 have opposite conclusions and thus attack each other; the conﬂict is then propagated to the derived argument A3. Let us suppose we require the evaluation of claim b via the AnswerQuery(Goal) function in Listing 1.1. First, the arguments sustaining b are created, in this case only A1. Then the evaluation conditions on A1 attackers – only A2 in this case – are assessed. However, A2 admissibility depends, in turn, on A1—as you can see in Fig. 1 also A1 attacks A2. There is a cycle in the graph (Cond2), and no other attackers matching (Cond1). As a consequence, A2 is UND and thus A1 (Cond3). Accordingly, claim b is labelled UND as expected. Let us now consider the algorithm in Listing 1.1 to analyse the requirements and implications of its parallelisation. The algorithm structure is simple: the argument evaluation leverages the evaluation obtained from its attackers—i.e., the attackers are recursively evaluated using the same algorithm and the result is exploited to determine the state of the target argument. Intuitively, a ﬁrst point
Multiagent Cooperative Argumentation in Arg2P
145
Fig. 1. Argumentation graph for arguments from Example 1, in which nodes are arguments and edges are attacks between arguments.
of parallelisation can be found in the search and evaluation of the Attackers. Indeed, every condition exploited by the algorithm – (Cond1), (Cond2), and (Cond3) – to evaluate an argument requires one and only one attacker to match the constraint. Those conditions directly suggest a parallelisation in the search and evaluation of the attackers. We could evaluate the arguments simultaneously under diﬀerent branches, and the success in one of the branches would lead to the success of the entire search. However, the algorithm exposes another point of parallalisation. The order in the evaluation of the conditions is essential for the soundness of the algorithm— as illustrated by the following example. Example 2. Let us consider argument A and its two attackers B and C. Let it be the case in which we know B and C’s labelling, IN for the former and UND for the latter. If we do not respect the order dictated by the algorithm, A’s labelling is either UND (Cond3) or OUT (Cond1). Of course, the ﬁrst result would be in contrast with the original grounded semantic requirements for which every argument having an IN attacker should be deﬁnitively OUT. Conversely, if we respect the evaluation order, A’s labelling would be OUT in every scenario. Although the evaluation order is strict, we can evaluate all the conditions simultaneously and consider the ordering only while providing the labelling for the target argument. In other words, the three conditions are evaluated in parallel, but the result is given accordingly to the deﬁned priorities. If (Cond1) is met, the argument is labelled as OUT. Conversely, even if (Cond2) or (Cond3) are met, one should ﬁrst verify that (Cond1) does not hold. Only then the argument can be labelled as UND. Listing 1.2 contains the version of the algorithm taking into account both points of parallelisation. The three conditions – (Cond1), (Cond2) and (Cond3) – are evaluated at the same time. Then the results of the three subtasks are combined to provide the ﬁnal solution according to the conditions’ priority. Of course, if we consider a scenario where only the ﬁrst condition (Cond1) is required to determine the status of the argument in input, the parallel evaluation of all three conditions would lead to a waste of computational resources. However, this problem is easily mitigated by evaluating the subtask results as soon as they are individually available—i.e. in the case we receive a positive result from a
146
G. Pisano et al.
Listing 1.2. Evaluate predicate with both parallel conditions evaluation and parallel attackers Evaluate (A , Chain ): PARALLEL { Cond1 = PARALLEL { ∃ B ∈ Attacker ( A ): Evaluate (B , A ∪ Chain ) = IN } Cond2 = PARALLEL { ∃ B ∈ Attacker ( A ): B ∈ Chain } Cond3 = PARALLEL { ∃ B ∈ Attacker ( A ): Evaluate (B , A ∪ Chain ) = UND } } if ( Cond1 ) return OUT if ( Cond2 AND NOT Cond1 ) return UND if ( Cond3 AND NOT Cond1 ) return UND if ( NOT Cond1 AND NOT Cond2 AND NOT Cond3 ) return IN
single subtask, and it is enough to compute the argument status, we can cut the superﬂuous computational branches and return the ﬁnal solution. In the ﬁrst part of our analysis we focused on the parallelisation problem from a pure computational perspective, by discussing whether the evaluation task could be split into a group of subtask to be executed simultaneously. However, there is another perspective to take into account when parallelising: the one concerning the data. Example 3. For instance, let us consider a job computing the sum and the product of a set of numbers. Using the subtask approach, we could have two subroutines running in parallel, one computing the sum and the other computing the product of the numbers. However, leveraging the associative property of addition and multiplication, we can split the problem into a series of tasks computing both sum and product on a subset of the original data. Then the ﬁnal result would be the sum and the multiplication of the tasks’ results. Let us suppose to apply the same principle to the argumentation task. We build arguments from a base theory according to the relations illustrated in Sect. 2. The logic theory is, for all intents, the input data of our algorithm (argumentation task). Now, the question is whether we can eﬀectively split the data into subportions to be evaluated in parallel without aﬀecting the global soundness of the original algorithm. Let us consider a splitting principle based on rules dependency – i.e., if two rules can be chained, they must stay together –, and the algorithm in Listing 1.2. According to the algorithm, the search and evaluation of the attackers are performed in a distinct subtask (concurrent evaluation). Then, we can split the knowledge concerning attacked and attackers into separate sets, since the subtasks evaluating an attacker require only the knowledge to infer such an attacker—i.e., the Dependency principle must be respected. Indeed, there is no task that needs to know how to build both an argument and its attackers, since the search is delegated to another process. In
Multiagent Cooperative Argumentation in Arg2P
147
other words, a single subprocess in charge of evaluating an argument needs only the portion of the theory needed to infer the argument itself—i.e., the chainable rules concluding the target claim. 3.1
Computational Model: The MasterSlave Actor Model
We can now provide a complete and sound mechanism for the admissibility task in a fullyconcurrent way, exploiting the insights from Sect. 3 and applying them to an actorbased model [16]. In short, the actor model is based on a set of computational entities – the actors – communicating with each other through messages. The interaction between actors is the key to computation. Actors are pure reactive entities that, only in response to a message, can: – create new actors; – send messages to other actors; – change their internal state through a predeﬁned behaviour. Actors work in a fullyconcurrent way – asynchronous communication and message passing are fundamental to this end – making the actor model suited to concurrent applications and scenarios. We choose this model for its simplicity: it presents very few abstractions making it easy to study both how to model a concurrent system and its properties. The ﬁnal goal is to provide a sound model for agents’ cooperative argumentation in MAS, enabling concurrent evaluation of the argumentation algorithms (focusing on distribution). The actor paradigm is a straightforward choice for an analysis of this sort. Since the actor model focuses on actors and their communication, the following design will review the structure and behaviour of the actors involved. Although a fullydistributed version of the model is possible, we choose to adopt a masterslave approach in order to simplify the functioning of the system as much as possible. Accordingly, two main sorts of actors are conceived in the system: master and worker. Master actors coordinate the knowledgebase distribution phase, while the workers hold a portion of the theory, concurring with the evaluation of a claim through their interaction. Let us start with the knowledge distribution. Since actors are reactive entities, in order to completely adhere to the actor model the master knowledge base can be changed from outside the actor system. If the master receives the order to add a new element to the theory, three possible scenarios can be conﬁgured: 1. none of the workers contains a compatible knowledge base (kb) – i.e., it is not possible to chain the new rule to the knowledge base – and consequently, the master creates a new worker containing the portion of the theory; 2. one or more workers have a compatible knowledge base, and they add the element to their kb; 3. a set of workers possess overlapping knowledge bases – i.e. the union set of workers’ knowledge bases can be used to create a unique inference chain –, and, as a consequence, we merge their knowledge bases and destroy the extra workers;
148
G. Pisano et al.
Iterating this procedure for all the elements of an input knowledge base, as a result, we should obtain a set of workers each of them containing a portion of the theory in accordance with the dependency splitting principle. Once the knowledge has been correctly split between workers, we can proceed with the actorbased evaluation of an argument. Each actor is responsible for evaluating those arguments that can be built using his portion of the theory. When the actor receives an evaluation request, it ﬁrst checks if attackers exist, w.r.t. its portion of the knowledge base. Then, the actor can: (i) register the impossibility to evaluate the argument – only if a cycle through the evaluation chain is detected –, (ii) require the attacker arguments evaluation to all the other actors. In the latter case, the actor shall answer the original evaluation request only after receiving a response from others actors. The conditions to match while evaluating an argument are the same as the original algorithm in Listing 1.1: – if one counterargument is admissible, we evaluate the argument as OUT; – if any number of actors decide for the argument undecidability with none advancing its rejection, we mark the argument as UND; – if all the actors agree that no counterarguments can be provided as acceptable, we evaluate the argument as IN; Actors provide their suggestions on the state of the requested argument according to all the labels of their counterarguments. We can describe the interactions between the system’s actors as a sequence diagram (Fig. 2) of messages exchanged between masters and workers, where: – Add, sent from the master to a worker, through which the master sends the new theory member to be stored in the workers’ kb; the decision on which is the right worker to send the data to is the responsibility of the master that knows the entire state of the system and how data has been divided; – RequireEvaluation, sent from outside the system to the master to require the evaluation of a claim; – Eval, sent from the master to all workers to require the evaluation of a claim – FindAttacker, sent from a worker to master to require the broadcasting of a request for counterarguments to all the available workers; – ExpectedResponses, sent from master to a worker to communicate the number of expected responses to a request for counterarguments; – AttackerResponse, sent from a worker to a worker in response to a request for counterarguments; the message contains the state of the counterargument obtained through a new FindAttacker evaluation; – EvalResponse, sent from workers to the master to communicate their decision on a claim; the decision is taken after all the AttackerResponse containing the state of possible counterarguments have been received; – EvaluationResponse, message sent from master containing the system decision on the state of a claim. Note that the Add and RequireEvaluation messages come from outside the actor system and start the distribution and evaluation process. This interaction
Multiagent Cooperative Argumentation in Arg2P
149
Fig. 2. Masterslave interaction for argument evaluation.
model implements both the parallelisation strategies described in Listing 1.2: the search for counterarguments is executed concurrently by all the worker nodes, as also the evaluation of the admissibility of arguments. Example 4. Let us consider again the theory in Example 1. Let us assume a single MasterActor and the following order in the inclusion of the rules in the system: r1, r3, r4, r2.3 As for the ﬁrst three rules, the behaviour is the same. Since the rules are not chainable, it creates three distinct workers and sends a single rule to every one of them via the Add message. We now have Worker 1, Worker 2, and Worker 3 with respectively r1, r3, and r4 in their knowledge bases. Then the inclusion of rule r2 is required, and both workers 1 and 3 results in having a chainable knowledge base. Rule r2 is, in fact, the missing link in the inference chain of r1 and r4. As a consequence, the Master stops the two workers, creates a new one, and then requires to it the inclusion of rules r1, r4, r2 via three Add messages. At the end of the distribution phase, we have two workers, one containing r1, r2, r4, and the other just r3. The dependency principle is thus respected. Going on with the example, we require the evaluation of claim b via the RequireEvaluation message: so, the Master sends an Eval 3
The order of inclusion aﬀects the steps required to converge, not the ﬁnal state of the system.
150
G. Pisano et al.
message to all the actors. Worker 1 succeeds in building an argument (A1) and sends to all the other Workers – also Worker 1 is included in the list – a FindAttacker message requiring attackers evaluation—the broadcasting of the message is done by the Master actor. The master also communicates the number of responses that are expected (ExpectedResponses message)—only two in that case. Worker 1 answers with a AttackerResponse communicating that there are no attacking arguments according to its knowledge, while Worker 2 sends back a AttackerResponse with an Und result. Indeed, Worker 2 is able to create a valid counterargument (A2), but a cycle is detected in the inference chain. According to the evaluation algorithm, receiving an Und response, Worker 1 can ﬁnally label A1 as UND and let the master know that via a EvalResponse message. 3.2
Implementation: The Parallel Library
The model in Subsect. 3.1 has been implemented as a library – the Parallel library – for the Arg2P framework.4 The goal of the implementation is twofold: (i) providing a mechanism for the concurrent evaluation of a claim by a single Arg2P instance – actors in execution on a single machine can achieve real parallelisation thanks to multicore hardware architectures – (ii) enabling cooperative argumentation by allowing diﬀerent Arg2P instances to create a single actor system, thus sharing their knowledge base or their hardware resources. Among the available technologies for the implementation, we selected Akka.5 [12] Akka is an open source middleware for programming concurrent and distributed actor systems based on the original Actor model by Hewitt [16]. Built upon the JVM platform, the framework oﬀers an easy way of deploying network distributed systems observant of the original actor principles—e.g. reactivity, asynchronous communications, and absence of states of shared memory between actors. All these features made the Akka framework one of the reference technologies in the distributed landscape. The ﬁnal implementation makes use of the Akka Clustering features to enable the collaboration of diﬀerent Arg2P instances. In particular, we rely on Cluster Singletons 6 to handle the Master actor lifecycle, and Cluster Sharding 7 for Worker nodes. The Parallel library makes available ﬁve directives: – join(Port), requesting the creation of an actor system on the local machine exposed on port Port; – join(Port, Address), to join an actor system on the machine at the given Address, exposed on port Port; – load, requesting the distribution of the rules contained in the knowledge base of the local instance between all the members of the actor systems; – reset, requesting the deletion of the data previously distributed in the actor system via the load directive; 4 5 6 7
Sources available at https://github.com/tuProlog/arg2pkt. https://akka.io/. https://doc.akka.io/docs/akka/current/typed/clustersingleton.html. https://doc.akka.io/docs/akka/current/typed/clustersharding.html.
Multiagent Cooperative Argumentation in Arg2P
151
– solve(Goal, In, Out, Und), requesting the evaluation of the Goal claim to the actor system according to the procedure in Fig. 2. Results are the set of facts matching the goal distributed in the three sets IN, OUT, and UND. All the application scenarios can be modelled by using the directives above. We achieve a parallel evaluation of a claim on a single Arg2P instance in three steps: (i) creating a local actor system (join(Port)), (ii) distributing the theory between local actors (load), (iii) requiring the evaluation of a statement through the solve(Goal, In, Out, Und) directive. At the same time we could have others Arg2P instances oﬀering their hardware resources (join(Port, Address)) or also participating in the resolution if they share their own knowledge (load).
4
Conclusion
In this work, given the relevance of issues such as pervasiveness and interconnection in the current technological landscape, we address the problem of distribution of the argumentation workload. We follow some insights from [5] and [22,23]. In [5] the ﬁrst proposal of a tuPrologbased is presented that exploits a dialogical argumentation mechanism—i.e., argumentation is performed across multiple processes proposing arguments and counterarguments. However, the argumentation algorithm distribution has not been addressed there. Conversely, in [22,23] authors directly address the problem of enabling argumentation techniques in MAS. Nonetheless, their approach just depicts a generalpurpose architectural solution for the multiparty argumentation problem in the MAS context, providing for neither an actual technology nor a precise model for the distribution and parallelisation of the argumentation process. Overall, we believe that our approach is a step forward in the direction of a full argumentationbased MAS, and more in general of the diﬀusion of argumentation theories as a solid foundation for the engineering of complex intelligent systems. Yet, many issues are still to be considered. We should provide a complete analysis of the computational properties of the presented model – e.g., correctness, completeness, termination –, and also consider its relation with alternative distribution schemes (e.g., peertopeer). Moreover, an empirical evaluation of the performance of the system compared to traditional solvers should also be provided. Another topic of future investigations is the extension to different argumentation semantics. The main diﬀerence would be in the labelling conditions used to classify the arguments according to the diﬀerent semantics. Moreover, a branching mechanism to allow the coexistence of multiple labellings should be devised in order to support the semantics with multiple extensions. However, most of the ideas behind the presented model should still remain applicable. Acknowledgements. This work was supported by the H2020 ERC Project “CompuLaw” (G.A. 833647).
152
G. Pisano et al.
References 1. Andrighetto, G., Governatori, G., Noriega, P., van der Torre, L.W.: Normative multiagent systems, Dagstuhl FollowUps, vol. 4. Schloss DagstuhlLeibnizZentrum fuer Informatik (2013). http://www.dagstuhl.de/dagpub/9783939897514 2. Baroni, P., Caminada, M., Giacomin, M.: An introduction to argumentation semantics. Knowl. Eng. Rev. 26(4), 365–410 (2011). https://doi.org/10.1017/ S0269888911000166 3. Baroni, P., Gabbay, D., Giacomin, M., van der Torre, L.: Handbook of Formal Argumentation. College Publications, London (2018). https://www. collegepublications.co.uk/handbooks/?00003 4. Besnard, P., et al.: Introduction to structured argumentation. Argument Comput. 5(1), 1–4 (2014). https://doi.org/10.1080/19462166.2013.869764 5. Bryant, D., Krause, P.J., Vreeswijk, G.: Argue tuProlog: a lightweight argumentation engine for agent applications. In: Computational Models of Argument. Frontiers in Artiﬁcial Intelligence and Applications, vol. 144, pp. 27–32. IOS Press (2006). https://ebooks.iospress.nl/publication/2929 6. Calegari, R., Contissa, G., Lagioia, F., Omicini, A., Sartor, G.: Defeasible systems in legal reasoning: a comparative assessment. In: Araszkiewicz, M., RodríguezDoncel, V. (eds.) Legal Knowledge and Information Systems, JURIX 2019: The Thirtysecond Annual Conference, Frontiers in Artiﬁcial Intelligence and Applications, vol. 322, pp. 169–174. IOS Press (2019). https://doi.org/10.3233/ FAIA190320 7. Calegari, R., Contissa, G., Pisano, G., Sartor, G., Sartor, G.: ArgtuProlog: a modular logic argumentation tool for PIL. In: Villata, S., Harašta, J., Křemen, P. (eds.) Legal Knowledge and Information Systems, JURIX 2020: The Thirtythird Annual Conference. Frontiers in Artiﬁcial Intelligence and Applications, vol. 334, pp. 265–268 (2020). https://doi.org/10.3233/FAIA200880 8. Calegari, R., Omicini, A., Sartor, G.: Computable law as argumentationbased MAS. In: Calegari, R., Ciatto, G., Denti, E., Omicini, A., Sartor, G. (eds.) WOA 2020–21st Workshop “From Objects to Agents”. CEUR Workshop Proceedings, vol. 2706, pp. 54–68. Sun SITE Central Europe, RWTH Aachen University, Aachen, Germany (2020). http://ceurws.org/Vol2706/paper10.pdf, 21st Workshop “From Objects to Agents” (WOA 2020), Bologna, Italy, 14–16 September 2020. Proceedings 9. Calegari, R., Pisano, G., Omicini, A., Sartor, G.: Arg2P: an argumentation framework for explainable intelligent systems. J. Logic Comput. 32(2), 369–401 (2022). https://doi.org/10.1093/logcom/exab089, Special Issue from the 35th Italian Conference on Computational Logic (CILC 2020) 10. Carrera, Á., Iglesias, C.A.: A systematic review of argumentation techniques for multiagent systems research. Artif. Intell. Rev. 44(4), 509–535 (2015). https:// doi.org/10.1007/s1046201594359 11. Ciatto, G., Calegari, R., Omicini, A.: 2P KT: a logicbased ecosystem for symbolic AI. SoftwareX 16(100817), 1–7 (2021). https://doi.org/10.1016/j.softx.2021. 100817 12. Cossentino, M., Lopes, S., Nuzzo, A., Renda, G., Sabatucci, L.: A comparison of the basic principles and behavioural aspects of Akka, JaCaMo and Jade development frameworks. In: Proceedings of the 19th Workshop “From Objects to Agents”. CEUR Workshop Proceedings, vol. 2215, pp. 133–141. CEURWS.org (2018). http://ceurws.org/Vol2215/paper_21.pdf
Multiagent Cooperative Argumentation in Arg2P
153
13. Denti, E., Omicini, A., Ricci, A.: Multiparadigm JavaProlog integration in tuProlog. Sci. Comput. Program. 57(2), 217–250 (2005). https://doi.org/10.1016/j.scico. 2005.02.001 14. Dung, P.M.: On the acceptability of arguments and its fundamental role in nonmonotonic reasoning, logic programming and nperson games. Artif. Intell. 77(2), 321–358 (1995). https://doi.org/10.1016/00043702(94)00041X 15. García, A.J., Simari, G.R.: Defeasible logic programming: an argumentative approach. Theory Pract. Logic Program. 4(1–2), 95–138 (2004). https://doi.org/10. 1017/S1471068403001674 16. Hewitt, C., Bishop, P.B., Steiger, R.: A universal modular ACTOR formalism for artiﬁcial intelligence. In: 3rd International Joint Conference on Artiﬁcial Intelligence, pp. 235–245. William Kaufmann (1973). http://ijcai.org/Proceedings/73/ Papers/027B.pdf 17. Hulstijn, J., van der Torre, L.W.: Combining goal generation and planning in an argumentation framework. In: International Workshop on Nonmonotonic Reasoning (NMR 2004), pp. 212–218 (2004). https://www.pims.math.ca/science/2004/ NMR/papers/paper28.pdf 18. Jung, H., Tambe, M., Kulkarni, S.: Argumentation as distributed constraint satisfaction: applications and results. In: 5th International Conference on Autonomous Agents (Agents 2001), pp. 324–331 (2001). https://doi.org/10.1145/375735.376322 19. Krippendorﬀ, K.: Intrinsic motivation and humancentred design. Theor. Issues Ergon. Sci. 5(1), 43–72 (2004). https://doi.org/10.1080/1463922031000086717 20. Modgil, S., Caminada, M.: Proof theories and algorithms for abstract argumentation frameworks. In: Simari, G., Rahwan, I. (eds.) Argumentation in Artiﬁcial Intelligence, pp. 105–129. Springer, Heidelberg (2009). https://doi.org/10.1007/ 9780387981970_6 21. Modgil, S., Prakken, H.: The ASPIC+ framework for structured argumentation: a tutorial. Argument Comput. 5(1), 31–62 (2014). https://doi.org/10.1080/ 19462166.2013.869766 22. Oliva, E., McBurney, P., Omicini, A.: Coargumentation artifact for agent societies. In: Rahwan, I., Parsons, S., Reed, C. (eds.) ArgMAS 2007. LNCS (LNAI), vol. 4946, pp. 31–46. Springer, Heidelberg (2008). https://doi.org/10.1007/9783540789154_3 23. Oliva, E., Viroli, M., Omicini, A., McBurney, P.: Argumentation and artifact for dialogue support. In: Rahwan, I., Moraitis, P. (eds.) ArgMAS 2008. LNCS (LNAI), vol. 5384, pp. 107–121. Springer, Heidelberg (2009). https://doi.org/10.1007/9783642002076_7 24. Pisano, G., Calegari, R., Omicini, A., Sartor, G.: A mechanism for reasoning over defeasible preferences in Arg2P. In: Monica, S., Bergenti, F. (eds.) CILC 2021 Italian Conference on Computational Logic. Proceedings of the 36th Italian Conference on Computational Logic. CEUR Workshop Proceedings, Parma, Italy, vol. 3002, pp. 16–30. CEURWS (2021). http://ceurws.org/Vol3002/paper10.pdf 25. Vasconcelos, W.W., Sabater, J., Sierra, C., Querol, J.: Skeletonbased agent development for electronic institutions. In: 1st International Joint Conference on Autonomous Agents and Multiagent Systems: Part 2 (AAMAS 2002), pp. 696–703. ACM, New York (2002). https://doi.org/10.1145/544862.544911
Ethics by Design for Intelligent and Sustainable Adaptive Systems Luca Squadrone, Danilo Croce(B) , and Roberto Basili(B) Department of Enterprise Engineering, University of Roma, Tor Vergata, Via del Politecnico 1, 00133 Rome, Italy {croce,basili}@info.uniroma2.it
Abstract. AI systems are increasingly dependent on the data and information sources they are developed with. In particular, learning machines are highly exposed to undesirable problems due to biased and incomplete coverage of training data. The autonomy exhibited by machines trained on lowquality data raises an ethical concern, as it may infringe on social rules and security constraints. In this paper, we extensively experiment with a learning framework, called Ethics by Design, which aims to ensure a supervised learning policy that can pursue both the satisfaction of ethical constraints and the optimization of task (i.e., business) accuracy. The results obtained on tasks and datasets conﬁrm the positive impact of the method in ensuring ethical compliance. This paves the way for a large set of industrial applications, whose ethical dimension is critical to increasing the trustworthiness with respect to this technology. Keywords: Ethical issues of AI · Ethics by design in machine learning · Bias in deep learning · Empirical evaluation of ethical AI systems
1
Introduction
Machine learning applications are experiencing exponential growth and are now being implemented in highrisk ethical scenarios, such as lending, hiring, or legal decision support [22]. The clear advantages of using machine learning algorithms include the ability to quickly and accurately analyze large amounts of data. However, this paves the way for algorithms to generate discriminatory predictions against individuals or social groups [1,2,6,23]1 , as per the bias inherent in the way historical data are collected. Consider, for example, COMPAS, a system used as a support tool by judges to predict a defendant’s risk of recidivism. African American defendants have 1
The articles for [2] and [23] can be found respectively at https://www.propublica. org/article/machinebiasriskassessmentsincriminalsentencing and https:// www.aclu.org/blog/privacytechnology/surveillancetechnologies/amazonsfacerecognitionfalselymatched28.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Dovier et al. (Eds.): AIxIA 2022, LNAI 13796, pp. 154–167, 2023. https://doi.org/10.1007/9783031271816_11
Ethics by Design for Intelligent and Sustainable Adaptive Systems
155
been found to be exposed to a higher risk of recidivism than Caucasian defendants due to unbalanced representation in historical data. This is further evidenced by recent studies [1,2,6,12,16,23]2 which have shown how machine learning algorithms may emphasize human factors such as prejudices, cliches, and errors of assessment. Since the algorithms are based on mathematical and statistical principles, they cannot independently recognize the ethical values related to the fair treatment of races or genders. This negatively impacts the trust with respect to this class of methods, especially in critical scenarios. Therefore, it becomes critically important that machines are somehow able to make datadriven decisions aligned with human values and expectations, in order to avoid the risk of dangerous drifts in terms of ethics and human values. The framework proposed in [21], while extending the generic applications of AI, focuses primarily on learning ethical behavior by numerical optimization, that is, through a deep neural model. The core idea is to model ethics as automated reasoning over formal descriptions of fair decisions, e.g., ontologies, but making it available during the learning stage. Note that this approach does not induce a set of ethical rules from a set of observable behaviors, but rather does the opposite. This approach takes for granted an explicit formulation of ethical principles (as done, for example, in earlier work [5,24]) and focuses on a form of ethical learning as external alignment (learning from others, [15]). It uses evidence inferred from an ethical ontology to guide model selection during the training process. The resulting deep neural network jointly models the functional and ethical conditions that characterize the underlying decisionmaking process. In this way, the discovery of latent ethical knowledge (that is, information hidden in the data that is meaningful from an ethical perspective) is enabled and made available to the learning process. Instead of relying on simulation to proceed in ethical decisions [24], the adopted framework integrates the acquisition of highquality inference abilities that simultaneously reﬂect ethical expectations. In other words, the learning machine is expected to select the “best decision” among those that are also ethically sustainable. In this work, we test the beneﬁcial impact of the above Ethics by Design technology3 on ﬁve wellknown datasets by (1) adopting ethical principles, that allow the ethical encoding of original instances into a space corresponding to ethical properties, and (2) by reformulating the learning function to favor decisions that better balance operational (i.e., business) eﬃciency and ethical compliance. The proposed experiments adopt ethical principles in form of taskspeciﬁc ethical rules that constrain the learning algorithm through the deﬁnition of dedicated preferences, the socalled truthmakers, as in [21]. We measured the impact of the Ethics by Design approach by showing the eﬀectiveness of parameterization and “tweaking” of ethical constraint weights. 2
3
The study referred by [12] is available at: https://www.bloomberg.com/graphics/ 2016amazonsameday/. The code is made available at: https://github.com/crux82/nnebd.
156
L. Squadrone et al.
As a result, we show that in all data sets, i.e., tasks, ethical conditions, and domains, a large improvement in ethical behavior (lower ethical risks) can be achieved at the cost of a small reduction in accuracy. In the remainder of the article, ethical issues in examplebased machine learning approaches are ﬁrst presented in Sect. 2. Section 3 summarizes the Ethics by Design approach, as a neural architecture that applies ethical constraints during the training process. In Sect. 4, experimental results are reported. Finally, in Sect. 5 the conclusions are drawn.
2
Ethics in Inductive Decision Systems
2.1
Ethics in Diﬀerent Application Scenarios
Regardless of their eﬀectiveness, ethical concerns are raised about the autonomy exhibited by machines trained on (possibly limited) data and their potential to violate social rules and security constraints. A ﬁrst example involves Amazon’s recruitment algorithm, which is used to automatically screen candidates’ curricula during the selection process. As indicated by the Business Insider report4 , this algorithm was found discriminatory against women, particularly in professions requiring technological skills. This bias was introduced by the data (i.e., real curricula) used in training: these were mostly related to male candidates, so the algorithm overweighted the contribution of candidate genderrelated characteristics. In [6], the output of facial recognition algorithms released to the market by three major tech companies showed a signiﬁcant racial and gender bias: these methods had very low error rates (never more than 0.8%) in determining the sex of lightskinned men, but when applied to darkskinned women this increased to ranges of 20% and 34%. In automatic recommendation, the analysis presented in [1] suggests that the algorithm adopted by Facebook for recommendation also applies racial and gender biases when oﬀering ads to more than two billion users, based on their demographic information. Similar issues are surveyed in [23]. As a consequence, growing attention is paid to the analysis of “sensitive features” (e.g., gender, ethnicity, and age) to identify and limit undesirable eﬀects of bias, discrimination, or prejudice, as surveyed in [8]. Several studies have shown that the deﬁnition and acquisition of a dataset aﬀected by (any kind of) bias signiﬁcantly aﬀect the quality of a datadriven method trained on it, as discussed below. The COMPAS 5 (Correctional Oﬀender Management Proﬁling for Alternative Sanctions) dataset discussed in [20] was released by ProPublica in 2016 based on the Broward County data. It assigns people a recidivism risk score that is computed using the defendant’s responses to the COMPAS screening survey. This dataset is generally used to train machine learning algorithms that predict 4
5
www.businessinsider.com/amazonbuiltaitohirepeoplediscriminatedagainstwomen201810. https://github.com/propublica/compasanalysis.
Ethics by Design for Intelligent and Sustainable Adaptive Systems
157
if an individual will be arrested again within two years after the ﬁrst arrest. According to ProPublica’s analysis [2], African Americans are more likely than Caucasians to be mislabeled as being at higher risk. The German credit dataset 6 is deﬁned to represent bank account holders and it is used in automatic risk assessment prediction, that is, to determine whether or not it is risky to extend credit to a person. The potential ethical risk of deriving a datadriven model that makes it diﬃcult to lend to women, youth, or foreign workers is generally discussed, as in [20]. The Adult dataset 7 was derived from U.S. Census data in 1994. It includes attributes describing social information about registered citizens (in terms of age, race, sex, or marital status) and is generally used to determine whether a person’s annual income exceeds 50, 000 US dollars. As discussed in [20], this dataset is subject to bias, as automatic classiﬁers generally overweight information about the sex and race of the individuals being considered. The Default Credit Card Clients 8 dataset investigated the customers’ default payments and contains payment information, demographics, credit data, and payment history. The goal is to predict whether or not a client will default in the next month. However, as suggested by [20], women are penalized compared to men. The Law School Dataset 9 is deﬁned after the survey conducted by the Law School Admission Council (LSAC) across 163 law schools in the United States. The dataset contains the law school admission records and it is generally used to predict whether a candidate would pass the bar exam or to predict a student’s ﬁrstyear average grade (FYA). As discussed in [20], this prediction is generally biased by features like the gender or the race of the candidates. 2.2
Computational Methods for Fair Inductive Systems
When training machine learning methods over potentially unfair datasets, much of the discussion focuses on various solutions to reduce “bias” in algorithms, such as modifying training data or diversifying data sources to reduce disparities between groups [10,11]. However, research such as [17]10 suggests that such approaches may fail when it is diﬃcult to isolate protected attributes from data. As extensively discussed in [8,19] and [18], methods to reduce bias eﬀects fall under three categories: preprocessing, inprocessing, and postprocessing algorithms. Preprocessing methods manipulate the training dataset before training a model, under the assumption that changing input data can prevent the insurgence of undesirable eﬀects. Inprocessing methods modify the learning machine 6 7 8 9 10
https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data). https://archive.ics.uci.edu/ml/datasets/adult. https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients. https://storage.googleapis.com/lawschool dataset/bar pass prediction.csv. The study referred by [17] can be found at: https://hai.stanford.edu/sites/default/ ﬁles/202112/Policy%20Brief%20%20Risks%20of%20AI%20Race%20Detection %20in%20the%20Medical%20System.pdf.
158
L. Squadrone et al.
itself, while postprocessing techniques modify the decisions made by a given machine. An example of a preprocessing approach is presented in [13], where a classiﬁcation model is learned on biased training data but works impartially for future data: a ranking function is learned to detect biased data that must be “sanitized”, to learn a nondiscriminatory model. In [10] a diﬀerent approach is deﬁned to modify each attribute, so that the marginal distributions based on subsets of that attribute characterized by a given sensitive value are all the same, without changing the target training labels. On the other hand, [7] and [11] present postprocessing methods. Rather than changing a training pipeline, they propose postprocessing frameworks for measuring and removing discrimination based on “protected” attributes. Inprocessing methods do not directly process input/output data, but instead, extend a machine learning algorithm so that (in addition to the originally targeted task) they also consider one or more additional tasks reﬂecting some sort of ethical principles. Such an extension is generally based on regularization, constraint optimization, or adversarial learning techniques. In [3,4,9,14], authors outline a general framework for empirical risk minimization under fairness constraints, such as the introduction of speciﬁc regularizers. Regardless of the type of approach used among those discussed so far, the goal is always to minimize the negative eﬀect of sensitive variables during the training process. In practice, adding constraints generally results in a tradeoﬀ: a fairer algorithm at the cost of small drops in accuracy in the original problem. Inspired by [21] and [13], we are interested here in methods that allow controlling this tradeoﬀ between system performance (in terms of accuracy on the target task) and ethics. We extensively investigate the method in [21], a neural framework that allows us (i ) to directly control the tradeoﬀ between accuracy and ethical principles and (ii ) to explicitly deﬁne these principles in terms of truthmakers, described hereafter. 2.3
Ethics, Principles and TruthMaker s
Ethical approaches to datadriven applications aim at minimizing the undesirable eﬀects that learning processes may introduce on the acceptability of the resulting decisions. This “ethical acceptability” is often related to principles that establish norms over the decisions. Violations of principles correspond to an imbalance in the treatment of equal rights among individuals, i.e., ethical risks, or missed opportunities for individuals or social groups, i.e., reduced beneﬁts. The idea in [21] is to introduce the notion of a truthmaker as a model for representing ethical risks and beneﬁts and exploiting them during the training process of a learning algorithm. We promote an ethical approach by assuming that reasoning over ethical ontologies is carried out through principles that apply as truthmakers. As an example, the application of an ethical principle such as “All minorities must be protected and ensured by equal rights” to an inductive classiﬁcation method C is based on training datasets where some social groups, e.g., women,
Ethics by Design for Intelligent and Sustainable Adaptive Systems
159
are somehow disadvantaged. Women might be discriminated against by C, such as when they take out a loan: an ethical constraint might be in this case to assume an ethical advantage when the loan is given to a woman. From a computational perspective determining such an advantage requires some explicit “rules”, that work as a constraint for the learning process without any manipulation of the training data. In this way, two aspects are optimized: on the one side, the quality of future decisions should reﬂect the past ones (as we usually do by optimizing accuracy), and, on the other end, they should also satisfy ethical compliance, i.e., work on minimizing ethical risks and maximizing any potential ethical beneﬁt. In the ProPublica case, as the COMPAS dataset suggests, African Americans are more often mislabeled as being at higher risk than Caucasians. An ethical principle that may be used against this potentially unfair situation could be expressed as “Safeguard the minority of African Americans from discriminatory classiﬁcation can be achieved by avoiding severe decisions when identifying them as repeat oﬀenders.” This principle suggests a constraint favoring situations in which it is particularly beneﬁcial to protect a minority, such as African Americans. At the same time, decisions about African Americans being repeat oﬀenders are also risky, because of the community’s potential social characterization. In fact, the COMPAS dataset contains the variable race (expressing if the individual is African American or Caucasian), which seems to suggest that African Americans are positively correlated with the repeat offender class on average: however, race should not be linked to such bias and the following principle can be used to counterbalance this trend:“We expect there is a substantial benefit and low risk in classifying an African American as a nonrepeat oﬀender”. Rules used to summarize the above principles sentences can be derived for “NONRepeat offender” decisions: – the Benefit in categorizing an African American as a NONrepeat offender is high; – the Risk in classifying an African American as a NONrepeat offender is low ; as well as “Repeat offender” decisions: – the Benefit in classifying an African American as a repeat oﬀender is very low ; – the Risk in classifying an African American as a repeat oﬀender is very high. The above rules are typical examples of truthmakers, as constraints on the decisions about recidivism based on the race feature. Notice that the adjective low, very high or high are vague but can be easily translated into fuzzy sets, as subjective models of the two metavariables expressing the Risk and Benefit of any individual decision, as formalized in [21].
3
Formalizing Principles, Rules and TruthMakers
The core of the adopted approach [21] is to model ethics via automated reasoning over formal descriptions, e.g., ontologies, but by making it available during the
160
L. Squadrone et al.
learning stage. We suggest the explicit formulation of ethical principles through truthmakers and let the resulting ethical evidence to guide the model selection of deep learning architecture. This network will jointly model causal as well as ethical conditions that characterize optimal decisionmaking. In this way, rules (i.e., truthmakers) are used to estimate risks and beneﬁts connected to training cases; then the discovery of latent ethical knowledge, i.e., hidden information in the data that is meaningful under the ethical perspective, is carried out; ﬁnally, the latter evidence is made available when learning the target decision function. This framework results in a learning machine able to select the best decisions among those that are also ethically sustainable. As already exempliﬁed, abstract ethical principles can be enforced through Ethical Rules that constrain individual features (e.g., gender or race) and determine the degree of the ethicality of decisions. Ethical Rules usually depend on one or more features and assign values (or better, establish some probability distributions) over the domains of some features. These rules are termed as truthmakers (T M), as they account for the possibly uncertain ethical state of the world determined by decisions over individual instances. Truthmakers are thus rules of an ethical ontology EO that actively determine the ethical proﬁle of a decision d(i) over an input instance i, e.g., an individual associated to therepeat offender class in the COMPAS dataset. In particular, given a pair i, d(i) , a truthmaker tm will determine a probability distribution to the set of ethical beneﬁt and ethical risk dimensions. For every tm, ethical dimension ej (i) and possible ethical value vk ∈ V , e.g. low or high risk11 , the following probability is deﬁned: P ej (i) = vk  i, d(i) , tm ∀j, ∀k = 1, . . . , 5 which expresses the evaluation of the truthmaker tm onto the representation i of an instance i. Here d(i) denotes the decision over the ith instance and kth is the value of the jth ethical dimensions (constrained by the truthmaker). A truthmaker thus assigns probabilities to the ethical signature of an individual i for all possible combinations of business characteristics i and decisions d(i); if no truthmaker is triggered by an instance the uniform probability distribution 1 , over the values vk and diﬀerent u is used, i.e., P ej (i) = vk  i, d(i) , tm = m ethical features, i.e., ∀j, k. Multiple truthmakers can contribute to a given ethical feature ej (i) by individually biasing their overall probability P (ej (i)). When all truthmakers are ﬁred, the resulting ethical signature es(i) over an instance i and its decision d(i) consists ∀j, k: P tmEO P ej (i) = vk  i, d(i) , tm esj (i) = tm 11
Consistently with [21] for both beneﬁts and risks, we ﬁxed m = 5 and limit values in the [0, 1] range. The following ﬁve labels can be adopted {“very low”, “low”, “mild”, “high”, “very high”} corresponding to the numerical values v1 = 0.1, v2 = 0.25, v3 = 0.5, v4 = 0.75 and v5 = 0.9.
Ethics by Design for Intelligent and Sustainable Adaptive Systems
161
The Deep Network Architecture. The network consists of several components, each trained under diﬀerent constraints expressed by speciﬁc loss functions (Fig. 1). In the ﬁrst component, an Ethics Encoding network is deﬁned, as responsible for learning the combinations of input features that capture possible relationships between business observations and (desired or undesired) ethical consequences: the network acts as an encoder of ethical consequences (i.e., further features) for each instance i. A second component includes two networks: a Business Expert and an Ethics Expert, whose roles are respectively to independently estimate the distributions of suitable business decisions, on the one side, and predict as well their ethical consequences. In the ﬁnal component, an Ethicalaware Deep Neural Network (DNN) is responsible for estimating the joint probability of possible triplets (decision, beneﬁt, risk), which determines the risks and beneﬁts associated with individual decisions for each instance.
Fig. 1. Network architecture proposed in [21]
This last component produces the ﬁnal business decision of the network by applying a certain decision policy over risks and opportunities. Diﬀerent policies are possible: from rejecting all decisions that are not ethically adequate (above thresholds imposed to the probability of risks and beneﬁts) to selecting other speciﬁc tradeoﬀs between business accuracy and ethical compliance. Policies are designed as diﬀerent loss functions used to train the specialized subnetworks, i.e., the Business Expert and the Ethics Expert. This architecture formulation allows thus to emphasize the contribution of each triple in the probability estimation through a factor β (the exponential Tweaking factor in [21]): in this way, we can train the overall network by balancing business accuracy and ethical compliance. Emphasis on ethical consequences can be achieved by amplifying the ethics constraints, i.e., by tweaking β toward larger values. Notice that training data usually provide discrete (i.e., crisp) business decisions, that do not give rise to any uncertainty. However, these are not guaranteed to be ethical decisions. Introducing probability distributions for all possible outcomes and smoothing them towards the nongold decisions allows us to disregard unethical cases and reserve some probability to decisions di diﬀerent from the gold standard ones.
162
L. Squadrone et al.
Several policies exist to derive the ﬁnal decisions: in this work, this is derived only from the probability triplets that respect the ethical constraints. For more details, refer to the paper [21].
4
Evaluating Ethical Compliance in Inductive Learning
The eﬀectiveness of the investigated method is evaluated using ﬁve wellknown datasets, always preserving the architecture across them, while deﬁning taskspeciﬁc truthmakers reﬂecting diﬀerent ethical principles. To verify the eﬀectiveness of the “tweaking” parameter in controlling the tradeoﬀ between the taskspeciﬁc accuracy and the sensitivity to ethical principles, we systematically measure the system in a range with β ∈ {0.001, 0.03, 0.05, 0.07, 0.1, 0.12, 0.14}, where higher values for β correspond to more inﬂuential ethical losses during training. Each dataset is divided into three parts: a test set (10%), and the remaining 90% in a validation set (10%), and a training set (90%). To assess whether or not the decisions made by our model are also respecting the ethical ontology in use, a measure, namely Ethical Compliance (EthCompl), is + computed as D+D+D− , where D+ represents the number of ethically compliant instances and D− the noncompliant ones. Finally, as in [25], we adopted disparate mistreatment to measure the change in bias. A decisionmaking process is suﬀering from disparate mistreatment concerning a given sensitive attribute (e.g., race) if the misclassiﬁcation rates diﬀer for groups of people having diﬀerent values of that sensitive attribute (e.g., AfroAmericans vs. Caucasians). The following equation y = y  z = 0, y = −1) DF P R = P (ˆ − P (ˆ y = y  z = 1, y = −1) DF N R = P (ˆ y = y  z = 0, y = 1) − P (ˆ y = y  z = 1, y = 1) quantiﬁes the disparate mistreatment incurred by a classiﬁer, whereas the closer the values of DF P R and DF N R to 0, the lower the degree of disparate mistreatment. 4.1
Use Cases
We now describe the diﬀerent investigated datasets, emphasizing the targeted sensitive features and adopted truthmakers. The COMPAS Use Case. We selected the subset of instances completely deﬁned in COMPAS, obtaining a subset of 6,908 samples. The target variable is recid indicates whether a defendant committed a crime in the two years after he was scored. The deﬁnition of the truthmaker focused on the sensitive attribute race, so that it assigns a high beneﬁt in classifying African Americans as not recidivists, a high risk in classifying them as recidivists while not acting
Ethics by Design for Intelligent and Sustainable Adaptive Systems
163
on the other subpopulations, such as Caucasians (no beneﬁt and risks assigned to the other subpopulations). The German Credit Use Case. This dataset contains 1, 000 examples, each described by 21 attributes, where the target variable default indicates good or bad customers. The truthmaker focused on the sex attribute (derived from personalstatusandsex) assigning high beneﬁt in classifying females as good customers, a high risk in classifying them as bad customers while not acting on males. The Adult Use Case. The Adult dataset consists of 48, 842 instances, each described via 15 attributes. The target boolean variable y indicates whether the annual income of a person exceeds 50, 000 US dollars. The truthmaker focused on the sex attribute, by assigning low beneﬁts in classifying females as “under 50, 000 US dollars”, and low risk in classifying them as “over 50, 000 US dollars” while not acting on males. The Default Credit Card Use Case. The dataset includes 30,000 customers described by 24 attributes. The target variable default indicates whether a customer will suﬀer the default payment situation in the next month (1) or not (0). The truthmaker focused on the sex attribute, by assigning high beneﬁts in classifying males as “NOT default”, a high risk in classifying them as “default” while not acting on females. The Law School Use Case. The Law school dataset has 26, 553 instances, where the target variable pass bar indicates that a person passes the bar exam. The truthmaker focused on the race attribute, by assigning low beneﬁts in classifying other races as “NOT passed the exam”, low risk in classifying them as “passed” while not acting on white. 4.2
Discussion of the Results
Table 1 reports the experimental results. Crossvalidation has been applied to study the behavior of Accuracy and EthCompl scores according to diﬀerent values of β. The ﬁrst line for each dataset in the table shows the performance of a MultiLayer Perceptron (MLP) whose loss does not depend on any ethical dimension of the problem. This is compared with the proposed ethical networks achieved with diﬀerent settings of the β parameters, whose role is to increase the impact of ethical constraints. The overall experimental outcome strongly conﬁrms the ability of the network to learn ethical constraints. In fact, in any of the targeted datasets, the measure of ethical compliance EthCompl grows as long as β (which emphasizes the impact of the ethical component of the network on the loss) increases. At the same time, disparate mistreatment also seems to be reduced: this is shown by the last pairs of columns in Table 1 (namely Disp. mistr.) and by the false positive rates on protected groups that are comparable to the corresponding rate on nonprotected groups (e.g., African Americans vs. other races in COMPAS). This is exactly the impact of unfair decisions we expect. The fact that
164
L. Squadrone et al.
Table 1. Results by varying the parameter β on the COMPAS, German Credit, Adult, Default and Law school datasets. Values express the average over 5 diﬀerent runs. COMPAS β
Accuracy
Eth.Compl.
Afr. Americans
Others
FPR
FNR
FPR
FNR
Disp. mistr. DFPR
DFNR
(MLP)
0.681
0.682
0.433
0.233
0.183
0.506
0.250
−0.273
0.001
0.676
0.681
0.442
0.237
0.168
0.539
0.274
−0.302
0.030
0.668
0.723
0.359
0.318
0.146
0.584
0.212
−0.267
0.050
0.666
0.742
0.317
0.353
0.140
0.595
0.177
−0.242
0.070
0.664
0.761
0.279
0.391
0.133
0.603
0.146
−0.212
0.100
0.653
0.782
0.246
0.435
0.108
0.664
0.138
−0.228
0.120
0.645
0.802
0.213
0.480
0.092
0.697
0.121
−0.216
0.140
0.640
0.814
0.196
0.507
0.089
0.705
0.107
−0.198
Eth.Compl.
Male
German Credit β
Accuracy
Female
Disp. mistr.
FPR
FNR
FPR
FNR
DFPR
DFNR
(MLP)
0.704
0.490
0.542
0.185
0.413
0.275
0.130
−0.090
0.001
0.688
0.522
0.489
0.229
0.316
0.347
0.173
−0.118
0.030
0.671
0.558
0.436
0.282
0.310
0.357
0.126
−0.075
0.050
0.645
0.611
0.361
0.359
0.319
0.359
0.042
0.000
0.070
0.637
0.603
0.382
0.351
0.286
0.416
0.096
−0.065
0.100
0.625
0.640
0.328
0.403
0.266
0.410
0.062
−0.007
0.120
0.613
0.668
0.288
0.445
0.235
0.418
0.053
0.027
0.140
0.598
0.695
0.242
0.482
0.235
0.448
0.007
0.034
Accuracy
Eth.Compl.
Male
Adult β
Female
Disp. mistr.
FPR
FNR
FPR
FNR
DFPR
DFNR
(MLP)
0.852
0.822
0.106
0.364
0.029
0.459
0.076
−0.095
0.001
0.853
0.831
0.094
0.385
0.023
0.499
0.072
−0.114
0.030
0.852
0.847
0.078
0.427
0.020
0.520
0.058
−0.092
0.050
0.849
0.859
0.068
0.462
0.018
0.554
0.050
−0.092
0.070
0.847
0.871
0.058
0.498
0.015
0.570
0.042
−0.073
0.100
0.841
0.893
0.040
0.564
0.012
0.618
0.028
−0.053
0.150
0.824
0.924
0.022
0.676
0.008
0.712
0.014
−0.036
Accuracy
Eth.Compl.
Male
Default β
Female
Disp. mistr.
FPR
FNR
FPR
FNR
DFPR
DFNR
(MLP)
0.818
0.945
0.063
0.620
0.049
0.640
0.015
−0.020
0.001
0.818
0.947
0.060
0.632
0.045
0.649
0.014
−0.017
0.030
0.818
0.951
0.053
0.654
0.044
0.653
0.009
0.001
0.050
0.819
0.954
0.049
0.669
0.043
0.655
0.006
0.014
0.070
0.818
0.957
0.045
0.687
0.043
0.657
0.002
0.030
0.100
0.817
0.962
0.038
0.718
0.040
0.668
−0.002
0.050
Eth.Compl.
White
Law school β
Accuracy
Others
Disp. mistr.
FPR
FNR
FPR
FNR
DFPR
DFNR
(MLP)
0.829
0.964
0.869
0.009
0.547
0.089
0.322
−0.081
0.001
0.827
0.964
0.873
0.010
0.556
0.091
0.318
−0.081
0.030
0.827
0.971
0.885
0.007
0.617
0.060
0.269
−0.053
0.050
0.827
0.975
0.893
0.006
0.656
0.045
0.237
−0.039
0.070
0.824
0.978
0.909
0.004
0.691
0.037
0.218
−0.032
0.100
0.823
0.983
0.921
0.003
0.740
0.022
0.181
−0.020
Ethics by Design for Intelligent and Sustainable Adaptive Systems
165
this eﬀect is systematic across all the analyzed datasets is a strong evidence of the proposed method as an eﬀective and reliable inprocess approach to fairness. These datasets in fact represent quite diﬀerent tasks and domains characterized by diﬀerent sensible features as well as by diﬀerent data distributions. As already noticed, the proposed Ethics by Design inevitably faces some dropin (business) accuracy, to adjust unfair training data (i.e., gold decisions to be neglected for sake of fairness). However, such a small loss in accuracy corresponds to more balanced (i.e., ethical) decisions: for example, a 0.682 vs. 0.814 increase in ethical compliance in the COMPAS dataset corresponds to a small accuracy loss, 0.681 vs. 0.640. It seems that tweaking the ethical sensitivity of the method is thus eﬀective. It allows identifying the optimal balance, as an operationally costeﬀective compromise, between the business and the ethical performance of the system. The injection of ethical rules within neural learning seems to be eﬀective in balancing biases that arise within datasets. Biased human judgments are the main cause of errors as statistical surveys suggest. The ethical rules we have deﬁned have reduced this distortion, leading to more ethically eﬀective outcomes. Although not conclusive, this approach results in an improvement. The suggested framework allows the management of incoming data based on an ethical perspective. When operational decisions are monitored across time, further adjustments through training are possible and incremental ethical optimization is enabled.
5
Conclusions
In this work, we experimented with the Ethics by Design framework, discussed by [21], against quite diﬀerent biased datasets, such as COMPAS. The tests conﬁrm the method’s ability to strongly foster fairness, in order to ensure responsibility and accountability of AI systems’ behavior. For example in COMPAS, the results are much better decisions over African Americans, without costs, i.e., with basically no change on any other social group. This result is systematically achieved in the diﬀerent datasets adopted at the expense of a more than acceptable loss of (business) performance, which in our view is a very signiﬁcant result. This conﬁrms the large applicability of the EthicsbyDesign framework [21]. As a future extension, the automatic identiﬁcation of sensible features and strategies adopted by the model to propose truthmakers against the corresponding “unfair” decisions is under investigation. The possibility of crossvalidating the role of diﬀerent features through quantitative assessment (i.e., the fairness measures proposed) makes it possible to assume an autonomous behavior for auditing the system in search of ethical balancing between social groups.
166
L. Squadrone et al.
References 1. Ali, M., Sapiezynski, P., Bogen, M., Korolova, A., Mislove, A., Rieke, A.: Discrimination through optimization: how Facebook’s ad delivery can lead to biased outcomes. Proc. ACM Hum.Comput. Interact. 3(CSCW), 1–30 (2019) 2. Angwin, J., et al.: Machine bias (2016) 3. Bechavod, Y., Ligett, K.: Penalizing unfairness in binary classiﬁcation. arXiv preprint arXiv:1707.00044 (2017) 4. Berk, R., et al.: A convex framework for fair regression. arXiv preprint arXiv:1706.02409 (2017) 5. Bonnemains, V., Saurel, C., Tessier, C.: Embedded ethics: some technical and ethical challenges. Ethics Inf. Technol. 20(1), 41–58 (2018). https://doi.org/10. 1007/s106760189444x 6. Buolamwini, J., Gebru, T.: Gender shades: intersectional accuracy disparities in commercial gender classiﬁcation. In: Conference on Fairness, Accountability and Transparency, pp. 77–91. PMLR (2018) 7. Calders, T., Verwer, S.: Three Naive Bayes approaches for discriminationfree classiﬁcation. Data Min. Knowl. Disc. 21(2), 277–292 (2010) 8. Caton, S., Haas, C.: Fairness in machine learning: a survey. arXiv preprint arXiv:2010.04053 (2020) 9. Donini, M., Oneto, L., BenDavid, S., ShaweTaylor, J., Pontil, M.: Empirical risk minimization under fairness constraints. arXiv preprint arXiv:1802.08626 (2018) 10. Feldman, M., Friedler, S.A., Moeller, J., Scheidegger, C., Venkatasubramanian, S.: Certifying and removing disparate impact. In: Proceedings of the KDD 2015, pp. 259–268 (2015) 11. Hardt, M., Price, E., Srebro, N.: Equality of opportunity in supervised learning. In: Advances in Neural Information Processing Systems, vol. 29, pp. 3315–3323 (2016) 12. Ingold, D., Soper, S.: Amazon doesn’t consider the race of its customers. Should it? (2016) 13. Kamiran, F., Calders, T.: Classifying without discriminating. In: 2009 2nd International Conference on Computer, Control and Communication, pp. 1–6. IEEE (2009) 14. Kamishima, T., Akaho, S., Asoh, H., Sakuma, J.: Fairnessaware classiﬁer with prejudice remover regularizer. In: Flach, P.A., De Bie, T., Cristianini, N. (eds.) ECML PKDD 2012. LNCS (LNAI), vol. 7524, pp. 35–50. Springer, Heidelberg (2012). https://doi.org/10.1007/9783642334863 3 15. KleimanWeiner, M., Saxe, R., Tenenbaum, J.B.: Learning a commonsense moral theory. Cognition 167, 107–123 (2017). Moral Learning 16. Lambrecht, A., Tucker, C.: Algorithmic bias? An empirical study of apparent genderbased discrimination in the display of stem career ads. Manag. Sci. 65(7), 2966–2981 (2019) 17. Lungren, M.: Risks of AI race detection in the medical system (2021) 18. Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., Galstyan, A.: A survey on bias and fairness in machine learning. ACM Comput. Surv. (CSUR) 54(6), 1–35 (2021) 19. Pessach, D., Shmueli, E.: Algorithmic fairness. arXiv preprint arXiv:2001.09784 (2020) 20. Quy, T.L., Roy, A., Iosiﬁdis, V., Ntoutsi, E.: A survey on datasets for fairnessaware machine learning. arXiv preprint arXiv:2110.00530 (2021)
Ethics by Design for Intelligent and Sustainable Adaptive Systems
167
21. Rossini, D., Croce, D., Mancini, S., Pellegrino, M., Basili, R.: Actionable ethics through neural learning. In: Proceedings of the AAAI Conference on Artiﬁcial Intelligence, vol. 34, pp. 5537–5544 (2020) 22. Savani, Y., White, C., Govindarajulu, N.S.: Intraprocessing methods for debiasing neural networks. arXiv preprint arXiv:2006.08564, vol. 33, pp. 2798–2810 (2020) 23. Snow, J.: Amazon’s face recognition falsely matched 28 members of congress with mugshots (2018) 24. Vanderelst, D., Winﬁeld, A.: An architecture for ethical robots inspired by the simulation theory of cognition. Cogn. Syst. Res. 48, 56–66 (2018). Cognitive Architectures for Artiﬁcial Minds 25. Zafar, M.B., Valera, I., Gomez Rodriguez, M., Gummadi, K.P.: Fairness beyond disparate treatment & disparate impact: learning classiﬁcation without disparate mistreatment. In: Proceedings of the 26th International Conference on World Wide Web, pp. 1171–1180 (2017)
Automated Planning and Scheduling
Veriﬁcation of Numeric Planning Problems Through Domain Dynamic Consistency Enrico Scala1 , Thomas L. McCluskey2 , and Mauro Vallati2(B) 1
Universit` a degli Studi di Brescia, Brescia, Italy 2 University of Huddersﬁeld, Huddersﬁeld, UK [emailprotected]
Abstract. Veriﬁcation of the development of complex problem models is an open problem in realworld applications of automated planning. To facilitate the veriﬁcation task, this paper introduces the notion of Domain Dynamic Consistency for planning problems expressed in PDDL. This notion is aimed at signalling suspicious inputs arising at the intersection between the abstract description of the model and its concrete instantiation. Together with the notion we present an approximation based approach that is devoted to automatically solve the problem of deciding when a PDDL numeric planning problem is not Domain Dynamic Consistent. The paper terminates with an example of application of this notion and its related technique within a Urban Traﬃc Control scenario.
Keywords: Automated planning
1
· Numeric planning · Veriﬁcation
Introduction
AI Planning is an important research area of Artiﬁcial Intelligence that deals with the problem of ﬁnding a sequence of actions whose application in an initial state of the environment leads to a desired goal state [12]. Automated planning is exploited in many realworld applications as it is a common capability requirement for intelligent autonomous agents [18]. Example application domains include drilling [11], smart grid [28], machine tool calibration [20], and mining [16]. Modelling AI planning problems is a challenging and errorprone tasks, as even small mistakes can compromise the validity of a representation. In realworld planning applications, where knowledge is acquired from diﬀerent sources, the veriﬁcation of the problem model is crucial. This may be caused both by some erroneous input done by the user, or by some automatic tool that does not work properly. For instance, one can simply forget to mention the initial value of a variable and this may indirectly cause some other variable to be not changeable anymore. Syntactic errors are easily recognised, whilst more profound interactions among the variables are diﬃcult to intercept. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Dovier et al. (Eds.): AIxIA 2022, LNAI 13796, pp. 171–183, 2023. https://doi.org/10.1007/9783031271816_12
172
E. Scala et al.
Veriﬁcation of a problem model means demonstrating that it is a correct implementation of the abstract or conceptual model. One important aspect of this is checking that the implementation does not introduce errors or behaviours inconsistent with the conceptual model. To help address this problem, we propose the notion of Domain Dynamic Consistency (DDC) of a planning problem, and illustrate its use in problems expressed in the Planning Domain Deﬁnition Language (PDDL), the standard defacto language used by the AI Planning community. Intuitively, we say that a planning problem is DDC if each variable that is present in the initial state is ﬂuent in the same way in which it is ﬂuent in the model of domain dynamics. Consider the problem involving a robot that can move in a metric unidimensional space. Assume that variable x is used to model its position, and that the movement of such a robot is modelled through a single moveright PDDL action, whose precondition requires the fuel to be at least of one unit. Further assume that the eﬀects simply state that the position of the robot is increased by 1 unit anytime the action is applied. Now consider a state where the position of the robot is such that x = 1, and the fuel is equal to 1. The initial state, and therefore the planning problem, is DDC in that the only ﬂuent variable that we are modelling can indeed be increased by 1 unit. Let us consider another situation. This time, assume a state has variable x set to 1 (as before) but the fuel is instead equal to 0. According to our deﬁnition, this state is not DDC in that the variable can never be increased. Although this does not represent an issue from a semantics perspective in that it is perfectly possible given the domain and the problem instance, this is somewhat a suspicious situation for an initial state; why would a state like this one make any sense at all if we cannot even model the movement of the robot? Why did we bother modelling its position and its modiﬁcation, if this position cannot actually be changed? Though the illustration above is simple, in reality, when initial states are complex and/or autogenerated, this property helps to uncover errors in the veriﬁcation and validation process. Other works have looked into the problem of veriﬁcation and validation of planning problems, e.g., [3,8,22,27]. Yet, to the best of our knowledge, none has investigated the problem through the lens of planning with numeric information [9] and without the need to express some additional explicit knowledge (e.g., through Linear Temporal Logic [6,21]). In this study we formally characterise the notion of DDC focusing on PDDL 2.1, the extension of PDDL that lets the user explicitly express numeric conditions and numeric changes. First, we discuss the general diﬃculty of the apparently simple problem of checking problems for DDC, observing that in general terms deciding when an initial state is DDC is as hard as solving a planning problem. To overcome this barrier we present an approximation schema and show how this can be used to verify whether a planning problem is not DDC. Finally, we show an example of the use of the DDC in strategy generation for Urban Traﬃc Control. The remainder of this paper is organised as follows. Section 2 provides the necessary background. Then, the Domain Dynamic Consistency notion is intro
Domain Dynamic Consistency for Numeric Planning Problems
173
duced, and Sect. 4 presents an approach to test the DDC property. The usefulness of the notion is then assessed using a case study, that is presented in Sect. 5. Finally, conclusions are given.
2
Background
This section provides the necessary background on numeric planning, the corresponding deﬁnition of a numeric planning problem, and on the additive intervalbased relaxation. 2.1
Numeric Planning
We consider planning problems [12] as those that can be expressed in PDDL2.1 level 2 [9]. These problems are called numeric planning problems [25], but we will in the rest simply refer to planning problems. Without loss of generality, we restrict our attention to the case with untyped objects, and with only numeric ﬂuents1 (see below). A full treatment of the syntax and semantics of the language is beyond the scope of this work; the full details can be found in [9]. Next, we provide only those aspects necessary to understand our proposal. A planning problem consists of two elements: a domain model and a problem model. Following the PDDL terminology, the domain model contains the deﬁnition of the predicates, the numeric ﬂuents, and a set of actions. In particular, numeric ﬂuents indicate properties of lists of objects; mathematically, they deﬁne mappings between lists of objects to numeric values. The domain model deﬁnes them in an abstract way: it speciﬁes a name, a string label for each such mapping, and a list of variables. Variables specify the order and the number of objects to be mapped. An action a is deﬁned by means of a name (which we will often omit in the interest of space), a list of variables (called the parameters of the actions), a precondition formula (i.e., pre(a)) and a set of eﬀects (i.e., eﬀ(a)). The precondition formula is a ﬁrstorder logic proposition having equalities or inequalities involving numeric ﬂuents as terms (e.g., (> (battery ?r1) 4) ∧ (> (battery ?r2) 5)). Each formula can make use of the standard logical connectives from propositional logic, i.e., ∧, ∨, ¬, together with arbitrary nesting of universal (∀) and existential (∃) quantiﬁer over the objects of the problem. Eﬀects are triplets of the form {inc, dec, ass}, x, ξ, where the ﬁrst term is the modiﬁer, and can either be an increase (i.e., inc), a decrease (i.e., dec) or an assignment (i.e., ass), the second term is a numeric ﬂuent, and the third term is a numeric expression that together with the modiﬁer determines the state of the numeric ﬂuent if the action is applied. Each numeric ﬂuent in the action structure can have its parameters expressed as concrete objects (i.e., actual objects of the problem to be solved) or variables. When all parameters are concrete objects, a numeric ﬂuent is said to be ground. 1
A Boolean ﬂuent can be mapped into a {0, 1} numeric ﬂuent.
174
E. Scala et al.
Similarly, an action with all parameters and free variables substituted with concrete objects is said to be ground. This also requires to eliminate all quantiﬁers in the preconditions using standard quantiﬁer elimination techniques. In this work we focus on actions whose eﬀects can increase, decrease or assign the value to a numeric ﬂuent by means of a constant (e.g., (increase (battery ?r1) 5.4)). A domain model is a tuple X , A where X is the numeric ﬂuents set as above, and A the set of actions. Let O be a set of objects and x a numeric ﬂuent from X . The grounding of x is the set of numeric ﬂuents each having the same name of x but the list of variables replaced with concrete objects from some subset of O. The set of ground numeric ﬂuents given O is denoted by X [O]. Finally, we use abs(x) to denote the abstraction of an object x into a variable, i.e., the ungrounded version of the numeric ﬂuent. A state s gives a value to each numeric ﬂuent in X [O]. The domain of each numeric ﬂuent is the set of rational number plus the special term ⊥; ⊥ is used to state that a given numeric ﬂuent is undeﬁned. Let x ∈ X and s be a state, we denote with [x]s the value of numeric ﬂuent x in state s. Then, we use succ(s) for the set of states reachable by s through actions from A. For more information on what a ground action is, and how actions can be grounded automatically, look at [14] and [26]. A ground action is applicable in state s iﬀ its precondition is satisﬁed in s. A precondition is satisﬁed iﬀ, by assigning all numeric ﬂuents their values as for state s, the evaluation of the formula returns true. The application of a ground action in a state s generates a new state s = s[a] such that all numeric ﬂuents conform with the eﬀects of the action. For instance, if an action features a numeric eﬀect inc, x, 1 and the state is such that x = 1, then the successor state will be such that x = 2. A problem model is given by a set of objects, a state, called the initial state, and a goal. The goal is structured as the precondition of an action, with the difference that any component which is not quantiﬁed only involves ground numeric ﬂuents. A problem model is formally expressed as a tuple O, I, G. The combination of a domain and a problem instance is a planning problem P = D, P . A plan for a planning problem is a sequence of actions τ such that τ can be iteratively applied starting from the initial state I, and the last produced state is one where the goal G is satisﬁed. 2.2
Problem Relaxation and Heuristics
A popular technique to ﬁnding plans for planning problems is that of performing a search over the state space induced by the problem. In order to make such a search eﬀective, planners usually employ heuristics devised directly from the description of the problem, and a very solid approach to make that happen is to extract such a heuristic from a proper relaxation of the problem itself [5]. State space planners use these two facilities during search by avoiding the exploration of deadends states, and by steering the search only towards the most promising paths. Heuristics that well approximate the cost to reach the goal can lead the
Domain Dynamic Consistency for Numeric Planning Problems
175
search to only explore a linear number of states on the length of the optimal path. The Additive IntervalBased Relaxation (AIBR) of a numeric planning problem is a relaxation speciﬁcally designed to support problems involving complex numeric expressions. As many other relaxations (e.g., [5,7,25]), the AIBR serves two purposes in statespace planners: the former is to prune states and the latter is providing the basis for computing heuristic estimates [2,15,24]. Pruning is given by the ability of the AIBR to correctly identify when a state does not allow the planner to reach the goal. Heuristic estimates can be computed by ﬁnding concrete relaxed plan, that is, plans that solve the problem at a relaxation level. As hinted at above, the additive intervalbased relaxation belongs to the family of frameworks that tries to exploit as much as possible the structure of the problem expressed in some language, in our case PDDL. This means that the user can take advantage of induced heuristics without the need of providing them manually. The relaxation at the basis of the AIBR grounds on a reinterpretation of the semantics of the numeric planning problem. Such a reinterpretation guarantees to overapproximate whether some goal or subgoal is reachable. Indeed AIBR is able to correctly identify unsolvable problems with an algorithm that is polynomial on the size of the problem. It does so with the following expedients. First, under AIBR, a planning state is not a single valuation for each numeric ﬂuent. Rather, each numeric ﬂuent x is mapped into an interval (x− , x+ ) deﬁning the minimum (i.e., x− ) and the maximum (i.e., x+ ) value for x; this way, a AIBR planning relaxed state approximates a concrete state with a number of intervals. Each such interval approximates all values that can ever be attained by a single numeric ﬂuent. Second, the AIBR changes the way satisﬁability of a formula is evaluated. Instead of operating using standard arithmetic operations, it uses interval analysis [19]. That is, let s be some state, an inequality in some formula is evaluated using interval enclosures of the possible evaluation of the numeric ﬂuents it encompasses. Then a generic propositional formula is evaluated by combining the evaluated terms recursively navigating a treeshaped formula up to the root. Finally, whenever an action is applied in the AIBR, the result is given by the convex union of the interval for each variable associated with the state in which the action is applied, and the interval associated to the state obtained by applying the eﬀects of the action. This way, the successor state monotonically accepts the values of the state in which the action is applied, and the new values that can be obtained by the execution of the action. Because of this, all formulas that are satisﬁed before the execution of the action are also satisﬁed after its application. To make this process run for a ﬁnite number of times, the AIBR makes use of the notion of asymptotic supporters. Intuitively, each asymptotic supporter makes the eﬀect of an action idempotent, therefore limiting the number of iterations needed to estimate the relaxed reachability of a condition. The AIBR is not the only heuristic seen in the literature. For instance, [25] deﬁnes subgoalingbased relaxations that work with a diﬀerent principle. Albeit
176
E. Scala et al.
such relaxations can provide more guidance, they are focused more on improving on the performances of statespace planners. The AIBR on the other hand aims at handling general numeric planning problems, which is what we target in this paper.
3
Domain Dynamic Consistency
Modelling planning problems using abstract, parametrized actions (also known as lifted actions) is very convenient. Indeed, one may encode compactly several actual transitions by just declaring the types of the variables the actions depend on. However, the plans that are going to be executed are composed by ground actions only. That is, actions where all variables are substituted with concrete objects from some particular problem model. While the modelling of abstract actions make things much more elegant, it may introduce some false expectations too. We argue that when one model an action at an abstract level, it is very likely that if some set of objects compatible with that action have most but not all object relevant conditions in the action preconditions satisﬁable, some modelling bug may have occurred at the level of the problem formulation. And this may be related to the fact that one condition that we were expecting to be satisﬁable at some point, it is actually not satisﬁable because it does not follow the dynamics that we were expecting at an abstract level. To capture situations as this one, we formalise the notion of Dynamic Domain Consistency. Roughly speaking we say that a problem is dynamic domain consistent if and only if, whenever we have some object ﬂuent that is expected to be dynamic at an abstract level, this object is dynamic at a ground level too. Though, we focus our attention on numeric ﬂuents only, as we expect these can be the main source of domain inconsistencies. In what follows we formalise the notion of Domain Dynamic Consistency (DDC). DDC is a property that is desired by some particular state. Such a notion makes sense when the state is evaluated in a planning problem context. Definition 1 (Domain Dynamic Consistency). Let P = D, P be a planning problem such that D = X , A and P = O, I, G. We say that P is Domain Dynamic Consistent (DDC) iﬀ ∀x ∈ X [O] it holds that – if ∃inc, y, k ∈ eﬀ(a) for some a ∈ A with k > 0 s.t. y = x or y = abs(x) then ∃s ∈ succ(I) s.t. [x]I < [x]s – if ∃dec, y, k ∈ eﬀ(a) for some a ∈ A with k > 0 s.t. y = x or y = abs(x) then ∃s ∈ succ(I) s.t. [x]I > [x]s – if ∃ass, y, k ∈ eﬀ(a) for some a ∈ A with k = [x]I s.t. y = x or y = abs(x) then ∃s ∈ succ(I) s.t. [x]s = k Intuitively, the notion establishes that a planning problem is DDC if each numeric ﬂuent mentioned in the initial state is dynamic, i.e. if actions in the domain model enable the numeric ﬂuent to dynamically change, at an abstract level. We are interested in determining if that is the case. To understand whether
Domain Dynamic Consistency for Numeric Planning Problems
177
this property is generally true for well formed and operational planning problems, we considered a range of well known numeric benchmark instances [23]. The set includes the following domains: Counters, Plantwatering, Blockgrouping, Sailing, and Farmland. We manually checked all the instances of the benchmarks, and observed that all of them are DDC. In all the considered instances, all the numeric ﬂuents that can be modiﬁed via actions are indeed initially set to be modiﬁable. This empirical evidence gives a solid ground to support our intuition, and suggests that it can provide a meaningful way to verify initial states. Of course, the considered instances are very easy to be checked, given their simple structure. Yet, and that is also where the DDC notion can be helpful, realworld planning applications can lead to problem models that are complex and large. An example will be given in Sect. 5. It can be proven that, in general, checking the DDC is indeed much more involved. Proposition 1. Deciding whether a planning problem is DDC is undecidable. Proof (Sketch). Observe that deciding whether a planning problem is DDC is as hard as ﬁnding a solution plan for it. Indeed, we can emulate a planning problem by encoding the goal into the precondition of a dummy action having a single numeric eﬀect. Then we make sure that this action is necessary to solve the problem. To do so we introduce a fresh numeric ﬂuent initially set to a random number, say 0, and model a numeric eﬀect for this action to set the fresh numeric variable to 1. Checking whether this problem is DDC necessitates making sure that the precondition of this action is achievable. Therefore, this is possible iﬀ the original problem admits a solution. As numeric planning is undecidable [13], so is the problem of verifying whether a planning problem is DDC.
4
Approximating Domain Dynamic Consistency
To overcome the complexity of determining if a planning problem P is DDC, we approximate the DDC checking through the additive intervalbased relaxation [1,24]. We make use of the AIBR for a diﬀerent purpose than that employed in statespace planners (e.g., [15,24]). Our objective is not to provide a heuristic estimate or doing pruning. Instead, we aim at evaluating the DDC of a problem. As a very ﬁrst step, we run the AIBR up to ﬁx point – note that such a ﬁx point does exist and can be computed eﬃciently because of the use of asymptotic supporters. This gives us an interval for each associated variable. Then we use such intervals to predict whether the conditions of Deﬁnition 1 are satisﬁed. More precisely, for each variable for which we know that there exists an action that can change its value abstractly, we see whether this may happen also at the ground level. We do so for each of the conditions that we want to evaluate. Algorithm 1 reports the AIBR reachability algorithm [24], slightly modiﬁed to return the last relaxed state obtained after ﬁxpoint computation. Algorithm 2 describes how to use Algorithm 1 to approximate the DDC of a problem w.r.t. a
178
E. Scala et al.
Algorithm 1: AIBR (slightly revisited from Scala et al. 2016)
7
Input: P ++ Output: The set of intervals at the asymptotic ﬁxpoint Ω = supporters of A. s+ = s+ 0 . S = {a ∈ Ω : s+ = pre(a)} while S = ∅ do s+ = succ+ (s+ , S) Ω = Ω\S S = {a ∈ Ω : s+ = pre(a)}
8
return s+
1 2 3 4 5 6
Algorithm 2: DDC Approximation
1 2 3 4 5 6 7 8 9 10 11 12 13
Input: P = D, P Output: Is P Domain Dynamic Consistent? Pg = grounding(P) s+ = AIBR(Pg++ ) foreach x ∈ XP do foreach a ∈ AD such that ∃x , + =, k ∈ eﬀ(a).x = abs(x) ∧ k > 0 do if up([x]s+ ) = [x]PI then return False foreach a ∈ AD such that ∃x , − =, k ∈ eﬀ(a).x = abs(x) ∧ k > 0 do if lo([x]s+ ) = [x]PI then return False foreach a ∈ AD such that ∃x , =, k ∈ eﬀ(a).x = abs(x) ∧ k = [x]PI do if k ∈ / [x]s+ then return False return True
domain. For any ﬂuent x, [x]s+ is used to denote the interval of values for x in s+ . lo([x]s+ ) and up([x]s+ ) denote the minimum and maximum value, respectively. Algorithm 2 works as follows. First, it grounds the planning problem, obtaining Pg ; AIBR indeed is deﬁned for fully grounded problems only. Then it calls the AIBR speciﬁed by Algorithm 1. This algorithm returns the ﬁx point AIBR planning state. Then, we iterate over all the variables that are expressed in the initial state of P . This set is denoted by XP . For each action that abstractly modiﬁes the variable under iteration, we distinguish the three possible eﬀects of an action on the variable: an increase, a decrease and an assignment. If the action abstractly increases (decreases) the value of a numeric ﬂuent x, then we check whether the interval for x at the ﬁx point s+ has increased (decreased) the variable. This is done by inspecting the lower and the upper bound of the interval (function lo and up in the code), and determining whether the ﬁxpoint
Domain Dynamic Consistency for Numeric Planning Problems
179
value admits an increase, a decrease, or the foreseen assignment; for the assignment it suﬃces to check whether the interval at ﬁx point does not include the value k. For instance, if we have a variable x with an initial value of 0, an eﬀect inx, x , 5 where x is the abstracted version of x, a ﬁxpoint [x]s+ = [− inf, 0] will imply that x is never going to be increased, even if it was supposed to do so at an abstract level. If at least one of these cases is not satisﬁed, the algorithm returns that the problem is not DDC. Otherwise it carries on and explores the next variable from Xp . Algorithm 2 correctly identiﬁes whether a problem is not DDC and can thus be used to signal suspicious situations. Proposition 2. If Algorithm 2 returns False for a problem P, then P is not DDC. Proof (Sketch). Observe that the algorithm terminates with True only for those cases where the relaxation proves that one variable violates Deﬁnition 1. AIBR overestimates all values that can ever be obtained. If some value is not reached under AIBR, it is not reachable in real semantics either.
5
The Case of Urban Traﬃc Control
Urban traﬃc control (UTC) aims at optimising traﬃc ﬂows in urban areas by reducing travel time delays and avoiding congestion of road links. One possibility, which is usually considered by traﬃc authorities, is conﬁguring traﬃc lights on the intersections [17,29]. A traﬃc signal conﬁguration of an intersection is deﬁned by a sequence of green light phases, each with its speciﬁed duration, that, in consequence, aﬀects the traﬃc movement through the intersection. Trafﬁc movements are described in terms of Passenger Car Units (PCUs) that on average can move from incoming to outgoing links of the intersection. Traﬃc signal conﬁgurations operate in cycles, i.e. the sequences of green phases they deﬁne are being repeated (until the conﬁguration changes). When specifying a conﬁguration, we need to keep in mind any rules governing minimum and maximum green phase length. In addition, we also need to respect the constraints on minimum and maximum duration of entire cycles as well. Intergreens typically have speciﬁed durations which we are not allowed to change. This section shows an UTC instance where the notion of DDC can be used to capture when the PDDL encoding of the UTC is faulty because of some erroneous input in deﬁning the problem. A UTC problem includes the deﬁnition of two actions modelling extension and reduction of the length of the default green time for a stage s in a junction j. The PDDL abstract model for such two actions is reported in Fig. 1. To change the default green time for a phase, several conditions have to be satisﬁed; focusing on numeric conditions, time needs to be less than the maximum green time or higher than the minimum green time. It is important, therefore, that both the minimum green time and the maximum green time are
180
E. Scala et al.
Fig. 1. Snippet of PDDL UTC model. All blocks ﬁnd a direct correspondence to the more mathematical formalisation provided in Sect. 2.
properly set in order to give room to the planner to modify the value of the default green time if necessary. Figure 2 shows an excerpt of a problem speciﬁcation. Notably, UTC problem speciﬁcations include knowledge pulled from a range of diﬀerent data sources, that may therefore be inconsistent or noisy and need to be carefully veriﬁed [4]. Further, the models are large, composed by thousands of lines, making manual veriﬁcation unfeasible. Run over the problem of Fig. 2, Algorithm 2 yields a ﬁxpoint interval state where (defaultgreentime wrac1_stage1) is any value between −∞ and ∞. Instead, in the considered excerpt, the value of (defaultgreentime wrac1_ stage2) will never change through time. Indeed, neither reduceStage nor extendStage can be applied. The default green time is not within the minimum and maximum green time. Although this is not a problem modelling wise, the notion of DDC detects this as a suspicious situation. The abstract version of default green time is non static due to the actions of Fig. 1. Yet, there is a concrete specialisation, (defaultgreentime wrac1_stage2) that is static, and this makes the problem to be non consistent w.r.t. the domain. Because such a problem is deemed as non Domain Dynamic Consistent, the user can be alarmed and ﬁx the problem accordingly, i.e., modifying the minimum green time variable for wrac1_stage2 to a consistent value. Using a prototype implementation of the presented algorithm on realworld data, we were able to quickly identify a dozen of issues and inconsistencies on automatically generated UTC initial states, eﬀectively addressing the issues rais
Domain Dynamic Consistency for Numeric Planning Problems
181
Fig. 2. Snippet of a UTC problem, presenting some elements of a single junction with two stages. In PDDL syntax, the block ‘’:init‘’ is the initial state; ‘’:objects‘’ deﬁne the universe of objects.
ing from pulling data from diﬀerent sources. The use of DDC also allowed to identify unforeseen failure points of the knowledge acquisition process. For instance, we identiﬁed a case where one junction went oﬄine and did not communicate its status (missing defaultgreentime value).
6
Conclusion
The use of automated planning in realworld applications, particularly when instances are generating by including data pulled together from a range of sources, comes with the challenge of verifying that the resulting instances are consistent. In this paper, to address the abovementioned challenge, we introduced the notion of Domain Dynamic Consistency (DDC) to identify instances that may not behave as expected. The notion of DDC can be used as a means to verify the knowledge acquisition process of a planning problem initial state, and the fact that pulled data provide a consistent overall ﬁgure. The DDC notion has been captured in PDDL, a well known formalism used by the planning community. This notion can be useful in contexts where one wants to have an automatic mechanism to inspect suspicious input. The idea being that DDC does not necessarily identify mistakes, but can ﬂag aspects that are suspicious and deserve indepth investigation. We then presented a sound technique to prove when a problem is not DDC, that leverages on existing
182
E. Scala et al.
numeric relaxationbased heuristics. Finally, we provided an example application where the use of DDC helped in catching a number of issues in large PDDL models. We see several avenues for future work. First, we are interested in extending the DDC notion to more complex planning formalisms, for instance PDDL+ [10]. Second, we plan to develop a suitable interface to allow nonplanning experts to take advantage of this technique. Finally, we are interested in exploiting the DDC notion also to suggest potential issues of the domain models, to provide a tool that can also help in revising and improving the planning models used. Acknowledgements. Mauro Vallati was supported by a UKRI Future Leaders Fellowship [grant number MR/T041196/1]. Enrico Scala has been partially supported by AIPlan4EU, a project funded by EU Horizon 2020 research and innovation programme under GA n. 101016442, and by the Italian MUR programme PRIN 2020, Prot.20203FFYLK (RIPER – Resilient AIBased SelfProgramming and Strategic Reasoning).
References 1. Aldinger, J., Mattm¨ uller, R., G¨ obelbecker, M.: Complexity of interval relaxed numeric planning. In: H¨ olldobler, S., Kr¨ otzsch, M., Pe˜ naloza, R., Rudolph, S. (eds.) KI 2015. LNCS (LNAI), vol. 9324, pp. 19–31. Springer, Cham (2015). https://doi. org/10.1007/9783319244891 2 2. Aldinger, J., Nebel, B.: Interval based relaxation heuristics for numeric planning with action costs. In: KernIsberner, G., F¨ urnkranz, J., Thimm, M. (eds.) KI 2017. LNCS (LNAI), vol. 10505, pp. 15–28. Springer, Cham (2017). https://doi.org/10. 1007/9783319671901 2 3. Bensalem, S., Havelund, K., Orlandini, A.: Veriﬁcation and validation meet planning and scheduling. Int. J. Softw. Tools Technol. Transf. 16(1), 1–12 (2014) 4. Bhatnagar, S., Mund, S., Scala, E., McCabe, K., McCluskey, L., Vallati, M.: On the challenges of ontheﬂy knowledge acquisition for automated planning applications. In: 14th International Conference on Agents and Artiﬁcial Intelligence (2022) 5. Bonet, B., Geﬀner, H.: Planning as heuristic search. Artif. Intell. 129(1–2), 5–33 (2001) 6. De Giacomo, G., Vardi, M.: Synthesis for LTL and LDL on ﬁnite traces. In: Proceedings of the International Joint Conference on Artiﬁcial Intelligence (IJCAI), pp. 1558–1564. AAAI Press (2015) 7. Edelkamp, S., Kissmann, P.: Partial symbolic pattern databases for optimal sequential planning. In: Dengel, A.R., Berns, K., Breuel, T.M., Bomarius, F., RothBerghofer, T.R. (eds.) KI 2008. LNCS (LNAI), vol. 5243, pp. 193–200. Springer, Heidelberg (2008). https://doi.org/10.1007/9783540858454 24 8. Fourati, F., Bhiri, M.T., Robbana, R.: Veriﬁcation and validation of PDDL descriptions using EventB formal method. In: Proceedings of the 5th International Conference on Multimedia Computing and Systems (ICMCS), pp. 770–776 (2016) 9. Fox, M., Long, D.: PDDL2.1: an extension to PDDL for expressing temporal planning domains. J. Artif. Intell. Res. 20, 61–124 (2003) 10. Fox, M., Long, D.: Modelling mixed discretecontinuous domains for planning. CoRR abs/1110.2200 (2011)
Domain Dynamic Consistency for Numeric Planning Problems
183
11. Fox, M., Long, D., Tamboise, G., Isangulov, R.: Creating and executing a well construction/operation plan. uS Patent App. 15/541,381 (2018) 12. Ghallab, M., Nau, D.S., Traverso, P.: Automated Planning and Acting. Cambridge University Press, Cambridge (2016) 13. Helmert, M.: Decidability and undecidability results for planning with numerical state variables. In: Proceedings of the Sixth International Conference on Artiﬁcial Intelligence Planning Systems (AIPS), pp. 44–53. AAAI (2002) 14. Helmert, M.: Concise ﬁnitedomain representations for PDDL planning tasks. Artif. Intell. 173(5–6), 503–535 (2009) 15. Hoﬀmann, J.: The MetricFF planning system: translating “ignoring delete lists” to numeric state variables. J. Artif. Intell. Res. 20, 291–341 (2003) 16. Lipovetzky, N., Burt, C.N., Pearce, A.R., Stuckey, P.J.: Planning for mining operations with time and resource constraints. In: Proceedings of the International Conference on Automated Planning and Scheduling (2014) 17. McCluskey, T.L., Vallati, M., Franco, S.: Automated planning for urban traﬃc management. In: Proceedings of the International Joint Conference on Artiﬁcial Intelligence (IJCAI), pp. 5238–5240 (2017) 18. McCluskey, T.L., Vaquero, T.S., Vallati, M.: Engineering knowledge for automated planning: towards a notion of quality. In: Proceedings of the Knowledge Capture Conference, KCAP, pp. 14:1–14:8 (2017) 19. Moore, R.E., Kearfott, R.B., Cloud, M.J.: Introduction to Interval Analysis. SIAM (2009) 20. Parkinson, S., Longstaﬀ, A., Fletcher, S.: Automated planning to minimise uncertainty of machine tool calibration. Eng. Appl. Artif. Intell. 30, 63–72 (2014) 21. Pnueli, A.: The temporal semantics of concurrent programs. In: Proceedings of Semantics of Concurrent Computation, pp. 1–20 (1979) 22. Raimondi, F., Pecheur, C., Brat, G.: PDVer, a tool to verify PDDL planning domains. In: Proceedings of Workshop on Veriﬁcation and Validation of Planning and Scheduling Systems, ICAPS (2009) 23. Scala, E., Haslum, P., Thi´ebaux, S.: Heuristics for numeric planning via subgoaling. In: Proceedings of the International Joint Conference on Artiﬁcial Intelligence (IJCAI), pp. 3228–3234. IJCAI/AAAI Press (2016) 24. Scala, E., Haslum, P., Thi´ebaux, S., Ram´ırez, M.: Intervalbased relaxation for general numeric planning. In: Proceedings of the 22nd European Conference on Artiﬁcial Intelligence (ECAI), pp. 655–663 (2016) 25. Scala, E., Haslum, P., Thi´ebaux, S., Ram´ırez, M.: Subgoaling techniques for satisﬁcing and optimal numeric planning. J. Artif. Intell. Res. 68, 691–752 (2020) 26. Scala, E., Vallati, M.: Exploiting classical planning grounding in hybrid PDDL+ planning engines. In: Proceedings of the 32nd IEEE International Conference on Tools with Artiﬁcial Intelligence (ICTAI), pp. 85–92 (2020) 27. Shrinah, A., Eder, K.: Goalconstrained planning domain model veriﬁcation of safety properties. In: STAIRS@ECAI (2020) 28. Thi´ebaux, S., Coﬀrin, C., Hijazi, H., Slaney, J.: Planning with MIP for supply restoration in power distribution systems. In: Proceedings of the International Joint Conference on Artiﬁcial Intelligence (2013) 29. Vallati, M., Magazzeni, D., Schutter, B.D., Chrpa, L., McCluskey, T.L.: Eﬃcient macroscopic urban traﬃc models for reducing congestion: a PDDL+ planning approach. In: Proceedings of the Thirtieth AAAI Conference on Artiﬁcial Intelligence, pp. 3188–3194 (2016)
Comparing MultiAgent Path Finding Algorithms in a Real Industrial Scenario Enrico Saccon(B) , Luigi Palopoli , and Marco Roveri University of Trento, Trento, Italy {enrico.saccon,luigi.palopoli,marco.roveri}@unitn.it
Abstract. There is an increasing trend for automating warehouses and factories leveraging on teams of autonomous agents. The orchestration problem for a ﬂeet of autonomous robotic cooperating agents has been tackled in the literature as MultiAgent Path Finding (MAPF), for which several algorithms have been proposed. However, these algorithms have been only applied to synthetic randomly generated scenarios. The application in real scenarios demands scalability (being able to deal with realistic size warehouses) and eﬃciency (being able to quickly adapt to changes in the problems, e.g., new orders or change in their priorities). In this work we perform an analysis of the MAPF literature, we selected the most eﬀective algorithms, we implemented them and we carried out an experimental analysis on a real scalable warehouse of a large distribution company to evaluate their applicability in such scenarios. The results show that a) no algorithm prevails on the others; b) there are diﬃcult (realistic) cases out of the scope of all the algorithms.
1
Introduction
Robots are becoming a familiar presence in the daily life of people, helping them in diﬀerent application domains: industry, warehouse, healthcare, search and rescue, and oﬃce automation. Despite this, industry is the domain in which automated machines have had the most successful applications. Indeed, the 4.0 industrial revolution meant for many workers an increased level of interaction with the machines present in the factory [4], with a signiﬁcant impact on productivity [29]. Indeed, robotics proves to enhance and solve more easily logistics and manufacturing problems allowing for a better use of the industrial space [12]. Since the last decade, robots have been used with great proﬁt in the healthcare sector. For example, they have been successfully used in precise surgical procedures to help surgeons reach diﬃcult anatomical compartments and doing operations that would otherwise be impossible [5]. Also, robotics has been applied to help elderly and impaired people move more freely, besides being used to assist during rehabilitation [10]. Robots have been also successfully utilized is M. Roveri—The work of M. Roveri was partially funded by the Italian MUR programme PRIN 2020, Prot.20203FFYLK (RIPER – Resilient AIBased SelfProgramming and Strategic Reasoning). c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Dovier et al. (Eds.): AIxIA 2022, LNAI 13796, pp. 184–197, 2023. https://doi.org/10.1007/9783031271816_13
Comparing MAPF Algorithms in a Real Industrial Scenario
185
search and rescue missions in challenging environments [1]. Finally, robots can be used to help in the daytoday life of an oﬃce allowing aﬀairs to be sped up and simplifying the general workday [27]. The majority of the above applications, involve multiple robots that need to cooperate while moving in a shared environment (e.g. a warehouse) without interfering with each other in order to complete one or multiple tasks in the most eﬃcient way possible, and requiring prompt response to contingencies (e.g., arrival of a new task). This can be achieved by an automatic synthesis of a plan (i.e., a sequence of movements for each agent) to fulﬁll the full set of tasks. The automatic synthesis to be applied in real industrial scenarios requires that i) the solution plan will be generated for realsize industrial scenarios; ii) the solution plan will be generated quickly (e.g., in at most 1 minute) to quickly adapt to contingencies (E.g., new order, change of priority, order cancellation). The problem of ﬁnding a plan for coordinating a ﬂeet of autonomous robotic cooperating agents aiming to complete assigned tasks has been tackled in the literature as MultiAgent Path Finding (MAPF). Several algorithms have been proposed to solve the MAPF problem like e.g., the Kornhauser’s algorithm [13], the Extended A* [11], the Increasing Cost Tree Search (ICTS) [22], several variants of the Constraint Based Search (CBS) [21], and the Constraint Programming (CP) and Mixed Integer Linear Programming (MILP) approaches. However, as far as our knowledge is concerned, these algorithms have been only applied to synthetic randomly generated graphs, and their application in real scenarios has not been studied. In this work we make the following contributions. First, we perform a detailed analysis of the MAPF literature, from which we selected the most eﬀective algorithms, and we implemented them as eﬃciently as possible. Second, we carry out an experimental analysis on a real warehouse of a large distribution company. To evaluate the performance and applicability of the considered algorithms we decomposed the whole warehouse in subareas of increasing size (from a smaller area to the whole warehouse). For each scenario we considered diﬀerent number of agents and several randomly generated tasks for each agent. The results show that the CP approach is able to solve very small cases and does not scale to large realsize scenarios, although it generates optimal solutions and is able to solve also small critical hard problems. The algorithms that performs better are the two variants of the CBS, although none of them is able to solve many cases. This work also contributed to identify some situations for which none of the considered algorithms is able to ﬁnd a solution in a set amount of time. These results contribute to pave the way for investigating new heuristics to solve these hard problems that appear in real scenarios. This paper is organized as follows. In Sect. 2, we revise the literature about MAPF. In Sect. 3, we formally deﬁne the problem we aim to, and we provide an highlevel description of the most relevant approaches studied in the literature. In Sect. 4, we describe the most relevant implementation details, the considered warehouse, and we critically discuss the results. Finally, in Sect. 5, we draw the conclusions and we outline future works.
186
2
E. Saccon et al.
Related Works
In this work, we focus on the aspect of motion planning considering the equally important problem of mission planning as completed before starting the motion planning task. While the former focuses on the best path to follow starting from a position, executing the intermediate objectives and reaching the ﬁnal destination [14], the latter focuses on the best way of organizing the goals for each robots in the environment [6]. The reason why mission planning is not considered is due to the fact that usually warehouses use specialized software to handle their internal structures, and such software is usually responsible for the generation of an ordered set of goals. The aspect of motion planning is particularly important in a populated environment because it needs to guarantee people safety. 2.1
SingleAgent Path Finding
The SingleAgent Path Finding (SAPF) problem is the problem of ﬁnding the best path on a graph between two given nodes or vertexes. Such problem is of great importance in various scenarios. Indeed, one of the main algorithms used to solve the SAPF problem, A*, has been successfully applied to GPS localization in order to improve the waypoints accuracy for remote controlled agents [15]. Nevertheless, the ﬁeld in which singleagent path ﬁnding has found the most importance is the ﬁeld of robot routing and planning, as the problem name also suggests. SAPF algorithms have been successfully implemented in robot routing, where they have been used to search a graph constructed by environmental data in order to avoid obstacles and to explore possible routes [2]. This thesis focuses on the path planning problem that can be deﬁned as follows: Definition 1 (SingleAgent Path Finding). Given an undirected graph G = (V, E), where V is the set of the vertexes (that correspond to possible locations for the agent) and E the set of edges joining two vertexes (representing the possible transitions between two locations), the SingleAgent Path Finding (SAPF) problem consists in finding the shortest feasible plan π between a starting vertex vS ∈ V and a final one vF ∈ V . A plan π is the sequence of N actions αi , i ∈ {1, . . . N } that take the agent from the starting position vS ∈ V to the ﬁnal position vF in N steps by following the graph edges: π = [α1 , ..., aN ] : π(vS ) = αN (...α2 (α1 (vS ))...) = vF where with αi (vs ) we denote the movement to the vertex ve ∈ V from vs ∈ V , such that vs , ve ∈ E. We denote with π[h], h ≤ N the hth action of the plan π = [α1 , ..., αN ], i.e. π[h] = αh . We also denote with π = N the length of the plan π = [α1 , ..., αN ]. Due to its deﬁnition, the SAPF problem can be reduced to the problem of ﬁnding the shortest path on a graph. What follows is a brief description of the main algorithms that can be applied to singleagent path ﬁnding which can be divided in deterministic algorithms (e.g. Dijkstra’s) and heuristic ones (e.g. A* ).
Comparing MAPF Algorithms in a Real Industrial Scenario
187
Dijkstra’s Algorithm. Dijkstra’s algorithm [9] aims to ﬁnd the shortest path between two nodes on a graph whose edges have only positive values. Note that the graph needs to be strongly connected, i.e., there must be at least one path between any two nodes. While this seems quite a strong limitation, industrial scenarios usually provide such graph: no node can be a sink since it must be possible for an agent to come back from each location, that is, usually graphs modeled on warehouses are either undirected, and hence strongly connected, or directed but no node can be a sink. The work of Dijkstra published in 1959 [9] presents two possible algorithms, one to ﬁnd the shortest path from one node to another and one to ﬁnd a tree of minimum length starting from a node and reaching all the other nodes. We focus on the second aspect. The complexity of the algorithm depends on the number of vertexes and edges. Moreover, diﬀerent and improved versions of the algorithm have diﬀerent worstcase performance, but the initial one proposed by Dijkstra runs in time O((V  + E) log V ). Finally, the algorithm has been successfully used in robot path planning [7,16,28]. A* Algorithm. A* is an heuristic bestﬁrst search algorithm for ﬁnding the shortest path on a graph [25]. It is also an admissible algorithm, that is, it is guaranteed to ﬁnd an optimal from the starting node to the arrival one [11]. The idea of A* is to direct the search over the nodes towards the arrival node without having to necessarily examine all the vertexes. To do so, A* keeps a set of nodes to be visited, which is initialized with only the starting node, but then it is enlarged with the neighbors that the algorithm deems worthy to be expanded. A node is said to be expanded when it is added to the set to be analyzed later on. The choice of which nodes should be expanded and which not, is given by the heuristic function. Indeed, when examining the neighbors u ∈ neigh(n) of the considered node, A* uses a heuristic h(u) to estimate the distance to the arrival vertex. Let h ∗ (u) be the perfect heuristic, that is, a function that returns the correct distance from the node u to the arrival vertex, then if h ∗ (u) is known for all the nodes, the best path is obtained just by choosing to go to the neighbor with the lower heuristic distance between neighbors. It has been proved that if h(n) ≤ h ∗ (n), then the heuristic is admissible and A* is optimal [11].
3
Problem Statement
The SingleAgent Path Finding (SAPF) problem is the problem of planning feasible movements for multiple agents [18] such that each one can reach its ﬁnal location from a respective initial. Definition 2 (MultiAgent Path Finding). Given a finite set A = {a1 , ..., ak } of k agents, given an undirected graph G = (V, E), where V is the set of the vertexes (that correspond to possible locations for the agents) and E the set of edges joining two vertexes (representing the possible transitions between two locations), given an initial start location vSai ∈ V and final location vFai for each agent ai , the SingleAgent Path Finding (SAPF) problem consists in finding
188
E. Saccon et al.
Fig. 1. The diﬀerent kinds of conﬂicts.
a joint feasible plan Π = {πa1 , ..., πak } such that for each πai , πai (vSai ) = vFai , and it minimizes a given cost function C(Π). In this work, we focus on edges with unitary cost (i.e. all edges have cost 1, whereas extensions for which edges have nonunitary costs will be left for future work). We say that a joint plan Π is feasible if no conﬂict happens between any two diﬀerent agents. In the literature, the most widely used notions of conﬂicts are the following [18]: – Vertex conflict: when two agents ai , aj ∈ A with i = j are not occupying the same vertex at the same time. We say that the two agents have a vertex a conﬂict iﬀ ∃1 ≤ h ≤ N such that πai [h](vSai ) = πaj [h](vSj ). – Edge conflict: when two agents ai , aj ∈ A with i = j are aiming to use the same edge on the same direction at the same time. We say that the two agents have an edge conﬂict iﬀ ∃1 ≤ h < N such that πai [h](vSai ) = a a πaj [h](vSj ) ∧ πai [h + 1](vSai ) = πaj [h + 1](vSj ). – Swap conflict: when two agents ai , aj ∈ A with i = j are aiming to use the same edge but on opposite direction at the same time. We say that the two agents have a swap conﬂict iﬀ ∃1 ≤ h < N such that πai [h](vSai ) = a a πaj [h + 1](vSj ) ∧ πai [h + 1](vSai ) = πaj [h](vSj ). – Follow conflict: when two agents ai , aj ∈ A with i = j are such that agent ai want to occupy a position at a given time h that was occupied by agent aj at time h − 1. We say that the two agents have a follow conﬂict iﬀ ∃1 < h ≤ N a such that πai [h](vSai ) = πaj [h − 1](vSj ). In Fig. 1, we provide a pictorial representation for the vertex, swap, and follow conﬂicts. The edge conﬂict is pictorially similar to the swap conﬂict where the two agents are in the same location and want to take the same edge. It should be noted that avoiding vertex conﬂicts will avoid edge conﬂicts by deﬁnition. In the literature, two diﬀerent kinds of cost function C(Π) have been considered: the makespan and the sum of costs (we refer to [18] for a more thorough discussion). – The makespan is a function that returns the length of the longest plan πaj ∈ Π: I.e. C(Π) = MKS(Π) = max πai . Thus, minimizing the makespan πai ∈Π
Comparing MAPF Algorithms in a Real Industrial Scenario
189
means ﬁnding the plan that contains the shortest path among the possible longest paths. – The sum of costs is a function that returns the sum of the individual cost of the diﬀerent plan πaj ∈ Π: I.e. C(Π) = SIC(Π) = πai . Here we πai ∈Π
assume that each action costs 1. If a cost ce for ei ∈ E is associated to each edge, then instead of the length of the plan, one has to consider the sum of the cost of each action in the plan. The classical multiagent path ﬁnding problem has been proved to be NPhard, i.e., it is not possible to ﬁnd an optimal solution in polynomial time [17, 26,30]. Notice that the problem is NPhard when ﬁnding an optimal solution, i.e., a solution that minimize the objective function, may it be the makespan or the sum of individual costs. 3.1
Solutions
In the literature, several algorithms to solve the MAPF problem have been proposed. These algorithms can be correct and complete (i.e. if they terminate with a solution, then the computed solution is a solution to the given MAPF problem, and if the problem admits no solution the algorithm says that no solution exists); and can compute an optimal solution if it minimizes the given cost function, or a bounded optimal one if the computed solution minimizes the cost function within a given bound (i.e. there is some degree of freedom), or a non optimal one if there is no guarantee of optimality for the computed solution. In the following description, we consider the these approaches with the corresponding algorithms: the Kornhauser’s algorithm [13], the Extended A* [24], the Increasing Cost Tree Search (ICTS) [22], the Constraint Based Search (CBS) [21], and the Constraint Programming (CP) and Mixed Integer Linear Programming (MILP) approaches. The Kornhauser’s algorithm [13] is a complete but not optimal algorithm that 3 solves the MAPF problem in O(V  ). This algorithm considers all the agents in their position, and it tries to move one single agent to a neighbor free location one at a time with the aim to ﬁnd a way to move all the agents from one arrangement to another. The solution is obtained by decomposing the problem in subproblems each one composed by the agents that can reach the same set of nodes and the subgraph made of these nodes [19]. This algorithm has been considered very hard to be implemented eﬃciently [25]. The extended A* algorithm considers moving all possible agents from one location to a free neighbor one at the same time. This results in a search space of k k E , which are both exponential in the number V  and a branching factor of V  of agents and hence intractable [25]. Two extensions were proposed to solve the MAPF problem [24]: Operator Decomposition (OD) and Independence Detection (ID). The ﬁrst aims at reducing the exponential branching factor while the
190
E. Saccon et al.
other tries to decouple the problem of k agents to smaller problems with less agents. The two extensions can also be combined. This algorithm is correct, complete and optimal. The Increasing Cost Tree Search (ICTS) algorithm is a twostage search in which a highlevel search aims at ﬁnding the lengths of the paths for the diﬀerent agents, while the lowlevel search carries out the creation of the path for the various agents with the cost constraints given by the highlevel search [22,25]. This algorithm creates a tree called Increasing Cost Tree (ICT) in which each node contains a vector of the costs Ci of the individual path of each agent ai . The total cost of the node C is given by the result of the objective function applied to the joint plan and all the nodes at the same level in the tree have the same total cost. The root of the tree is initialized with the costs of the individual paths of the agents as if they were considered in a SAPF problem. If there are no conﬂicts, then the solution is ﬁne as it is and the algorithm stops. If instead a conﬂict was found, then k new nodes are going to be created, one for each agent: the ith node is composed of the solution of the parent and by only increasing the cost solution for the ith agent by one unit than before. The idea is the following: if with a given solution it was not possible to ﬁnd a solution without conﬂicts, then it may be possible to ﬁnd a solution by increasing the path of an agent by one. The algorithm continues until a solution is found. The ICT nodes not containing conﬂicts are called goal nodes. The lowlevel search is instead the part of the algorithm that has to ﬁnd a path for the ith agent of cost Ci and such that it reaches its ﬁnal destination. There may be diﬀerent implementations for this part of the algorithm: the most trivial would be to start from the initial node and enumerate all the possible path of length Ci and check which are reaching the ﬁnal node. This though may become very expensive as the number of possible paths of cost Ci may be exponential. The solution proposed [22] uses an Multivalue Decision Diagram (MDD) [23] which are a generalization of the binary decision diagrams in the sense that they allow for more than two choices for every node. Basically, the MDD has a single source node which corresponds to the starting node of the agent. Then, it keeps track of all the neighbors of the source node adding them only if the path going through them can lead to the ﬁnal node with cost Ci . This implies also that the MDD has a single sink and that it is the ﬁnal goal of the agent. The problem is then how to choose which path is best to return to the highlevel search since a path may produce more conﬂicts than another leading to a bigger and suboptimal ICT. This is done by doing the crossproduct, i.e., merging, the diﬀerent MDDs and removing those branches that contains conﬂicts. We remark that, given the structures of the ICT and of the crossproduct of the MDDs, the optimization problem can be reduced to a satisfaction problem: the ﬁrst ICT node that satisfy the constraint of not having any conﬂict is also going to be optimal, and the same is true for the paths found in the combination of the MDDs. The Constraint Based Search (CBS) algorithm uses two distinct search processes similarly to ICTS, a highlevel and a lowlevel one, and a tree to solve the MAPF
Comparing MAPF Algorithms in a Real Industrial Scenario
191
problem. Diﬀerently from ICTS, the CBS algorithm builds a Constraint Tree (CT) composed of nodes tracking three elements: i) the joint plan; ii) the cost of the joint plan; iii) a set of constraints associated with the joint plan. The idea is that whenever a joint plan contains a conﬂict, it is resolved by creating two new nodes with diﬀerent constraints, which are limitations of an agent movement. In particular, the original CBS [21] deﬁnes constraint as a negative restriction tuple (ai , n, t), meaning that the agent ai is not allowed to be on node n at time t. The protocol works in the following way: the root is built by considering the paths of the agents as in a singleagent path ﬁnding (SAPF) problem. Then, the highlevel search checks for possible conﬂicts. Let πi and πj be the plans for agents ai and aj respectively, and suppose that they have a vertex conﬂict at time t on node n. Then, the highlevel search creates two new CT nodes from the parent, one in which agent ai cannot be on node n at time t, and the other CT node in which agent aj cannot be on node n at time t. An improvement to CBS [3] suggests that using two positive constraints and a negative one may produce better results since the set of paths that complies with the constraints is disjoint [25]. This means that, instead of having two children from a node, the highlevel search creates three children, one in which agent ai must be on node n at time t, one in which agent aj must be on node n at time t and one in which neither of them is allowed to be on node n at time t. The process of expanding nodes, i.e., creating new children, stops when there are no more conﬂicts to be solved. Whenever a new node is added, the lowlevel search is called to ﬁnd a solution to the problem with the new added constraints. If a feasible solution can be found, then the node is added to the set of nodes to be further explored. To pick the next node to examine, CBS uses the cost function of the joint plan. Finally, as it regards the lowlevel search, it can be any SAPF algorithm, although it needs to be properly modiﬁed to support the presence of constraints. The Constraint Programming (CP) approach leverages a mathematical modeling paradigm in which the problem is encoded as a set of constraints among two kind of variables: state variable and decision variables. This approach is usually divided in two parts: a modeling part that addresses the shaping of the aspects of the problem introducing variables over speciﬁc domains and constraints over such variables; a solving part that aims at choosing the value of the decision variables that minimize a given cost function and that make the constraints satisﬁable. If the constraints are wellformed, i.e., they correctly cover the variables and their domains, than constraint programming is both optimal and correct. Typical modeling considers Boolean variables for each agent for each vertex for each time point, and a constraint that enforces that each agent can occupy only one vertex in each time step (thus ensuring no vertex conﬂict). Agents are positioned on their initial position at the ﬁrst time step, and must be on their arrival position at the last time step. Agents move along edges towards neighbors of the node on which they are: this is to ensure the validity of the solution since an agent cannot jump from one node to another. Once the constraints are ﬁxed, the model can be solved with any oﬀtheshelf constraint solver, which tries to look at all the possible combinations without infringing any constraint.
192
4
E. Saccon et al.
Experimental Evaluation
In this section, we ﬁrst provide the high level details of the implementation of the considered algorithms, and the information about the software and hardware infrastructure used for the experiments. Then we describe the considered industrial scenarios, we report and critically discuss the results of the experiments. 4.1
Implementation
For the implementation we have considered only three of the approaches discussed in Sect. 3. We implemented the CP approach and two variants of the CBS family of algorithms, in particular the Spanning Tree (ST) and the TimeDependant Shortest Path (TDSP). The CBS ST and the CBS TDSP diﬀers in the lowlevel search used to build the constraint tree. The CBS ST in the local search builds a spanning tree as to allow the highlevel search to choose among the possible diﬀerent paths that have the same length. The CBS TDSP in the local search uses a variant of the Dijkstra [9] algorithm to compute shortest paths where the costs of the edges depends on the time the edge is considered. We do not report here the pseudocode of the considered algorithms for lack of space, and we refer to [20] for further details. We decided not to implement the Kornhauser’s algorithm since this algorithm has been considered very hard to be implemented eﬃciently from the research community [25], and it produces nonoptimal solutions. We did not implement the extended A* algorithm because of its large branching factor that will make it not applicable in large industrial scenarios. Finally, we also did not implement the ICTS approach since it requires to know possible bounds for the costs of the searched solutions a priori (an impractical information to get for realistic scenarios). All the algorithms have been implemented in C++ using the standard template libraries. For the CP algorithm we have leveraged the latest release of the C++ API of the CPLEX commercial constraint solver [8]. The source code with the implementation of all the algorithms is available at our open repository1 . We run all the experiments on an AMD Ryzen 3700X equipped with an 8 core CPU at 3.6 GHz base clock, and 32 GB of RAM running Linux. We considered as runtime timeouts 1 s, 10 s, and 60 s to mimic the time response expectations requested in industrial realistic scenarios. 4.2
Industrial Scenarios
For the experiments we considered a real warehouse taken from a collaboration with a company operating in the ﬁeld of robotic assisted warehouses. The entire warehouse and its graph representation is depicted in Fig. 2. The topological graph obtained from the map consists of 414 nodes with undirected edges. For the experiments we decomposed the warehouse into subproblems as follows: i) WH1 that corresponds to the gold rectangle in the top right corner of Fig. 2; ii) 1
https://www.bitbucket.org/chaﬀ800/maof.
Comparing MAPF Algorithms in a Real Industrial Scenario
193
Fig. 2. The schema of the real warehouse considered for the experiments.
WH2 that corresponds to the blue rectangle in the bottom left corner of Fig. 2; iii) WH2 1 that corresponds to the red rectangle in the bottom left corner of Fig. 2; iv) WH2 2 that corresponds to the green rectangle in the bottom left corner of Fig. 2; v) WH2 1 1 that corresponds to the top 4 rows of red rectangle in the bottom left corner of Fig. 2; vi) WH2 1 2 that corresponds to the bottom 4 rows of red rectangle in the bottom left corner of Fig. 2; vii) WH2 2 1 that corresponds to the top 4 rows of green rectangle in the bottom left corner of Fig. 2; viii) WH2 2 2 that corresponds to the bottom 4 rows of green rectangle in the bottom left corner of Fig. 2. For each scenario, we considered problems with increasing number of robotic agents taken from {2, 5, 10, 20} and increasing number of goals taken from {1, 2, 5, 10, 20}. These numbers are the results of the discussion with the company owner of the reference warehouse we considered. The goals have been generated to resemble typical goals taken from the logistic activities carried out in the considered warehouse. In the results, we only report the name of the scenario followed by the number of problems considered in that scenario in parenthesis (E.g., WH2 2 2 (10) means the scenario WH2 2 2 with ten problems). For each experiment, we report the number of problems solved among the one considered, and the average search time in milliseconds (ms) required for the solved problems. We use TO to indicate that the algorithm was not able to ﬁnd a solution within the given time budget for any of the problem in the scenario.
194
4.3
E. Saccon et al.
Results
The results are reported in the Table 1: the upper left table reports the results for CBS with TDSP; the upper right table reports the results for CBS with ST; the lower down table reports the results for CP. For CP we also report the average memory in megabytes (MB) required to either ﬁnd a solution or used before ending in timeout. Table 1. Results for CBS with TDSP (up left), CBS with ST (up right), CP (down).
The results clearly show that none of the considered algorithm was able to solve all the problems in the considered budget constraints but only very few cases (e.g. in the WH2 2, WH2 2 2, WH1). In particular, the results show that the CBS algorithms are able to solve slightly more scenarios than CP (which solves only 3 cases in the 60s time boundaries with the best runtime completed in 1.1s). More speciﬁcally, the results show that the CBS algorithms are complementary. Indeed, for WH2 2 2 CBS TDSP is slower than the CBS ST, whereas for WH1 CBS TDSP is able to solve one instance while CBS ST none ending always in TO. CP is always worse in performance than the CBS algorithms. As the table with the results for CP reports, it is clear that this approach consumes a larger amount of memory w.r.t. the other approaches. Indeed, each time it does not ﬁnds a solution, it tries to increase the time steps by 1 unit thus resulting in a much larger complexity due to the used variables matrix structure.
Comparing MAPF Algorithms in a Real Industrial Scenario
195
These results clearly show that although these algorithms have been thoroughly studied in the literature, and experimented on random graphs with random goals, when applied to realistic scenarios, they fail to ﬁnd solutions in typical industrial budget of resources. A more thorough analysis of the cases where no solution was found (even with larger resource budgets) are cases where two robotic agents need to follow the same shortest path but in Fig. 3. A simple scenario opposite direction thus requiring to swap places in not solvable by CBS. one edge (see Fig. 3). In this cases, a simple strategy would move one of the two agents into a lateral position (if available) to allow the other to pass, and then go back to the previous location (thus taking a longer path that visit the same node more than once). The problem in solving such a situation stands in the diﬃculty to diﬀerentiate between a waiting action, which can be done on the node on which the agent currently is, or the action of exploring the neighbors of the node. Algorithms such as TDSP and ST are not meant to visit multiple times the same node. To solve this problem, both the highlevel and lowlevel searches of CBS should be modiﬁed, the former to consider multiple possible nodes for a given time step h on the plan of an agent, and the latter to allow moving over the same node multiple times. Both changes are already planned for future works.
5
Conclusions
In this paper, we studied the performance of the stateoftheart MAPF algorithms on a set of scalable industrial scenarios all derived from a real warehouse of a large distribution company. The results show that the CP approach ﬁnd optimal solutions, but it is applicable to only very small scenarios. The CBS approaches scale better and allows to solve in the given resource budgets more problems. However, these approaches fail to ﬁnd a solution in cases where it was requested some agent to move to other locations and then go back to the same location to continue the motion to allow other agents to exit from conﬂicting cases. This particular case is really likely to happen by construction of the graph: the aisles are long and they can basically be occupied by just one agent at a time without having to solve many swap conﬂicts. The results show that there is not a clear winner, but all the approaches have pros and cons. This work paves the way for several future works that go from investigating new heuristics to solve hard problems that appear in real scenarios, to new algorithms that combine the pros of each approach, or that consider the use of divideetimpera approaches to leverage diﬀerent lowlevel search strategies. Moreover, we aim also to extend the work so that each agent does not only consider a set of tasks, but also other information such as batteries level and the possibility to recharge. Also, while in this work we have given the mission planning for granted, integrating mission planning in the MAPF problem
196
E. Saccon et al.
may lead to more eﬀective ways of allocating tasks to the diﬀerent agents to minimize the overall cost of the computed solution. The ﬁnal goal is an open source framework containing diﬀerent MAPF solvers that can be used to tackle the problem and that may be integrated in platforms such as ROS. For this same reason, the algorithms have been reimplemented instead of employing preexisting code. Moreover, any existing code would have had to be adapted to our usecase, leading to a loss in performance.
References 1. Arnold, R.D., Yamaguchi, H., Tanaka, T.: Search and rescue with autonomous ﬂying robots through behaviorbased cooperative intelligence. J. Int. Humanit. Action 3(1), 1–18 (2018). https://doi.org/10.1186/s4101801800454 2. Bhattacharya, S., Likhachev, M., Kumar, V.: Topological constraints in searchbased robot path planning. Auton. Robots 33, 273–290 (2012). https://doi.org/ 10.1007/s1051401293041 3. Boyarski, E., et al.: ICBS: the improved conﬂictbased search algorithm for multiagent pathﬁnding (2015) ˜ ga, S., Costa, E., Castellucci, I., Arezes, P.M.: A brief overview of the use 4. BraganA˘ of collaborative robots in industry 4.0: human role and safety (2019). https://doi. org/10.1007/9783030147303 68 5. Brett, P., Taylor, R., Proops, D., Coulson, C., Reid, A., Griﬃths, M.: A surgical robot for cochleostomy, pp. 1229–1232. IEEE (2007). https://doi.org/10.1109/ IEMBS.2007.4352519 6. Brumitt, B., Stentz, A.: Dynamic mission planning for multiple mobile robots, pp. 2396–2401. IEEE (1996). https://doi.org/10.1109/ROBOT.1996.506522 7. Chen, Y.Z., Shen, S.F., Chen, T., Yang, R.: Path optimization study for vehicles evacuation based on Dijkstra algorithm. Procedia Eng. 71, 159–165 (2014). https://doi.org/10.1016/j.proeng.2014.04.023 8. Corportation, I.: Ibm ilog cplex optimization studio 9. Dijkstra, E.W.: A note on two problems in connexion with graphs. Numer. Math. 1, 269–271 (1959). https://doi.org/10.1007/BF01386390 10. Ferrari, F., et al.: Human–robot interaction analysis for a smart walker for elderly: the ACANTO interactive guidance system. Int. J. Soc. Robot. 12(2), 479–492 (2019). https://doi.org/10.1007/s12369019005725 11. Hart, P.E., Nilsson, N.J., Raphael, B.: A formal basis for the heuristic determination of minimum cost paths. IEEE Trans. Syst. Sci. Cybern. 4, 100–107 (1968). https://doi.org/10.1109/TSSC.1968.300136 12. Javaid, M., Haleem, A., Singh, R.P., Suman, R.: Substantial capabilities of robotics in enhancing industry 4.0 implementation. Cogn. Robot. 1, 58–75 (2021). https:// doi.org/10.1016/j.cogr.2021.06.001 13. Kornhauser, D., Miller, G., Spirakis, P.: Coordinating pebble motion on graphs, the diameter of permutation groups, and applications, pp. 241–250. IEEE (1984). https://doi.org/10.1109/SFCS.1984.715921 14. Latombe, J.C.: Robot Motion Planning, vol. 124. Springer Science & Business Media, Berlin, Heidelberg (2012). https://doi.org/10.1007/9781461540229 15. Pouke, M.: Using GPS data to control an agent in a realistic 3D environment, pp. 87–92. IEEE, September 2013. https://doi.org/10.1109/NGMAST.2013.24
Comparing MAPF Algorithms in a Real Industrial Scenario
197
16. Qing, G., Zheng, Z., Yue, X.: Pathplanning of automated guided vehicle based on improved Dijkstra algorithm, pp. 7138–7143. IEEE, May 2017. https://doi.org/10. 1109/CCDC.2017.7978471 17. Ratner, D., Warmuth, M.K.: Finding a shortest solution for the n × n extension of the 15puzzle is intractable (1986) 18. Roni, S., et al.: Multiagent pathﬁnding: deﬁnitions, variants, and benchmarks. CoRR abs/1906.08291 (2019) 19. R¨ oger, G., Helmert, M.: Nonoptimal multiagent pathﬁnding is solved (since 1984) (2012) 20. Saccon, E.: Comparison of MultiAgent Path Finding Algorithms in an Industrial Scenario. Master’s thesis, Department of Information Engineering and Computer Science  University of Trento, July 2022. https://www5.unitn.it/Biblioteca/en/ Web/RichiestaConsultazioneTesi 21. Sharon, G., Stern, R., Felner, A., Sturtevant, N.R.: Conﬂictbased search for optimal multiagent pathﬁnding. Artif. Intell. 219, 40–66 (2015). https://doi.org/10. 1016/j.artint.2014.11.006 22. Sharon, G., Stern, R., Goldenberg, M., Felner, A.: The increasing cost tree search for optimal multiagent pathﬁnding. Artif. Intell. 195, 470–495 (2013). https:// doi.org/10.1016/j.artint.2012.11.006 23. Srinivasan, A., Ham, T., Malik, S., Brayton, R.: Algorithms for discrete function manipulation, pp. 92–95. IEEE Computer Society Press. https://doi.org/10.1109/ ICCAD.1990.129849 24. Standley, T.: Finding optimal solutions to cooperative pathﬁnding problems, vol. 24, pp. 173–178 (2010) 25. Stern, R.: Multiagent path ﬁnding  an overview (2019). https://doi.org/10.1007/ 9783030332747 6 26. Surynek, P.: An optimization variant of multirobot path planning is intractable, vol. 2, July 2010 27. Veloso, M.M., Biswas, J., Coltin, B., Rosenthal, S.: CoBots: robust symbiotic autonomous mobile service robots, pp. 4423–4429, July 2015 28. Wang, H., Yu, Y., Yuan, Q.: Application of Dijkstra algorithm in robot pathplanning, pp. 1067–1069. IEEE (2011). https://doi.org/10.1109/MACE.2011. 5987118 29. Wurman, P.R., D’Andrea, R., Mountz, M.: Coordinating hundreds of cooperative, autonomous vehicles in warehouses. AI Mag. 29, 9 (2008). https://doi.org/10. 1609/aimag.v29i1.2082, https://ojs.aaai.org/index.php/aimagazine/article/view/ 2082 30. Yu, J., LaValle, S.M.: Structure and intractability of optimal multirobot path planning on graphs, pp. 1443–1449. AAAI Press (2013)
LogicBased Ethical Planning Umberto Grandi1 , Emiliano Lorini1 , Timothy Parker1(B) , and Rachid Alami2 1
IRIT, CNRS, Toulouse University, Toulouse, France [emailprotected] 2 LAAS, CNRS, Toulouse, France
Abstract. In this paper we propose a framework for ethical decisionmaking in the context of planning, with intended application to robotics. We put forward a compact but highly expressive language for ethical planning that combines linear temporal logic with lexicographic preference modelling. This original combination allows us to assess plans both with respect to an agent’s values and its desires, introducing the novel concept of the morality level of an agent and moving towards multigoal, multivalue planning. We initiate the study of computational complexity of planning tasks in our setting, and we discuss potential applications to robotics.
1
Introduction
In ethical planning the planning agent has to ﬁnd a plan for promoting a certain number of ethical values. The latter include both abstract values such as justice, fairness, reciprocity, equity, respect for human integrity and more concrete ones such as “greenhouse gas emissions are reduced”. Unlike classical planning in which the goal to be achieved is unique, in ethical planning the agent can have multiple and possibly conﬂicting values, that is, values that cannot be concomitantly satisﬁed. It is typical of ethical planning the problem of facing a moral struggle which is “...provoked by inconsistencies between value commitments and information concerning the kinds of decision problems which arise...” [18, p. 8]. Consequently, in ethical planning the agent needs to evaluate and compare the ideality (or goodness) of diﬀerent plans depending on how many and which values are promoted by each of them. In this paper our intended application ﬁeld is that of robotics. Including ethical considerations in robotics planning requires (at least) three steps. First, identify ethically sensitive situations in the robotics realm, and how are these situations represented. Planning seems to be the ﬁrst candidate in which to include ethical considerations, thus we assume that values or ethical judgments are expressed about the results of plans. Second, design a language to express such values, bearing in mind that they can be, and often are, potentially conﬂicting in multiple ways: among values, between a value and a goal, or between a value and good practices. Such a value representation language needs to be compact and computationally tractable. Third, complete the picture of ethical planning by designing algorithms that compare plans based on the ethical values. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Dovier et al. (Eds.): AIxIA 2022, LNAI 13796, pp. 198–211, 2023. https://doi.org/10.1007/9783031271816_14
LogicBased Ethical Planning
199
In this paper we put forward a framework for ethical planning based on a simple temporal logic language to express both an agent’s values and goals. For ease of exposition we focus on singleagent planning with deterministic sequential actions in a known environment. Our model borrows from the existing literature on planning and combines it in an original way with research in compact representation languages for preferences. The latter is a widely studied topic in knowledge representation, where logical and graphical languages are proposed to represent compactly the preferences of an agent over a combinatorial space of alternatives, often described by means of variables. In particular, we commit to a prioritised or lexicographic approach to solve the possible arising inconsistencies among goals, desires, and good practices in a uniﬁed planning model.
2
Related Work
There is considerable research in the ﬁeld of ethics and AI, see Müller [25] for a general overview. Popular ethical theories for application are consequentialism, deontology, and virtue ethics.1 Our approach should be able to work with any notion of “good actions” but is probably a most natural ﬁt for pluralistic consequentialism [30]. While there is a lot of work at the theoretical/abstract level, there is comparatively less that examines how ethical reasoning in artiﬁcial agents could actually be done in practice. There are approaches both in terms of formal models [12] and allowing agents to learn ethical values [2]. Yu et al. [33] provides a recent survey of this research area. The closest approaches to ours are the recent work on (i) logics for ethical reasoning and (ii) the combination of a compact representation language, such as conditional preference networks, with decisionmaking in an ethically sensitive domain. The former are based on diﬀerent methodologies including event calculus (ASP) [6], epistemic logic and preference logic [22,24], BDI (belief, desire, intention) agent language [11], classical higherorder logic (HOL) [5]. The latter was presented in “blue sky” papers [21,28] complemented with a technical study of distances between CPnets [20] and, more recently, with an empirical study on human ethical decisionmaking [4]. CPnets are a compact formalism to order states of the world described by variables. We take inspiration from these lines of work, but depart from them under two aspects. First, robotics applications are dynamic ones, and ethical principles must be expressed over time. Hence, unlike existing logics for ethical reasoning, our focus is on a speciﬁcation language for values based on linear temporal logic. Second, ethical decisionmaking in robotic applications requires mixing potentially conﬂicting values with desires of the agent and to express the notion of plan, and CPnets alone are not suﬃcient. In the ﬁeld of robotics, there are approaches to enabling artiﬁcial agents to compute ethical plans. The evaluative component, which consists in assessing the “goodness” of an action or a plan in relation to the robot’s values, is made explicit 1
See Copp [9] for a philosophical introduction, and Jenkins et al. [15], Powers [27], and Vallor [31] for a discussion of these three theories in robotics.
200
U. Grandi et al.
by Arkin et al. [3] and Vanderelst and Winﬁeld [32]. Evans et al. [13] focuses on a collision scenario involving an autonomous vehicle, proposing to prioritise the ethical claims depending on the situation, e.g. by giving more priorities to the claims of the more endangered agents. Related work explores the design of planning algorithms designed to help robots produce socially acceptable plans by assigning weights to social rules [1]. In preferencebased planning by Bienvenu et al. [7] plans are compared relative to a single (possibly lexicographic) preference formula about temporal properties. Similarly, Lindner et al. [19] evaluate the permissibility of plans according to a specific ethical principle such as the deontological principle, the utilitarian principle, the donoharm or the double eﬀect principle. In our approach plans are compared relative to sets of values. Comparison of alternatives (e.g., plans, states, histories) relative to a set of values is an essential aspect of ethics which is not considered in these two works. As we will show in Sect. 3.5, it opens up the possibility of formalizing the notion of moral conﬂict.
3
Model
In this section, we present the formal model of ethical evaluation and planning which consist, respectively, in comparing the goodness of plans and in ﬁnding the best plan relative to a given base of ethical values. 3.1
LTL Language
Let Prop be a countable set of atomic propositions and let Act be a ﬁnite nonempty set of action names. Elements of Prop are noted p, q, . . ., while elements of Act are noted a, b, . . .. We assume the existence of a special action skip. The set of states is S = 2Prop with elements s, s , . . . In order to represent the agent’s values, we introduce the language of LTLf (Linear Temporal Logic over Finite Traces) [10,26], noted LLTLf (Prop) (or LLTLf ), deﬁned by the following grammar: ϕ ::= p  ¬ϕ  ϕ1 ∧ ϕ2  Xϕ  ϕ1 U ϕ2 , with p ranging over Prop. X and U are the operators “next” and “until” of LTLf . Operators “henceforth” (G) and “eventually” (F) are deﬁned in the usual way: def def Gϕ = ¬( U ϕ) and Fϕ = ¬G¬ϕ. The propositional logic fragment of LLTLf is noted LPL and is deﬁned in the usual way. We will use LPL to describe the eﬀect preconditions of the agent’s actions. 3.2
Histories
The notion of history is needed for interpreting formulas in LLTLf . We deﬁne a khistory to be a pair H = (Hst , Hact ) with Hst : [0, k] −→ S and Hact : [1, k] −→ Act.
LogicBased Ethical Planning
201
A history speciﬁes the actual conﬁguration of the environment at a certain time point and the action executed by the agent that leads to the next state. The set of khistories is noted Hist k . The set of histories is Hist = k∈N Hist k . Semantic interpretation of formulas in LLTLf relative to a khistory H ∈ Hist and a time point t ∈ [0, k] goes as follows (we omit boolean cases which are deﬁned as usual): H, t = p
⇐⇒ p ∈ Hst (t),
H, t = Xϕ ⇐⇒ t < k and H, t + 1 = ϕ, H, t = ϕ1 U ϕ2 ⇐⇒ ∃t ≥ t : t ≤ k and H, t = ϕ2 and ∀t ≥ t : if t < t then H, t = ϕ1 . 3.3
Action Theory
We suppose actions in Act are described by an action theory γ = (γ + , γ − ), where γ + and γ − are, respectively, the positive and negative eﬀect precondition function γ + : Act × Prop −→ LPL and γ − : Act × Prop −→ LPL . The fact γ + (a, p) guarantees that proposition p will be true in the next state when action a is executed, while γ − (a, p) guarantees that proposition p will be false in the next state when action a is executed. We stipulate that if γ + (a, p) and γ − (a, p) are concomitantly true at a given state and action a is executed, then the truth value of p will not change in the next state. The latter captures an inertial principle for ﬂuents. Definition 1 (Actioncompatible histories). Let γ = (γ + , γ − ) be an action theory and let H = (Hst , Hact ) be a khistory. We say H is compatible with γ if the following condition holds, for every t ∈ [1, k] and for every a ∈ Act : if Hact (t) = a then Hst (t) = Hst (t − 1) \ p ∈ Prop : H, t − 1 = ¬γ + (a, p) ∧ γ − (a, p) ∪ p ∈ Prop : H, t − 1 = γ + (a, p) ∧ ¬γ − (a, p) . The set of γcompatible histories is noted Hist (γ). 3.4
Plans
Let us now move from the notion of action to the notion of plan. Given k ∈ N, a kplan is a function π : {1, . . . , k} −→ Act. The set of kplans is noted Plan k . The set of plans is Plan = k∈N Plan k . The following deﬁnition introduces the notion of history generated by a kplan π at an initial state s0 . It is the actioncompatible khistory along which the agent executes the plan π starting at state s0 .
202
U. Grandi et al.
Definition 2 (History generated by a kplan). Let γ = (γ + , γ − ) be an action theory, s0 ∈ S and π ∈ Plan k . Then, the history generated by plan π from state s0 in conformity with the action theory γ is the khistory H π,s0 ,γ = π,s0 ,γ π,s0 ,γ , Hact ) such that: (Hst (i) H π,s0 ,γ ∈ Hist (γ), π,s0 ,γ (ii) Hst (0) = s0 ,
π,s0 ,γ (iii) ∀k s.t. 1 ≤ k ≤ k : Hact (k ) = π(k ),
Given a set of LTLf formulas Σ, we deﬁne Sat(Σ,π,s0 ,γ) to be the set of formulas from Σ that are guaranteed to be true by the execution of plan π at state s0 under the action theory γ. That is, Sat(Σ,π,s0 ,γ) = ϕ ∈ Σ : H π,s0 ,γ , 0 = ϕ . 3.5
Moral Conflicts
An ethical planning agent is likely to have multiple values that it wishes to satisfy when making plans. Some of these values will be ethical in nature (“do not harm humans”), and some may not be (“do not leave doors open”). However, the more values the robot has the more likely it is to experience scenarios where it cannot satisfy all of its values with any given plan, and must violate some of them. In such a scenario, the agent must ﬁrst work out which subsets of its value base are jointly satisﬁable, and then which of those subsets it should choose to satisfy. To this end we deﬁne a notion of a moral conﬂict (note that in line with Levi [18] we refer to any conﬂict between an agent’s values as a “moral conﬂict” even if some or all of those values are not strictly moral/ethical in nature). Definition 3 (Moral problem). A moral problem is a tuple M = (Ω, γ, s0 ) where: – Ω ⊆ LLTLf is a set of values (which may or may not be strictly moral in nature); – γ = (γ + , γ − ) is an action theory and s0 is an initial state, as described above. Definition 4 (Moral conflict). A moral problem M = (Ω, γ, s0 ) is a moral conflict if: – ∀k ∈ N, there is no kplan π such that Sat(Ω,π,s0 ,γ) = Ω. In other words, a moral conﬂict occurs when it is not possible to satisfy all of our values with any given plan. In some cases, a moral conlict may not depend on any particular feature of the start state, but may result simply from the value base and action theory, or even the value base alone. This allows us to deﬁne two further notions of moral problem.
LogicBased Ethical Planning
203
Definition 5 (Physical moral problem). A physical moral problem is a pair (Ω, γ) where: – Ω ⊆ LLTLf is a set of values; – γ is an action theory. Definition 6 (Logical moral problem). A logical moral problem is a set of values Ω ⊆ LLTLf . We can also deﬁne moral conﬂict for these moral problems. A physical (logical) moral problem is a physical (logical) value conﬂict if for every possible start state s0 (and every possible action theory γ), the resultant moral value problem M = (Ω, γ, s0 ) is a moral conﬂict. By our deﬁnition, conﬂict mirrors the concept of necessity. Necessity would imply that every possible plan satisﬁes all the values in Ω, whereas conﬂict implies that no plan satisﬁes all values. Thus it is interesting to note that our deﬁnitions of conﬂict have mirrors in philosophical literature [16]. A physical moral conﬂict mirrors the notion of nomic necessity (necessary given the laws of nature) (at least from the perspective of the robot, for whom the action theory comprises the laws of nature) whereas a logical moral conﬂict mirrors the notion of logical necessity (necessary given the nature of logic). If an agent is experiencing a moral conﬂict, one response would be to “temporarily forget” values until it has a satisﬁable set. Definition 7 (Contraction). If M = (Ω, γ, s0 ) is a moral problem and M = (Ω , γ, s0 ) is a moral problem, we say that M is a contraction of M if: – Ω ⊆ Ω – M is not a moral conflict. Note that if M = (Ω, γ, s0 ) is a moral problem, π is a plan, and Ω = Sat(Ω,π,s0 ,γ) then M = (Ω , γ, s0 ) must be a contraction of M . In this case, we refer to M as the contraction generated by π. This also illustrates that the current notion of contraction is unhelpful for an agent attempting to select a plan in a moral conﬂict, as all plans generate contractions. What would be helpful is some notion of a “minimal” or “ideal” contraction that sacriﬁces as few values as possible. Definition 8 (Minimal contraction). If M = (Ω, γ, s0 ) is a moral problem and M = (Ω , γ, s0 ) is a contraction of M , M is: – A qualminimal contraction if there is no contraction M = (Ω , γ, s0 ) such that Ω ⊂ Ω ; – A quantminimal contraction if there is no contraction M such that Ω  < Ω  Proposition 1. If M = (Ω, γ, s0 ) is a moral problem and is not a moral conflict, then the only qualminimal and quantminimal contraction of M is M .
204
U. Grandi et al.
For either notion of minimality, we will have cases where there are multiple minimal contractions of a given moral conﬂict. This can produce unintuitive results, as if there is some moral conﬂict with Ω = {“do not kill humans”, “do not leave the door open”} with contractions {“do not kill humans”} and {“do not leave the door open”} then either notion of minimality will tell you that both contractions are ideal. On the other hand, it does seem that any stronger notion of minimality should at least respect qualitative minimality, since (intuitively), if plan π1 fulﬁlls all of the values fulﬁlled by π2 , and fulﬁlls more values, then π1 should be preferred to π2 . Proposition 2. Given a moral conflict M , a contraction M is quantminimal only if it is qualminimal. One way to resolve this is to recognise, in line with Levi [18], that some of our values are only used as tiebreakers to separate otherwiseequivalent plans, and should not be considered directly alongside our more important values. To model this, our values exist in lexicographically ordered sets, where each set is examined only if the sets above cannot deliver a verdict. 3.6
Lexicographic Value Base
Together with an action theory and an initial state, an agent’s value base constitutes an ethical planning domain. Definition 9 (Ethical planning domain). An ethical planning domain is a tuple Δ = (γ, s0 , Ω) where: – γ = (γ + , γ − ) is an action theory and s0 is an initial state, as specified above; – Ω = (Ω1 , . . . , Ωm ) is the agent’s value base with Ωk ⊆ LLTLf for every 1 ≤ k ≤ m. Ω1 is the agent’s set of values with priority 1, Ω2 is the agent’s set of values with priority 2, and so on. For notational convenience, given a value base Ω = (Ω1 , . . . , Ωm ), we note dg(Ω) its degree (or arity). Agent’s values are used to compute the relative ideality of plans, namely, whether a plan π2 is at least as ideal as another plan π1 . Following [24], we call evaluation the operation of computing an ideality ordering over plans from a value base. Building on classical preference representation languages [17], we deﬁne the following qualitative criterion of evaluation, noted qual Δ , which compares two plans lexicographically on the basis of inclusion between sets of values. Definition 10 (Qualitative ordering of plans). Let Δ = (γ, s0 , Ω) be an ethical planning domain with Ω = (Ω1 , . . . , Ωm ) and π1 , π2 ∈ Plan . Then, π1 qual π2 if and only if: Δ (i) ∃1 ≤ k ≤ m s.t. Sat(Ωk ,π1 ,s0 ,γ) ⊂ Sat(Ωk ,π2 ,s0 ,γ), and ∀1 ≤ k < k, Sat(Ωk ,π1 ,s0 ,γ) = Sat(Ωk ,π2 ,s0 ,γ); or (ii) ∀1 ≤ k ≤ m, Sat(Ωk ,π1 ,s0 ,γ) = Sat(Ωk ,π2 ,s0 ,γ).
LogicBased Ethical Planning
205
Note that a quantitative criterion could also be deﬁned by counting the number of satisﬁed values in each level and, in line with the previous deﬁnition, compare these values lexicographically. , compares two plans lexicographiThe quantitative criterion, noted quant Δ cally on the basis of comparative cardinality between sets of values. Definition 11 (Quantitative ordering of plans). Let Δ = (γ, s0 , Ω) be an ethical planning domain with Ω = (Ω1 , . . . , Ωm ) and π1 , π2 ∈ Plan . Then, π2 if and only if: π1 quant Δ (i) ∃1 ≤ k ≤ m s.t. Sat(Ωk ,π1 ,s0 ,γ) < Sat(Ωk ,π2 ,s0 ,γ), and ∀1 ≤ k < k, Sat(Ωk ,π1 ,s0 ,γ) = Sat(Ωk ,π2 ,s0 ,γ); or (ii) ∀1 ≤ k ≤ m, Sat(Ωk ,π1 ,s0 ,γ) = Sat(Ωk ,π2 ,s0 ,γ). This allows us to deﬁne another notion of minimal contraction for a moral conﬂict, namely a minimal contraction with respect to a lexicographic value base. Definition 12 (Lexicographicminimal contraction). If M = (Ω, γ, s0 ) is a moral problem, and Ω = (Ω1 , ..., Ωm ) is a value base such that ∪Ω = Ω then M = (Ω , γ, s0 ) is a Ωqualminimal contraction of M if and only if: (i) Ω ⊆ Ω; (ii) M is not a moral conflict; (iii) If M = (Ω , γ, s0 ) is also a contraction of M, k : (a) 1 ≤ k ≤ m and Ω ∩ Ωk ⊂ Ω ∩ Ωk , and (b) ∀1 ≤ i < k, Ω ∩ Ωi = Ω ∩ Ωi . Note that by combining deﬁnitions 11 and 12 we can deﬁne a notion of Ωquantminimal contraction. Proposition 3. Given a moral conflict M , a contraction M is Ωqualminimal or Ωquantminimal only if it is qualminimal. 3.7
Adding Desires
The behavior of autonomous ethical agents is driven not only by ethical values aimed at promoting the good for society but also by their endogenous motivations, also called desires or goals. Following existing theories of ethical preferences in philosophy, economics and logic [14,23,29], we assume that (i) desires and values are competing motivational attitudes, and (ii) the agent’s degree of morality is a function of its disposition to promote the fulﬁlment of its values at the expense of the satisfaction of its desires. The following deﬁnition extends the notion of ethical planning domain by the notions of desire and introduces the novel concept of degree of morality. Definition 13 (Mixedmotive planning domain). A mixedmotive planning domain is a tuple Γ = (γ, s0 , Ω, ΩD , μ) where
206
U. Grandi et al.
– (γ, s0 , Ω) is an ethical planning domain (Definition 9); – ΩD ⊆ LLTLf is the agent’s set of desires or goals; – μ ∈ {1, . . . , dg(Ω) + 1} is the agent’s degree of morality. A mixedmotive planning domain induces an ethical planning domain whereby the agent’s set of desires is treated as a set of values whose priority level depends on the agent’s degree of morality. Speciﬁcally, the lower the agent’s degree of morality, the higher the priority of the agent’s set of desires in the induced ethical planning domain. In many practical applications it is likely to be desirable to restrict the range of values that μ can take, in order to prevent (for example) the robot’s goal from overriding its safety values. Definition 14 (Induced ethical planning domain). Let Γ = (γ, s0 , Ω, ΩD , μ) be a mixedmotive planning domain. The ethical planning domain induced by Γ is the tuple Δ = (γ, s0 , Ω ) such that dg(Ω ) = dg(Ω) + 1 with: (i) Ωμ = ΩD ; (ii)Ωk = Ωk for 1 ≤ k < μ; (iii)Ωk = Ωk−1 for μ < k ≤ dg(Ω) + 1.
4
An Example
Consider a blood delivery robot in a hospital. The robot mostly makes deliveries between diﬀerent storage areas, and sometimes delivers blood to surgeries. The robot may have to deal with various kinds of obstacles to complete its deliveries, but we will consider only one: people blocking the robot. The robot has two methods to resolve this obstacle, it can ask for them to move and then wait for them to move (ask), or it can use a loud airhorn to “force” them to move (horn). Once the person has moved, the robot can reach its destination (move). We suppose that the robot can tell some things about its environment, it knows if it is blocked (blocked), if it is near the operating theatre (theatre) and if it has reached its destination (destination). We can then deﬁne the action model as follows: γ + (move, destination) = ¬blocked γ − (ask, blocked) = blocked γ + (ask, delayed) = γ − (horn, blocked) = blocked γ + (horn, annoyed) = γ + (horn, dangerous) = theatre otherwise, γ ± (a, p) = ⊥
LogicBased Ethical Planning
207
The propositions delayed, annoyed and dangerous are used to keep track of the robot’s actions, we suppose that using the horn near the operating theatre is dangerous. The values and desires of the robot can be presented as follows: Ω = {Ω1 , Ω2 } Ω1 = {G¬dangerous} Ω2 = {G¬annoyed} ΩD = {Fdestination, F(destination ∧ ¬delayed)} In words, the robot’s goal is to reach its destination without delays, with the primary value to never do anything dangerous, and the secondary value to never be annoying. Let Ω be the value base induced by Ω, ΩD and μ = 3. Now we can compare the following 2plans π1 = (ask, move) and π2 = (horn, move). If we assume that in the initial state the robot is blocked but far from an operating theatre, we can represent the histories generated from these plans as follows (each block contains exactly the propositions that are true in that state): ask
H π1 blocked H π2 blocked
horn
delayed annoyed
move
delayed, destination
move
annoyed, destination
In this case Sat(Ω ,π1 ,s0 ,γ) = {G¬dangerous, G¬annoyed, Fdestination} = A ⊇ Ω1 ∪ Ω2 whereas Sat(Ω ,π2 ,s0 ,γ) = G¬dangerous, Fdestination, F(destination ∧ ¬delayed) = B ⊇ Ω1 ∪ ΩD . Therefore π1 will be preferred to π2 . However, if we change the morality level to 2, perhaps to represent an urgent delivery to an ongoing surgery, then we see that the robot will choose plan π2 rather than π1 . This illustrates how we can adjust the morality level of the robot to reﬂect the urgency of its goals. If we move the example to the operating theatre (so now theatre ∈ s0 instead of ¬theatre ∈ s0 ), then the robot would not sound its horn even if the delivery was urgent, as Ω1 still overrides ΩD . This also means that for this robot we should restrict μ to 2 or 3 to ensure that being safe is always prioritised over goals. Furthermore, notice that for any lexicographic value structure containing exactly these values and goals, the set of nondominated plan will always contain either π1 , π2 or both, since A and B are exactly the qualminimal contractions of ∪Ω given an initial state where the robot is blocked.
5
Computational Complexity
In this section we initiate the study of the computational complexity of ethical planning in our setting. We borrow our terminology from the work of Lang [17]
208
U. Grandi et al.
on compact preference representation, but the problems we study have obvious counterparts in the planning literature, as should be clear from the proofs. In the interest of space all proofs can be found in the appendix. We begin by studying the problem Conflict, which determines if a moral problem is also a moral conﬂict. Conflict Input: Moral problem M = (Ω, γ, s0 ) Question: Is there some k ∈ N such that there is a kplan π such that Sat(Ω,π,s0 ,γ) = Ω? Theorem 1. Conflict is PSPACEcomplete. We then study the case of contractions, in particular, determining if a given moral problem is a qualminimal contraction. MinimalContraction Input: Moral problem M = (Ω, γ, s0 ), moral problem M = (Ω , γ, s0 ) Question: Is M a qualminimal contraction of M ? Theorem 2. MinimalContraction is PSPACEcomplete. Neither of these results are particularly technically advanced, indeed Conflict is almost exactly equivalent to PLANSAT from classical planning [8]. The purpose of these results is to indicate that quite apart from the issue of how a robot should select the best option when faced with a moral conﬂict, the task of identifying that the robot is facing a moral conﬂict and determining all of its options is extremely computationally diﬃcult. On the subject of planning, we begin by studying the problem Comparison, π2 . Despite the which given two kplans π1 and π2 , asks whether π1 qual Δ apparent complexity of our setting this problem can be solved eﬃciently: Comparison Input: Ethical planning domain Δ = (γ, s0 , Ω), k ∈ N, kplans π1 , π2 Question: is it the case that π1 qual π2 ? Δ Theorem 3. Comparison is in P. We then move to the problem of nondominance, i.e., the problem of determining if given a gplan π1 there exists a better kplan wrt. qual (where g ≤ k). Δ Nondominance Input: Ethical planning domain Δ = (γ, s0 , Ω), k ∈ N, gplan π for g ≤ k Question: is there a kplan π such that π qual π and π qual π? Δ Δ We show that this problem, as most instances of classical planning satisfaction, is PSPACEcomplete: Theorem 4. NonDominance is PSPACEcomplete.
LogicBased Ethical Planning
209
Proposition 4. Given an ethical planning domain Δ = (γ, s0 , Ω), a kplan π and S = Sat(∪Ω,π,s0 ,γ) π is nondominated for Δ if and only if M = (S, γ, s0 ) is a Ωqualminimal contraction for (∪Ω, γ, s0 ). Theorems 3 and 4 are to be interpreted as baseline results showing the computational feasibility of our setting for ethical planning with LTLf . One clear direction for future work would expand on the computational complexity analysis, identifying tractable fragments and exploring their expressivity in ethical applications. An important property for an ethical planner is explainability. While explaining why a particular plan was chosen is diﬃcult to do succinctly (even for humans), a simpler problem is to explain why the chosen plan was better than another proposed alternative. Our approach enables this in a way that is both computationally straightforward and intuitively understandable to humans, since by the lexicographic ordering of plans there always exists a single value or set of values that decides between two plans.
6
Conclusion
We put forward a novel setting for ethical planning obtained by combining a simple logical temporal language with lexicographic preference modelling. Our setting applies to planning situations with a single agent who has deterministic and instantaneous actions to be performed sequentially in a static and known environment. Aside from the addition of values, our framework diﬀers from classical planning in two aspects, by having multiple goals and by allowing temporal goals. In particular, the expressiveness of LTL means that we can express a wide variety of goals and values, including complex temporal values such as “if the weather is cold, close external doors immediately after opening them”, with a computational complexity equivalent to that of standard planners. As a limitation, the system is less able to express values that tend to be satisﬁed by degree rather than absolutely or not at all. Among the multiple directions for future work that our deﬁnitions open, we plan to study the multiagent extension with possibly conﬂicting values among agents, moving from plans to strategies (functions from states or histories to actions), from complete to incomplete information, and, most importantly, test our model by implementing it in simple robotics scenarios. Furthermore, given the computational complexity of Conflict, MininalContraction and NonDominance, it may often be the case that in practical applications we cannot guarantee ﬁnding a nondominated plan. Therefore, it would be valuable to ﬁnd more tractable algorithms that at least guarantee some degree of approximation of a nondominated plan, or restrictions (likely to the language or action theory) that improve tractability of the problem. Acknowledgements. This work is supported by the CNRS project LEXIA (“The Logic of Explanation: From Explainable to Explaining Legal Knowledgebased Systems”).
210
U. Grandi et al.
References 1. Alili, S., Alami, R., Montreuil, V.: A task planner for an autonomous social robot. In: Asama, H., Kurokawa, H., Ota, J., Sekiyama, K. (eds.) Distributed Autonomous Robotic Systems 8. Springer, Berlin, Heidelberg (2009). https://doi.org/10.1007/ 9783642006449_30 2. Anderson, M., Anderson, S.L.: Geneth: a general ethical dilemma analyzer. Paladyn (Warsaw) 9(1), 337–357 (2018) 3. Arkin, R.C., Ulam, P., Wagner, A.R.: Moral decision making in autonomous systems: enforcement, moral emotions, dignity, trust, and deception. Proc. IEEE 100(3), 571–589 (2012) 4. Awad, E., et al.: When is it acceptable to break the rules? Knowledge representation of moral judgement based on empirical data. CoRR abs/2201.07763 (2022) 5. Benzmüller, C., Parent, X., van der Torre, L.W.N.: Designing normative theories for ethical and legal reasoning: logiKEy framework, methodology, and tool support. Artif. Intell. 287, 103–348 (2020) 6. Berreby, F., Bourgne, G., Ganascia, J.: A declarative modular framework for representing and applying ethical principles. In: Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems (AAMAS) (2017) 7. Bienvenu, M., Fritz, C., McIlraith, S.A.: Planning with qualitative temporal preferences. In: Doherty, P., Mylopoulos, J., Welty, C.A. (eds.) Proceedings of the 10th International Conference on Principles of Knowledge Representation and Reasoning (KR), pp. 134–144. AAAI Press (2006) 8. Bylander, T.: The computational complexity of propositional STRIPS planning. Artif. Intell. 69(1–2), 165–204 (1994) 9. Copp, D.: The Oxford Handbook of Ethical Theory. Oxford University Press, Oxford (2007) 10. De Giacomo, G., Vardi, M.Y.: Linear temporal logic and linear dynamic logic on finite traces. In: Rossi, F. (ed.) Proceedings of the 23rd International Joint Conference on Artificial Intelligence (IJCAI), pp. 854–860. IJCAI/AAAI (2013) 11. Dennis, L.A., Fisher, M., Slavkovik, M., Webster, M.: Formal verification of ethical choices in autonomous systems. Robot. Auton. Syst. 77, 1–14 (2016) 12. Dennis, L.A., del Olmo, C.P.: A defeasible logic implementation of ethical reasoning. In:1st International Workshop on Computational Machine Ethics (CME) (2021) 13. Evans, K., de Moura, N., Chauvier, S., Chatila, R., Dogan, E.: Ethical decision making in autonomous vehicles: The AV ethics project. Sci. Eng. Ethics 26(6), 3285–3312 (2020) 14. Harsanyi, J.: Utilitarianism and beyond. In: Sen, A.K., Williams, B. (eds.) Morality and the Theory of Rational Behaviour. Cambridge University Press, Cambridge (1982) 15. Jenkins, R., Talbot, B., Purves, D.: When robots should do the wrong thing. In: Robot Ethics 2.0. Oxford University Press, New York (2017) 16. Kment, B.: Varieties of Modality. In: Zalta, E.N. (ed.) The Stanford Encyclopedia of Philosophy. Spring (2021) 17. Lang, J.: Logical preference representation and combinatorial vote. Ann. Math. Artif. Intell. 42(1–3), 37–71 (2004) 18. Levi, I.: Hard Choices: Decision Making Under Unresolved Conflict. Cambridge University Press, Cambridge (1990)
LogicBased Ethical Planning
211
19. Lindner, F., Mattmüller, R., Nebel, B.: Moral permissibility of action plans. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI), pp. 7635–7642. AAAI Press (2019) 20. Loreggia, A., Mattei, N., Rossi, F., Venable, K.B.: On the distance between cpnets. In: Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems (AAMAS) (2018) 21. Loreggia, A., Rossi, F., Venable, K.B.: Modelling ethical theories compactly. In: The Workshops of the 31st AAAI Conference on Artificial Intelligence (2017) 22. Lorini, E.: A logic for reasoning about moral agents. Logique Analyse 58(230), 177–218 (2015) 23. Lorini, E.: Logics for games, emotions and institutions. FLAP 4(9), 3075–3113 (2017) 24. Lorini, E.: A logic of evaluation. In: Proceedings of the 20th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), pp. 827–835. ACM (2021) 25. Müller, V.C.: Ethics of artificial intelligence and robotics. In: Zalta, E.N. (ed.) The Stanford Encyclopedia of Philosophy. Summer 2021 (2021) 26. Pnueli, A.: The temporal logic of programs. In: Proceedings of the 18th Annual Symposium on Foundations of Computer Science (FOCS) (1977) 27. Powers, T.M.: Deontological machine ethics. In: Anderson, M., Anderson, S.L., Armen, C. (eds.) Association for the Advancement of Artificial Intelligence Fall Symposium Technical Report (2005) 28. Rossi, F., Mattei, N.: Building ethically bounded AI. In: The 33rd AAAI Conference on Artificial Intelligence (AAAI) (2019) 29. Searle, J.: Rationality in Action. Cambridge University Press, MIT Press (2001) 30. Sen, A.: On Ethics and Economics. Basil Blackwell, Oxford (1987) 31. Vallor, S.: Technology and the Virtues: A Philosophical Guide to a Future Worth Wanting. Oxford University Press, New York (2016) 32. Vanderelst, D., Winfield, A.F.T.: An architecture for ethical robots inspired by the simulation theory of cognition. Cogn. Syst. Res. 48, 56–66 (2018) 33. Yu, H., Shen, Z., Miao, C., Leung, C., Lesser, V.R., Yang, Q.: Building ethics into artificial intelligence. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI) (2018)
A Hybrid Recommender System with Implicit Feedbacks in Fashion Retail Ilaria Cestari1,2 , Luigi Portinale2,3(B) , and Pier Luigi Riva1 1
2
ORS Group, Roddi, Italy {ilaria.cestari,pierluigi.riva}@ors.it Computer Science Institute, DiSIT, Univ. Piemonte Orientale, Alessandria, Italy [emailprotected] 3 Inferendo srl, Alessandria, Italy
Abstract. In the present paper we propose a hybrid recommender system dealing with implicit feedbacks in the domain of fashion retail. The proposed architecture is based on a collaborativefiltering module taking into account the fact that users feedbacks are not explicit scores about the items, but are obtained through user interactions with the products in terms of number of purchases; moreover, a second module provides a knowledgebased contextual postfiltering, based on both customeroriented and businessoriented objectives. We finally present a case study where “lookoriented” recommendations have been implemented for a specific fashion retail brand. Keywords: Recommender systems architecture · Fashion retail
1
· Implicit feedbacks · Hybrid
Introduction
Recommender Systems (RS) are software products based on machine learning having the goal of learning user preferences for speciﬁc items or services in very diﬀerent contexts, particularly ecommerce and online retail. They can employ various methods such as collaborative ﬁltering, contentbased, hybrid, and knowledgebased approaches [13]. The most widely adopted approaches are those based on collaborative ﬁltering; the idea is that user preferences about speciﬁc items can be captured by looking at the interactions such users have on the set of available items. In general, one can think to the useritem interaction as a “feedback” the user provides with respect to the item. Formally, given a set of m users U , a set of n items I and a set of possible feedbacks F , we can deﬁne a feedback matrix R(m×n) = {rij = f i ∈ U, j ∈ I, f ∈ F }. In the most general case, values in set F are ranked preferences expressed in natural numbers (e.g., from 1 stars up to 5 stars). In this situation we talk about explicit feedbacks, and a special case is that of binary feedbacks, where F = {0, 1} (i.e., like, dislike). However, very often users are not able or willing to leave explicit feedbacks, and what c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Dovier et al. (Eds.): AIxIA 2022, LNAI 13796, pp. 212–224, 2023. https://doi.org/10.1007/9783031271816_15
A Hybrid Recommender System with Implicit Feedbacks in Fashion Retail
213
can be done is to “count” the interactions between users and items. In this case rij ≥ 0 is just the number of times user i has interacted with item j. Interactions must be deﬁned as speciﬁc actions such as item search, item view, item purchase or others. Of course, diﬀerent kind of interactions can have diﬀerent meaning, leading to diﬀerent information concerning the user preferences. For instance, some actions can be considered as positive (purchase, addition to cart), while others are actually negative (removal from cart). Positive and negative actions should be treated according to their meaning (for example adding 1 to rij if the action is positive and subtracting 1 when it is negative). Moreover, even if all the actions are positive, they may have diﬀerent relevance (e.g., a search is usually less indicative of a preference than a purchase); the diﬀerent role of the diﬀerent actions can be taken into account directly into the collaborative ﬁltering process [6, Chapter 4], or by resorting to some hybrid form of recommendation taking into account multiple kinds of knowledge [2,9]. In the present paper we consider implicit feedbacks in the form of positive interactions, counting the number of purchases of an item by a given user. We addresses the problem of implicit feedbacks by resorting to the conﬁdence model proposed in [8], and by adopting a hybrid architecture where collaborative ﬁltering is complemented with a knowledgebased subsystem taking into account speciﬁc business rules. The considered domain is that of fashion retail. The paper is organized as follows: in Sect. 2, the main concepts about the collaborative ﬁltering and the approach proposed to deal with implicit feedbacks are discussed; Sect. 3 illustrates the proposed hybrid architecture, focusing on each module, explaining how they have been developed, and in Sect. 4 a case study illustrating the diﬀerent steps of the recommendation process implemented for an important fashion retailer is discussed. Section 5 ﬁnally reports the conclusions and some comparisons with related works.
2
Collaborative Filtering with Implicit Feedbacks
Collaborative Filtering (CF) produces user speciﬁc recommendations based on patterns of user actions such us ratings or item usage, without the need of explicit user or item meta information. One of the most popular CF approach is the latent factor model, a model based technique based on the lowrank factorization of the feedback matrix [14]. It addresses the sparsity problem by projecting users and items into a reduced latent space containing the most salient features about their interactions. Given the feedback matrix R(m×n) , the idea is to decompose it into two matrices U(m×k) and V(n×k) such that T R(m×n) ≈ U(m×k) V(k×n)
where k n and k m is the size of the latent space and U, V are the latent feature matrices for users and items respectively. Once this factorization is obtained, if ui represents the ith row of matrix U and vj the jth row of matrix V , we can predict the feedback of each pair (i, j) of users and items as
214
I. Cestari et al.
rˆij = ui · vjT =
k
T uih vhj
h=1
that is computable for each useritem pair, even for those having a missing entry in the original matrix R. Let the set of all useritem pairs (i, j), which are observed in R, be denoted by S: S = {(i, j) : rij is observed}. A typical way of solving this factorization problem involves an optimization procedure (e.g., stochastic gradient descent) on the following objective function J(U, V ) =
1 ( (rij − rˆij )2 + λ(ui 2 + vj 2 )) 2 (i,j)∈S
where λ is a regularization hyperparameter. The above characterization is suitable when the entries encoded into the feedback matrix are explicit, such as precise ratings provided by the users. In case of implicit feedbacks, which are those that interest us in the present work, some modiﬁcations to the framework must be considered. First of all, we must notice that in our case only positive interactions are considered, leading to the fact that when the feedback is not null, then there is an interest of the user for the corresponding item. We then introduce an auxiliary indicator variable pij representing the generic interest of user i for item j and simply deﬁned as 1 if rij > 0 pij = 0 otherwise Following the approach suggested in [8], we also consider a conﬁdence level for the indicator pij , depending on the actual number of interactions a user had on a given item and deﬁned as follows: cij = 1 + αrij Given this characterization we have to ﬁnd a user matrix U and an item matrix V minimizing the following cost function: J(U, V ) =
1 ( cij (pij − ui · vjT )2 + λ( ui 2 + vj 2 )) 2 i,j i j
(1)
In other words, we need to ﬁnd a vector ui for each user and a vector vj for each item factorizing in the best possible way the user preferences. Preferences are represented by pij and must be computed as the inner product ui · vjT . The main diﬀerences with the explicit feedback framework is that we need to take into account conﬁdence levels cij , but mostly the fact that we need to consider every possible useritem pair (i, j) and not only those pairs for which we have an explicit interaction. This makes standard optimization procedures such as stochastic gradient descent impractical. A possible solution is to adopt Alternating Least Squares (ALS) optimization. The idea is conceptually simple: ﬁx the user matrix U and ﬁnd the optimal item matrix V ; then ﬁx the item
A Hybrid Recommender System with Implicit Feedbacks in Fashion Retail
215
matrix V and ﬁnd the optimal user matrix U ; keep alternating the previous steps until convergence. However, the implicit feedback framework requires a careful strategy to deal with a dense cost function (all possible useritem pairs must be considered) and to integrate the conﬁdence levels. In [8], Hu et al. proposes the following procedure. First we compute user factors from item factors contained in V(n×k) . – Compute the (k × k) matrix V T V in time O(k 2 n) i i be a diagonal matrix with Cjj = cij (the diagonal – For each user i, let C(n×n) contains the conﬁdence in the preferences of the given user with respect to all n possible items); let also p(i) ∈ Rn be the vector containing preferences pij of user i. – the following expression1 minimizes the cost function in (1): ui = (V T C i V + λI)−1 V T C i p(i) In a similar fashion, once the user matrix U has been obtained we can recompute2 the entries of the item matrix as vj = (U T C j U + λI)−1 U T C j p(j) j where C(m×m) is the diagonal matrix with Ciij = cij (the diagonal contains the conﬁdence in the preferences of the given item with respect to all m possible users), and p(j) ∈ Rm be the vector containing preferences pij for item j of every possible user. The procedure alternates the above user factors and item factors computation until convergence. Once the ﬁnal matrices U and V have been computed, the K available items with the largest score pˆij = ui · vjT are recommended to user i.
3
System Architecture
The main goal of the present work is to deﬁne an architecture for the recommendation of products in the fashion domain. Recommender systems in fashion retail are usually integrated into ecommerce platforms or in digital marketing campaigns, as personalized recommendations generators [15]; in this setting it is really hard that users are able to release explicit “scores” on the products, thus the designed system must deal with the availability of feedbacks which are implicit by nature (i.e., useritem interactions such as purchases). Moreover, such a system also needs to fulﬁll speciﬁc objectives that can be divided into “customeroriented” and”businessoriented”. Regarding customers, an important aspect of fashion recommendations is the ability to propose an 1
2
In [8] the authors shows that the corresponding computation can be performed in time O(k2 N + k3 m) where N is the total number of nonzero entries in the feedback matrix. Similarly to previous step we can show that the computation takes O(k2 N + k3 n) time steps.
216
I. Cestari et al.
overall look composed by diﬀerent type of products that may ﬁt well together and tailor to the user individual preferences. Indeed, fashion products belonging to diﬀerent technical categories are commonly bought together, in order to obtain a given look which can ﬁt the customer style; thus the level of accuracy may be reduced with the goal of achieving higher levels of diversity and novelty, and to prevent overspecialization over the customer past purchases. On the business point of view, the user satisfaction after buying the recommended products can improve the customer loyalty, while the goal of recommending products belonging to various categories can be helpful also to increase the crossselling and, in a more general way, to let the user explore, and ideally buy, as more products as possible. For these reasons, we propose a hybrid architecture based on two main modules (see Fig. 1):
Fig. 1. Hybrid system architecture.
– a collaborative ﬁltering module with implicit feedbacks – a knowledgebased postﬁltering module 3.1
Collaborative Filtering Module
We know that entities involved in the recommendation process are the customers (the set of users U ) and the products (the set of items I); in the considered domain, the interactions between them can be collected from the purchases history in various distribution channels (retail, outlet, web) and from any available system capable of tracking user activities, such as visiting products pages or searching for keywords, or ideally giving an explicit rating to purchased products.
A Hybrid Recommender System with Implicit Feedbacks in Fashion Retail
217
In the absence of such systems, purchases can be considered a good starting point to learn customers preferences and how they are distributed. Usually, transactions (S) are stored in relational databases as receipt rows represented with tuples: s = t, p, i, j, a where t is the transaction date, p, i, and j are the point of sale, customer and item identiﬁer respectively, and a is the activity or transaction type. Rows may be decorated with other data about the transaction, e.g. whether the customer used a coupon, the actual purchase value or the chosen payment method; all of these can be used as further analysis to better characterize the customers. In the present work, we only consider transactions corresponding to purchases, e.g., a = 1; in addition, in the following discussion we are only interested in tracking which user has purchased a given item, in a given set of stores and in a particular time interval. We then deﬁne an indicator 1 if ∃s = t, , i, j, 1 fijt = 0 otherwise with symbol meaning “don’t care” (and last element of s equal to 1 meaning purchase). Hence, by projecting S on users and items, and by considering a given time interval T , the transaction table can be reduced to tuples having the following structure: fijt i, j, rij with rij = t∈T
Finally, the (sparse) feedback matrix is deﬁned a Rm×n = [rij ] where m = U  and n = I. As described in Sect. 2, the feedback matrix is then used as the basis for the implementation of the collaborative ﬁltering module: we ﬁrst produce T , then compute the predicted feedback the factorization Rm×n = Um×k Vk×n T rˆij = ui · vj , and ﬁnally for a given user i ∈ U , we return the top k items j ∈ I with respect to rˆi,j . 3.2
KnowledgeBased Postfiltering Module
The second module executes a knowledgebased postﬁltering on the results list produced by the CF model. It allows the ﬁlter of the results by adding constraints based on product features or contextual data, such as location, time, or other domainspeciﬁc elements. In our case, constraints are deﬁned by domain experts as business rules, which describe known relationships between the domain entities (users or items), and used to adjust the model results by either ﬁltering out or replace some items, or by adding gain and penalties on the scores, in order to change the ranking. Such rules are deﬁned using a GUI (Fig. 2). For each entity, a set of variables is deﬁned representing its characteristics (e.g., the product description, the customer age or the estimated annual income). In addition, a set of actions available on speciﬁc instances is also deﬁned; the main action, especially for item entities j ∈ I, is to select the instance and add
218
I. Cestari et al.
Fig. 2. An example of rule definition using the developed GUI: experts can choose the target entity and define multiple conditions with logical operators over their variables.
it to an output list to proceed with the postﬁltering operations. This is also the main goal of the rules in this architecture: they can be considered as queries targeting the selection of some recommended products on which to execute a speciﬁc ﬁltering function. A rule is composed of a condition over some entity variables, and a consequence which determines the action the system must perform on the instances satisfying the constraint. Actions select instances whose values meet a given conditions or, in more complex cases, they link the status of one instance to that of the instances of another entity, in order to create an explicit correlation between them. For example, lets consider some features of the entities I (“Status”, “IsCurrentSeason” and “ConsumerGroup”) and U (“Age” and “Gender”). A rule representing a constraint on items is reported in (2), and a more complex rule that links the two entities in (3): I.Status = “Adoption” ∧ I.IsCurrentSeason = True
(2)
U.Gender = “F ” ∧ U.Age ≥ 18 ⇒ I.ConsumerGroup = “M isses”
(3)
The ﬁrst rule selects every item in the “Adoption” production status and available for the current season; the second one is used to deﬁne a constraint on the entity I on the basis of the instance of U (i.e., the target user). Such rules can be applied in a modular way over the model results, by introducing the idea of “context”, which is an aspect of the domain a group of rules refers to. Following the framework proposed in [1], the aim is to implement a type of contextaware recommender system, with contextual postﬁltering. For each context, the domain expert deﬁnes the ﬁltering operations to perform on the instances selected by the context’s rules. The number of contexts may vary depending on the number of entities involved, or the diﬀerent objectives that the system must achieve. In fashion retail, one usually consider almost two main contexts: the “catalog context”, containing constraints about products availability, which can change depending on the temporal context of the recommendation or other external causes; the “customer context”, that allows the retailers to deﬁne correlations between the features of products and customers. Next section describes how they can be applied in a speciﬁc case study.
A Hybrid Recommender System with Implicit Feedbacks in Fashion Retail
4
219
Case Study and Experimental Results
In a fashion store or ecommerce platform, item features are usually organized hierarchically. The actual hierarchy can vary from one brand to another; in this paper we refer to a case study concerning an important fashion brand, where a given product belongs to a merchandise group characterized by the department store (outlet or retail), the technical category (such as trousers, shirts or skirts), and the “lifecycle” (fashion or basic). Inside a given merchandise group we identify diﬀerent models; the model identiﬁes a series of technical features like the garment materials, ﬁtting and other speciﬁc characteristics that help to distinguish the product style. Finally for a given model we can specify “lowlevel” features such as color or size of the garment (see Table 1). Table 1. Hierarchy of products features with an example case of a men’s dress shirt. Hierarchy level
Description
Example
Merchandise Group Indicates the product’s department, technical category and lifecycle
Retail, Dress Shirt, Fashion
Group Type
The target customer group
Men
Model
Id that identifies the stylistic features of the product
Regular Point Collar
Color (Style)
Color shade
Blue Navy
Size
Stock Keeping Unit (SKU), unit sold and registered in the receipts
S
Customers buy products at the lowest hierarchy level: the so called Stock Keeping Unit (SKU). Each SKU represents a speciﬁc garment as an instance of a merchandise group (store type, technical category and lifecycle) with a speciﬁc model, size and color. Typically, all these features are categorical and may have a broad range of values. The CF model discussed in Sect. 2 does not use the explicit features of users and products to learn their latent factors, so it is crucial to decide over which items the preferences should be determined. By considering the feedbacks as purchases at the SKU level, the useritem preference is implicitly computed over all the product characteristics, thus the model will propose the most preferable SKUs for the users. Depending on the recommender’s ﬁnal objectives, it may be useful to consider a more abstract level of attributes (i.e., to get rid of some details such as size and color for instance) or to group the values of some features. In this way, we can learn preferences over more abstract aspects, such as the style, instead of speciﬁc characteristics such as the size or the shade of color. Furthermore, this can help to reduce the number of useritem pairs and so the dimensionality of the feedback matrix. In the present case study, we are targeting lookoriented recommendations; in this situation, the size is too speciﬁc, related more to the user’s need to
220
I. Cestari et al.
ﬁnd clothes suitable for her/his body, rather than to an actual preference, thus it should be excluded. On the other hand, color is a fundamental feature of fashion products, very representative of the user’s preferences, but with a huge number of “nuances”; this could led to recommendations which are too biased by a speciﬁc color shade. Hence, shades have been grouped into their main color (e.g., “blue” instead of “light blue”, “dark blue” or “navy blue”) to keep the recommendation more generic as possible, and to easily recommend diﬀerent colors as well. We have taken into account the retail transactions history of the last two years, containing purchases from both stores and web distribution channels of the considered brand, and by selecting customers with at least 3 purchases; the ﬁnal (sparse) feedback matrix Rm×n contained 5, 200, 649 positive interactions between m = 662, 964 loyal customers and n = 26, 185 modelcolor items. In the following, in order to provide a recommendation example, we consider a speciﬁc user case: a 30 years old man, for which some purchases are listed in the ﬁrst column of Table 2. Table 2. Some of the customer purchases and the top 5 recommendations of the model with k = 500 factors. In the description of the recommended items are reported also their seasonality and production status. Purchased
Recommended
rˆij
Ribbed Crew Socks Beige Men
Ribbed Crew Socks Black Men (Ongoing, Design)
0.749 46943
Leather Belt Brown Men
Casual Trousers Relaxed Plain Blue Men (Ongoing, Adoption)
0.710 99337
Supima Cotton Crewneck Sweater Stripe Blue Men
Dress Shirt Blue Men (Fall, Adoption)
0.688 126718
Dress Shirt Stretch White Summer Men
Ribbed Crew Socks Blue Men (Ongoing, Design)
0.686 46945
Dress Shirt Purple Fall Men
Set Shorts Bermuda Dyed Beige 0.657 71958 Men (Fall, Adoption)
Item id
The model has been ﬁtted in two versions: one with k > 1 factors, which is the main model, and one benchmark with 1 factor (equivalent to recommend the most popular items in terms of purchases); the model’s parameters have been tuned with a 10fold crossvalidation and the best results have been obtained with k = 500 factors, achieving an AUC score of 0.73 on the test set against the 0.56 of the benchmark model. If we indicate as Ii the set of items already purchased by user i, the output for the user i will be a list L of pairs (j, rˆij ) with j ∈ (I \ Ii ) and ranked by rˆij (see Sect. 2). Table 2 reports (last three columns) the top 5 items recommended by the model for the customer described above, together with the corresponding rˆij score. For the postﬁltering module, three contexts have been deﬁned: “catalog”, “customer”, and “look”. As described in Sect. 3.2, the catalog context determines
A Hybrid Recommender System with Implicit Feedbacks in Fashion Retail
221
which items are available at the time of the recommendation; it identiﬁes two subsets Iout and Iv of unavailable and valid items respectively. The postﬁltering operation consists in removing from the output list L any pair (j, rˆij ) for j ∈ Iout , by replacing it with the pair (j , rˆij ) where j ∈ Iv is the most similar (available) item to j (in case such a j exists). In the present case study, the similarity score between two items j1 , j2 has been computed as sj1 j2 = vj1 vjT2 , reusing the latent factors V learned by the CF model (see Sect. 2). The score of the new replacing item j is again computed from latent factors as rˆij = ui vjT (see again Sect. 2). The catalog context has the higher priority, and thus the strongest eﬀect on the recommended items. For instance, considering rule (2) as the only rule in the catalog context, Table 4 reports the new top 5 items (ﬁrst column), in which only product with ItemID=99337 is kept and the others have been replaced. Score s1 in Table 4 refers to the usual item relevance score rˆij (i being the current user and j the considered item). Customer context rules are applied after catalog rules and link customer and item characteristics with respect to known statistical correlations, such as relationships between age (or genre) and particular garment categories, relationships between an item price and how much the customer usually spends, and so on. However, the descriptive characteristics of loyal customers are not necessarily representative of their normal buying behaviour or style preferences: a common case is when registered customer buy products for family members or other people. Thus, the application of this context shouldn’t be too disruptive to the model’s output and consists in increasing the score of the current recommended items j as follows: (4) s2 = s1 (1 + P (j, Ci )) where s1 is the old item relevance score, Ci is the set of customer context rules whose antecedent is satisﬁed by features of user i, and P (j, Ci ) is a term computed as: t(j, Ci ) P (j, Ci ) = Ci  where t(j, Ci ) is the number of the customer context rules in Ci whose consequent is satisﬁed by features of item j. The idea is to increment the score by a quantity proportional to the number of rules that are satisﬁed. In Table 3 some examples of customer context rules are shown. The item listed in Table 4 are those items resulting from ﬁrst applying catalog rule (2), then by replacing items diﬀerent than ItemId = 99337 (the only one satisfying the rule) with their most similar available items. Score s2 is the result of selecting the customer context rule 1 of Table 3 (the only one satisﬁed by the current user) and applying formula (4). Finally, the look context has been added to introduce more diversity between the recommended product categories, in order to increase the crossselling. In this phase, items present in the current recommended list are penalized with a term c(Tj ), which represents how many times the technical category Tj of item
222
I. Cestari et al.
Table 3. Rules available in the customer context. Given the target customer’s profile, since his parental status is unknown, rule 1 his the only that can be added to the customer’s specific context Ci , and thus Ci  = 1. Rule Definition 0
Gender = “F ” ∧ Age ≥ 18 ⇒ ConsumerGroup = “M isses”
1
Gender = “M ” ∧ Age ≥ 18 ⇒ ConsumerGroup = “M en”
2
Children = “Y es” ⇒ ConsumerGroup = (“Boys” ∨ “Girls”)
Table 4. Changes in scores after applying the catalog and the customer contexts. Here, listed items are available in the current season and in adoption status (Lc1 ) and all have had an increase by the customer context, since all the visible items belong to the “Men” consumer group. Item
Score s1 Score s2 Description
99337 0.710
1.420
Casual Trousers Relaxed Plain Blue Men
99341 0.489
0.976
Pants Lightweight Stretch Chino Gray Men
74393 0.437
0.874
Shoes Leather Boat Blue Men
57871 0.436
0.871
Boots Field Chukka Brown Men
99302 0.421
0.842
Casual Trousers Relaxed Plain Blue
j has already appeared in higher rank positions in the list. The new score for each item is computed as: sij sij = c(Tj ) Table 5 shows the results of applying this penalty score to the items of Table 4. Here it is not necessary to deﬁne explicit rules, because the postﬁltering operation is performed on every item without any selection; notice that in principle, one could also integrate business rules to replace the penalized items with others “compatible” with those in the list. Table 5. Penalties assigned by the look context and the final score of each recommended item Item
Score Tech. category
Penalty New score
99337 1.420 CASUAL TROUSERS 1
1.420
99341 0.976 CASUAL TROUSERS 2
0.488
74393 0.874 SHOES
1
0.874
57871 0.871 SHOES
2
0.436
99302 0.842 CASUAL TROUSERS 3
0.281
A Hybrid Recommender System with Implicit Feedbacks in Fashion Retail
5
223
Conclusions and Related Works
As reported in [5], fashion and apparel industries have grown tremendously over the last years, especially because of the availability of a great amount of products in online stores, coupled with the support provided by recommender systems. One speciﬁc challenge is the large vocabulary of distinct fashion items, leading to very sparse useritem interaction matrices, often represented with a given overspeciﬁcation level (as we discussed above). Other issues in fashion recommendation, are related to the suggestion of a suitable “look” or outﬁt [4,10], as well as the evolution of a fashion trend across time and location [11]. The vocabulary problem is often tackled through computer vision techniques for the determination of item category and attributes [3,7], while the other issues can be addressed by learning suitable models via massive amount of social data [16], or customer reported information [12], as well as customer reviews [17]. In the present work, we have dealt with the above issues by resorting to a hybrid architecture, where collaborative ﬁltering is complemented with speciﬁc contextual knowledgebased rules. The cons is that expert knowledge must be elicited, in order to build the contextual rules; however, as we have outlined in the case study, the fashion domain provides precise contextual situations where such rules can be obtained from experts without a huge eﬀort. The experience gained in this application suggests that the approach is feasible and beneﬁcial.
References 1. Adomavicius, G., Mobasher, B., Ricci, F., Tuzhilin, A.: Contextaware recommender systems. AI Mag. 67–80 (2011) 2. Burke, R.: Hybrid recommender systems: survey and experiments. User Model. UserAdap. Int. 12(4), 31–370 (2002) 3. Chen, H., Gallagher, A., Girod, B.: Describing clothing by semantic attributes. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7574, pp. 609–623. Springer, Heidelberg (2012). https://doi.org/ 10.1007/9783642337123 44 4. Chen, W., et al.: POG: personalized outfit generation for fashion recommendation at Alibaba iFashion. In: Proceedings of 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2662–2670 (2019) 5. Deldjoo, Y., et al.: A review of modern fashion recommender systems. ACM Comput. Surv. 37(4), 111:1–111:35 (2021) 6. Dunning, T., Friedman, E.: Practical Machine Learning: Innovations in Recommendation. O’Reilly, Sebastopol (2014) 7. Ferreira, B., Costeira, J., Sousa, R., Gui, L.Y., Gomes, J.: Pose guided attention for multilabel fashion image classification. In: Proceedings of IEEE/CVF International Conference on Computer Vision (ICCVW 2019), pp. 3125–3128 (2019) 8. Hu, Y., Koren, Y., Volinsky, C.: Collaborative filtering for implicit feedbacks datasets. In: Proceedings of 8th IEEE International Conference on Data Mining (ICDM), pp. 263–272 (2008) 9. Koren, Y., Bell, R.: Advances in collaborative filtering. In: Ricci, F., Rokach, L., Shapira, B. (eds.) Recommender Systems Handbook, pp. 77–118. Springer, Boston (2015). https://doi.org/10.1007/9781489976376 3
224
I. Cestari et al.
10. Lin, Y.L., Tran, S., Davis, L.: Fashion outfit complementary item retrieval. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), pp. 3311–3319 (2020) 11. Matzen, K., Bala, K., Snavely, N.: Streetstyle: exploring worldwide clothing styles from millions of photos. CoRR abs/1706.01869 (2017). http://arxiv.org/abs/1706. 01869 12. Parr, J., Pookulangara, S.: The impact of true fit technology on consumer confidence in their online clothing purchase. In: Proceedings of Annual Conference on International Textile and Apparel Association. Iowa State University Press (2017) 13. Ricci, F., Rokach, L., Shapira, B.: Recommender Systems Handbook, 2nd edn. Springer, New York (2015). https://doi.org/10.1007/9781489976376 14. Takacs, G., Pilaszy, I., Nemeth, B., Tikk, D.: Scalable collaborative filtering approaches for large recommender systems. J. Mach. Learn. Res. 10, 623–656 (2009) 15. Walter, F., Battiston, S., Yildirim, M., Schweitzer, F.: Moving recommender systems from online commerce to retail stores. Inf. Syst. eBus. Manag. 10, 367–393 (2012) 16. Wen, Y., Liu, X., Xu, B.: Personalized clothing recommendation based on knowledge graph. In: Proceedings of International Conference on Audio, Language and Image Processing (ICALIP 2018), pp. 1–5 (2018) 17. Zhao, K., Hu, X., Bu, J., Wang, C.: Deep style match for complementary recommendation. CoRR abs/1708.07938 (2017). http://arxiv.org/abs/1708.07938
Incremental TimelineBased Planning for Eﬃcient Plan Execution and Adaptation Riccardo De Benedictis(B) , Gloria Beraldo , Amedeo Cesta , and Gabriella Cortellessa CNR  Italian National Research Council, ISTC, Via S. Martino della Battaglia 44, 00185 Rome, RM, Italy {riccardo.debenedictis,gloria.beraldo,amedeo.cesta, gabriella.cortellessa}@istc.cnr.it https://istc.cnr.it Abstract. The increasing deployment, in real environments, of intelligent and distributed systems like robotic platforms, wearable sensors and AIbased devices, requires robust solutions that allow planned activities to converge with the emerging dynamic reality. Once a planning problem has been solved, indeed, it needs to be executed and, in the real world, things might not go as expected. While planned activities may be carried out by some underlying reactive modules, in fact, the adaptation to the surrounding environment provided by such components may not be sufﬁcient to achieve the planned goals. Planned activities, for example, can be delayed or last longer than expected. The execution of other activities could fail threatening the achievement of the desired goals. Finally, new objectives may emerge during execution thus requiring changes to ongoing plans. This paper presents a timelinebased framework for eﬃciently adapting plans in order to cope with possible complications which might emerge during execution. By exploiting the information gathered during the ﬁnding solution process, the proposed framework allows, eﬃciently and without overturning it, to adapt the generated plan in case of unexpected events during its execution. Empirical results show that, compared to replanning from scratch, plan adaptations can be obtained more eﬃciently, reducing computational costs and consequently enhancing the ability of the whole system to react quickly to unexpected events. Keywords: Automated planning Timelinebased planning
1
· Plan execution · Plan adaptation ·
Introduction
Automated planning has been deﬁned as “the reasoning side of acting” [24]. Planning, in particular, represents an abstract, explicit deliberation process that This work is partially supported by “SIRobotics: SocIal ROBOTICS for active and healthy ageing” project (Italian M.I.U.R., PON – Ricerca e Innovazione 2014–2020 – G.A. ARS01 01120). c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Dovier et al. (Eds.): AIxIA 2022, LNAI 13796, pp. 225–240, 2023. https://doi.org/10.1007/9783031271816_16
226
R. De Benedictis et al.
chooses and organizes actions by anticipating their expected outcomes. Although automated planning constitutes a rich technical ﬁeld, however, most of the literature on domainindependent planning is biased towards that “reasoning” side [36]. Whether due to a partial knowledge of the world, or to the impossibility of predicting the actions of other agents that autonomously act in the same environment, a large part of any agent’s behavior can be traced back to its ability to react to dynamic changes occurring, or predicted, in the world. Unlike other approaches that propose integration of planning systems into the executives of the autonomous ones [6,31,38], this paper has the twofold objective of: a) concentrating on a speciﬁc form of planning, called timelinebased [34]; and b) proposing a new framework which, by exploiting the knowledge acquired during the previous reasoning processes, is able to adapt plans more eﬃciently than by adopting from scratch replanning. The reasons for concentrating on timelinebased planning are mainly due to the fact that the constraints, introduced among the diﬀerent elements of the plan during the reasoning process, produce a partial plan [47] which, compared to total order plans usually generated by more classical approaches, often result more suitable for adaptations during plan execution. Despite its ability to adapt dynamically, however, this type of formalism is particularly expressive and, consequently, is associated with a high computational complexity which makes the reasoning process signiﬁcantly onerous, with consequent long computation times. While in [13] it has been demonstrated that the computation time could be eﬀectively reduced, thanks to the introduction of some domainindependent heuristics, herein we focus on showing how the introduction of some of the data structures necessary for the computation of the above heuristics, as will be detailed further on, allows as well to eﬃciently manage the dynamic adaptation of the plans during their execution.
2
Related Works
The problem of the dynamic adaptation of plans has already been tackled from various points of view. Some approaches, such as those relying on simple temporal networks with uncertainty [32,33,50] or those based on modelchecking [2,3], aim to generate robust solutions that don’t require (or, in any case, that minimize) the need for adaptations at runtime. Although desirable in those contexts in which certain safety conditions in interacting with people are required (such as, for example, in industrial contexts), these approaches prove to be unattractive in situations with fairly free interactions (e.g., navigation or dialogues) between the user and the machine. Unlike in standard planning problems, solving a contingency planning problem, as described in [39,48], does not generate a plan, but a decision tree with diﬀerent contingent branches that could arise at execution time, whose execution allows the agent to achieve the planned objectives. These approaches allow for practically immediate adaptation at runtime and therefore, once the solution is found, they are probably the best possible choice. Nonetheless, these approaches require to consider, in the problem solving phase, all the possible events that may
Incremental TimelineBased Planning
227
occur during the execution, making the reasoning process particularly burdensome even for relatively simple problems. Furthermore, the approaches adopted in contingency planning rarely manage forms of numerical and/or temporal reasoning. Nebel et al. compare the advantages of replanning from scratch and of reusing information from the old plan to generate a new one, showing that, from a theoretical point of view, the two approaches have, not surprisingly, the same computational complexity [37]. By relying on such theoretical results, approaches like ROSPlan [6] generate a new planning problem all over again whenever an exogenous event, incompatible with the current solution, occurs. These approaches have the great advantage of being able to use any existing planner as a blackbox. However, it has the obvious disadvantage of potentially taking a lot of computational time whenever some exogenous event requiring adaptation occurs. Despite the theoretical results, indeed, further studies, such as [16,22,28,40,41], show that plan adaptation can be, in practice, more eﬀective than replanning. Such repair approaches, furthermore, help to maintain the plan stability, that is, how close the newly generated plan is to the one that it must replace. The approach proposed in this document is situated within the latter context. Unlike the cited approaches, however, we focus on a particular class of automated planning which, in addition to explicitly allowing forms of temporal and numerical reasoning, relies on partial order planning and, hence, produces solutions that usually require a smaller number of causal adaptations when unexpected events occur at execution time. Particularly relevant to our approach, the Flexible Acting and Planning Environment (FAPE), introduced by [15], combines planspace planning with simple temporal networks and hierarchical decomposition rules. A dispatcher calls for each planned action a set of skills and keeps track of their evolution over time allowing plan repair, extension, and replanning, while being able to check and keep up to date the temporal relations and the causal constrains. Compared to FAPE, our architecture assigns a more central role to the acting component, giving it the ability to determine when and how to generate plans, execute them, adapt them or, if no longer needed as a consequence of a drastic change in the environment, destroy them and generate new ones. More than on architectural aspects, however, we focus, in this document, on the possibility of dynamically and eﬃciently adapting plans in the event of failures. When dealing with failures, indeed, FAPE is limited to the removal of just the one failing action, without considering cascades of other potential failures. Thanks to the adaptation of classical planning heuristics, as we will see, and similarly to what is done in the previously cited works applied to classical planning, we are able to overcome this limitation.
3
Technical Background
Timelinebased planning constitutes a form of deliberative reasoning which, in an integrated way, allows to carry out diﬀerent forms of semantic and causal reasoning. Although this approach to planning has mostly been relegated to forms
228
R. De Benedictis et al.
of causal reasoning in the space domain, many solvers have been proposed over the time like, for example, IXTET [23], Europa [26], Aspen [11], the Trf [8,19] on which the APSI framework [20] relies and, more recently, PLATINUm [45]. Some theoretical works on timelinebased planning like [18,26] were mostly dedicated to identifying connections with classical planning ala PDDL [17]. The work on IXTET and Trf has tried to clarify some keys underlying principles but mostly succeeded in underscoring the role of time and resource reasoning [9,29]. The planner CHIMP [44] follows a MetaCSP approach having metaConstraints which havely resembles timelines. The already mentioned FAPE [4,15] tightly integrates structures similar to timelines with acting. The Action Notation Modeling Language (ANML) [42] is an interesting development which combines the Hierarchical Task Network (HTN) [7,35,49] decomposition methods with the expressiveness of the timeline representation. Finally, it is worth mentioning that the timelinebased approaches have been often associated to resource managing capabilities. By leveraging on constraintbased approaches, most of the above approaches like IXTET [10,29,30,43] or [46] integrate planning and scheduling capabilities. Finally, [12] proposes a recent formalization of timelinebased planning. Given the mentioned link with the heuristics we will refer, in this paper, to the timelinebased planning formalization as deﬁned in [13]. According to this formalization, speciﬁcally, the basic building block of timelinebased planning is the token which, intuitively, is used to represent the single unit of information. Through their introduction and their constraining during the planning process, in particular, tokens allow to represent the diﬀerent components of the highlevel plans. In its most general form, a token is formally described by an expression like n (x0 , . . . , xi )χ . In particular, n is a predicate symbol, x0 , . . . , xi are its parameters (i.e., constants, numeric variables or object variables) and χ ∈ {f, g} is a constant representing the class of the token (i.e., either a fact or a goal ). The token’s parameters are constituted, in general, by the variables of a constraint network N (refer to [14] for further details) and can be used, among other things, to represent temporal information such as the start or the end of some tasks. The semantics of the χ constant, on the contrary, is borrowed from Constraint Logic Programming (CLP) [1]. Speciﬁcally, while the facts are considered inherently true, the goals must be achieved as deﬁned by a set of rules. Rules, in particular, are expressions of the form n (x0 , . . . , xk ) ← r where n (x0 , . . . , xk ) is the head of the rule and r is the body of the rule. In particular, r represents the requirement for achieving any goal having the “form” of the head of the rule. Such requirements can be either a token, a constraint among tokens (possibly including the x0 , . . . , xk variables), a conjunction of requirements or a disjunction of requirements. It is worth noting the recursive deﬁnition of requirement, which allows the deﬁnition of the body of a rule as any logical combination of tokens and constraints. Similarly to CLP, through the application of the rules it is hence possible to establish and generate relationships among tokens. Compared to CLP, however, timelines introduce an added value: some tokens may be equipped with a special
Incremental TimelineBased Planning
229
Fig. 1. Diﬀerent timelines extracted by their associated tokens.
object variable τ that identiﬁes the timeline aﬀected by the token. Diﬀerent tokens with the same value for the τ parameter, in particular, aﬀect the same timeline and, depending on the nature of the timeline, might interact with each other. There can, indeed, be diﬀerent types of timelines. In case of statevariable timelines (see Fig. 1a), for example, diﬀerent tokens on the same statevariable cannot temporally overlap. In case of reusableresource timelines (see Fig. 1b), on the contrary, tokens represent resource usages and can, hence, overlap as long as the concurrent uses remain below the resource’s capacity. Given the ingredients mentioned above we can now formally introduce the addressed planning problem. A timelinebased planning problem, speciﬁcally, is a triple P = (O, R, r), where O is a set of typed objects, needed for instantiating the initial domains of the constraint network variables and, consequently, the tokens’ parameters, R is a set of rules and r is a requirement. Intuitively, a solution to such a problem should be described by a set of tokens whose parameters assume values so as to guarantee the satisfaction of all the constraints imposed by the problem’s requirement, by the application of the rules, as well as by the cumulative constraints imposed by the timelines. Unfortunately, the previous deﬁnition, although intuitive, is not easily translatable into a reasoning process which guarantees its achievement starting from the deﬁnition of the planning problem. For this reason, just like common partialorder planners, timelinebased planners often rely on the concepts of flaw and resolver. The planner, in particular, internally maintains a data structure, called token network, which represents a partial plan π = (T , N ), where T is a set of tokens whose parameters are constrained by the constraint network N . During the resolution process, the reasoner incrementally reﬁnes the current token network π by identifying its ﬂaws and by solving them through the application of resolvers, while maintaining consistent the constraints of N . There can be, in general, diﬀerent types of ﬂaws, each resolvable by applying the corresponding resolvers. The achievement of a goal, for example, can take
230
R. De Benedictis et al.
place either through the application of a rule or through a unification with either a fact or another already achieved goal with the same predicate (i.e., the parameters of the current goal and the token with which is unifying are constrained to be pairwise equal). In case of disjunctions, introduced either in the initial problem or by the application of a rule, a disjunct must be chosen. The domain of all the variables that make up the token parameters must be reduced to a single allowed value. Finally, timelines must be consistent, possibly requiring the introduction of constraints which prevent not allowed overlaps. Thanks to the introduction of the ﬂaw and resolver concepts, it is therefore possible to provide an implementable deﬁnition of solution. Speciﬁcally, a solution to a timelinebased planning problem is a ﬂawless token network whose constraint network is consistent. 3.1
A Lifted Heuristic for TimelineBased Planning
Finding a solution to a timelinebased planning problem is far from simple. Choosing the right ﬂaw and the right resolver, in particular, constitutes a crucial aspect for coping with the computational complexity and hence eﬃciently generating solutions. Taking a cue from classical planning heuristics, [13] describes how, by building a causal graph and by analyzing its topology, it is possible to estimate the costs for the resolution of the ﬂaws and for the application of the resolvers. Flaws and resolvers, in particular, are seen as if they are, respectively, classical planning propositions and actions. The eﬀect of applying a resolver is, intuitively, the resolution of a ﬂaw (the sole positive eﬀect of the corresponding classical action). In the case of the application of a rule or the choice of a disjunct in a disjunction, however, further ﬂaws (the preconditions for the corresponding classical action) can be introduced. Starting from the initial facts, with a zero estimated resolution cost, the cost of applying a resolver can be estimated as an intrinsic cost of the resolver plus the maximum cost (hmax heuristic). The cost of resolving a ﬂaw, on the other hand, is given by the minimum cost of its resolvers. Starting from the toplevel goals present in the planning problem, initially estimated with inﬁnite cost, a graph is constructed by proceeding backwards, considering all the possible resolvers for all the possible ﬂaws. The estimated costs are updated every time a uniﬁcation is found or in those cases in which the resolver does not introduce further ﬂaws. Finally, the graph building procedure proceeds until a ﬁnite estimate cost for the toplevel goals is reached. Compared to other stateoftheart timelinebased solvers, the above heuristics allow solving problems up to one order of magnitude faster [13]. The most interesting aspect for the current topic, however, concerns the management of the causal constraints in the causal graph. Similar to planning models based on satisﬁability [27], indeed, a set of propositional variables is assigned to ﬂaws and to resolvers. For the sake of brevity we will use subscripts to indicate ﬂaws (e.g., ϕ0 , ϕ1 , etc.), resolvers (e.g., ρ0 , ρ1 , etc.) as well as their associated propositional variables. Additionally, given a ﬂaw ϕ, we refer to the set of its possible resolvers by means of res (ϕ) and, by means of cause (ϕ), to the set of resolvers (possibly empty, in case of the ﬂaws of the problem’s requirement) which are responsible
Incremental TimelineBased Planning
231
for introducing it. Moreover, given a resolver ρ, we refer to the set of its preconditions (e.g., the set of tokens introduced by the application of a rule) by means of precs (ρ) and to the ﬂaw solved through its application by means of ef f (ρ). The introduction of such variables allows to constrain them so as to guarantee the satisfaction of the causal relations. Speciﬁcally, for each ﬂaw ϕi , we guarantee that the preconditions of all the applied resolvers are satisﬁed (ϕi = ρk ∈cause(ϕi ) ρk (1)) and that at least one resolver is active whenever the ﬂaw becomes active (ϕi ⇒ ρl ∈res(ϕi ) ρl (2)). Additionally, we need a gimmick to link the presence of the tokens with the causality constraint. A further variable σ ∈ {inactive, active, unif ied}, in this regard, is associated to each token. A partial solution will hence consist solely of those tokens of the token network which are active. Moreover, in case such tokens are goals, the bodies of the associated rules must also be present within the solution. Later on, we refer to tokens by means of the σ variables (we will use subscripts to describe speciﬁc tokens, e.g., σ0 , σ1 , etc.) and to the ﬂaws introduced by tokens by means of the ϕ (σ) function. The last aspect to consider concerns the update of such variables as a consequence of the activation of a rule application resolver and of a uniﬁcation resolver. Speciﬁcally, each rule application resolver ρa binds the σa variable of the goal token, whose rule has been applied, to assume the active value (formally, ρa = [ϕ (σa ) = active]). Finally, for each uniﬁcation resolver ρu representing the uniﬁcation of a token σu with a target token σt , the constraints ρu = [σu = unif ied] and ρu ⇒ [σt = active] guarantee the update of the σ variables while adding ϕ (σt ) to the preconditions of ρu guarantees the operation of the heuristic. 3.2
An Explanatory Example
In order to better understand how the heuristics and the causality constraints work, we introduce in this section a very simple example of planning problem, whose objective is to plan a physical rehabilitation session for an hypothetical user. Figure 2 shows the causal graph which is generated for the problem, whose problem requirement is constituted by the Fig. 2. An example of causal graph for the sole goal σ0 . Estimated costs for ﬂaws planning of a physical rehabilitation session. Tokens’ parameters are omitted to (boxes) and resolvers (circles) are on avoid burdening the notation. their upper right. The propositional variables that participate in the causal constraints are on their upper left. Solid (True) and dashed (Unassigned) contour lines are used to distinguish ﬂaws’ and resolvers’ associated propositional variables’ values. In the ﬁgure, in particular, the ϕ0 variable, representing a ﬂaw which is present in the problem requirement and therefore must necessarily be solved, assumes the True value.
232
R. De Benedictis et al.
It is worth noting that, in the example, the ϕ0 ﬂaw, for achieving the σ0 goal, can only be solved through the ρ0 resolver, which is hence directly applied (notice the solid line) as a consequence of the propagation of the causal constraints. Since res (ϕ0 ) = {ρ0 }, indeed, the expression (2) translates into ϕ0 ⇒ ρ0 . This, in turn, forces the σ0 goal to assume the active value as a consequence of ρ0 = [ϕ (σ0 ) = active]. The ρ0 resolver, furthermore, represents the application of a rule having a P hysicalExercise () in the head and, in the body, a conjunction of the two σ1 and σ2 goals. The application of this resolver, in particular, introduces the ϕ1 = ϕ (σ1 ) and the ϕ2 = ϕ (σ2 ) ﬂaws, each of which must necessarily be resolved as a consequence of the ϕ1 = ρ0 and ϕ2 = ρ0 causal constraints, from the expression (1). These ﬂaws, in turn, can be solved through the application of the ρ1 and of the ρ2 resolvers which introduce, respectively, the disjunctions represented by the ϕ3 and ϕ4 ﬂaws. Proceeding backwards, the propagation of the causal constraints no longer allows to infer what is present in the current partial plan (notice the dashed lines). The resolution of the ϕ3 and ϕ4 ﬂaws, in particular, constitute two choices that the planner must make during the resolution process. The ϕ3 ﬂaw, for example, can be solved either by applying the Disj 0 disjunct, represented by the ρ3 resolver, or by applying the Disj 1 disjunct, represented by the ρ4 resolver. The graph construction process, however, which proceeds following a breadthﬁrst approach, has identiﬁed, in the example, a possible solution for the ϕ3 ﬂaw by applying ﬁrst the ρ3 resolver and then the ρ7 resolver (the latter corresponding, in this simple example, to a rule with an empty body). The heuristics’ estimated costs propagation procedure, hence, makes the ρ3 resolver, with an estimated cost of 2, much more attractive than the ρ4 resolver, with an estimated cost of ∞. For a similar reason, the ρ5 resolver will be preferred over the ρ6 resolver, leading to a (possible) solution of the planning problem. It is worth noting that, for the sake of simplicity, the tokens’ parameters are not represented in the example ﬁgure. All tokens, however, are endowed with numerical variables that represent the start and the end of the associated activities, appropriately constrained according to common sense. Upper and lower body exercises, for example, represented respectively by the σ1 and by the σ2 tokens, will take place as part of the more general physical exercise represented by the σ0 token. The σ3 and by the σ5 tokens, additionally, are endowed with their τ variables which will avoid their temporal overlapping if they will assume the same value.
4
An Architecture for Deliberative and Reactive Reasoning
In order to integrate the deliberative and reactive capabilities we have adopted an architecture that, from a highlevel perspective, is depicted in Fig. 3. Taking inspiration from classical robotics architectures [21], speciﬁcally, our system consists of a deliberative tier responsible for the generation, the execution and the dynamic adaptation of the plans; a sequencing tier which, through the application of a policy (out of the scope of this paper), executes a sequence of
Incremental TimelineBased Planning
233
actions according to the current state of the system; and a sensing and a controlling tier, which respectively interprets data produced by sensors and translates the sequencer’s actions into lower level commands for the actuators. Particularly interesting from an execution perspective, it is worth noting that the state, according to which actions are selected from the sequencer tier policy, is described by the combination of three distinct states: – the ss state, generated by the sensing tier and characterized as a consequence of the interpretation of sensory data, is able to represent, for example, the intentions of the users, the estimation of their current pose, the users’ emotions perceived from the camera, as well as situations which might be dangerous for both the users and for the robot; – the sc state, generated by the control tier, representing the state of the controllers such as whether the robot is navigating or not, or whether the robot is talking to or listening to the user; – the sd state, generated by the deliberative tier, representing the highlevel commands generated as a result of the execution of the planned plans Similarly, the actions executed by the sequencer tier can be of three distinct types: – the as actions, towards the sensors, responsible, for example, for their activation or for their shutdown; – the ac actions, towards the controllers, responsible, for example, for activating contextual navigation commands as well as conversational interactions with the users; – the ad actions, towards the deliberative tier, responsible, for example, for the creation and for the adaptation of the generated plans.
Fig. 3. The threelayer architecture.
It is worth noting that, through the application of the π (s) policy, the sequencing tier can act both indirectly on the environment, through the ac actions, and, through the ad actions, introspectively on other higherlevel forms of reasoning adopted by the agent itself. The highlevel actions generated by the deliberative tier while executing the plans, moreover, constituting only one component among those that determine the choice of the actions by the policy, are not mandatory for the autonomy of the robot and represent a sort of “suggestions”, for the agent, on the things to do.
5
Plan Execution and Possible Adaptations
Once the graph has been built, the heuristics introduced in the previous section guide the resolution process by providing an indication of the ﬂaws to be solved
234
R. De Benedictis et al.
(those with the most expensive estimate) through the application of their best resolvers (those with the least expensive estimate)1 . After a ﬂawless partial plan has been found, it is time to execute it. An internal currenttime variable, in particular, is incremented at each execution step (e.g., every second) and whenever its value is greater than or equal to the beginning (end) of unexecuted (executing) tasks, these are started (ended). The generated plan, however, must deal with the evolving reality of all the (often, unpredictable) events that can happen in the real world. More than simply dispatching the planned activities, indeed, the main challenge of executing a plan consists in modifying it to make it ﬁt the current reality. The introduction of the causal graph, in particular, can also be useful during the execution of the plan whenever should it be necessary to update it. Coherently with what described in [25], the possible adaptations that a plan can undergo during its execution, speciﬁcally, represented by the ad actions of Fig. 3, can be of four types: temporal delays, in case tasks are not ready to start/end; variable freezes, to freeze, for example, a start variable of a task and prevent the violation of a duration constraint in case of delays on its termination; task failures, in case inconsistent constraints related to the task are introduced or unexpected events decrees its failure; and requirement additions, in case the underlying reactive module requires the introduction of a new requirement (e.g., a new goal). We have no theoretical results in this regard but it is worth noting that it is possible to build a new plan by incrementally introducing new requirements from scratch. Additionally, adding delays and failing tasks can bring the solver back to root level. These considerations, coherently with the theoretical results on classical planning, suggest that the cost of the adaptations, exception made for the freezes, is equal, asymptotically, to the cost of replanning. Most of the time, however, an adapted plan has little diﬀerence from the original plan. Furthermore, the information gathered during the initial search can be exploited to generate an adapted plan. For this reason we aim to empirically show that adaptation, especially in those contexts where reactivity is required, can be advantageous. The pursued approach consists of introducing a new propositional variable, called ξ, that will be used, before starting the execution, to force the propagation of the execution constraints (i.e., delays and freezes). Additional propositional variables, called σiξ and associated to each token σi , will be used as the “reason” for the propagation of the execution constraints. Finally, these variables must be causally linked to the planned activities, so as to allow, as a consequence of the introduction of an inconsistent constraint, the removal from the plan of the corresponding activity. We obtainthis result by introducing, for each token σi , the clause ¬ξ, ¬ [σi = active] , σiξ . Once the planning problem has been solved, the assignment of the true value to the ξ variable will cause, through propagation, the assignment of the true value to the σiξ variables corresponding to those 1
There is, intuitively, no guarantee that the built graph contains a solution. Similarly to what happens in Graphplan [5], indeed, it might be required the addition of a “layer” to the graph.
Incremental TimelineBased Planning
235
tokens σi which are in the current plan. Since variables σiξ are assigned at the last search level, they can be safely used as the “reason” for the propagation of the execution constraints. The introduction of an inconsistent constraint will ﬁrst lead to the analysis of the introduced conﬂict, allowing to carry out a more targeted backtracking (a.k.a, backjumping [14]). Whenever possible, the tokens incompatible with the execution are eliminated and, subsequently, the resolution process is invoked to guarantee the satisfaction of the causal constraints. Finally, in case some delaying constraints cannot be added or, in the absence of alternative plans, some tokens cannot be removed, the false value is assigned to the ξ variable, decreeing the unexecutability of the plan and, consequently, the need to replan.
6
Experimental Results
Fig. 4. Adaptation vs replanning from scratch in case of failures and new requirements.
We have conducted some experiments to demonstrate the eﬀectiveness of the proposed approach. Given the project needs, in particular, we focused on planning problems similar to those described in Sect. 3.2, in which the user has to carry out some physical and cognitive rehabilitation exercises to keep active and prolong his/her health wellbeing. In particular, series of physical exercises chosen from 14 diﬀerent types (e.g., Chest press, Biceps curl, etc.) are planned in order to guarantee the training of all parts of the body. The exercises are repeated several times and with diﬀerent characteristics depending on the proﬁle of user. The interesting aspect regarding the current experimentation is that the user may refuse to perform these exercises, or he/she may have problems at doing them. For this reason, the planner must be ready to provide alternatives that still achieve the desired rehabilitative goals (i.e., training all the parts of the body). Whenever the user is particularly conﬁdent in carrying out the exercises, on the contrary, the system could add new tasks through the addition of new goals to the planning problem, hence requesting the adaptation of the plan for
236
R. De Benedictis et al.
taking into account the new requirements dynamically introduced. We have left out the temporal adaptations as they are less interesting and already managed by several existing frameworks. Figure 4a shows a comparison between adaptation and replanning in 5 diﬀerent generated plans. To demonstrate the eﬀectiveness of the proposed approach, in particular, we artiﬁcially made the ﬁrst, the second and the third activity fail during execution. This allows us to compare the adaptation times with the times required by the solver to generate a new plan from scratch without failing the task. It is worth noting how, often, the time required for subsequent adaptations takes less and less time. This is because the information collected during the previous searches (the topology of the causal graph and the nogoods learned during the search) are exploited to make the adaptation more eﬀective. Figure 4b, on the contrary, compares the adaptation times with the replanning times in the case of adding one and two new goals. Also in this case the information collected during the previous searches are exploited, reducing the calculation time necessary for the addition of new goals. In the event new plans have to be generated from scratch, on the contrary, these would have an increasing number of goals and would therefore require more and more calculation times. Although the reasoning times are relatively small, for this type of planning problems, we are talking about robots that interact with people. Reducing the computation times allows such robots to behave more ﬂuidly in a dynamic environment, in which the activities fail easily and where new goals can emerge during the execution of the planned tasks. Whether it’s a failure or the addition of a new requirement, the sum of the reasoning times of the adaptations is signiﬁcantly less than the sum of the reasoning times of the replannings. Furthermore, as the number of adaptation increases, there is an ever greater divergence between adaptive and replanning behaviors, showing that the more dynamic the environment, the more advantageous the adaptive approach is.
7
Conclusions
The word agent comes from the Latin word agere which means, in English, “to do”. Much of the literature on automated planning, however, focuses on those forms of reasoning that lead to the deﬁnition of a plan, rather than its actual application to the real world, hence neglecting much of that agere. Acting in the real world requires an agent to be able to adapt to the agent’s perception which might not necessarily be consistent with the expected plans, either because the agent’s knowledge is partial, or because of the impossibility to predict the behavior of other agents acting in the same environment. Much of an agent’s behavior in the real world is therefore related to reacting to its dynamical evolution, taking advantage, from time to time, of higherlevel information coming from more deliberative forms of reasoning which in turn require high adaptability skills. In this paper we have presented some techniques that allow to realize these adaptation skills. An underlying reactive tier, in particular, continuously reacts to the environment’s dynamic changes. When perceiving particular situations,
Incremental TimelineBased Planning
237
this module triggers adjustments to the planned tasks, which can range from introducing delays to the removal of some tasks till to the generation of (part of) new plans. Adapting a plan, in general, can be as complex as generating a new plan from scratch. Since an adapted plan is typically similar to the original plan, however, it is possible to exploit part of the information learned in the initial search to make the adaptation more eﬃcient. Empirical results show that some of the data structures introduced to make the reasoning process more eﬃcient can be exploited also to improve the dynamic adaptation of the plan to the emerging reality.
References 1. Apt, K.R., Wallace, M.G.: Constraint Logic Programming Using ECLi PSe . Cambridge University Press, New York (2007) 2. Bensalem, S., Havelund, K., Orlandini, A.: Veriﬁcation and validation meet planning and scheduling. Int. J. Softw. Tools Technol. Transfer 16(1), 1–12 (2014) 3. Bertoli, P., Cimatti, A., Roveri, M., Traverso, P.: Strong planning under partial observability. Artif. Intell. 170(4), 337–384 (2006). https://doi.org/ 10.1016/j.artint.2006.01.004. https://www.sciencedirect.com/science/article/pii/ S0004370206000075 4. BitMonnot, A., Ghallab, M., Ingrand, F., Smith, D.E.: FAPE: a Constraintbased Planner for Generative and Hierarchical Temporal Planning. arXiv preprint arXiv:2010.13121 (2020) 5. Blum, A.L., Furst, M.L.: Fast planning through planning graph analysis. Artif. Intell. 90(1–2), 281–300 (1997) 6. Cashmore, M., et al.: ROSPlan: planning in the robot operating system. In: Proceedings of the TwentyFifth International Conference on International Conference on Automated Planning and Scheduling, ICAPS 2015, pp. 333–341. AAAI Press (2015) 7. Castillo, L., FdezOlivares, J., Garc´ıaP´erez, O., Palao, F.: Eﬃciently handling temporal knowledge in an HTN planner. In: Proceedings of the Sixteenth International Conference on International Conference on Automated Planning and Scheduling, ICAPS 2006, pp. 63–72. AAAI Press (2006) 8. Cesta, A., Cortellessa, G., Fratini, S., Oddi, A.: Developing an endtoend planning application from a timeline representation framework. In: IAAI 2009, Proceedings of the 21st Innovative Applications of Artiﬁcial Intelligence Conference, Pasadena, CA, USA, pp. 66–71 (2009) 9. Cesta, A., Oddi, A.: Gaining eﬃciency and ﬂexibility in the simple temporal problem. In: Chittaro, L., Goodwin, S., Hamilton, H., Montanari, A. (eds.) Proceedings of the Third International Workshop on Temporal Representation and Reasoning (TIME 1996), pp. 45–50. IEEE Computer Society Press, Los Alamitos (1996) 10. Cesta, A., Oddi, A., Smith, S.F.: A constraintbased method for project scheduling with time windows. J. Heuristics 8(1), 109–136 (2002). https://doi.org/10.1023/ A:1013617802515 11. Chien, S., Tran, D., Rabideau, G., Schaﬀer, S., Mandl, D., Frye, S.: Timelinebased space operations scheduling with external constraints. In: ICAPS 2010, Proceedings of the 20th International Conference on Automated Planning and Scheduling, pp. 34–41 (2010)
238
R. De Benedictis et al.
12. Cialdea Mayer, M., Orlandini, A., Umbrico, A.: Planning and execution with ﬂexible timelines: a formal account. Acta Informatica 53(6), 649–680 (2016). https:// doi.org/10.1007/s002360150252z 13. De Benedictis, R., Cesta, A.: Lifted heuristics for timelinebased planning. In: ECAI2020, 24th European Conference on Artiﬁcial Intelligence, pp. 498–2337. Santiago de Compostela, Spain (2020) 14. Dechter, R.: Constraint Processing. Elsevier Morgan Kaufmann, Cambridge (2003) 15. Dvor´ ak, F., BitMonnot, A., Ingrand, F., Ghallab, M.: Planspace hierarchical planning with the action notation modeling language. In: IEEE International Conference on Tools with Artiﬁcial Intelligence (ICTAI), Limassol, Cyprus (2014). https://hal.archivesouvertes.fr/hal01138105 16. Fox, M., Gerevini, A., Long, D., Serina, I.: Plan stability: replanning versus plan repair. In: Long, D., Smith, S.F., Borrajo, D., McCluskey, L. (eds.) Proceedings of the Sixteenth International Conference on Automated Planning and Scheduling, ICAPS 2006, Cumbria, UK, 6–10 June 2006, pp. 212–221. AAAI (2006). http:// www.aaai.org/Library/ICAPS/2006/icaps06022.php 17. Fox, M., Long, D.: PDDL2.1: an extension to PDDL for expressing temporal planning domains. J. Artif. Intell. Res. 20, 61–124 (2003) 18. Frank, J., J´ onsson, A.K.: Constraintbased attribute and interval planning. Constraints 8(4), 339–364 (2003) 19. Fratini, S., Pecora, F., Cesta, A.: Unifying planning and scheduling as timelines in a componentbased perspective. Arch. Control Sci. 18(2), 231–271 (2008) 20. Fratini, S., Cesta, A., De Benedictis, R., Orlandini, A., Rasconi, R.: APSIbased deliberation in goal oriented autonomous controllers. In: ASTRA 2011 (2011) 21. Gat, E.: On threelayer architectures. In: Artiﬁcial Intelligence and Mobile Robots, pp. 195–210. AAAI Press (1997) 22. Gerevini, A., Serina, I.: Fast plan adaptation through planning graphs: local and systematic search techniques. In: Chien, S.A., Kambhampati, S., Knoblock, C.A. (eds.) Proceedings of the Fifth International Conference on Artiﬁcial Intelligence Planning Systems, Breckenridge, CO, USA, 14–17 April 2000, pp. 112–121. AAAI (2000). http://www.aaai.org/Library/AIPS/2000/aips00012.php 23. Ghallab, M., Laruelle, H.: Representation and control in IxTeT, a temporal planner. In: AIPS 1994, Proceedings of the 2nd International Conference on AI Planning and Scheduling, pp. 61–67 (1994) 24. Ghallab, M., Nau, D., Traverso, P.: Automated Planning: Theory and Practice. Morgan Kaufmann Publishers Inc., Burlington (2004) 25. Ingrand, F., Ghallab, M.: Robotics and artiﬁcial intelligence: a perspective on deliberation functions. AI Commun. 27(1), 63–80 (2014). https://doi.org/10.3233/ AIC130578. https://hal.archivesouvertes.fr/hal01138117 26. Jonsson, A., Morris, P., Muscettola, N., Rajan, K., Smith, B.: Planning in interplanetary space: theory and practice. In: AIPS 2000, Proceedings of the Fifth International Conference on AI Planning and Scheduling, pp. 177–186 (2000) 27. Kautz, H., Selman, B.: Planning as satisﬁability. In: ECAI, vol. 92, pp. 359–363 (1992) 28. van der Krogt, R., de Weerdt, M.: Plan repair as an extension of planning. In: Biundo, S., Myers, K.L., Rajan, K. (eds.) Proceedings of the Fifteenth International Conference on Automated Planning and Scheduling (ICAPS 2005), 5–10 June 2005, Monterey, California, USA, pp. 161–170. AAAI (2005). http://www. aaai.org/Library/ICAPS/2005/icaps05017.php 29. Laborie, P.: Algorithms for propagating resource constraints in AI planning and scheduling: existing approaches and new results. Artif. Intell. 143, 151–188 (2003)
Incremental TimelineBased Planning
239
30. Laborie, P., Ghallab, M.: Planning with sharable resource constraints. In: Proceedings of the 14th International Joint Conference on Artiﬁcial Intelligence  Volume 2, IJCAI 1995, pp. 1643–1649. Morgan Kaufmann Publishers Inc. (1995) 31. McGann, C., Py, F., Rajan, K., Thomas, H., Henthorn, R., Mcewen, R.: A deliberative architecture for AUV control. In: 2008 IEEE International Conference on Robotics and Automation, pp. 1049–1054. IEEE (2008) 32. Morris, P., Muscettola, N., Vidal, T.: Dynamic control of plans with temporal uncertainty. In: Proceedings of the 17th International Joint Conference on Artiﬁcial Intelligence  Volume 1, IJCAI 2001, pp. 494–499. Morgan Kaufmann Publishers Inc., San Francisco (2001) 33. Morris, P.H., Muscettola, N.: Temporal dynamic controllability revisited. In: Veloso, M.M., Kambhampati, S. (eds.) Proceedings, The Twentieth National Conference on Artiﬁcial Intelligence and the Seventeenth Innovative Applications of Artiﬁcial Intelligence Conference, 9–13 July 2005, Pittsburgh, Pennsylvania, USA, pp. 1193–1198. AAAI Press/The MIT Press (2005). http://www.aaai.org/Library/ AAAI/2005/aaai05189.php 34. Muscettola, N.: HSTS: integrating planning and scheduling. In: Zweben, M., Fox, M.S. (ed.) Intelligent Scheduling. Morgan Kauﬀmann (1994) 35. Nau, D.S., et al.: SHOP2: an HTN planning system. J. Artif. Intell. Res. 20, 379– 404 (2003) 36. Nau, D.S., Ghallab, M., Traverso, P.: Blended planning and acting: Preliminary approach, research challenges. In: Bonet, B., Koenig, S. (eds.) Proceedings of the TwentyNinth AAAI Conference on Artiﬁcial Intelligence, 25–30 January 2015, Austin, Texas, USA, pp. 4047–4051. AAAI Press (2015) 37. Nebel, B., Koehler, J.: Plan reuse versus plan generation: a theoretical and empirical analysis. Artif. Intell. 76(1–2), 427–454 (1995) 38. Niemueller, T., Hofmann, T., Lakemeyer, G.: Goal reasoning in the CLIPS executive for integrated planning and execution. In: Proceedings of the International Conference on Automated Planning and Scheduling, vol. 29, no. 1, pp. 754–763 (2021) 39. Peot, M.A., Smith, D.E.: Conditional nonlinear planning. In: Proceedings of the First International Conference on Artiﬁcial Intelligence Planning Systems, pp. 189– 197. Morgan Kaufmann Publishers Inc., San Francisco (1992) 40. Saetti, A., Scala, E.: Optimising the stability in plan repair via compilation. In: Kumar, A., Thi´ebaux, S., Varakantham, P., Yeoh, W. (eds.) Proceedings of the ThirtySecond International Conference on Automated Planning and Scheduling, ICAPS 2022, Singapore (virtual), 13–24 June 2022, pp. 316–320. AAAI Press (2022) 41. Scala, E., Torasso, P.: Deordering and numeric macro actions for plan repair. In: Yang, Q., Wooldridge, M.J. (eds.) Proceedings of the TwentyFourth International Joint Conference on Artiﬁcial Intelligence, IJCAI 2015, Buenos Aires, Argentina, 25–31 July 2015, pp. 1673–1681. AAAI Press (2015) 42. Smith, D.E., Frank, J., Cushing, W.: The ANML language. In: ICAPS Workshop on Knowledge Engineering for Planning and Scheduling (KEPS) (2008) 43. Smith, D.E., Frank, J., J´ onsson, A.K.: Bridging the gap between planning and scheduling. Knowl. Eng. Rev. 15(1), 47–83 (2000) 44. Stock, S., Mansouri, M., Pecora, F., Hertzberg, J.: Hierarchical hybrid planning in a mobile service robot. In: KI 2015 Proceedings, pp. 309–315 (2015) 45. Umbrico, A., Cesta, A., Cialdea Mayer, M., Orlandini, A.: Platinum: a new framework for planning and acting. In: AI*IA 2017 Proceedings, pp. 498–512 (2017)
240
R. De Benedictis et al.
46. Verfaillie, G., Pralet, C., Lemaˆıtre, M.: How to model planning and scheduling problems using constraint networks on timelines. Knowl. Eng. Rev. 25(3), 319– 336 (2010) 47. Weld, D.S.: An introduction to least commitment planning. AI Mag. 15(4), 27–61 (1994) 48. Weld, D.S., Anderson, C.R., Smith, D.E.: Extending graphplan to handle uncertainty and sensing actions. In: Proceedings of the Fifteenth National/Tenth Conference on Artiﬁcial Intelligence/Innovative Applications of Artiﬁcial Intelligence, AAAI 1998/IAAI 1998, pp. 897–904. American Association for Artiﬁcial Intelligence, USA (1998) 49. Wilkins, D.E.: Practical Planning: Extending the Classical AI Planning Paradigm. Morgan Kaufmann Publishers, San Mateo (1988) 50. Zavatteri, M., Vigan` o, L.: Conditional simple temporal networks with uncertainty and decisions. Theor. Comput. Sci. 797, 77–101 (2019). https://doi. org/10.1016/j.tcs.2018.09.023. https://www.sciencedirect.com/science/article/pii/ S0304397518305942. Temporal Representation and Reasoning (TIME 2017)
Knowledge Acquisition and Completion for LongTerm HumanRobot Interactions Using Knowledge Graph Embedding Ermanno Bartoli2 , Francesco Argenziano1(B) , Vincenzo Suriani1 , and Daniele Nardi1 1
Department of Computer, Control, and Management Engineering, Sapienza University of Rome, Rome, Italy {argenziano,suriani,nardi}@diag.uniroma1.it 2 Division of RPL (Robotics, Perception and Learning), KTH Royal Institute of Technology, Stockholm, Sweden [emailprotected]
Abstract. In HumanRobot Interaction (HRI) systems, a challenging task is sharing the representation of the operational environment, fusing symbolic knowledge and perceptions, between users and robots. With the existing HRI pipelines, users can teach the robots some concepts to increase their knowledge base. Unfortunately, the data coming from the users are usually not enough dense for building a consistent representation. Furthermore, the existing approaches are not able to incrementally build up their knowledge base, which is very important when robots have to deal with dynamic contexts. To this end, we propose an architecture to gather data from users and environments in longruns of continual learning. We adopt Knowledge Graph Embedding techniques to generalize the acquired information with the goal of incrementally extending the robot’s inner representation of the environment. We evaluate the performance of the overall continual learning architecture by measuring the capabilities of the robot of learning entities and relations coming from unknown contexts through a series of incremental learning sessions. Keywords: Humanrobot interaction · Knowledge graphs · Knowledge graphs embeddings · Continual learning · Robots Knowledge base · Knowledge representation
1
·
Introduction
In the last years, robots started leaving laboratories to enter our daily environments where they are asked to autonomously operate, often sharing the working area with humans. To be eﬀective in this goal, representing and storing information in a suitable way is fundamental regardless of the speciﬁc robotic applications. In particular, this problem acquires more relevance when designing HumanRobot Interaction (HRI) systems, since there is the intrinsic need to E. Bartoli and F. Argenziano—These two authors contributed equally. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Dovier et al. (Eds.): AIxIA 2022, LNAI 13796, pp. 241–253, 2023. https://doi.org/10.1007/9783031271816_17
242
E. Bartoli et al.
Fig. 1. Complete architecture of the system: from the interaction with the user to the deployment of learned knowledge and capabilities after longrun training.
make the human and the robot participants interact with each other. In order to make this interaction successful, the robot and the human not only must be able to communicate and understand each other, but also they should have a mutual understanding of the world they both operate in. Therefore, a shared semantic of the environment is needed in order to make the interaction successful. In many HRI applications, this knowledge (that is the building block on which the whole system is built) is often embedded in the agent’s behaviour and changes from one episode to another. A way to improve it can be through a generalization of the knowledge that is transferred and acquired by the robot. In fact, usually, it is very domaindependent for the speciﬁc application of the system (Fig. 1). In this paper, we propose a novel architecture for acquiring knowledge from sparse data acquisition from environments. The acquired knowledge is represented and organized to improve the completeness of the previous knowledge base of the robot. This process leads to the creation of a resulting more extensive knowledge base that is built up incrementally. The nature of the architecture is meant to be robust to any change in the context so that it can be suitable in several HRI applications, even if very diﬀerent from each other. A major advantage of the proposed approach is that, diﬀerently from previous HRI systems, it is not necessary to modify the software architecture when the context of the interaction changes, but it is only needed to start a new learning session that shapes the existing learning skills of the robot. The acquisition of the data is humandriven, and the human who collaborates with the robot is not required to know anything about the software of the agent, nor how the knowledge is represented, but the user just needs to share his knowledge of the world with the robot. This process needs to take into account some aspects. First of all, this kind of interaction is not deﬁned over a short period of time, longruns are necessary to achieve good results. However, longruns are not that common in the HRI ﬁeld, since the interactions between humans and robots happen quite fast, and therefore this problem must be treated. Moreover, because of these longruns, the robot will face information that needs to be stored and eﬀectively processed, without forgetting acquired knowledge as the run goes on. To solve these problems, the methodology we propose relies on Continual Learning (CL) and Knowledge Graph Embeddings (KGEs): the former is used to deal with the catastrophic forgetting phenomenon during incremental knowledge acquisition sessions, while the latter is used to eﬃciently use the information, stored in a
Knowledge Acquisition and Completion for LongTerm HRI
243
Fig. 2. Interaction with the robot before the longrun training and knowledge acquisition. The robot still has diﬃculties in carrying on a correct interaction.
Fig. 3. Interaction with the robot after the longrun training and knowledge acquisition. The robot has improved its capabilities, can correctly carry on the interaction, and exploits it to learn new relations.
Knowledge Graph (KG) database, to perform the knowledge completion. In the end, the knowledge of the system spans from grounded facts about the environment to more general concepts on which the system can make predictions. This knowledge allows for several reasoning processes, based on the kind of query that the human operator may ask: if the query is very speciﬁc (namely the human asks for a particular object in a particular location), the robot can answer by exploiting its experience, that is what it has detected in the past explorations; for more general queries (namely, general objects or concepts), the robot can answer by making predictions depending on what it has learned, so by using an ontological scheme of the environment that it has slowly built in the past days.
2
Related Work
In order to have robots working and acting in humanshaped environments, semantic mapping approaches have been studied, aiming at constructing a common representation of the world between robots and humans [11]. To this end, there was a growing need of representing the knowledge of the robot with appropriate techniques in order to allow for faster and more precise reasoning about
244
E. Bartoli et al.
the environment the agents lived in. One particular way of knowledge representation that is demonstrated to be very eﬀective is through triples [12], in which objects of the worlds are linked together by some sort of relation. This way of memorizing facts enabled the usage of a particular kind of data structure, the Knowledge Graphs (KGs) [4], in which is it possible to represent collections of triples as directed graphs. In those graphs, objects and entities are represented as nodes, and relations between entities are represented as directed edges. This representation allows for better data integration, uniﬁcation, and reuse of information since it is also easier to represent ontological structures by the use of them. However, one of the biggest problems of KGs is that they do not scale well with size: the bigger the graph, the harder is to navigate through it and the harder is to make any sort of inference from it. For this reason, instead of working directly with KGs, through the years techniques of Knowledge Graph Embeddings (KGEs) [15] have been developed, in which KGs are transformed into lowerdimensional representation in order to reduce the number of parameters of the system while also preserving the information of the graph. Another problem in representing information with KGs is that when knowledge comes from multiple sources, there is often the possibility of incorporating contradictory pieces of information that will eventually compromise the quality of the system (in particular during the training of the embedding). For this reason, it is important to introduce in the process of knowledge acquisition some sort of validation procedure, and this validation can be done by interacting with humans. In recent years, the human participant in the interaction has acquired a bigger and bigger role in the robot’s acquisition of knowledge from the world [3,13], and this is because through the ﬁltering process of a human we are able to transfer to the robot only useful information, that can signiﬁcantly improve further reasoning processes down the interaction pipeline. Although the human can get rid of useless information, a humandrive acquisition of knowledge needs much time to be robust and eﬃcient, because the data that the robot acquires through the human can be sparse and not cohesive. For that purpose, the development of systems capable to handle longruns of one single experiment has become more popular [8]. This kind of experiment allows the robot to build up robust and dense knowledge. An interesting way to build up the robot’s knowledge is doing it incrementally through humanrobot interaction. Such a class of problems has been addressed in applications focused on learning without forgetting [7]. These approaches typically operate in a taskbased sequential learning setup. This formulation, which is rarely encountered in practical applications under this assumption, has been also studied in a taskfree scenario [1].
3
Methodology
The proposed approach aims at making the robot able to address the multirelational embedding problem while incrementally building up the robot’s knowledge base in a unique longrun. The goal mentioned can be subdivided into three subtasks which are addressed at the same time: acquiring data in collaboration
Knowledge Acquisition and Completion for LongTerm HRI
245
with the human, incorporating the acquired data in the infrastructure designed for semantic mapping, improving the accuracy of the robot’s predictions by training the model on the new data (Fig. 4).
Fig. 4. The ﬁnal task is the composition of three subtasks.
3.1
Acquiring and Extending the Knowledge Base
To properly build a knowledge base (KB) for the purpose of this work, we chose to have a basic predicate represented by a triple, (h, r, t), where h is the head entity, t is the tail entity, and r is the relation that connects the head to the tail. A set of those triples can be suitable for Continual Learning on Knowledge Graph Embedding. In fact, a dataset of triples can be easily split into learning sessions, each of them comprising a portion of the data. This can be used to simulate the fact that data are not all available at once, so in the training session n, only the n − th portion of the dataset is given to the model, and it trains itself only on those data. This procedure is valid, but it is assumed that even if the dataset is not given to the model entirely, it must be known in advance in order to be able to divide it. This is a huge constraint when dealing with real robots and real environments for two main reasons. The ﬁrst is that, when the robot is put into an environment, the number and the type of the object in the environment are unknown. This means that the number of predicates that the robot collects when evolving in the environment, so the number of entities and relations of the robot’s knowledge base, can vary. The second reason is that also the number of tasks can vary. In fact, when the robot detects an unknown object, the system has to take care of a new entity but also a new task. The architecture will assign an embedding to the new entity and the next training will include also such an entity. From a conceptual point of view, the interaction between the robot and the human that cooperate in order to enlarge the knowledge base is shown in Fig. 5, on the left. In the context of Interactive Task Learning (ITL) [6], the
246
E. Bartoli et al.
Fig. 5. On the left, the process of acquiring meaningful information, composed by 3 phases: retrieving information (A), asking for correctness (B), and updating based on feedback (C). On the right, the workﬂow for a longrun execution.
setup of our experiments aims at developing agents and systems that are focused on generality across diﬀerent tasks. In fact, the main focus is the ability of the system to abstract concepts from diﬀerent domains on a more general level. Our work, which exploits embedding algorithms on the triples of a KG, adopts these principles. The knowledge acquisition procedure consists of three diﬀerent phases, that are chronologically consecutive. First, the objects detected using the YOLO Neural Network [14] come into the robot as simple labels, and the phase A starts. The robot queries its KB in order to retrieve the semantic meaning of the object detected. The semantic meaning could be also inaccurate: in fact, the more that entity appears in the KB, the more the embedding of that entity will be precise and, the predictions on that entity, more accurate. If there are not enough data that grant an accurate embedding of the entity, the predictions will be incorrect. The predictions are represented by the predicates (h, r, t) where the head entity is the detected object, the relation is chosen randomly among all the known relations, and the tail entity is the result of the prediction. After the generation of the predicates, the phase B starts. Here the robot asks the human for the correctness of the predicates by asking questions for each predicate. Communication is very important and it needs to be welldeﬁned because misunderstanding could provoke incorrect answers that lead to the addition of wrong data to the KB. Since the data of the KB are not always human interpretable (“objInLoc” stands for “object is in Location”), according to the relation of the predicates, the question is generated so that it is humanunderstandable. As soon as the robot asks the question to the user, it waits for the user’s answer, and phase C starts. In this phase, the user can answer positively or negatively. If the user answers positively, it means that the robot’s prediction was correct, and the predicate (h, r, t) is a true fact, so it can be added to the KB. If the prediction is judged as false, the robot asks the user for the correct tail of the predicate (h, r, ?), where h and r are the same head entity and relation as before. Once the
Knowledge Acquisition and Completion for LongTerm HRI
247
user answers the robot with the correct tail entity, a new predicate is created, and it is added to the KB. In the end, both a correct prediction and an incorrect prediction lead to an addition of a true predicate in the KB. Moreover, when the robot adds the predicate to its KB, it provides an implicit consensus to the user. In this way, the user is able to know which predicate is being added to the knowledge base, and if there is an error, the user can recover from it. 3.2
Knowledge Graph Embedding
In order to predict new predicates, we adopted the Knowledge Graph Embedding (KGE) technique, which uses supervised learning models capable to learn vector representations of nodes and edges. By deﬁnition, the objective of Knowledge Graph Embedding problem is to learn a continuous vector representation of a Knowledge Graph G which encodes vertices that represent entities E as a set of vectors vE ∈ RE×dE , where dE is the dimension of the vector of entities E, and as a set of edges which represent relations R as mappings between vectors WR ∈ RR×dR , where dR is the dimension of the vector of relations. The knowledge graph G is composed by triples (h, r, t), where h, t ∈ E are the head and tail of the relations, while r ∈ R is the relation itself. One example of such a triple is (bottle, hasMaterial, plastic). In literature, there are numerous ways of embedding the knowledge in a knowledge graph: transitional models, rotational models, gaussian models, and many others. However, independently on what is the class of methods that are used, the embedding is learned by minimizing the loss L computed on a scoring function f (h, r, t) over the set of triples in the knowledge graph, and over the set of negative triples that are generated by negative sampling over the same graph. For this research, the embedding model we used is ANALOGY that represents a relation as a matrix. This model can cope with asymmetrical relations and imposes the structure of the matrix to be a diagonalblock matrix, to minimize the number of parameters that need to be stored by the system. ANALOGY. In the ﬁeld of KGEs, there are many numerous ways of representing the relations into lower dimensional spaces. Usually, these techniques are grouped in families of models that describe the general principle that makes the embedding of the information possible. For instance, translational models (like TransE [2]) represent relationships as translations in the embedding space, while Gaussian embeddings model also takes the uncertainty of the information contained in a KG. Despite these models being simpler than other models, they fail to correctly represent more complex kinds of relations (like symmetrical relations), and so more advanced models are needed. For this reason, we chose ANALOGY as our KGE model. ANALOGY is an improvement of the RESCAL [9] model that is a tensor factorization approach able to perform collective training on multirelational data. In this approach, a triple (h.r.t) is represented as an entry in a threeway tensor X . A tensor entry Xijk = 1 means that the triple composed by the ith and the kth entity as, respectively, head and tail, and the
248
E. Bartoli et al.
jth relation is a true fact. Otherwise, unknown or nonexisting facts have their entry set to 0. Each slice Xk of the tensor is then factorized as Xk ≈ ARk AT , where A is a matrix that contains the latentcomponent representation of the entities, while instead Rk is a matrix that models the interactions of the latent components, and both are computed by solving the minimization problem min f (A, Rk ) + g (A, Rk )
(1)
A,Rk
where 1 f (A, Rk ) = 2
Xk − ARk AT 2 F
(2)
k
and g is a regularization term 1 2 2 Rk F g (A, Rk ) = λ AF + 2
(3)
k
Starting from this approach, ANALOGY makes some important improvements: it constrains R to be a diagonal matrix (like DistMult), and it introduces ¯ T (like complexvalued embeddings to cope with asymmetric relations X = EW E ComplEx does), but most importantly it imposes analogical structures among the representations by the means of a diagonalblock matrix (reducing the number of parameters needed by the model) by modifying the objective function as follows minv,W Es,r,o,y∼D (φv,W (s, r, o), y) s.t. Wr Wr = Wr Wr ∀r ∈ R (4) Wr Wr = Wr Wr ∀r, r ∈ R 3.3
LongRun
The process described is robust, because allows a robot that is put in a completely unknown environment, to incrementally build a robust knowledge of it. A completely unknown environment means that no entity or relation is present in the KB of the robot at the beginning. Moreover, one of the advantages of this approach is that some knowledge could be transferred to the robot. For example, it is possible to exploit existing knowledge graph databases to give some apriori knowledge to the robot. In this way, the robot will learn to build up its KB much faster. During this process, the KB of the robot evolves in the environment, acquiring information and communicating with the human. This approach is meant for designing a single longrun, instead of multiple short runs. Figure 5, on the right, shows the block scheme of such approach. The circular block, depicting the robot and the user, wraps all the infrastructure responsible for enlarging the KB and communicating with the human, which is shown in Fig. 5, on the left. The two blocks, i.e. exploration and training, are mutually exclusive. These 2 blocks are called whether or not a condition is veriﬁed. There are three diﬀerent conditions that have been implemented. The
Knowledge Acquisition and Completion for LongTerm HRI
249
Table 1. HITS@10 of ANALOGY with standard settings sess 0 sess 1 sess 2 sess 3 sess 4 sess 5 classical context on ai2thor 5 




classical context on ai2thor 4 



0.238 0.764
0.705
classical context on ai2thor 3 


0.346 0.336 0.676
classical context on ai2thor 2 

0.382 0.371 0.389 0.647
classical context on ai2thor 1 
0.402 0.385 0.361 0.380 0.558
classical context on ai2thor 0 0.339 0.355 0.343 0.343 0.336 0.500 Table 2. MRR of ANALOGY with standard settings sess 0 sess 1 sess 2 sess 3 sess 4 sess 5 classical context on ai2thor 5 




classical context on ai2thor 4 



0.104 0.385
0.569
classical context on ai2thor 3 


0.129 0.128 0.338
classical context on ai2thor 2 

0.136 0.127 0.130 0.322
classical context on ai2thor 1 
0.153 0.146 0.141 0.146 0.270
classical context on ai2thor 0 0.151 0.134 0.134 0.130 0.130 0.198
ﬁrst (shown in Fig. 5, on the right) deals with the amount of data collected by the robot during the exploring phase. This kind of condition makes it possible that at each learning session the robot collects the same amount of data, so the dataset will always be balanced. The second condition deals with the battery level of the robot. With this condition, the robot is free to explore the environment until the battery goes under a certain threshold, so the robot comes back to its docking station and, while recharging, it performs a training session. The ﬁnal condition only includes time. Two periods, namely day and night, are deﬁned. In the ﬁrst one, the robot is in exploration, while in the latter, the robot is in training.
4
Results
In the evaluation of the presented work, we would like to capture the capability of the robot to exploits its knowledge during the process of learning whatever the human teaches to it. The learning procedure is built so as to recognize the entities in a certain environment, also to learn the relations between these entities, and predict them even when they are not explicitly mentioned by the human. The ﬁrst thing that we want to prove is that models based on the standard learning process tend to forget what they have learned when new things to learn come. In order to prove this, we have simulated with the TIAGo robot a situation in which it
250
E. Bartoli et al.
Fig. 6. The Loss during the learning sessions: 0, 2, 4, 5. (The last one, in blue, represents the training considering the last subset of the data acquired through the proposed methodology). This shows that the trend is constantly decreasing. (Color ﬁgure online)
learns from the human some information belonging to a certain context, and then it is asked to learn other information from a diﬀerent context. From a technical point of view, this experiment consists of training the robot over 6 learning sessions, using a dataset structure that is inspired by AI2THOR [5]. In the ﬁrst one, the dataset sess 5 ai2thor has been taken as input. Instead, for the subsequent 5 learning sessions, the dataset sess i ai2thor with i ∈ {0, 1, 2, 3, 4} has been used. In particular, the dataset sess 5 ai2thor has been created by the robot through the methodology described. Moreover, the model used for this experiment is ANALOGY, and it has been developed in “classical context” which means that it has not been made suitable for continual learning, but it is such as the standard model for KGEs. The results of this experiment are showed in Tables 1 and 2. The two tables show the performances of the model in terms of HITS@10 (Hits at 10) and MRR (Mean Reciprocal Rank). The 2 metrics HITS@10 and MRR are deﬁned as follows: Q
1 1 M RR = Q i=1 rank(s,p,o)i Hits@N =
Q
1 if rank(s,p,o)i ≤ 10
(5)
(6)
i=1
Each table must be read from the top to the bottom because the order is chronological. In each row, there is the performance of the model (trained on the subset i of the dataset) with respect to the other subsets. The ﬁrst row of Table 1 for instance, shows the HITS@10 of ANALOGY which has been trained on sess 5 ai2thor. Since it has only been trained on that subset of the dataset, it has been evaluated only on sess 5 ai2thor. The row “classical context on 2 ai2thor”, shows the HITS@10 of ANALOGY which has been trained on sess 5 ai2thor, sess 4 ai2thor, sess 3 ai2thor (previously), and sess 2 ai2thor (currently). It means that can be evaluated on the subset sess i ai2thor where i ∈ {2, 3, 4, 5}. The model comes across the catastrophic forgetting phenomenon because, the more it trains on subsets sess i ai2thor where i ∈ {4, 3, 2, 1, 0} which contain the same entities and relations, the less it is precise on HITS@10 on sess 5 ai2thor whose data are unseen for all the subsequent learning sessions.
Knowledge Acquisition and Completion for LongTerm HRI
251
Fig. 7. The MRR during the learning sessions: 0, 3, 5 (the last one, in blue, represents the training considering the last subset of the data acquired through the proposed methodology). The graph on the left compare the MRR of last learning session with the average MRR among all the previous learning sessions. (Color ﬁgure online)
Fig. 8. The HITS@10 function during the learning sessions: 0, 3, 5 (the last one, depicted in blue, represents the training considering the last subset of the data acquired through the proposed methodology). The graph on the left compares the HITS@10 of the last learning session with the average MRR among all the previous learning sessions. (Color ﬁgure online)
For the next experiment, the model ANALOGY is considered only with continuous context, because it proves eﬃcient for the problem of catastrophic forgetting. The same dataset considered previously has been used, i.e. sess 5 ai2thor with i ∈ {0, 1, 2, 3, 4, 5}, where the partitions i ∈ {0, 1, 2, 3, 4} are composed by the same types of entities, while the partition i = 5 consists mostly of new entities. In Fig. 6 are shown 5 diﬀerent graphs, representing the trend of the loss function for each learning session. By only looking at these graphs, there are some elements that are very important. First, the trend of the loss is always decreasing. The most decreasing shape is reached in the ﬁrst forty epochs of each learning session. Since in each learning session there is a limited amount of data, after some epochs the trend is quite stable, and the model is no longer improving. Here comes the “early stopping”, which is set with a patience = 50, that stops the training for that learning session and starts the next learning session. Although the entities are almost the same in each learning session, the predicates are diﬀerent, and for this reason, at the beginning of each learning session the loss is pretty high, but then it decreases. The overall trend of the loss decreases learning session by learning session. The loss function is an important metric for checking if the model is learning or not, but is not signiﬁcant if considered alone, in fact Fig. 7 and Fig. 8 show the graphs of the two metrics considered for the evaluation of the models, which are MRR and HITS@10. The increasing learning skills are conﬁrmed by the
252
E. Bartoli et al.
graphs of MRR and HITS. The model, in fact, is not only evaluated on the nth portion of the dataset given in input for training but all portions of the data are considered in the evaluation. Hence, if good performances were expected when evaluating the current portion of the dataset (see MRR/DevSess 5 in Fig. 7 and HITS/DevSess 5 in Fig. 8), it was not sure that it was also for the previous ones. The results showed a remarkable ability to not forget what is learned, and it is visible in MRR/DevSess i with i ∈ {0, 1, 2, 3, 4} and HITS/DevSess i with i ∈ {0, 1, 2, 3, 4}. In these graphs, the performance of the last learning session is marked with the color blue. Both for MRR and for HITS the performances of the last learning session (represented in blue color) are not worse than the performance of the model at the previous learning session (depicted in red). Finally, when evaluating performances, it might be worth considering also if they are aﬀected by performative eﬀects [10]. These phenomena have always been present in several ﬁelds like statistical decision theory and causal reasoning, but in the last years, they have been brought to attention also in the deep learning ﬁeld. They can occur when predictions may inﬂuence the outcomes they are trying to predict, causing a distribution shift of the obtained results. It has been observed that these eﬀects are reduced if multiple retraining procedures are performed. In the present work, we proposed a retraining procedure at the end of each learning session. This operation would reduce such distribution shifts. A video representing a key result of this work can be found in the following link: https://www.youtube.com/watch?v=vQbyn7hs8 4. It shows, through some snapshots of the video, the process of enlarging the knowledge base of the robot, thanks to the interaction with the human. With this procedure, entities that were ﬁrst unknown, become part of the knowledge of the robot.
5
Conclusions and Future Directions
In this work, we show (as in Fig. 2 and 3) the ability of the robot to learn from unknown environments, relying on the answers of the human. Thanks to the proposed architecture, the robot uses Knowledge Graph Embedding techniques to generalize the acquired information with the goal of incrementally extending its inner representation of the environment. We evaluate the performance of the overall architecture by measuring the capabilities of the robot of learning entities and relations coming from unknown contexts through a series of incremental learning sessions, demonstrating the ability of the presented architecture to cope with the catastrophic forgetting phenomenon. For example, at the beginning of the experiments, the robot is unable to ﬁnd any meaningful information of an unknown detected object, if it has been never encountered before. After some learning sessions, it has become able to retrieve accurate information about it. The learning process of the robot is humandriven, and the human is no more required to be an expert. This allows the application of the system in many dynamic scenarios when a robot needs to learn information about the operating environment. Despite the data that drive the learning being sparse and unbalanced, the designed architecture allows the learning curve to converge quickly.
Knowledge Acquisition and Completion for LongTerm HRI
253
The whole architecture, in addition to these improvements, would make the interactions between humans and robots more natural, making a further step toward the creation of systems that can handle long interactions with humans in an environment whose knowledge of it is incrementally built during the interaction, and it is not needed to give it in advance to the robot.
References 1. Aljundi, R., Kelchtermans, K., Tuytelaars, T.: Taskfree continual learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11254–11263 (2019) 2. Bordes, A., Usunier, N., GarciaDuran, A., Weston, J., Yakhnenko, O.: Translating embeddings for modeling multirelational data. In: Advances in Neural Information Processing Systems, vol. 26 (2013) 3. Gemignani, G., Capobianco, R., Bastianelli, E., Bloisi, D.D., Iocchi, L., Nardi, D.: Living with robots: interactive environmental knowledge acquisition. Robot. Auton. Syst. 78, 1–16 (2016) 4. Ji, S., Pan, S., Cambria, E., Marttinen, P., Philip, S.Y.: A survey on knowledge graphs: representation, acquisition, and applications. IEEE Trans. Neural Netw. Learn. Syst. 33(2), 494–514 (2021) 5. Kolve, E., et al.: AI2THOR: an interactive 3D environment for visual AI. arXiv preprint arXiv:1712.05474 (2017) 6. Laird, J.E., et al.: Interactive task learning. IEEE Intell. Syst. 32(4), 6–21 (2017) 7. Li, Z., Hoiem, D.: Learning without forgetting. IEEE Trans. Pattern Anal. Mach. Intell. 40(12), 2935–2947 (2018). https://doi.org/10.1109/TPAMI.2017.2773081 8. Lindblom, J., Andreasson, R.: Current challenges for UX evaluation of humanrobot interaction. In: Schlick, C., Trzcieli´ nski, S. (eds.) Advances in Ergonomics of Manufacturing: Managing the Enterprise of the Future, pp. 267–277. Springer, Cham (2016). https://doi.org/10.1007/9783319416977 24 9. Nickel, M., Tresp, V., Kriegel, H.P.: A threeway model for collective learning on multirelational data. In: ICML (2011) 10. Perdomo, J., Zrnic, T., MendlerD¨ unner, C., Hardt, M.: Performative prediction. In: International Conference on Machine Learning, pp. 7599–7609. PMLR (2020) 11. Pronobis, A.: Semantic mapping with mobile robots. Ph.D. thesis, KTH Royal Institute of Technology (2011) 12. Pronobis, A., Jensfelt, P.: Largescale semantic mapping and reasoning with heterogeneous modalities. In: 2012 IEEE International Conference on Robotics and Automation, pp. 3515–3522. IEEE (2012) 13. Randelli, G., Bonanni, T.M., Iocchi, L., Nardi, D.: Knowledge acquisition through humanrobot multimodal interaction. Intel. Serv. Robot. 6(1), 19–31 (2013) 14. Redmon, J., Farhadi, A.: Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767 (2018) 15. Wang, Q., Mao, Z., Wang, B., Guo, L.: Knowledge graph embedding: a survey of approaches and applications. IEEE Trans. Knowl. Data Eng. 29(12), 2724–2743 (2017)
Construct, Merge, Solve and Adapt Applied to a Bus Driver Scheduling Problem with Complex Break Constraints Roberto Maria Rosati1(B) , Lucas Kletzander2 , Christian Blum3 , Nysret Musliu2 , and Andrea Schaerf1 1
2
DPIA, University of Udine, via delle Scienze 206, 33100 Udine, Italy {robertomaria.rosati,andrea.schaerf}@uniud.it Christian Doppler Laboratory for Artiﬁcial Intelligence and Optimization for Planning and Scheduling, DBAI, TU Wien, Vienna, Austria {lucas.kletzander,nysret.musliu}@tuwien.ac.at 3 Artiﬁcial Intelligence Research Institute (IIIACSIC), Campus of the UAB, Bellaterra, Spain [emailprotected]
Abstract. Bus Driver Scheduling (BDS) is a combinatorial optimization problem that consists in assigning atomic driving duties (legs) belonging to predetermined routes to bus drivers. We consider the highlyconstrained, realworld version of the problem proposed by Kletzander and Musliu (2020), with complex break rules speciﬁed by a collective agreement and public regulation. We propose a Construct, Merge, Solve and Adapt (CMSA) algorithm, which is a recent metaheuristic proposed by Blum et al. (2016) based on the idea of problem instance reduction. At each iteration of the algorithm, subinstances of the original instance are solved by an exact solver. These subinstances are obtained by merging the components of the solutions generated by a probabilistic greedy algorithm. We compare our method with the stateoftheart approaches on the benchmark instances. The results show that CMSA compares favourably with other metaheuristics on most instances and with exact techniques on large ones.
Keywords: Bus driver scheduling CMSA
1
· Metaheuristics · Optimization ·
Introduction
Driver scheduling problems are complex combinatorial problems that integrate the scheduling part with routing issues, due to the fact that drivers and vehicles get moved to diﬀerent locations by their duties. Diﬀerent driver scheduling problems have been proposed in the literature, diﬀering among themselves mainly depending on the type of vehicles that are involved and constraints. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Dovier et al. (Eds.): AIxIA 2022, LNAI 13796, pp. 254–267, 2023. https://doi.org/10.1007/9783031271816_18
Construct, Merge, Solve and Adapt for Bus Driver Scheduling
255
We consider here a Bus Driver Scheduling (BDS) problem, which is characterized by the fact that the atomic driving duties (called legs) are short compared to other vehicles (e.g., planes or trains). Therefore, the daily shift of a driver is composed of a relatively large number of independent legs, which must be assembled in a working shift respecting various regulations mainly connected to safety issues. We focus on the speciﬁc BDS formulation proposed by [13], which arises from a public transportation setting in Austria and is subject to many constraints related to rest time (breaks) regulated by legal requirements and collective agreements. This formulation comes with a challenging dataset composed of many realistic instances, which has already been used in the experimental analysis of a few exact and metaheuristic techniques [12,13,15]. We propose for this problem a Construct, Merge, Solve and Adapt (CMSA) approach, which is a metaheuristic technique recently proposed by [3], and applied to a variety of combinatorial problems [9,16,22]. Additionally, we have been able to reuse a greedy algorithm developed in a previous work [13], that we suitably randomized in order to employ it for the generation of solutions within the CMSA algorithm. For our CMSA solver, we performed a principled tuning procedure in order to obtain the best conﬁguration of the parameters and we compared our tuned solver with the best results from the literature. The outcome is that our solver is able to improve the stateoftheart results for a range of problem instances, in particular for the large ones.
2
Problem Description
The investigated Bus Driver Scheduling problem deals with the assignment of bus drivers to vehicles that already have a predetermined route for one day of operation, according to the rules speciﬁed by an Austrian collective agreement. We use the same speciﬁcation as presented in previous work [13], where the reader can ﬁnd a more detailed description of the problem. 2.1
Problem Input
The bus routes are given as a set L of individual bus legs, each leg ∈ L is associated with a tour tour (corresponding to a particular vehicle), a start time start , an end time end , a starting position startPos , and an end position endPos . The actual driving time for the leg is denoted by drive . The benchmark instances use drive = length = end − start . Table 1 shows a short example of one particular bus tour. The vehicle starts at time 360 (6:00 am) at position 0, does multiple legs with stops including waiting time at positions 1 and 2 and ﬁnally returns to position 0. A valid tour never has overlapping bus legs and consecutive bus legs satisfy endPos i = startPos i+1 . A tour change occurs when a driver has an assignment of two consecutive bus legs i and j with tour i = tour j .
256
R. M. Rosati et al. Table 1. A bus tour example tour start end startPos endPos 1 1
360
395
1
2 1
410
455
1
2
3 1
460
502
2
1
4 1
508
540
1
A distance matrix speciﬁes, for each pair of positions p and q, the time dp,q a driver takes to get from p to q when not actively driving a bus. If no transfer is possible, then dp,q = ∞. dp,q with p = q is called the passive ride time. dp,p represents the time it takes to switch tour at the same position, but is not considered passive ride time. Finally, each position p is associated with an amount of working time for starting a shift (startWork p ) and ending a shift (endWork p ) at that position. The instances in this paper use startWork p = 15 and endWork p = 10 at the depot (p = 0), to take into account the time needed to enter and exit the depot. These values are 0 for other positions, given that the bus is already on the street. 2.2
Solution
A solution to the problem is an assignment of exactly one driver to each bus leg. Criteria for feasibility are: – No overlapping bus legs are assigned to any driver. – Changing tour or position between consecutive assignments i and j requires start j ≥ end i + dendPos i ,startPos j . – Each shift respects all hard constraints regarding work regulations as speciﬁed in the next section. 2.3
Work and Break Regulations
Valid shifts for drivers are constrained by work regulations and require frequent breaks. First, diﬀerent measures of time related to a shift s containing the set of bus legs Ls need to be distinguished, as visualized in Fig. 1: – The total amount of driving time: Ds = i∈Ls drive i – The span from the start of work until the end of work Ts with a maximum of Tmax = 14 h. – The working time Ws = Ts − unpaid s , not including certain unpaid breaks. Driving Time Regulations. The maximum driving time is restricted to Dmax = 9 h. The whole distance startj − endi between consecutive bus legs i and j qualiﬁes as a driving break, including passive ride time. Breaks from driving need to be taken repeatedly after at most 4 h of driving time. In case a
Construct, Merge, Solve and Adapt for Bus Driver Scheduling
257
Fig. 1. Example shift
Fig. 2. Rest break positioning
driving break is split in several parts, all parts must occur before a driving block exceeds the 4h limit. Once the required amount of break time is reached, a new driving block starts. The following options are possible: – One break of at least 30 min – Two breaks of at least 20 min each – Three breaks of at least 15 min each Working Time Regulations. The working time Ws has a hard maximum of Wmax = 10 h and a soft minimum of Wmin = 6.5 h. If the employee is working for a shorter period of time, the diﬀerence has to be paid anyway. The actual paid working time is Ws = max{Ws ; 390}. A minimum rest break is required according to the following options: – Ws < 6 h: no rest break – 6 h ≤ Ws ≤ 9 h: at least 30 min – Ws > 9 h: at least 45 min The rest break may be split into one part of at least 30 min and one or more parts of at least 15 min. The ﬁrst part has to occur after at most 6 h of work. Note that a break can be a rest break and driving break simultaneously or just qualify as one of the two types. Whether rest breaks are paid or unpaid depends on break positions according to Fig. 2. Every period of at least 15 min of consecutive rest break is unpaid as long as it does not intersect the ﬁrst 2 or the last 2 h of the shift (a longer rest break might be partially paid and partially unpaid). The maximum amount of unpaid rest is limited:
258
R. M. Rosati et al.
– If 30 consecutive minutes of rest break are located such that they do not intersect the ﬁrst 3 h of the shift or the last 3 h of the shift, at most 1.5 h of unpaid rest are allowed. – Otherwise, at most one hour of unpaid rest is allowed. Rest breaks beyond this limit are paid. Shift Splits. If a rest break exceeds 3 hours, it is instead considered a shift split, which is unpaid and does not count towards Ws . However, such splits are typically regarded badly by the drivers. A shift split counts as a driving break, but does not contribute to rest breaks. 2.4
Objectives
As argued in previous work [13], practical schedules must not consider only operating costs. The objective cost s = 2 · Ws + Ts + ride s + 30 · ch s + 180 · split s
(1)
represents a linear combination of several criteria for shift s. The paid working time Ws is the main objective and it is combined with the total time Ts to reduce long unpaid periods for employees. The next subobjectives reduce the passive ride time ride s and the number of tour changes ch s , which is beneﬁcial for both employees and eﬃcient schedules. The last objective aims to reduce the number of shift splits split s as they are very unpopular.
3
Related Work
Diﬀerent variants of BDS have been studied from the early 60’s [27]. The BDS is often modelled as a Set Partitioning Problem and exact methods have been used in many publications to solve various variants of this problem [8,15,18, 23,25]. To solve very large realworld problems in a reasonable time, several metaheuristic methods have been studied for BDS. Such methods include Greedy approaches [20], Tabu Search [12,24], Simulated Annealing [13], GRASP [7], and Genetic Algorithms [17,19]. The problem deﬁnition of BDS is highly dependent on the country’s labour regulations, therefore, algorithms for other BDS variants cannot be used directly for the Austrian rules, which are more complex than most found in the literature. Previous work mostly focuses on cost only, sometimes including minimizing idle times and vehicle changes [6,11], but without considering the additional objectives for shift ergonomics that are considered for the BDS problem in this paper. Our problem variant has been introduced recently in the literature, and, to the best of our knowledge, the recently introduced exact approach based on branch and price [15], the metaheuristic approaches simulated annealing [13] and tabu search [12], as well as the application of problemindependent hyperheuristics in
Construct, Merge, Solve and Adapt for Bus Driver Scheduling
259
combination with a set of problemdependent lowlevel heuristics [14], represent the current state of the art for this problem. Although these approaches give very good results, the optimal solutions are still not known for most instances. Therefore, the investigation of new approaches is important for this problem.
4
The CMSA Approach to the BDS Problem
Construct, Merge, Solve and Adapt (CMSA) is a metaheuristic that was proposed recently in [3] and it is based on the idea of problem instance reduction [4]. At each iteration, the algorithm generates a number of solutions in a probabilistic way (Construct). The solution components found in these solutions are added to an initially empty subinstance of the tackled problem instance (Merge). Then, an independent algorithm—typically an exact solver—is applied to the current subinstance, in order to ﬁnd the best, or possibly best, solution to the original problem instance that only contains solution components currently present in the subinstance (Solve). Finally, the subinstance is adapted according to the result of the independent algorithm, in such a way that those solution components that are frequently chosen by the independent algorithm are kept and those that are never used along a certain number of iterations are discarded (Adapt). The four phases are repeated in a loop until a certain stop criterion is met, where CPU running time is the most commonly employed. When the independent algorithm is a MIP solver like CPLEX or Gurobi, and this is the typical case for CMSA, the procedure can be said to be a matheuristic, because it envelopes an exact solver inside a metaheuristic procedure. 4.1
The CMSA Algorithm
Our CMSA algorithm for the BDS Problem is based on the following main idea. Given the set of legs L = {1 ...n }, let S be the collection of all possible feasible bus shifts, where each shift s ∈ S is a sequence of legs that does not violate any of the constraints of the problem. A feasible solution is any collection of shifts φ ⊂ S such that every leg ∈ L belongs to one and only one shift s ∈ φ. Solution φ is a valid solution for the set partitioning problem on S. Let then ts ∈ {0, 1} be 1 if leg forms part of shift s, and 0 otherwise. Moreover, let cs be the cost of shift s, calculated according to the objectives explained in Sect. 2. If we were able to enumerate all shifts in S, the optimal solution of the BDS problem could be found by solving the following ILP model of the set partitioning problem to optimality. cs xs (2) min s.t.
s∈S
xs ts = 1
∀∈L
(3)
∀s∈S
(4)
s∈S
xs ∈ {0, 1}
260
R. M. Rosati et al.
Algorithm 1. CMSA for the BDS Problem 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19:
input: a set of legs L, values for nsols , drate , agelimit Φbsf ← ∅; S ← ∅ while CPU time limit not reached do for i ← 1,...,nsols do Φcur ← ProbabilisticGenerateSolution(L, drate ) if Φcur is better than Φbsf then Φbsf ← Φcur end if / S do for all s ∈ Φcur such that s ∈ S ←S ∪s age[s] ← 0 end for end for Φopt ← ApplyExactSolver(S ) if Φopt is better than Φbsf then Φbsf ← Φopt end if for all s ∈ S do if s ∈ Φopt then age[s] ← 0 else age[s] ← age[s] + 1 end if if age[s] > agelimit then S ← S \ s end if end for end while output: Φbsf
This ILP model is based on a binary variable xs for each bus shift s ∈ S, whereby a value of xs = 1 means that shift s is chosen to be part of the solution. Moreover, constraints (3) ensure that each leg in L is present exactly once among the chosen bus shifts. In this way, all bus legs will be assigned to exactly one bus driver and no legs will be left uncovered. The objective (2) is to minimize the total cost, which is the sum of the costs cs of the shifts that belong to the solution. Nonetheless, in realworld instances, and in most instances proposed for this formulation, the cardinality of set S is too big for making the enumeration of the shifts a practical solution, and even the application of some eﬃcient generation procedures, such as backtracking, would lead to ILP models that are too large to be solved in reasonable time with the current availability of memory and computational resources. However, we can use the above ILP model for solving reduced subinstances S ⊂ S, as required by the solve phase of CMSA. Algorithm 1 provides the pseudocode of our CMSA algorithm for the BDS problem. The CMSA takes as input the values for the following three parameters: – nsols , which ﬁxes the number of solutions to be probabilistically generated by the construction procedure at each CMSA iteration. – drate , which guides the determinism rate in the solution construction procedure. – agelimit , which limits the number of iterations a solution component (shift) s can remain in the subinstance S without being chosen by the exact solver. Note that the age of a solution component s is maintained in a variable age[s].
Construct, Merge, Solve and Adapt for Bus Driver Scheduling
261
CMSA starts with the initialization of the best solution found so far, Φbsf , to an empty solution. Moreover, the subinstance S is initialized to an empty set. The main loop of CMSA starts in line 3 of Algorithm 1. The four phases of CMSA take place, respectively: construct at line 5, merge at lines 7–10, solve at line 12 and adapt at lines 14–17. At each CMSA iteration, the construct and merge steps are repeated until nsols are generated and merged into S . The construction procedure, speciﬁcally, is called at line 5, and it consists in a probabilistic greedy heuristic for generating a solution Φcur to the original set L. The construction procedure uses a parameter drate to decide whether certain internal choices are performed in a deterministic or probabilistic way. Details on the heuristic procedure are given in Sect. 4.2. After the construction of every new solution, the corresponding merge step is performed in lines 7–10, that is, all those shifts s ∈ Φcur that are not yet present in subinstance S are added to S , and their age values age[s] are initialized to zero. After generating and merging nsols solutions, the CMSA algorithm enters into the solve phase, which takes place at line 12. In our case, the ILP solver CPLEX 20.1 is applied in function ApplyExactSolver(S ). This is done by solving the ILP model stated at the beginning of this section after replacing all occurrences of S with S . We do not make use of a time limit for CPLEX, and the time limit for the solve phase is equal to the remaining CPU time budget. This implies that, apart from the last iteration of CMSA, when CPLEX may be capped by the time limit, the solution Φopt found in the solve phase is always the optimal one for the subinstance S . Finally, in lines 14–17, the subinstance is adapted. This adaptation comprises the following steps. First, the ages of shifts in Φopt are reset to zero. Secondly, the age values of all remaining shifts from S are incremented by one. Finally, all shifts s ∈ S with age[s] > agelimit are removed from S . 4.2
Greedy Heuristic
The greedy heuristic employed in the construction step of our CMSA, which is called at line 5 of Algorithm 1, is described in Algorithm 2. It is a revisited and randomized version of the greedy algorithm proposed in [13]. The procedure takes as input a value for parameter drate . The algorithm starts by sorting the legs, which is done at line 3 of Algorithm 2 in function ApplySorting. This subprocedure adds the legs—one by one—into a sorted sequence Lsorted , initially empty, choosing among those legs that have not been added to Lsorted yet. Every new entry is chosen according to the following criterion: with probability drate , the leg with the earliest start time is added to Lsorted . Otherwise—that is, with probability 1 − drate —a random leg is chosen. If drate is set to 1.0, legs in Lsorted are sorted according to their start time, as done in the original algorithm [13]. Then, beginning at line 4, the main loop of the algorithm takes place. The legs are explored in the order deﬁned by Lsorted and each leg is inserted either in the shift that produces the least cost increase or a new shift is created, if the cost of the new shift containing solely is less than the least cost increase plus a certain threshold τ . Function SetThreshold chooses the value of τ as follows: with probability drate , τ is set to a ﬁxed value of 500, while with probability 1 − drate ,
262
R. M. Rosati et al.
Algorithm 2. Probabilistic greedy procedure 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17:
input: a set of legs L, value for drate Φcur ← ∅ Lsorted = ApplySorting(L, drate ) for all in Lsorted do sbest = argmins∈Φcur (cs∪{} − cs ) τ = SetThreshold(drate ) if c{} < csbest ∪{} − csbest + τ then Add new shift {} to Φcur else Add leg to shift sbest in Φcur end if for all = in Lsorted such that tour( ) = tour() do Add leg to shift sbest in Φcur if sbest ∪ { } is feasible end for Remove from Lsorted all legs added to shifts in Φcur at current iteration. end for output: Φcur
a random number between 500 and 1000 is chosen uniformly. These bounds (500, respectively 1000) were selected according to problemspeciﬁc knowledge. After inserting a leg in an existing or in a new shift, the algorithm tries to perform all feasible additions of other legs that belong to the same tour of to that shift. This subprocedure explores the legs by increasing start time, and it terminates at the ﬁrst infeasible insertion or where no other legs with the same tours are left. The procedure ends when all legs from Lsorted have been added to the shifts in the solution Φcur .
5
Experimental Results
We tested the CMSA algorithm on the wide set of realistic instances available in the literature. Instances sizes range from 10 tours (about 70 legs) to 250 tours (at most 2500 legs). The instances with sizes 10–100 were released in [13], while the larger instances with sizes 150–250 were introduced later in [12]. We compare our CMSA with the stateoftheart algorithms previously presented in the literature: Simulated Annealing (SA) and Hill Climbing (HC) [13], Tabu Search (TS) [12], and three hyperheuristics using lowlevel heuristics proposed in [14]: ChuangPruning (CHPR) [5], a combination of adaptive mechanisms to manage a set of active lowlevel heuristics (GIHH) [21], and a streamlined (lean) version of GIHH (LGIHH) [1]. We compare the results also with the Branch and Price (B&P) developed by [15].
Construct, Merge, Solve and Adapt for Bus Driver Scheduling
263
Table 2. CMSA parameters, the considered domains for parameter tuning, and the ﬁnally determined parameter values. Parameter
10–100 tours 150 − 250 tours Domain Value Domain Value
nsols
{2, 3, ..., 500} 300
{2, 3, ..., 200} 66
drate
[0.50, 1.00]
0.77
[0.80, 1.00]
0.96
agelimit
{2, 3, ..., 50}
4
{2, 3, ..., 30}
4
We implemented the CMSA in C++ and compiled with GNU g++, version 9.4.0, on Ubuntu 20.04.4 LTS. The experiments were run on a machine equipped with an AMD Ryzen Threadripper PRO 3975WX processor with 32 cores, with a base clock frequency of 3.5 GHz, 64 GB of RAM. We allowed one core per experiment. The experiments for other algorithms were run on a diﬀerent and slower machine, with a base clock frequency of 2.20 GHz and max frequency of 2.90 GHz. Although a completely fair comparison is not possible, for the abovementioned reasons and because the algorithms were not all implemented in the same programming language, experimental data presented in Sect. 5.2 clearly shows that CMSA is able to outperform other metaheuristics on most instance classes, even if the time limit for CMSA is kept much shorter than for other methods. 5.1
Parameters Tuning
We tuned the values for parameters nsols , drate and clist through the automatic algorithm conﬁguration tool json2run [26], which implements the FRACE procedure [2]. The parameter space was sampled using a Hammersley point set [10]. We independently tuned the parameters for the instances with sizes from 10 to 100 tours and for the new larger instances, with sizes spread from 150 to 250 tours. Indeed, we had to allow smaller domains for the larger instances because combinations of high values of agelimit and nsols together with small drate are very likely to give birth to ILP models that are too large and that may saturate the memory during the solve phase. Parameters nsols and agelimit have domains of natural numbers, while drate takes real numbers with a precision of two decimal places. Table 2 shows the domains that we applied to the parameters and the diﬀerent outcomes of the tuning procedures. 5.2
Analysis of the Results
Table 3 shows the average results grouped by instance sizes for diﬀerent methods. Each instance size class contains ﬁve distinct instances and we executed 10 independent runs on each instance, so that each value is calculated over 50 runs. The values for SA and HC are also taken over 10 runs per instance, while for
264
R. M. Rosati et al.
Table 3. Average results (costs) for classes of instances (sizes expressed by number of tours) and methods. Size CMSA
SA
HC
TS
CHPR
GIHH
LGIHH B&P
10
14879.7
14739.6
14988.4
15036.4
14956.2
14847.4
14810.6
14709.2
20
30745.9
30971.0
31275.6
31248.4
30896.7
30892.2
30810.8
30294.8
30
50817.2
51258.0
51917.4
51483.0
51331.4
51059.4
51037.6
49846.4
40
68499.9
69379.8
71337.6
69941.2
69182.9
68988.4
69022.2
67000.4
50
86389.2
87557.4
87262.4
87850.6
87394.3
87184.4
87145.2
84341.0
60
102822.9 104333.0 104296.4 104926.2 103921.5 103491.6 103467.3
99727.0
70
121141.9 123225.6 123304.0 123632.2 122502.9 122198.6 122321.8 118524.2
80
138760.3 140914.0 140508.0 140482.4 139931.8 139648.2 139551.9 134513.8
90
155078.3 157426.0 156862.4 156296.4 155520.8 155560.8 155649.6 150370.8
100 171786.7 174501.8 172909.0 172916.0 171901.0 171879.8 172763.7 172582.2 150 263387.7 266705.5 265492.3 265654.8 –
–
–
–
200 349017.0 354408.4 353494.9 350747.2 –
–
–
–
250 439234.5 446525.0 446000.9 443845.8 –
–
–
–
the hyperheuristics 5 runs per instance were executed. TS and B&P are deterministic, so runs are not repeated. All algorithms worked with time limits of 1 h, except for B&P, which was allowed up to 2 h. Values in bold report best results within metaheuristics, while underlined values are the best values including also the exact approach. We can observe that CMSA outperforms other metaheuristics on all instance groups but the smallest one, sized 10. In general, the best results for instances up to size 90 remain the one set by the B&P, whilst, for larger instances, CMSA gets the new best results. For larger instances sized 150, 200 and 250 tours, only data for SA, HC and TS are available for comparison. Table 4 shows mean values of the objective function collected from the same CMSA experiments as those presented in Table 3 after 15 min (900 s) and 30 min (1800 s). We compare them with the stateoftheart metaheuristic, which is speciﬁed in the column benchmark. We report also the results of the B&P, which has a time limit of 2 h, but it may stop before, if an optimal solution is found, so actual B&P execution time is speciﬁed as well. Results that improve or equal the current stateoftheart within metaheuristics are marked in bold. Data show that CMSA converges very quickly toward good solutions. After 15 min it already shows better results than other metaheuristics for 10 out of 13 instance classes, and for 11 out of 13 after 30 min. For instances sized 100, CMSA after 15 or 30 min is already capable to perform better also than the exact method in 2 h, but not better than the hyperheuristic LGIHH. Data suggest also that CMSA is not likely to get stuck on early local minima, as we can see always a consistent decrease of the cost function value over time. Finally, the fact that CMSA is able to provide good solutions quickly may be interesting for realworld applications, where human decision makers are likely to prefer to wait short times to have in hand the results of the automated scheduling.
Construct, Merge, Solve and Adapt for Bus Driver Scheduling
265
Table 4. CMSA results (costs) measured after 15, 30, and 60 min (900, 1800 and 3600 s), and comparison with stateoftheart metaheuristics and B&P. Best values among metaheuristic methods are in bold. Instances sizes CMSA  average 900 s 1800 s
3600 s
Benchmark Method 3600 s SA
B&P Time
10
14899.0
14886.4
14879.7
7.2
14709.2
20
30805.1
30770.3
30745.9 LGIHH
30810.8
1201.4
30294.8
30
50911.6
50863.0
50817.2 LGIHH
51037.6
3610.6
49846.4
40
68711.3
68600.2
68499.9 GIHH
68988.4
3605.8
67000.4
50
86674.3
86517.0
86389.2 LGIHH
87145.2
3674.4
84341.0
60
103206.0 102998.3 102822.9 LGIHH 103467.3
4373.2
99727.0
70
121734.6 121410.7 121141.9 GIHH
122198.6
6460.4 118524.2
80
139397.4 139073.7 138760.3 LGIHH 139551.9
5912.4 134513.8
90
155674.5 155387.4 155078.3 CHPR
100
172447.3 172086.9
14739.6
Best
155520.8
7390.4 150370.8
171786.7 LGIHH 171833.5
7395.8 172582.2
150
264261.6 263803.9 263387.7 HC
265492.3


200
350638.9 349707.2 349017.0 TS
350747.2


250
441917.3 440364.5 439234.5 TS
443845.8


6
Conclusions
We applied the CMSA metaheuristic to BDS, a complex and challenging realworld problem that integrates scheduling and routing issues. CMSA turned out to compare favourably with the stateoftheart metaheuristics for this problem. In particular, it showed good performances on the large instances, which are in general the most critical ones. In the future, we plan to investigate the use of featurebased tuning mechanisms, in which the parameters are not ﬁxed to speciﬁc values, but are computed by functions of the features of instance. Indeed, our analysis highlighted that the best parameter conﬁguration depends on some of the features, in particular those related to the size of the instance. We would like to study also the option of performing an online tuning of the CMSA parameters, so that the parameters are adjusted during the single execution of the algorithm, using some learning mechanism. Finally, we will investigate the use of diﬀerent techniques for the building blocks of the CMSA technique, in particular for the construct phase. To this aim, we plan to test both other greedy techniques and some form of backtracking procedure to generate shifts with suitable characteristics. Acknowledgements. We thank Tommaso Mannelli Mazzoli for helpful discussions about the BDS problem and for sharing the code of the problem validator with us.
266
R. M. Rosati et al.
Roberto Maria Rosati acknowledges support by TAILOR, a project funded by EU Horizon 2020 research and innovation programme under GA No 952215, which facilitated his research stay at the IIIACSIC. The ﬁnancial support by the Austrian Federal Ministry for Digital and Economic Aﬀairs, the National Foundation for Research, Technology and Development and the Christian Doppler Research Association is gratefully acknowledged by Lucas Kletzander and Nysret Musliu. Finally, Christian Blum acknowledges support by grant PID2019104156GBI00 funded by MCIN/AEI/10.13039/501100011033.
References 1. Adriaensen, S., Now´e, A.: Case study: an analysis of accidental complexity in a stateoftheart hyperheuristic for HyFlex. In: 2016 IEEE Congress on Evolutionary Computation (CEC), pp. 1485–1492. IEEE (2016) 2. Birattari, M., Yuan, Z., Balaprakash, P., St¨ utzle, T.: Frace and iterated Frace: an overview. In: Experimental Methods for the Analysis of Optimization Algorithms, pp. 311–336 (2010) 3. Blum, C., Pinacho, P., L´ opezIb´ an ˜ez, M., Lozano, J.A.: Construct, merge, solve & adapt a new general algorithm for combinatorial optimization. Comput. Oper. Res. 68, 75–88 (2016) 4. Blum, C., Raidl, G.R.: Hybridization based on problem instance reduction. In: Blum, C., Raidl, G.R. (eds.) Hybrid Metaheuristics. AIFTA, pp. 45–62. Springer, Cham (2016). https://doi.org/10.1007/9783319308838 3 5. Chuang, C.Y.: Combining multiple heuristics: studies on neighborhoodbase heuristics and samplingbased heuristics. Ph.D. thesis, Carnegie Mellon University (2020) 6. Constantino, A.A., de Mendon¸ca Neto, C.F.X., de Araujo, S.A., LandaSilva, D., Calvi, R., dos Santos, A.F.: Solving a large realworld bus driver scheduling problem with a multiassignment based heuristic algorithm. J. Univ. Comput. Sci. 23(5), 479–504 (2017) 7. De Leone, R., Festa, P., Marchitto, E.: Solving a bus driver scheduling problem with randomized multistart heuristics. Int. Trans. Oper. Res. 18(6), 707–727 (2011) 8. Desrochers, M., Soumis, F.: A column generation approach to the urban transit crew scheduling problem. Transp. Sci. 23(1), 1–13 (1989) 9. Ferrer, J., Chicano, F., OrtegaToro, J.A.: CMSA algorithm for solving the prioritized pairwise test data generation problem in software product lines. J. Heuristics 27(1), 229–249 (2021) 10. Hammersley, J.M., Handscomb, D.C.: Monte Carlo Methods. Chapman and Hall, London (1964) 11. IbarraRojas, O., Delgado, F., Giesen, R., Mu˜ noz, J.: Planning, operation, and control of bus transport systems: a literature review. Transp. Res. Part B Methodol. 77, 38–75 (2015) 12. Kletzander, L., Mazzoli, T.M., Musliu, N.: Metaheuristic algorithms for the bus driver scheduling problem with complex break constraints. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 232–240 (2022) 13. Kletzander, L., Musliu, N.: Solving large reallife bus driver scheduling problems with complex break constraints. In: Proceedings of the International Conference on Automated Planning and Scheduling, vol. 30, pp. 421–429 (2020)
Construct, Merge, Solve and Adapt for Bus Driver Scheduling
267
14. Kletzander, L., Musliu, N.: Hyperheuristics for personnel scheduling domains. In: Proceedings of the International Conference on Automated Planning and Scheduling, vol. 32, pp. 462–470 (2022) 15. Kletzander, L., Musliu, N., Van Hentenryck, P.: Branch and price for bus driver scheduling with complex break constraints. In: Proceedings of the AAAI Conference on Artiﬁcial Intelligence, vol. 35, pp. 11853–11861 (2021) 16. Lewis, R., Thiruvady, D., Morgan, K.: Finding happiness: an analysis of the maximum happy vertices problem. Comput. Oper. Res. 103, 265–276 (2019) 17. Li, J., Kwan, R.S.: A fuzzy genetic algorithm for driver scheduling. Eur. J. Oper. Res. 147(2), 334–344 (2003) 18. Lin, D.Y., Hsu, C.L.: A column generation algorithm for the bus driver scheduling problem. J. Adv. Transp. 50(8), 1598–1615 (2016) 19. Louren¸co, H.R., Paix˜ ao, J.P., Portugal, R.: Multiobjective metaheuristics for the bus driver scheduling problem. Transp. Sci. 35(3), 331–343 (2001) 20. Martello, S., Toth, P.: A heuristic approach to the bus driver scheduling problem. Eur. J. Oper. Res. 24(1), 106–117 (1986) 21. Misir, M., De Causmaecker, P., Vanden Berghe, G., Verbeeck, K.: An adaptive hyperheuristic for CHeSC 2011. In: OR53 Annual Conference, Date: 2011/09/06– 2011/09/08, Location: Nottingham, UK (2011) 22. PinachoDavidson, P., Bouamama, S., Blum, C.: Application of CMSA to the minimum capacitated dominating set problem. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 321–328 (2019) 23. Portugal, R., Louren¸co, H.R., Paix˜ ao, J.P.: Driver scheduling problem modelling. Public Transp. 1(2), 103–120 (2008) 24. Shen, Y., Kwan, R.S.K.: Tabu search for driver scheduling. In: Fandel, G., Trockel, W., Aliprantis, C.D., Kovenock, D., Voß, S., Daduna, J.R. (eds.) ComputerAided Scheduling of Public Transport, vol. 505, pp. 121–135. Springer, Heidelberg (2001). https://doi.org/10.1007/9783642564239 7 25. Smith, B.M., Wren, A.: A bus crew scheduling system using a set covering formulation. Transp. Res. Part A General 22(2), 97–108 (1988) 26. Urli, T.: json2run: a tool for experiment design & analysis. CoRR abs/1305.1112 (2013) 27. Wren, A.: Scheduling vehicles and their driversforty years’ experience. University of Leed, Technical report (2004)
Topic Modelling and Frame Identification for Political Arguments Shohreh Haddadan1 , Elena Cabrio2 , Axel J. Soto3,4 , and Serena Villata2(B) 1
4
University of Luxembourg, EschsurAlzette, Luxembourg [emailprotected] 2 Université Côte d’Azur, CNRS, Inria, I3S, Nice, France {elena.cabrio,serena.villata}@univcotedazur.fr 3 Universidad Nacional del Sur, Bahía Blanca, Argentina Institute for Computer Science and Engineering (CONICET–UNS), Bahía Blanca, Argentina [emailprotected]
Abstract. Presidential debates are one of the most salient moments of a presidential campaign, where candidates are challenged to discuss the main contemporary and historical issues in a country. These debates represent a natural ground for argumentative analysis, which has been always employed to investigate political discourse structure in philosophy and linguistics. In this paper, we take the challenge to analyse these debates from the topic modeling and framing perspective, to enrich the investigation of these data. Our contribution is threefold: ﬁrst, we apply transformerbased language models (i.e., BERT and RoBERTa) to the classiﬁcation of generic frames showing that these models improve the results presented in the literature for frame identiﬁcation; second, we investigate the task of topic modelling in political arguments from the U.S. presidential campaign debates, applying an unsupervised machine learning approach; and ﬁnally, we discuss various visualisations of the identiﬁed topics and frames from these U.S. presidential election debates to allow a further interpretation of such data.
Keywords: Argument mining
1
· Framing · Political debates
Introduction
Argumentation is a rhetoric means used by politicians to put forward their own arguments in front of their audience. As highlighted by Boydstun et al. [3], candidates strive to focus the debate on a topic that advantages them and/or their party. A candidate whose party’s or administration’s economy was thriving would either prefer to discuss topics related to economy or try as much as she can to portray her arguments on other topics from the perspective of economics. The later strategy is referred to as framing in rhetoric. Entman [10] deﬁnes framing as follows: “To frame is to select some aspects of a perceived reality and make them more salient in a communicating text, in such a way as c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Dovier et al. (Eds.): AIxIA 2022, LNAI 13796, pp. 268–281, 2023. https://doi.org/10.1007/9783031271816_19
Topic Modelling and Frame Identiﬁcation for Political Arguments
269
to promote a particular problem definition, casual interpretation, moral evaluation and/or treatment recommendation.” In the U.S. presidential debates, topics are customarily demanded by the audience either explicitly (i.e., through questions by moderators), e.g., the Iraq war which dominates debates in 2004, or implicitly, i.e., as an important issue which the audience might crave hearing about like the Watergate scandal in 1976. Topics and frames cover two diﬀerent viewpoints on the arguments put forward in the debate. On the one hand, topics are identiﬁed by the keywords that make them distinct from the other topics. The language or the set of keywords describing the topic of an argument are the same regardless of the stance the debater is taking towards this topic, e.g., Iraq, war, military, Saddam Hossein. On the other hand, framing is how an argument by a debater is put forward through selected words to react to the discussion about the topics in debate. Lakoﬀ [11] highlights the importance of framing in political speeches and debates by giving an example from the United States politics. He points out that the term “tax relief” introduced by George W. Bush’s administration puts the topic of “taxation” in a frame which implies that the party who is advocating taxation is a villain, while the (Republican) party against it is relieving people from this aﬄiction. In the example below, about the topic of “death penalty”, from the 1988 U.S. presidential elections, the candidate from the Democratic party Micheal Dukakis chooses words such as education and prevention in his premises against death penalty, whilst the Republican candidate George H.W Bush uses words like inhibiting, rape and brutalization. Dukakis’s choice of words portrays his argument on death penalty in a diﬀerent framing dimension than Bush’s one, and this is how their stance for and against death penalty is formed. Both arguments are on the topic of “death penalty”. Thus, framing can be a determining factor in the recognition of the stance for or against a topic in a debate, as framing deﬁnes the aspects about which a topic can be discussed. 1. BushDukakis, September 25, 1988: DUKAKIS: “I’m opposed to the death penalty. I think everybody knows that. I’m also very tough on violent crime. And that’s one of the reasons why my state has cut crime by more than any other industrial state in America. It’s one of the reasons why we have the lowest murder rate of any industrial state in the country. It’s one of the reasons why we have a drug education and prevention program that is reaching out and helping youngsters all over our state, the kind of thing I want to do as president of the United States... ” LEHRER: “Response, Mr. Vice President.” BUSH: “... And I favor the death penalty. I know it’s tough and honest people can disagree. But when a narcotics wrapped up guy goes in and murders a police oﬃcer, I think they ought to pay with their life. And I do believe it would be inhibiting. And so I am not going to furlough men like Willie Horton, and I would meet with their, the victims of his last escapade, the rape and the brutalization of the family down there in Maryland. Maryland would not extradite Willie Horton, the man who was furloughed, the murderer, because they didn’t want him to be furloughed again. And so we have a fundamental diﬀerence on this one.”
270
S. Haddadan et al.
The automatic identiﬁcation of topics and frames in political argumentation is therefore of main importance to enrich argumentbased information like the claims put forward in the debate, the supporting evidence, and the relations between the identiﬁed arguments. In this paper, we address this issue on the ElecDeb60To16 dataset1 of U.S. political presidential debates [12,14]. More precisely, the three main contributions of this paper are the following: ﬁrst, we apply transformerbased language models (i.e., BERT and RoBERTA) to classify generic frames on the “Media Frame Corpus” dataset with ﬁve topics, showing that these models improve the results achieved in the literature for the task of frame identiﬁcation; second, we apply an unsupervised machine learning approach for topic modeling which takes advantage of sentence embeddings to represent the debates. This approach integrates the argument components in the debates as a source to extract issuespeciﬁc frames from the ElecDeb60To16 dataset; ﬁnally, we provide some visualisations of the identiﬁed topics and frames from diﬀerent debates which allow for insightful interpretations. Such visualisations are also meant to be consumed by lay people, hence enabling the use of NLP methods by nontechnical persons.
2
Related Work
In this section, we discuss the related literature focusing on Argument Mining (AM) with topic modeling and frame identiﬁcation in the political domain. In computational studies of rhetoric in the political domain, two main deﬁnitions of frames have been discussed. In the ﬁrst one, frames are deﬁned in a certain predeﬁned dimension space as in Boydstun et al. [4]: this deﬁnition is referred to as generic frames [20]. The other approach considers frames as an extra topic dimension to a speech which is deﬁned by the choice of words in a statement [1,25]. This is referred to as issuespecific frames. The “Policy Frames Codebook” [4] considers the following 15 frames to be comprehensive enough to cover most issuespeciﬁc frames in most topics: economic frames, capacity and resources frames, morality frames, fairness and equality frames, constitutionality and jurisprudence frames, policy prescription and evaluation frames, law, order, crime and justice frames, security and defense frames, health and safety frames, quality of life frames, cultural identity frames, public opinion frames, political frames, external regulations and reputation frames and other frames. Boydstun et al. [4] discuss that issuespeciﬁc frames such as “right of life for a fetus” in the argument against the topic of “abortion” can be interpreted to fall into one of such generic framing dimensions, i.e., “Morality”. From the computational point of view, some approaches address the issue of automatically identifying topics and classifying frames in text. Nguyen et al. [21] introduce the concept of Hierarchical Ideal Point Topic Model (HIPTM) to identify Ideal Points from the bill speeches and voting patterns of Republican legislators in the U.S. Congress. Using the hierarchy of topics, they identify the issuespeciﬁc frames politicians used in their arguments. Tsur et al. [25] 1
https://github.com/pierpaologoﬀredo/disputool2.0/tree/main/Dataset/ElecDeb60 To16.
Topic Modelling and Frame Identiﬁcation for Political Arguments
271
analyse framing strategies in an unsupervised setting using topic models ﬁtted on time series through regression methods. The authors use this framework to identify temporal topic relations, and expressed agendas and analysis of framing dynamics known as “political spin”. They use lagged dependency between two time series to uncover framing patterns or attention shifts in the campaigns. The data they use consists of 134000 statements made by 641 representatives (i.e., members of Congress) between two Congressional Elections in 2010 and 2012. In alignment with the growth of attention towards applying computational methods in identifying frames in the social/political domains, Card et al. [5] build the Media Frame Corpus of news articles annotated with the above mentioned generic frames and the tone of the article (i.e., pro, anti, neutral). We describe this dataset in detail in Sect. 3. Hartmann et al. [15] also introduce a dataset of online fora discussions extracted from the Argument Extraction Corpus [24] annotated with a subset of generic frames from the Policy Frame Cookbook. Finally, Naderi and Hirst [20] compare several baselines with neural network based methods on multiclass classiﬁcation and oneagainstothers classiﬁcation to identify the generic frames on the Media Frame Corpus. They achieve highest accuracy using GRUs with pretrained GloVe embeddings as features. Ajjour et al. [1] also leverage the concept of framing in arguments. They deﬁne frames to be nonoverlapping aspects of arguments on the same subject while concealing other aspects. In this context, frames are aspects taken for/against a controversial issue. They build a dataset of premiseconclusion pairs from Debatepedia2 , and annotate each pair with a few keyphrases, that are then lemmatised and uniﬁed. As an example of uniﬁcation, the terms unhealthy, nonsmoker, and US business are transformed to health, smoker and business, respectively. Counting the number of labels for each pair, frames are considered as generic when the label is used for more than one argument, and as topicspeciﬁc when the label is in one argument pair only. The ﬁnal dataset includes 7052 generic frame arguments (i.e., economics, public opinion, environment, feasibility, rights, democracy, crime, politics, security and safety), and 5274 speciﬁc frame arguments. They ﬁrst cluster the documents into topics using TFIDF features of the debate and argument components with kmeans, then they remove the topic from these clusters by using the prominent terms for each topic extracted by CTFIDF3 and again cluster the results into frames. Analogously, Dumani et al. [9] consider the classiﬁcation of stances and frames as a preliminary stage of argument clustering for the argument retrieval task. Also reframing, i.e., controllable text generation, has recently attracted attention in similar studies. The aim of reframing is to change the perspective of the argument with respect to its target audience and the aspects that might be more appealing to it. Chen et al. [7] train neural models to reframe sentences on the Media Frame Corpus. They apply a sentencelevel blank ﬁlling method. Chakrabarty et al. [6] create a parallel dataset of arguments with mutual purpose but diﬀerent framings. Then, they apply a text generation method along with textual entailment to reframe the arguments. 2 3
http://www.debates.org. C stands for cluster.
272
3
S. Haddadan et al.
Datasets
In this section, we present the datasets we used in this paper for our experiments. – Media Frame Corpus: It consists of English news articles on 5 controversial topics (gun control, death penalty, same sex marriage, immigration and smoking) annotated with general frames using the 15 framing dimensions introduced in [4], on three diﬀerent levels: 1) headline frame, 2) primary frame, and 3) span level [5]. The following example from Card et al. [5] depicts a piece of a news article from the 2006 editorial in the Denver Post on the topic of immigration annotated with headline and span frames. • [WHERE THE JOBS ARE]Economic [Critics of illegal immigration can make many cogent arguments to support the position that the U.S. Congress and the Colorado legislature must develop eﬀective and wellenforced immigration policies that will restrict the number of people who migrate here legally and illegally.]Public opinion [It’s true that all forms of immigration exert inﬂuence over our economic and [cultural makeup.]Cultural identity In some ways, immigration improves our economy by adding laborers, taxpayers and consumers, and in other ways [immigration detracts from our economy by increasing the number of students, health care recipients and other beneﬁciaries of public services.]Capacity ]Economic [Some economists say that immigrants, legal and illegal, produce a net economic gain, while others say that they create a net loss.]Economic There are rational arguments to support both sides of this debate, and it’s useful and educational to hear the varying positions.
The InterAnnotator Agreement (IAA) of primary frames in three stages of annotation is reported between 0.4 and 0.6 based on Krippendorf’s α, which is considered as moderate agreement [2]. However, due to the complexity of overlapping spanlevel annotation, IAA on this task is at highest 0.23 on one of the topics, which is in any case higher than chance agreement.4 – ElecDeb60To16: This dataset contains the transcripts of the speeches from the candidates during ﬁnal stages of the US presidential debates between the two major parties (in 1980 and 1992 independent candidates were also included in the ﬁnal debates). The dataset contains 6666 speech turns from 41 diﬀerent debates through these years (1960–2016). No annotation on Frames/Topics is available for this dataset. However, each debate has been segmented in sections where the moderator asks a new question. There are 467 sections in all debates, in average each debate contains approximately 12 sections. This dataset is annotated with argument components and relations, which we will proﬁt from in the methodology adopted in this paper [14].
4
For more details, see https://github.com/dallascard/media_frames_corpus.
Topic Modelling and Frame Identiﬁcation for Political Arguments
4
273
Topic Modeling and Frame Classification for Arguments
In this section, we describe the two tasks we focus on to enrich the analysis of political argumentation. The ﬁrst task consists in uncovering the topics discussed in political debates data. Following the work of Ajjour et al. [1] for unsupervised issuespeciﬁc frame identiﬁcation, we also apply a hierarchical topic modeling approach using sentencebased transformer language models to discover the framing of the arguments by the presidential candidates in the ElecDeb60To16 dataset. This experimental setting is discussed in Sect. 4.1. Secondly, we focus on identifying frames in arguments occurred in the same dataset of presidential debates. We adopt a frame identiﬁcation approach by training a supervised model using transformerbased language models on the “Media Frame Corpus” to classify generic frame spans and primary frames. We later use this model to classify frames in the ElecDeb60To16 dataset. This experimental setting is discussed in Sect. 4.2. 4.1
Generic Frame Classification
Naderi and Hirst [20] applied diﬀerent approaches on the Media Frame Corpus data to classify frames at sentence level, and achieved best results with LSTMbased neural networks. In this paper, we employ transformerbased models like BERT [8] and RoBERTa [16] to address the task of generic frame classiﬁcation and compare our results with those obtained by Naderi and Hirst [20]. It is noteworthy that their experiments were done on version v1.0 of the Media Frame Corpus, whilst we run ours on v2.0. To address a fair comparison, we implemented the experiments of Naderi and Hirst [20] and ran them on v2.0 of the dataset, applying the same data preprocessing. We use the pretrained embeddings of BERT (uncased) and RoBERTa, and used a softmax function for the labels of sequence classiﬁcation. The ﬁnetuning process is done in 4 epochs using an Adam optimiser with learning rate of 2−e5 and epsilon parameter of 1e−8 . The Media Frame Corpus contains frame annotations at article level (primary frame) and span level. We perform our experiments on both levels. Furthermore, we perform a crosstopic experiment to evaluate to what extent the ﬁnetuned model is able to predict the frame on a topic which has not seen before on the training data (albeit from the same dataset). This experiment has been conducted on both primary frames of the news article and the spanlevel frames. 4.2
Topic Modeling and IssueSpecific Frame Identification
In this second experimental setting, we address the topic modeling task on the ElecDeb60To16 dataset, taking advantage of the debate features like questions of the debates, speeches and sections identiﬁed in this dataset. Furthermore, we employ this topic modeling approach and the annotated argument components in this dataset to identify issuespeciﬁc frames used in candidates’ arguments on various topics. More precisely, this experimental setting includes:
274
S. Haddadan et al.
– Topics from questions: We assume that the questions asked by the moderators, panelists and audiences set the theme and determine the topic for the arguments made by candidates. The theme set for the debate is then discussed by the candidates using various frames to structure their arguments concerning the issue/topic/theme. For instance, in Example 2 below, the moderator explicitly sets the topic of the debate to “gun control laws”. 2. ClintonTrump, October 19, 2016: WALLACE: [...] I want to focus on two issues that, in fact, by the justices that you name could end up changing the existing law of the land. First is one that you mentioned, Mr. Trump, and that is guns.
– Topics from Speeches: Occasionally candidates digress from topics set by moderators and set another topic for the rest of the debate. This argumentative technique is called agenda setting [3]. In order to retrieve topics initiated through this rhetorical technique, we also consider extracting topics from the speeches made by candidates during the debates. – Frames from argument components: Frames are provided as a contextual setting for taking a stance for or against an argument. For instance, the topic of “tax laws” can be argued in diﬀerent frames such as “middle class famililes”, “small business owners”. The two argument components annotated on the debate of September 26th, 2008 in Example 3 below indicate two diﬀerent frames provided by the two candidates (belonging to diﬀerent parties) on the topic of “taxation law”. Based on this evidence, we assume that extracting more detailed topics from the argument components may help retrieve the frames about the discussed topics. 3. MaccinObama, September 26, 2016: Obama: And I think that the fundamentals of the economy have to be measured by whether or not the middle class is getting a fair shake. Maccian: Senator Obama’s secret that you don’t know is that his tax increases will increase taxes on 50 percent of small business revenue.
Topic modeling [26] has been used for a long time along with bag of words features and Latent Dirichlet Allocation (LDA) or other matrix factorisation models. Recently, with the advancement of Language Models, topic modeling has also been adapted to the use of transformerbased models such as BERT [8], and later on sentence embeddings [22]. In order to obtain these sentence embeddings, Reimers and Gurevych [22] add a pooling layer (MEAN pooling as a default) on top of the pretrained BERT output layer to get a ﬁxed size vector. Then they rely on siamese and triplet neural networks, previously used in machine vision [23], to ﬁnetune the model. They use diﬀerent objective functions (Classiﬁcation, Regression, Triplet) depending on the task. The sentence embeddings resulting from this model are shown to improve the results of many semantic textual similarity tasks and clustering algorithms. We apply some preprocessing steps on the text inputs (i.e., questions, speeches and argument components) before encoding them using the sentence embedding model proposed by Reimers and Gurevych [22]. This preprocessing
Topic Modelling and Frame Identiﬁcation for Political Arguments
275
Fig. 1. Overall architecture of the clustering system implemented for topic modeling and issuespeciﬁc frame identiﬁcation.
includes replacing the name of candidates in the debates by “candidate” or “other candidate” depending on the speaker. Speeches shorter than 16 tokens (word tokeniser function from the nltk library [17]) have been removed, as well as interruptions and cutoﬀ speeches, such as “Can I respond?”, “Thank you”. Applying this preprocessing  based on the assumption that these speeches do not contribute to the topic distribution in the debates, ∼25% of the speeches are set aside from the clustering data input. We then apply a topic modeling approach on the input, which is implemented with a density based clustering method called HDBSCAN [18]. In this way, we cluster documents based on their encoded representations using sentenceembeddings, based on the implementation of Grootendorst and Reimers [13]. They reduce the dimensions of the input encoded by the sentence embeddings with UMAP [19], and they implement ctfidf to automatically extract the prominent terms characterising each cluster. We adopt this architecture on the diﬀerent levels to extract the topics and frames in the debates, employing the annotated argument structure. Our architecture is visualized in Fig. 1.
5
Results
In this section, we report the obtained results for the two tasks presented in Sect. 4, and we discuss some visualisations that helps to get a better understanding of the identiﬁed topics and frames.
276
S. Haddadan et al.
Table 1. Multiclass classiﬁcation results (accuracy) of diﬀerent methods on sentences on the 5 most common frames: Economic, Legality Constitutionality Jurisdiction, Policy Prescription and Evaluation, Crime and Punishment, Political. Method
Death pen. Immigr. Same sex Tobacco Gun contr. All marriage
BiLSTM no pretrained embeddings
0.7371
0.6270
0.7618
0.6369
0.6892
0.7131
BiLSTM Glove emb.
0.7535
0.6207
0.7699
0.6441
0.6974
0.7120
BiLSTM Glove emb. updating weights
0.7573
0.6310
0.7856
0.6419
0.7041
0.7236
GRU (Glove emb.)
0.7406
0.6238
0.7687
0.6412
0.6967
0.7132
GRU Glove Emb. updating 0.7555 weights
0.6366
0.7776
0.6505
0.6996
0.7231
LSTM (Glove emb.)
0.7521
0.6105
0.7786
0.6346
0.6885
0.7091
BERTcased
0.8097
0.7316
0.8276
0.7562
0.7730
0.7737
BERTuncased
0.8115
0.7450
0.8349
0.7641
0.7826
0.7764
RoBERTa
0.8117
0.7395
0.8320
0.7578
0.7916
0.7859
Naderi and Hirst [20] provide the results for generic frame classiﬁcation both for all classes of generic frames and also the results of multiclass classiﬁcation for the most occurring 5 frames, which cover more than 60% of the data, namely: Economic, Legality constitutionality and jurisprudence, Policy prescription and evaluation, Crime and punishment, and Politics. We also run our experiments on all these classes. Table 1 compares the results of the multiclass classiﬁcation on the 5 most common frames using the methods applied by Naderi and Hirst [20] with the ﬁnetuning of pretrained transformerbased models BERT and RoBERTa. It is worth noticing that results improve by at least 0.7% when applying frame classiﬁcation on all topics. The results of the multiclass classiﬁcation of news articles from the Media Frame Corpus on all frames are reported in Table 2. Table 3 shows the results of the primary frame classiﬁcation task using the same methods. Results in all experiments shows an improvement using the ﬁnetuning of pretrained BERT and RoBERTa. Table 4 shows the results of spanlevel frame identiﬁcation, when the articles of a particular topic is left out from the training data and used only in the test set. Results indicate that spanlevel identiﬁcation does not change drastically whilst the primary frame identiﬁcation seems to be very correlated with the topic being used in the train data, leading to a substantial impairment in the results which are therefore not reported. Due to the highly imbalanced number of frame classes, we report the weighted measures for Fscore in all results. We illustrate the results of topic modeling and issuespeciﬁc frame identiﬁcation using some visualisations techniques to get a better understanding of the obtained results. In Fig. 2, the size of each bubble represents the topic frequency while the colour is given by the party that uses that topic the most. The ﬁgure on the left shows how prominent the topic of “Iraq and Afghanistan war” is in
Topic Modelling and Frame Identiﬁcation for Political Arguments
277
Table 2. Multiclass classiﬁcation results (in terms of accuracy) of diﬀerent methods on sentences on all 15 frames, plus the irrelevant class. Method
Death pen. Immigr. Same sex marriage
Tobacco Gun contr. All
BiLSTM Glove emb. upd. weights 0.6027
0.4840
0.5856
0.5320
0.5660
0.5761
GRU (Glove emb. upd. weights)
0.5034
0.5942
0.5454
0.5858
0.5805
0.6065
LSTM (Glove emb.) upd. weights 0.6109
0.4879
0.5876
0.5368
0.5685
0.5718
BERTuncas.base
0.70
0.6217
0.6869
0.6715
0.6657
0.6514
RoBERTabase
0.7040
0.6234
0.6845
0.6721
0.6732
0.6672
Table 3. Multiclass classiﬁcation results of primary frames of articles. Method
Death penalty P
LSTM
R
Immigration F1
0.7064 0.7042 0.6890
P
R
Same sex marriage F1
P
R
F1
0.6444 0.6469 0.6309 0.7086 0.7129 0.6880
BiLSTM
0.7077 0.7061 0.7068
0.6482 0.6423 0.6349 0.7376 0.7371 0.7363
GRU
0.7208 0.7167 0.7070
0.6981 0.6906 0.6827 0.7742 0.7678 0.7575
BERTuncasedbase 0.7071 0.7081 0.6964
0.8284 0.8262 0.8088 0.7491 0.7167 0.7070
RoBERTabase
0.8248 0.8404 0.8286 0.7160 0.7167 0.7126
0.7240 0.7343 0.7256 Tobacco P
LSTM
Gun control R
F1
P
R
ALL F1
P
R
F1
0.6715 0.6589 .0.6440 0.9139 0.9118 0.9106 0.5514 0.5604 0.5417
BILSTM
0.6100 0.5845 0.5970
0.9084 0.9057 0.9046 0.5635 0.5761 0.5661
GRU
0.7046 0.7013 0.6950
0.9225 0.8348 0.8627 0.7506 0.7531 0.7495
BERTbaseuncased 0.7389 0.7652 0.7387
0.9413 0.9416 0.9377 0.8540 0.8547 0.8540
RoBERTabase
0.9238 0.9260 0.9208 0.8086 0.8114 0.8093
0.7409 0.7591 0.7307
2004, while the ﬁgure on the right shows that the topic of “schools and education” is twice as much discussed by the Democratic candidate than by his opponent in 1960. This visualisation also reveals the participation of each candidate in each topic (e.g., in the second ﬁgure, on the 16% of the speeches on “school and education”, 10.71% were from Kennedy and only 4.11% from Nixon). Figure 3 shows the distribution of frames on the topic of abortion in 1984. Two of the highest occurrence of frames in the topic of abortion represented by topic words “abortion, women, life, child” are “church, faith, religion, religious, catholic, prayer, separation, practice, state” and “abortion, abortions, life, prolife, unborn, rape, birth, child, reduce, incest” from argument components. Figure 4 also illustrates the highest ranking frames on the topic of energy in 1980 to be “oil, drilling, gas, oﬀshore, gasoline, dependence, pipeline, production, natural” ,“environment, clean, environmental, water, air, pollution, toxic, waste, standards” and “energy, solar, independence, wind, coal, policy, alternative, gas, independent”. The keywords dependence and independence refer to the energy production in the U.S. being dependant on other countries.
278
S. Haddadan et al.
Table 4. Multiclass classiﬁcation results of sentences of articles taking one set of articles as test set after ﬁnetuning. Test set
Precision Recall Fscore Size of test set
Death penalty
0.6270
0.6084 0.5958
38590
Immigration
0.5476
0.5413 0.5251
45959
Tobacco
0.6119
0.5901 0.5919
30773
Gun control
0.6207
0.6027 0.6070
45544
Same sex marriage 0.6546
0.6521 0.6486
35774
Fig. 2. Visualisation of the distribution of topics in 2004 and 1960.
Fig. 3. Distribution of frames over the topic of Abortion in 1984.
Topic Modelling and Frame Identiﬁcation for Political Arguments
279
Fig. 4. Distribution of frames over the topic of Energy in 1980.
6
Concluding Remarks
In this paper, we presented a new architecture to automatically identify and classify the topics and frames in political debates, namely the debates of the US presidential campaigns from 1960 to 2016. Our extensive empirical evaluation shows good results, outperforming standard baselines and similar approaches [20]. Finally, we proposed some intuitive visualisations of the extracted topics and frames which allow to get a better understanding about the nuances of the argumentation. Future work perspectives include, in addition to an improvement in the obtained results, a timeguided analysis of the evolution of the topics and frames in the U.S. presidential debates, with the goal to highlight how the way politicians discuss these topics has changed over time. Acknowledgments. This work was partly supported by the French government, through the 3IA Côte d’Azur Investments in the Future project managed by the National Research Agency (ANR) with the reference number ANR19P3IA0002. This work was partly supported also by EU Horizon 2020 project AI4Media, under contract no. 951911 (https://ai4media.eu/), and MIREL (http://www.mirelproject.eu/), under contract no. 690974. Shohreh Haddadan hereby acknowledges that this research is supported by the Luxembourg National Research Fund (FNR) (10929115).
References 1. Ajjour, Y., Alshomary, M., Wachsmuth, H., Stein, B.: Modeling frames in argumentation. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pp. 2922–2932. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/D191290, https://www. aclweb.org/anthology/D191290 2. Artstein, R., Poesio, M.: Intercoder agreement for computational linguistics. Comput. Linguist. 34(4), 555–596 (2008). https://doi.org/10.1162/coli.07034R2, https://direct.mit.edu/coli/article/34/4/555596/1999
280
S. Haddadan et al.
3. Boydstun, A.E., Glazier, R.A., Pietryka, M.T.: Playing to the crowd: agenda control in presidential debates. Polit. Commun. 30(2), 254–277 (2013). https:// doi.org/10.1162/coli.07034R2, https://www.tandfonline.com/doi/abs/10.1080/ 10584609.2012.737423 4. Boydstun, A.E., Gross, J.H., Resnik, P., Smith, N.A.: Identifying media frames and frame dynamics within and across policy issues. In: New Directions in Analyzing Text as Data Workshop, London (2013) 5. Card, D., Boydstun, A.E., Gross, J.H., Resnik, P., Smith, N.A.: The media frames corpus: annotations of frames across issues. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp. 438–444. Association for Computational Linguistics (2015). https://doi.org/10. 3115/v1/P152072, https://www.aclweb.org/anthology/P152072 6. Chakrabarty, T., Hidey, C., Muresan, S.: ENTRUST: argument reframing with language models and entailment. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4958–4971. Association for Computational Linguistics (2021). https://doi.org/10.18653/v1/2021.naaclmain.394, https://aclanthology. org/2021.naaclmain.394 7. Chen, W.F., Al Khatib, K., Stein, B., Wachsmuth, H.: Controlled neural sentencelevel reframing of news articles. In: Findings of the Association for Computational Linguistics: EMNLP 2021, pp. 2683–2693. Association for Computational Linguistics (2021). https://aclanthology.org/2021.ﬁndingsemnlp.228 8. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pretraining of deep bidirectional transformers for language understanding. In: Proceedings of NAACLHLT 2019, pp. 4171–4186 (2019). https://doi.org/10.18653/v1/n191423 9. Dumani, L., Wiesenfeldt, T., Schenkel, R.: Fine and coarse granular argument classiﬁcation before clustering. In: Proceedings of the 30th ACM International Conference on Information & Knowledge Management, CIKM 2021, pp. 422–432. Association for Computing Machinery (2021). https://doi.org/10.1145/3459637. 3482431 10. Entman, R.M.: Framing: toward clariﬁcation of a fractured paradigm. J. Commun. 43(4), 51–58 (1993). https://doi.org/10.1111/j.14602466.1993.tb01304.x, https:// academic.oup.com/joc/article/43/4/5158/4160153 11. George Lakoﬀ: The ALL NEW Don’t Think of an Elephant!: Know Your Values and Frame the Debate. Chelsea Green Publishing (2014). googleBooksID: FSqPBAAAQBAJ 12. Goﬀredo, P., Haddadan, S., Vorakitphan, V., Cabrio, E., Villata, S.: Fallacious argument classiﬁcation in political debates. In: Raedt, L.D. (ed.) Proceedings of the ThirtyFirst International Joint Conference on Artiﬁcial Intelligence, IJCAI 2022, Vienna, Austria, 23–29 July 2022, pp. 4143–4149. ijcai.org (2022). https:// doi.org/10.24963/ijcai.2022/575 13. Grootendorst, M., Reimers, N.: MaartenGr/BERTopic: v0.9.3  quickﬁx. Zenodo (2021). https://doi.org/10.5281/zenodo.5574296 14. Haddadan, S., Cabrio, E., Villata, S.: Yes, we can! mining arguments in 50 years of US presidential campaign debates. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4684–4690. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/P191463, https://www.aclweb.org/anthology/P191463 15. Hartmann, M., Jansen, T., Augenstein, I., Søgaard, A.: Issue framing in online discussion fora. In: Proceedings of the 2019 Conference of the North American Chapter
Topic Modelling and Frame Identiﬁcation for Political Arguments
16. 17. 18.
19.
20.
21.
22.
23.
24.
25.
26.
281
of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 1401–1407. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/N191142, https://aclanthology. org/N191142 Liu, Y., et al.: Roberta: a robustly optimized bert pretraining approach. CoRR abs/1907.11692 (2019). https://arxiv.org/abs/1907.11692 Loper, E., Bird, S.: NLTK: the natural language toolkit (2002). arXiv:cs/0205028, https://arxiv.org/abs/cs/0205028 McInnes, L., Healy, J., Astels, S.: HDBSCAN: hierarchical density based clustering. J. Open Source Softw. 2(11), 205 (2017). https://doi.org/10.21105/joss.00205, https://joss.theoj.org/papers/10.21105/joss.00205 McInnes, L., Healy, J., Melville, J.: UMAP: uniform manifold approximation and projection for dimension reduction (2020). arXiv:1802.03426 [cs, stat], https:// arxiv.org/abs/1802.03426 Naderi, N., Hirst, G.: Classifying frames at the sentence level in news articles. In: Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017, pp. 536–542. INCOMA Ltd. (2017–09). https://doi.org/ 10.26615/9789544520496_070 Nguyen, V.A., BoydGraber, J., Resnik, P., Miler, K.: Tea party in the house: a hierarchical ideal point topic model and its application to republican legislators in the 112th congress. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1438–1448. Association for Computational Linguistics (2015). https://doi.org/10.3115/v1/P151139, https:// aclweb.org/anthology/P151139 Reimers, N., Gurevych, I.: SentenceBERT: sentence embeddings using Siamese BERTnetworks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pp. 3982–3992. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/D191410, https://www.aclweb.org/anthology/D191410 Schroﬀ, F., Kalenichenko, D., Philbin, J.: FaceNet: a uniﬁed embedding for face recognition and clustering. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 815–823. IEEE (2015). https://doi.org/10.1109/ CVPR.2015.7298682, https://ieeexplore.ieee.org/document/7298682/ Swanson, R., Ecker, B., Walker, M.: Argument mining: extracting arguments from online dialogue. In: Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pp. 217–226. Association for Computational Linguistics (2015). https://doi.org/10.18653/v1/W154631, https://aclanthology. org/W154631 Tsur, O., Calacci, D., Lazer, D.: A frame of mind: using statistical models for detection of framing and agenda setting campaigns. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1629–1638. Association for Computational Linguistics (2015). https:// doi.org/10.3115/v1/P151157, https://aclweb.org/anthology/P151157 Xia, L., Luo, D., Zhang, C., Wu, Z.: A survey of topic models in text classiﬁcation. In: 2019 2nd International Conference on Artiﬁcial Intelligence and Big Data (ICAIBD), pp. 244–250 (2019). https://doi.org/10.1109/ICAIBD.2019.8836970
Substitute Plastic Film with Kraft Paper in Automatic Pallet Wrapping: An AI Pipeline Eleonora Iotti1(B) , Alessandro Dal Palù1 , Gianluca Contesso2 , and Francesco Bertinelli2
2
1 Department of Mathematical, Physical and Computer Sciences, University of Parma, Parco Area delle Scienze 53/A, 43124 Parma, Italy {eleonora.iotti,alessandro.dalpalu}@unipr.it ACMI S.p.A., Via G. Di Vittorio, 60, 43045 Fornovo di Taro, Parma, Italy {gianluca.contesso,francesco.bertinelli}@acmispa.com
Abstract. This paper presents and discuss an overview of an AI pipeline to analyze the eﬀects of substituting plastic ﬁlm with Kraft paper in the tertiary packaging, i.e., in the external envelope of a pallet. Since there is no prior knowledge about paper wrapping yet, the goal is to understand the physics of the load unit—wrapped in paper—when subject to horizontal accelerations. This permits to study and analyze its rigidity and robustness to permanent deformations and/or excessive shifting during road or rail freight, to avoid damages and ripping of the envelope. The idea behind our AI pipeline is to virtually simulate such a situation, to precisely identify critical use cases, and eventually suggest a correction in the wrapping format. The ﬁrst gain in using such an approach is to drastically reduce the number of physical tests needed to build a solid base knowledge about the behavior of Kraft paper enveloping the pallet during motion. The proposed pipeline consists of three phases: (i) data collection from real tests, (ii) modeling of the simulation, ﬁtting relevant parameters between the actual test and the simulated one, and (iii) performing of virtual experiments on diﬀerent settings, to suggest the best format. Computer vision and machine learning techniques are employed to accomplish these tasks, and preliminary results show encouraging performances of the proposed idea. Keywords: Multiphysics simulation · Machine learning objects tracking · Automatic pallet wrapping
1
· Multiple
Motivations
For some years now, we have witnessed the rise of a global movement pointing towards a more sustainable future. Such campaign caused a renewed interest Project entitled “Machine learning to substitute LLDPE plastic film with Kraft paper in automatic pallet wrapping,” supported by ACMI S.p.A. and funded with D.M. 10.08.2021 n.1062 on FSE REACTEU, by Ministero dell’Università e della Ricerca (MUR), under the Programma Operativo Nazionale (PON) “Ricerca e Innovazione” 2014–2020–Azione Green. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Dovier et al. (Eds.): AIxIA 2022, LNAI 13796, pp. 282–296, 2023. https://doi.org/10.1007/9783031271816_20
An AI Pipeline to Substitute Plastic w. Paper in Automatic Pallet Wrapping
283
from companies and institutions, that resulted in the search for novel technologies to reduce pollution and plastic usage, and in the advance of strategic actions to change their course. Regarding the public institutions, this translates into longterm objectives and into the development of resolutions and plans like, for example, the European Strategy for Plastics in a Circular Economy [9]. In particular, such a strategy was put in place in January 2018, and still applies in the broader context of the European Green Deal [10] which consists of a set of proposals, actions, and funding during the ﬁveyear term 2019–2024. The EU strategy for plastics imposed a crucial challenge to companies and industries working in the ﬁeld of packaging, especially for food and beverages. As a matter of fact, the majority of primary and secondary packaging (respectively the single product actual container and their grouping into a larger set for handling purposes) for food and beverages consists of multilayered plastic, which is not recyclable, but still plays an important role in food safety compared to other available materials. A recent survey on the eﬀects of the EU plastic resolution pointed out these issues, and deﬁnitely stated that there are currently “no viable alternatives” that could ensure the same level of safety and avoidance of food waste [31], and at the same time those conditions, lack of safety and risk to waste food, are suﬃcient to produce bad environmental impacts. However, there are also evidences of virtuous examples of plastic elimination or at least reduction and recycling in the primary packaging of food [19]. Despite these discussions, attempts, and eﬀorts on primary and secondary packaging, nowadays the LLDPE (Linear Low Density Polyethylene) stretch ﬁlm and the heatshrink wrap are still the best available choices for tertiary packaging of food and beverages (the enclosure or wrapping of stacked secondary packaging onto a pallet for safe transportation and logistics). During years, the amount of plastic material and its thickness were constantly decreased, and the use of recyclable plastic for wrapping made possible in some cases to comply with EU requirements. Managing plastic materials and adapting them to shrink around the loaded pallet is a wellknown automation task that is currently eﬃciently performed by endofline packaging machines. Nevertheless, due to this knowhow and due to some of LLDPE main properties, like resistance to water and UV rays, to the best of our knowledge, there were no attempts worldwide to automatic pallet wrapping with sustainable materials. ACMI S.p.A.1 is an Italian manufacturer of hightech bottling and packaging lines, specialized for beverages and food. ACMI has international relevance, serving both national companies and large multinational groups, such as CocaCola CompanyTM . The recent work of ACMI is signiﬁcant in the open discussion about plastic, since their novel “green” approach to the endofline proposed to replace the external wrapping material from LLDPE to Kraft paper (a recyclable and biodegradable paper with speciﬁc elastic strength). This represents the ﬁrst attempt in substituting plastic tertiary packaging for food and beverages industry. A completely plasticfree endofline opens up to a series of engineering and automation challenges which have yet to be explored. 1
https://www.acmispa.it/en/.
284
E. Iotti et al.
One of these challenges is to ensure that the wrapped envelope could withstand to road or rail freight, thus guaranteeing safety for truck workers and avoiding loss of product. This aspect is of key importance in the plasticpaper transition of tertiary packaging and it requires a thorough understanding of the paper behavior in relation to safety aspects, such as, e.g., how many layers of paper are needed for wrapping and how they should be stratiﬁed, how much pulling tension has to be applied to paper while wrapping, what is the optimal pallet loading packaging schema for better stability, and so on. Such knowledge should, in turn, give back to engineers some hints for the actual development of the automatic wrapping machine and its controlling software. 1.1
A Note on Methodology and Purposes of This Work
The growth and increasing impact of Artiﬁcial Intelligence (AI) technologies [24], in almost every aspect of human development opens up also to the challenges posed by the ﬁeld of automation [37,42]. The research question posed by ACMI’s innovative idea (to wrap pallets of food or beverages products within Kraft paper instead of plastic) requires rethinking the design of socalled wrapping formats, i.e., those series of parameters and indications given to the pallet stretch wrapper to perform the actual wrapping of products. Therefore, the longterm goal of the research project is to develop an intelligent automatic recommendations system that is able to suggest the safer, robust, and most reliable wrapping format to the paper wrapping machine. To pursue such an objective, preliminary studies have to be done with the help of other machinery. In our case, we make use of an inhouse special testing bench, which is able to reproduce the actual horizontal acceleration of the transport with the load unit carried on, to control the dynamics parameters and to record a video of the test. This paper focuses on the shortterm goal of using such raw data to incrementally build enough knowledge to virtually simulate the behavior of any physical setup, while minimizing the number of actual experiments needed to gain useful information. In summary, the proposed AI pipeline consists of a lowlevel computer vision system to extract raw data, a realistic multibody simulation enhanced with a machine learning method to ﬁt the concrete behavior of the pallet during the test, and the use of such a simulation to perform virtual tests and give feedback on a possible improvement of the wrapping format. The approach allows to estimate and simulate any wrapping, from plastic to paper and even to simulate an arbitrary number of wrapping overlapping. Input of the ﬁrst phase are raw video recordings of load units subject to horizontal accelerations. Those videos are analyzed with standard computer vision techniques. The goal of such techniques is to extract the centers of gravity of the wooden pallet and of the bundles (secondary packages) over the pallet. Moreover, we also extract rotations of each package, in order to detect instability and deformations. The second phase is about the development of a simulation of the test case, by using a multiphysics engine that models the testing machine, its acceleration, and the load unit. At the beginning of this phase, the simulation cannot be realistic, due to the lack of physical parameters related to secondary/tertiary packaging (i.e., static and
An AI Pipeline to Substitute Plastic w. Paper in Automatic Pallet Wrapping
285
dynamic friction forces at work). We aim at learning those parameters, and, in turn, the global behavior of the pallet during acceleration, by matching the ideal conditions of the simulation to the actual measurements of the centers of gravity obtained from the vision system. Such a task employs machine learning algorithms to match the actual behavior. Once the physical parameters are accurate enough, the third phase proceeds to identifying critical issues of a speciﬁc wrapping, e.g. points with a high risk of ripping of the paper, and suggest corrections. This paper overviews the whole pipeline, covering in particular the implementation details of the ﬁrst and second phases of the project.
2
Background
Pallets wrapped with paper, as those with plastic envelope, must comply with the European Road Worthiness Directive in order to guarantee the security during rail or road freight. In such a ﬁeld, safety and reliability are expressed in terms of the European Standard EUMOS 40509 [3], that aims at quantifying the rigidity of the pallet when it is subject to a force (due to an acceleration) along a direction. The situation that aims to be simulated and investigated by such a standard is the motion dynamic of a truck loaded with one or more EUR pallets. The rigidity of the load, in fact, impacts the transport eﬀectiveness, and measuring such quantity is needed to prevent permanent deformations or excessive shifting of the load during transportation, in other words, the stability of the load unit [41]. Moreover, such rigidity and robustness directly impact on the holding strength of the external wrapping, which in turn could be deformed or ripped during motion. Excessive deformations and/or shifting of the loaded units result in a lack of stability of the overall truck and in an unsafe transportation. In detail, the Acceleration Bench Test, in line with EUMOS 40509, deﬁnes a test load unit which is typically a pallet with a number of layers of products on it and wrapped with plastic ﬁlm (or Kraft paper, in our case), and that can be oriented in the LPdirection, i.e., the long side of the pallet is parallel to acceleration direction, or in the BPdirection, i.e., the short side of the pallet is parallel to acceleration direction. Such a test unit with its orientation is then subject, using a special testing machinery detailed later, to an acceleration impulse that immediately stops and gives rise to a constant deceleration, until the load unit stops. In a real setting, the acceleration impulse is modeled as a constant acceleration that lasts for half a second. Typical tests are performed with constant accelerations from 0.2 g up to 0.5 g, which is the acceleration to be supported, as stated by EUMOS 40509. The acceleration may cause permanent deformations and elastic deformations, which are respectively the residual deformations of the load unit after the test and the deformation of the load unit during the test. In the latter case, the tilting of the entire load unit during test is not considered as an elastic deformation, since the wooden pallet is taken as reference for the coordinate system to measure such deformations. Acceleration Bench Test of EUMOS 40509 deﬁnes the test setup and some test acceptance criteria, as follows: (i) the permanent displacement of all parts of
286
E. Iotti et al.
Fig. 1. The ESTL Machine in the R&D Department of ACMI S.p.A. The acceleration bench holds on a sleight where a wooden pallet is loaded with two layers.
Fig. 2. Centers of gravity of pallet and packages, returned by MOSSE tracker
the test load unit (after the test) must not exceed 5% of the total height of load unit; (ii) permanent displacement in the lowest 20 cm of the test load unit is to be less than 4 cm on the wooden pallet; (iii) the elastic displacement of all parts of the test load unit (during the test) must not exceed 10% of the total height of load unit; (iv) there must be no visible structural damage and/or leakage of products at the end of the test. The development new wrapping technologies must cope with the compliance of such criteria. 2.1
The ESTL Machine
Fig. 3. Example of static (on the left) and dynamic (on the right) deformations, recorded by ESTL vision system after the test and during the test.
Tests are performed with a special testing machine, called here ESTL Machine from the name of the manufacturer company [2]. The pallets are loaded on a movable platform, called sleight, on an acceleration bench. An example is illustrated in Fig. 1. Once the pallet is loaded, the acceleration bench can generate a constant horizontal acceleration impulse,
An AI Pipeline to Substitute Plastic w. Paper in Automatic Pallet Wrapping
287
which moves the sleight with the load unit on. The acceleration can be set between 0 m/s2 and 10 m/s2 in steps of 0.5 m/s2 . The duration of the acceleration is at least 500 ms. These parameters permit to simulate road transport events such as diverting maneuvers and/or emergency stops. Usually the pallets are tested at diﬀerent acceleration levels: tests start at a low acceleration level of 0.2 g or 0.3 g (about 1.962 m/ss and 2.943 m/s2 , respectively). Then, if the result is successful (w.r.t. EUMOS 40509), the constant acceleration impulse is increased by a value of 0.1 g, heading for the legal requirement of 0.5 g for load safety. While testing the acceleration impulse, high speed recordings are made. Three markers are attached to the load unit and two markers on the sleight, as in Fig. 1, so that the ESTL Machine vision system can detect ﬂuctuations of the pallet. In fact, to detect the plastic (or static) deformation of the load unit after the test, measurements are taken at three diﬀerent points. Those measurements are made before and after the test. The diﬀerence between them gives an indication of the plastic deformation. The elastic (or dynamic) deformation, instead, is measured at a height of approximately 1 m, with an ultrasonic sensor. Then, video recordings are annotated by ESTL vision system with the detected boundaries of the load over the wooden pallet, the value (in mm) of the current and maximum deformation and its angle. Figure 3 shows an example of the annotations of the ESTL system on the video recording, denoting the static and dynamic deformations happened during and after the test. The actual acceleration and displacement of the sleight is known, since the ESTL machine employs also an XY accelerometer, and the acceleration proﬁle data are recorded and plotted as well, as shown in Fig. 4. The plot shows the detected acceleration and deceleration along the x and y axes in a 0.3 g test, respectively in orange and black colors (it is worth noting that acceleration in the y direction is almost zero for the whole duration of the tests). Theoretical values of speeds (magenta) and displacements (blue) are also plotted.
3
Multiple Tracking of Bundles
The ﬁrst goal of this work is to extract relevant information regarding the behavior of the pallet and bundles during motion. This phase is necessary to understand the dynamic of the sleightpalletload system of the ESTL machine in terms of visible displacements of each part of such a system. We want to identify the actual displacement of the load unit, i.e. the wooden pallet and the layers of product, w.r.t. the sleight, and the relative displacements between each couple of elements of the load unit. Each unit of product is called a bundle, and the traceable bundles are only those in front of the camera. In general, the displacement of the pallet and the traceable bundles could vary according to the type of pallet, its orientation, the type of product, and also the primary and secondary packaging make their contribution on the dynamic of the system. Moreover, the possible presence of paper/plastic interlayer between the layers of bundles also impacts the amount of displacement. It can be noticed that the motion of the
288
E. Iotti et al.
Fig. 4. Examples of acceleration proﬁles measured by ESTL machine, with a 0.3 g acceleration impulse setting. (Color ﬁgure online)
entire load unit is delayed w.r.t. the motion of the sleight, because of friction between the two objects. With the sleight displacement that serves as reference, we can compute the diﬀerence between such a displacement and the one of the load unit. The same reasoning could be made to obtain the relative diﬀerences of pallet and bundles displacements, and of each layer of bundles. Such diﬀerences are strongly related to friction coeﬃcients (static and dynamic) of the pallet over the sleight, of each bundle over the pallet, and of bundles with each other. To the best of our knowledge, we are not aware of similar approaches in the context of estimating friction constants and parameters for simulation of pallet dynamics. In literature there are plenty of AI approaches to process video raw data, in order to detect objects and their positions. Such approaches could be roughly divided into standard computer vision ones and deep learning approaches. Deep neural networks and learning algorithm outperform standard techniques in almost any mainstream recognition/segmentation/detection task, like recognition of common objects [20,21,27,34,35], segmentation of a typical external scene [36], tracking of pedestrians from surveillance camera [17], human pose estimation [38], recognition of handwritten text [28]. Unfortunately, except for some notable examples [33], yet not mature enough for video processing, deep learning methods usually require a huge amount of homogeneous data, that have to be carefully annotated in case of supervised learning. This is one of the reasons why, by deviating from mainstream applications, deep neural networks are diﬃcult to train and not always successful at generalizing information. Moreover, recent criticism pointed out that such networks are often treated as black boxes full of parameters which do not have a intelligible semantics, thus a deep network could not explain its decisions [30]. There are cuttingedge works that try to achieve a natural language explanation from such systems, but those results are still subject of discussion in the ﬁeld of eXplainable Artiﬁcial Intelligence (XAI) [14]. Explainability is of course an appreciable feature of an
An AI Pipeline to Substitute Plastic w. Paper in Automatic Pallet Wrapping
289
AI system, but in particular it is crucial for those high risk safety systems. The topic of XAI in critical systems is being addressed by the European Commission, which developed an AI strategy to rule the trustworthiness of AI systems and enhance the excellence in such ﬁeld [11,12]. In our case, the ﬁrst phase of the AI pipeline does not require to produce transparent processes or explain its outputs, but those features will be useful in the last phase of the project, where suggestions and recommendations to correct the wrapping format have to be delivered to the ﬁnal user. Despite this, following a deep learning plan for the development of the computer vision system remains unfeasible due to the requirement of a vast amount of data. Our raw data, in fact, are produced by physical experiment with the acceleration testing machine, and each experiment has a high cost in terms of time and power consumption. Moreover, even if we would like to use pretrained networks, mainstream datasets on which those networks are trained are too general, and making eﬀorts to switch from their to our domain could result in a global lowering of the network performances (multidomain methods are still subject of investigations). Therefore, our system was crafted for the speciﬁc task, with the aid of standard computer vision algorithms and techniques. Our computer vision system consists of a program which processes the video frame by frame. We employ (i) automatic methods to identify a region of interest (ROI), based on the prediction of the position of the load unit given by the acceleration proﬁle data of the test; (ii) a template matching technique on the ROI to detect bundles; (iii) a set of multitracking algorithms tailored on the speciﬁc task, with the goal to follow the bundles during motion; (iv) optical ﬂow detection methods to measure the actual displacements and rotations of packages. In order to help the detection of bundles in raw videos we input the system some general template images to be matched to the pallet and to the visible bundles. These templates are subject to standard augmentation by stretching, rotating, cutting the reference image. Given the dimensions of the pallet template and of a bundle template, together with the number of layers loaded on the pallet, the ROI is obtained. In fact, in the few ﬁrst frames of the video, the load unit is approximately in the center of the visual. Displacement data from ESTL machine, that were obtained from the acceleration proﬁle depicted in Fig. 4, are used to slightly move the ROI frame by frame. This process is correct only at the very beginning of video processing, since the perspective is not much noticeable. The ROI is maintained until the template matching algorithm (a normalized crosscorrelation between templates and pixels of the ROI) recognizes all bundles, and the tracking is ready to start. To prevent the explosion of computational times, a NonMaximum Suppression (NMS) algorithm [32] follows the template matching. We used several stateoftheart methods for multiobject tracking: from basic Discriminative Correlation Filter with Channel and Spatial Reliability (CSRT) [29] and Kernelized Correlation Filter (KCF) [23], to the AdaBoostbased Boosting tracker [22], and Multiple Instance Learning (MIL) algorithm [15], but also TLD (Trackinglearningdetection) [26], MedianFlow [25], and Minimum Output Sum of Squared Error (MOSSE) [16] trackers have been tested. When tracking starts,
290
E. Iotti et al.
the ROI detaches from the acceleration proﬁle models (which in the meantime became more and more incorrect), and takes the center of the tracked bundles as a reference. We consider 1 frame every 3, to reduce the computation burden. If the tracking loses a bundle for some frames, the template matching phase is repeated (inside the new ROI). Finally, a dense optical ﬂow is computed using an algorithm based on [18], for all points in the frame. Then the vector ﬁeld of the optical ﬂow is converted in polar coordinates to trace rays and rotations of groups of pixels, for each bundle and the wooden pallet. Optical ﬂow thus retrieves information about bundles displacements and their bounces/tilting/turning. For each bundle and the wooden pallet, an approximation of the center of gravity is computed, by taking a weighted mean of all displacements centered in the center of the bounding box of the tracked object. Figure 2 shows an example of results, where colored dots are the computed centers of gravity of bundles and the wooden pallet.
4
Developing the Simulation
The second phase of the AI pipeline consists of the development of a simulation of acceleration tests and tuning of such a simulation on ‘real’ data from video recordings. Those simulations are called multi(rigid)body dynamics simulations, and there are many commercial and free software capable of more or less accurate reproduction of rigid bodies motion, such as AutoDesk AutoCAD [1], MathWorks Simscape Multibody [8], NVIDIA PhysX [6], and so on. Each of them diﬀers from the others by the way it manages frictions, velocity, particles motion, using speciﬁc formulations. For our purposes we need an engine that allows the user to model wrapping envelopes, and that it is ﬂexible enough to shape the parameters of such an envelope. Kraft paper, and in general, paper dynamics are still an open challenge for those types of engines. On the other hand, our preliminary work, aims at reproducing the dynamics of unwrapped load units (wooden pallet and bundles) ﬁrst. A closed envelope is a complex system, thus the understanding of the global system dynamics is subject to what is happening to single packages under the cover. The choice fell on an opensource multiphysics simulation engine called Project Chrono [13,40], developed by University of Wisconsin (Madison) and University of Parma, allows the positioning of some rigid bodies on a scene, along with various types of links between them. Each body can be a simple shape (e.g. a box, a sphere) and/or a user deﬁned 3D model. Each body has a center of gravity, a mass, a moment of inertia and a collision model. Masses of pallets and bundles are easily obtainable from real measurements. Initial centers of gravity of objects depend on their shape and their initial position in the simulation. We choose to approximate bundles with boxes, so the center of gravity could be easily calculated. Then, a linear motor engine is initialized to model the sleight. Chrono has a facility to create functions for vertical and/or horizontal motion, and in our case a constant x acceleration could be modeled by imposing the ramp length and height (of the speed function), the ending time of the acceleration
An AI Pipeline to Substitute Plastic w. Paper in Automatic Pallet Wrapping
291
and the starting time of the deceleration. All such parameters could be easily obtained from the acceleration proﬁle provided by the ESTL machine. In Chrono engine, each body has its own material properties. The ESTL machine, the wooden pallet, and each of the bundles are composed of diﬀerent materials. For each object/material a value of static friction and a value of kinetic friction must be set. Since these values are unknown, we employ a machine learning method to approximate them. Input data are the extracted positions (t) (t) (centers of gravity) pi (t) = (xpi , ypi ) of each relevant object i visible in the video recordings of ESTL machine at time t. The predicted outputs are the (t) (t) (t) computed positions cj (t) = (xcj , ycj , zcj ) of all the objects in the simulation at time t. Let us note that a computed position also depends on static and kinetic friction coeﬃcient of its material, cj (t) = cj (t, μs , μk ). Of the latter, only the visible ones should be compared to extracted data, i.e. the line of bundles in front of the camera. Being a constant position on the z axis, we consider (t) (t) only ci (t) = (xci , yci ). The objective is to minimize the distance between real position and simulated positions, for each time instant t, with a L2 loss: L2 (pi , ci ) =
T
pi (t) − ci (t, μs , μk )
2
(1)
t=0
where T is the ﬁnal time instant, i.e. the last frame of the video. The idea is to apply the gradient descent algorithm with (1) cost function to ﬁnd μs and μk . However, input data frames are few (∼50 positions for each bundle) and also very noisy due to previous calculations (matching, tracking, optical ﬂow detection). A statistical method to denoise data is Exponential Moving Average (EMA), that deﬁnes a novel sequence from raw data depending on the value of a parameter β ∈ (0, 1). Larger (close to 1) values of β produce smoother sequences. The machine learning method is a variation of the gradient descent algorithm, which uses EMA on gradient sequence. v0 = ∇0 L2 (pi , ci ) (2) vk = βvk−1 + (1 − β)∇k L2 (pi , ci ) where ∇k L2 (pi , ci ) is the loss gradient at step k of gradient descent. Each step moves the values of μs and μk toward the minimum of the loss function, as follows. (0) μ∗ is randomly initialized (3) (k) (k−1) μ∗ = μ∗ − ηvk where η is the learning rate of gradient descent method. In deep learning ﬁeld, such a method is known as gradient descent with momentum [39].
5
Experiments
The computer vision system and the virtual simulation are developed in Python 3.8.12. We used the Python versions of OpenCV [5] opensource library and of
292
E. Iotti et al.
the Project Chrono, PyChrono [7]. IRRLicht [4] engine renders the simulation. Both programs run on an Anaconda environment on a laptop with a 6cores 10th gen. i7 CPU, base speed 1.61 GHz up to 3.60 GHz, and 16 GB of RAM. We choose bundles of six CocaCola ZeroTM and maintained the same product for all the experiments. Diﬀerent products would have diﬀerent shapes and dynamics, thus making it impossible to compare experiments among each other. In the future, nevertheless, we plan to extend our test to diﬀerent types of products. We performed 3 singlelayer tests, with accelerations 0.2 g, 0.3 g, and 0.4 g. The orientation of the load unit was LP, and the layout of products over the pallet is the ﬁrst layer of a columnar one. At a later time, doublelayers experiments were made. First three tests use a columnar layout. In detail, the ﬁrst test has no interlayer between layers, the second includes a paper one, and the third a plastic one. Then, we tested two types of symmetric cross layouts, the ﬁrst one including a deliberate fracture line that impacts negatively on load stability. Plastic and paper interlayers were also considered in combination with cross layouts, resulting in six more experiments. All doublelayers experiments were performed with acceleration 0.3 g. Recordings of experiments last 8–10 s, at a rate of 20 frames per second, with 1920 × 1200 frames size. Table 1. CPU execution times of tracking algorithms on the whole video, for tests in absence of interlayers, columnar layouts and 0.3 g constant acceleration. CPU execution time [seconds] Tracker name Singlelayer Doublelayer Boosting
91.671
122.828
CSRT
112.531
176.094
KCF
62.734
63.641
81.750
118.187
184.953
295.0312
MedianFlow MIL MOSSE TLD
46.719 263.781
56.484 453.312
Table 1 shows CPU execution times of the computer vision system with the diﬀerent tracking algorithms, and Fig. 5 shows visual results of some of these elaborations (the faster ones). Using Boosting tracker, the retrieved centers of gravity were then passed to the machine learning algorithm to tune the simulation. Figure 6 shows two diﬀerent time instants of a simulation that reproduces the behavior of a singlelayer unit and LP orientation. From initial μs = 0.5 and μk = 0.5 for bundles, and μs = 0.5 and μk = 0.06 for the pallet, ﬁnal values decrease to μs = 0.1, μk = 0.001 for bundles and μs = 0.15, μk = 0.04 for pallet. Figure 6 shows how the lack of z axis in parameters ﬁtting results in a uncontrolled expansion of the columnar layer also in the z direction.
An AI Pipeline to Substitute Plastic w. Paper in Automatic Pallet Wrapping
293
Fig. 5. Example results of elaborations on video recordings, where the ﬁrst column refers to Boosting, the second to KFC, and the third to MOSSE trackers.
Fig. 6. Example frames of a simulation with only one layer of products, at 0.4 g acceleration. On the left, the rotation of the rightmost bundles, and on the right the expansion of the columnar layout in z direction.
6
Conclusions and Future Works
This paper proposes an AI pipeline to tackle the challenge of substituting LLDPE stretch ﬁlm with Kraft paper in automatic pallet wrapping. The design of the pipeline strongly relies on the EUMOS 40509 requirements of safety for rail and road transport of packages. The key idea is to simulate acceleration tests bench according to such a regulation, to produce an automatic recommendation system for the development of wrapping formats. The ﬁrst important step of the pipeline is to ﬁt the simulation on real tests. A computer vision approach, with mixed methods, serves as an input to retain measurements of products displacements during acceleration tests with ESTL machine. Attempts with single and doublelayers settings showed good results. Handling of noisy data was addressed by using a momentum version of gradient descent, which aims at tuning the parameters of the virtual simulation. Results are promising, even if further investigations are needed, and future work will be devoted to the identiﬁcation of critical points of tension which could impact the paper wrapping, the
294
E. Iotti et al.
developing of a realistic simulation of the whole envelope (with wrapping), and the use of such insight to tell ﬁnal users how many wrapping layers are needed, at which heights, with what tension, and so on. Moreover, XAI techniques would be of primarily relevance in this last phase of the pipeline.
References 1. Autocad: https://www.autodesk.it/solutions/simulation/overview 2. Engineering & solutions for transport & logistic nv (estl nv, https://www.estl.be). Wafelstraat 46, 8540 Deerlijk, Belgium 3. EUMOS, the European safe logistics association. quality standards. https://eumos. eu/qualitystandards/. Accessed 5 Aug 2022 4. Irrlicht: https://irrlicht.sourceforge.io/ 5. Opencv: https://opencv.org/ 6. Physx: https://github.com/NVIDIAGameWorks/PhysX 7. Pychrono: https://www.projectchrono.org/pychrono/ 8. Simscape: https://www.mathworks.com/products/simscapemultibody.html 9. European Commission: A European Strategy for Plastics in a Circular Economy 2018a (2018). https://ec.europa.eu/environment/circulareconomy/pdf/plasticsstrategyannex.pdf. Accessed 5 Aug 2022 10. European Green Deal (2019–2024). https://ec.europa.eu/info/strategy/priorities20192024/europeangreendeal_en. Accessed 5 Aug 2022 11. European Commission. Proposal for a regulation of the European Parliament and of the council laying down harmonised rules on Artiﬁcial Intelligence (Artiﬁcial Intelligence act) and amending certain union legislative acts (2021). https://eurlex.europa.eu/legalcontent/EN/TXT/HTML/?uri=CELEX: 52021PC0206&from=EN. Accessed 5 Aug 2022 12. European Commission. A European approach to artiﬁcial intelligence (2022). https://digitalstrategy.ec.europa.eu/en/policies/europeanapproachartiﬁcialintelligence. Accessed 5 Aug 2022 13. Anitescu, M., Tasora, A.: An iterative approach for cone complementarity problems for nonsmooth dynamics. Comput. Optim. Appl. 47(2), 207–235 (2010). https:// doi.org/10.1007/s1058900892234 14. Arrieta, A.B., et al.: Explainable artiﬁcial intelligence (XAI): concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 58, 82–115 (2020) 15. Babenko, B., Yang, M.H., Belongie, S.: Visual tracking with online multiple instance learning. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 983–990. IEEE (2009) 16. Bolme, D.S., Beveridge, J.R., Draper, B.A., Lui, Y.M.: Visual object tracking using adaptive correlation ﬁlters. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 2544–2550. IEEE (2010) 17. Brunetti, A., Buongiorno, D., Trotta, G.F., Bevilacqua, V.: Computer vision and deep learning techniques for pedestrian detection and tracking: a survey. Neurocomputing 300, 17–33 (2018) 18. Farnebäck, G.: Twoframe motion estimation based on polynomial expansion. In: Bigun, J., Gustavsson, T. (eds.) SCIA 2003. LNCS, vol. 2749, pp. 363–370. Springer, Heidelberg (2003). https://doi.org/10.1007/354045103X_50 19. Foschi, E., Bonoli, A.: The commitment of packaging industry in the framework of the European strategy for plastics in a circular economy. Adm. Sci. 9(1), 18 (2019)
An AI Pipeline to Substitute Plastic w. Paper in Automatic Pallet Wrapping
295
20. Girshick, R.: Fast RCNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015) 21. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014) 22. Grabner, H., Grabner, M., Bischof, H.: Realtime tracking via online boosting. In: Bmvc. vol. 1, p. 6. Citeseer (2006) 23. Henriques, J.F., Caseiro, R., Martins, P., Batista, J.: Highspeed tracking with kernelized correlation ﬁlters. IEEE Trans. Pattern Anal. Mach. Intell. 37(3), 583– 596 (2014) 24. Iman, M., Arabnia, H.R., Branchinst, R.M.: Pathways to artiﬁcial general intelligence: a brief overview of developments and ethical issues via artiﬁcial intelligence, machine learning, deep learning, and data science. In: Arabnia, H.R., Ferens, K., de la Fuente, D., Kozerenko, E.B., Olivas Varela, J.A., Tinetti, F.G. (eds.) Advances in Artiﬁcial Intelligence and Applied Cognitive Computing. TCSCI, pp. 73–87. Springer, Cham (2021). https://doi.org/10.1007/9783030702960_6 25. Kalal, Z., Mikolajczyk, K., Matas, J.: Forwardbackward error: automatic detection of tracking failures. In: 2010 20th International Conference on Pattern Recognition, pp. 2756–2759. IEEE (2010) 26. Kalal, Z., Mikolajczyk, K., Matas, J.: Trackinglearningdetection. IEEE Trans. Pattern Anal. Mach. Intell. 34(7), 1409–1422 (2011) 27. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classiﬁcation with deep convolutional neural networks. In: Advances in neural Information Processing Systems, vol. 25 (2012) 28. LeCun, Y., Bottou, L., Bengio, Y., Haﬀner, P.: Gradientbased learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998) 29. Lukezic, A., Vojir, T., Zajc, L.C., Matas, J., Kristan, M.: Discriminative correlation ﬁlter with channel and spatial reliability. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6309–6318 (2017) 30. Marcus, G.: Deep learning: A critical appraisal. arXiv preprint arXiv:1801.00631 (2018) 31. Matthews, C., Moran, F., Jaiswal, A.K.: A review on European union’s strategy for plastics in a circular economy and its impact on food safety. J. Clean. Prod. 283, 125263 (2021) 32. Neubeck, A., Van Gool, L.: Eﬃcient nonmaximum suppression. In: 18th International Conference on Pattern Recognition (ICPR’06), vol. 3, pp. 850–855. IEEE (2006) 33. Nichol, A., Achiam, J., Schulman, J.: On ﬁrstorder metalearning algorithms. arXiv preprint arXiv:1803.02999 (2018) 34. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: uniﬁed, realtime object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016) 35. Ren, S., He, K., Girshick, R., Sun, J.: Faster RCNN: towards realtime object detection with region proposal networks. In: Advances in Neural Information Processing Systems, vol. 28 (2015) 36. Ronneberger, O., Fischer, P., Brox, T.: UNet: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/9783319245744_28 37. Shekhar, S.S.: Artiﬁcial intelligence in automation. Artif. Intell. 3085(06), 14–17 (2019)
296
E. Iotti et al.
38. Sun, K., Xiao, B., Liu, D., Wang, J.: Deep highresolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5693–5703 (2019) 39. Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance of initialization and momentum in deep learning. In: International Conference on Machine Learning, pp. 1139–1147. PMLR (2013) 40. Tasora, A., et al.: Chrono: an open source multiphysics dynamics engine. In: Kozubek, T., Blaheta, R., Šístek, J., Rozložník, M., Čermák, M. (eds.) HPCSE 2015. LNCS, vol. 9611, pp. 19–49. Springer, Cham (2016). https://doi.org/10.1007/ 9783319403618_2 41. Tkaczyk, S., Drozd, M., Kędzierski, Ł, Santarek, K.: Study of the stability of palletized cargo by dynamic test method performed on laboratory test bench. Sensors 21(15), 5129 (2021) 42. Wan, J., Li, X., Dai, H.N., Kusiak, A., MartínezGarcía, M., Li, D.: Artiﬁcialintelligencedriven customized manufacturing factory: key technologies, applications, and challenges. Proc. IEEE 109(4), 377–398 (2021). https://doi.org/10. 1109/JPROC.2020.3034808
AI Applications
Transformer Based Motion InBetweening Pavithra Sridhar(B) , V. Aananth, Madhav Aggarwal, and R. Leela Velusamy National Institute of Technology  Tiruchirappalli, Tiruchirappalli 620015, TN, India [emailprotected], [emailprotected]
Abstract. Inbetweening is the process of drawing transition frames between temporallysparse keyframes to create a smooth animation sequence. This work presents a novel transformerbased inbetweening technique that serves as a tool for 3D animators. We ﬁrst show that this problem can be represented as a sequencetosequence problem and introduce Tween Transformers  a model that synthesizes highquality animations using temporallysparse keyframes as input constraints. We evaluate the model’s performance via two complementary methods  quantitative and qualitative evaluation. The model is compared quantitatively with the stateoftheart models using LaFAN1, a highquality animation dataset. Meansquared metrics like L2P, L2Q, and NPSS are used for evaluation. Qualitatively, we provide two straightforward methods to assess the model’s output. First, we implement a custom ThreeJsbased motion visualizer to render the ground truth, input, and output sequences side by side for comparison. The visualizer renders custom sequences by specifying skeletal positions at temporallysparse keyframes in JSON format. Second, we build a motion generator to generate custom motion sequences using the model. Keywords: Motion inbetweening LAFAN1
1
· Kinematics · Transformer ·
Introduction
Realistic and accurate animation generation is an important but challenging problem with many applications, including animating 3D characters in ﬁlms, realtime character motion synthesis in Video Games, and Educational applications. One widely used method to generate animations is motion inbetweening, commonly known as tweening. It generates intermediate frames called inbetweens between two temporally sparse keyframes to deliver an illusion of movement by smoothly transitioning from one position to another. In traditional animation pipelines, animators manually draw motion frames between a set of still keyframes indicative of the most critical positions the body must be at during its motion sequence. Recent improvements include Motion Capture (MOCAP) technologies [9] and querybased methods [15,19] to generate animations. However, MOCAP technology is expensive, and humandrawn animations are preferred. With the rise of computeraided animation, deep learningbased algorithms have enabled the smooth generation of keyframes from sparse c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Dovier et al. (Eds.): AIxIA 2022, LNAI 13796, pp. 299–312, 2023. https://doi.org/10.1007/9783031271816_21
300
P. Sridhar et al.
frames by learning from largescale motion capture data. Existing models currently use Recurrent Neural Networks (RNNs) [7,10], Long Short Term Memory Networks (LSTMs) [8], and BERTbased models [3,4]. The complexity of generating character animations includes 1. Replicating complex human behavior to create realistic characters 2. Predominantly used transition generation methods are either expensive or ineﬃcient 3. RNNs/LSTMs, though they can capture longterm dependencies, cannot be parallelized due to the sequential processing of input, resulting in longer training times 4. RNNs/LSTMs do not support transfer learning making it hard to use pretrained models Inspired by the concept of selfattention to capture longterm dependencies, this paper proposes a transformerbased model to generate realistic animation sequences. Model generalization constitutes the main eﬀort this framework puts into improving the performance of machine learning predictions. This would be analogous to large text transformer models like GPT3 [2]. This work not only eases the eﬀort put in by the animators but also helps researchers by unblocking transfer learning for the task of inbetweening, thus introducing a level of generalization into the model. Overall, the contributions in this paper can be summarized as follows:1 1. Represent motion inbetweening as a sequence to sequence problem where the input sequence consists of keyframes and the output sequence represents the complete and smoothed motion sequence. 2. Set a baseline for the input sequence by ﬁlling the frames between the keyframes with interpolated values. 3. Experiment with the eﬃciency and viability of using transformers to achieve sequence to sequence translation for human motion and compare them with the existing results. 4. Evaluate the model against other stateoftheart models [4,8,16] for the same task using L2P, L2Q, and NPSS metrics. 5. Build a visualizer and a motion generator that qualitatively evaluates the output of the model in comparison to the ground truth and input sequences.
2
Related Work
The problem is analogous to machine translation, where sequencetosequence (seq2seq) architectures are prevalent [1,18,21]. “Encoderonly” models like BERT [3] are designed to learn the context of a word based on all its surroundings (left and right of the word), making them suitable for feature extraction, sentiment classiﬁcation, or span prediction tasks but not for generative tasks like 1
Code can be found in https://github.com/Pavi114/motioncompletionusingtransformers.
Transformer Based Motion InBetweening
301
translation or sequence completion. The pretraining objectives used by encoderdecoder transformers like T5 [17] include a ﬁllintheblank task where the model predicts missing words within a corrupted piece of text that is analogous to inbetweening when motion sequences replace sentences. Early works in human motion prediction include using Conditional Restricted Boltzmann Machines (RBMs) [20] to encode the sequence information in latent variables and predict using decoders. More recently, many RNNbased approaches like EncoderRecurrentDecoder (ERD) networks [5] propose separating spatial encoding and decoding from the temporal dependencies. Other recent approaches investigate new architectures like transformers [13] and loss functions to improve human motion prediction further [6,12]. Initial approaches in motion inbetweening focused on generating missing frames by integrating keyframe information with spacetime models [23]. The following widely successful method for inbetweening adopted a probabilistic approach, framing it as a Maximum A posterior Optimization problem (MAP) [14], dynamical Gaussian process model [22], or Markov models with dynamic autoregressive forests [11]. The latest deep learning approaches include works by Holden et al. [10], and Harvey et al. [7] and helped RNNs dominate this ﬁeld. The latest work using RNN focuses on augmenting a Long Short Term Memory(LSTM) based architecture with timetoarrival embeddings and a scheduled target noise vector, allowing the system to be robust to target distortions [8]. Some recent work includes BERTbased encoderonly models [3,4] that predict the entire sequence in one pass and deep learning approaches for interpolation [16]. However, BERTbased models will be less eﬀective than encoderdecoder models for generative tasks.
3
Methodology
The following sections detail the model architecture, Tween Transformers, to perform motion frame completion similar to sentence completion. 3.1
Tween Transformers (TWTR)
The architecture of Tween Transformers (TWTR) consists of four main components: 1. Input masking module 2. Input encoding neural network that encodes each motion sequence and converts the input to a set of sequential tokens 3. Transition generation network that includes a standard transformer comprising encoder and decoder modules with feedforward and multihead attention networks. 4. Output decoding neural network that computes a sequence of character motion.
302
P. Sridhar et al.
Fig. 1. Model architecture of TWTR
While the transition generation module learns the temporal dependencies, the input and output encoding networks aim to learn spatial dependencies between the diﬀerent body joints for encoding and decoding motion sequences. Finally, the model also uses multiple losses, including forward kinematics loss, to improve the realism of the generated sequences. It is assumed that the input has both position (x, y, z) and orientation (q0, q1, q2, q3) variables. Therefore, a single pose can be deﬁned with a root position coordinate P ∈ R3 and a quaternion matrix Q ∈ RJ×4 , where J represents the joint number of the input pose (here, 22). The following sections discuss the model’s architecture in detail, as indicated in Fig. 1. Input Masking. There are multiple keyframe gaps k speciﬁed in the model conﬁguration. The frames belonging to the keyframe gap are ﬁlled with interpolated values derived from the frames constituting the two ends of the keyframe gap. Two kinds of interpolations are carried out and compared. They are implemented in the following ways: – positions and rotations are linearly interpolated – positions are linearly interpolated while rotations are spherically interpolated
Transformer Based Motion InBetweening
303
Input Encoding. As seen in Fig. 1, model encoding has three modules  Input Sequence Encoding, Positional Encoding, and Keyframe Embedding. 1. Input Sequence Encoding: The input sequence encoder network is a set of three Linear encoders fully connected to twolayer FeedForward Networks (FFN) with ReLU activations. The input sequence encoder takes in the global root position root p, local quaternions q, and global root velocity root v and outputs a set of “sequential tokens”. The hidden sizes of the FFNs are 16, 8, and 8 for q, root p, and root v, respectively. The embedding hyperparameter deﬁnes the output sizes of the FFNs. The outputs from the FFNs are concatenated to form the output of the input sequence encoding network. Equation (1) describes the Linear Encoder, and Eq. (2) describes the Input Sequence Encoder. L(x) = Linear(ReLU(Linear(x))) I(root p, root v, q) = Lp (root p) Lv (root v) Lq (q1 ) ... Lq (qJ )
(1)
(2)
where root p ∈ R3 , root v ∈ R3 , qi ∈ R4 , I denotes the Input Sequence Encoder, and L denotes the Linear Encoder. 2. Positional Encoding: Positional encoding, a popular method introduced by Vaswani et al. [21], involves adding a set of predeﬁned sinusoidal and cosine signals to introduce temporal knowledge to the transformer model. The positional encoding for source Zs = [ztta,2i ] and target Zt = [ztta,2i ] is computed using Eq. (3) tta ) basis2i/d (3) tta ) ztta,2i+1 = cos( basis2i/d where tta is the number of timesteps until arrival and the basis component inﬂuences the rate of change in frequencies along the embedding dimension d. A basis of 10,000 is used. 3. Keyframe Embedding: Following previous works [4], the model incorporates additive keyframe embeddings. The keyframe embeddings Ekf classify the frames in the sequence into keyframes, unknown frames, and ignored frames. They’re represented by learnable embedding vectors {ˆ e0 , eˆ1 , eˆ2 } respectively. e0 , eˆ1 , eˆ2 } The keyframe embeddings are represented by Eq. (4), where etkf ∈ {ˆ and T is the sequence length. The embeddings are added to the input sequence, similar to positional encodings. ztta,2i = sin(
Ekf = [e1kf , e2kf , ..., eTkf ]
(4)
304
P. Sridhar et al.
Transformer. A transformer consists of multiple encoder and decoder layers. Each encoder includes a multihead selfattention layer (MHSA) and a feedforward network (FFN), and each decoder consists of a masked multihead selfattention layer (MMHSA), multihead attention layer (MHA) and a feedforward network. The attention function leveraged in the transformer maps a query and a set of keyvalue pairs  all vectors  to an output. The processing of a single attention head can be represented as follows: QK T Attention(Q, K, V ) = Sof tmax( √ )V dk
(5)
where Q = Wq A represents a query matrix, K = Wk A represents a key matrix, and V = Wv A represents a value matrix. Wq , Wk , and Wv are the corresponding weight matrices, and dk represents the dimension of the key matrix. The Query matrix can be interpreted as the keyframe for which Attention is calculated. The Key and Value matrices represent the keyframes that are “attended to”, i.e., how relevant that keyframe is to the query keyframe. In MMHSA, the target is masked before applying the attention mechanism. All the attention outputs are concatenated and sent to the FFN. Output Decoding. The decoder takes in the concatenated “sequential tokens” outputted by the Input Sequence Encoder and outputs the global root position root p, local quaternions q, and global root velocity root v. To reverse engineer the spatial dependencies, each of the three FFNs, one for each output, comprises two linear layers with ReLU activation. The hidden sizes of the FFNs are the same as in the Input Sequence Encoder, and the output sizes are deﬁned by the original dimensions of the three parameters. Equation (6) describes the Output Decoder. O(x) = (Lp (x[: dp ]), Lv (x[dp : dp + dv ), Q)
(6)
⎡
⎤ Lq (x [ dp + dv : dp + dv + dq ]) ⎢ ⎥ Lq (x [ dp + dv + dq : dp + dv + 2 × dq ] ⎥ Q=⎢ ⎣ ⎦ ... Lq (x [ dp + dv + (J − 1) × dq : dp + dv + J × dq ] where dp , dv , and dq are embedding dimensions for p, v, and q. x[i : j] represents a tensor containing the values in x from the ith index to the (j − 1)th index. J denotes the number of joints in the skeleton, Q ∈ RJ×4 denotes the tensor of stacked quaternions, O denotes the Output Decoder, and L denotes the Linear Encoder. 3.2
Loss Computation
Given a collection of predicted motion sequences and the ground truth, inbetweening loss is computed as the scaled sum of two individual losses  Reconstruction loss and Forward Kinematics (FK) loss.
Transformer Based Motion InBetweening
L = αr LR + αf k LF K
305
(7)
where αr and αF K are constants to balance the disparity of individual losses. For training we use αr = 100 and αF K = 1. Reconstruction Loss LR . Reconstruction loss evaluates the ability of the model to “reconstruct” the target sequence from the input sequence. Reconstruction loss accounts for the diﬀerence in output and target quaternions values and is computed using an L1 norm. While Harvey et al. [8] compute and sum reconstruction losses for q, x, and contacts, they acknowledge that the most important component is q. Reconstruction loss is computed using Eq. (8). LR =
N −1 T −1 1 t qˆ − qnt N T n=0 t=0 n
(8)
where qˆnt is the rotational quaternion of the predicted motion sequence n at time t. q refers to the ground truth quaternion. N refers to the number of sequences, and T refers to the length of each motion sequence. Forward Kinematics Loss LF K . Forward Kinematics loss compares the diﬀerence in the global positions of joints between the ground truth and the model’s output. Forward Kinematics loss evaluates the ability of the model to “understand” the relationships between relative angles and global positions. Although the oﬀsets of various joints in the skeleton are not provided to the model, it learns to respect human geometry and maintain correct posture by minimizing the Forward Kinematics loss. The Forward Kinematics loss is computed using Eq. (9). pglobal − pglobal 1 + ˆ qglobal − qglobal 1 LF K = ˆ
(9)
where pˆglobal and qˆglobal can be derived from the local coordinates using Forward Kinematics F K(ˆ plocal , qˆlocal ) and, similarly pglobal and qglobal can be derived from the local coordinates using Forward Kinematics F K(plocal , qlocal ). 3.3
Training
Following previous works [8,16], the entire dataset was split into windows of maximum length Tmax = 65. To construct each batch, the number of start keyframes are set to 10 and the number of end keyframes to 1. The number of inbetween frames is sampled from the range [5, 44] without replacement. The weight associated with the number of inbetween frames nin is set to be inversely proportional to it, wnin = n1in . This prevents overﬁtting on the windows with a large number of inbetween frames. Shorter windows are sampled more often as they are more abundant and hence harder to overﬁt. Therefore, the number of unique nonoverlapping sequences of a given total length 10 + 1 + nin is approximately inversely proportional to nin . Finally, given the total sampled sequence length, the sequence start index is sampled uniformly at random in the range [0, Tmax − (1 + 10 + nin )].
306
P. Sridhar et al.
Fig. 2. Stills from the Ground Truth, LERP, Model Output, and Smoothed Output sequences at diﬀerent timestamps for the action “Aiming2” performed by subject “Subject5”. Considering the frames at t = 20, it is clear that the output produced by our model resembles the ground truth more than the interpolated sequence.
4 4.1
Setup and Experimental Results Dataset
The publicly available Ubisoft La Forge Animation (LaFAN1) Dataset was used for all the experiments. Introduced by Harvey et al. [8] in Ubisoft, LaFAN1 consists of general motion capture clips in high deﬁnition. The motion sequences are in BVH format. The LaFAN1 dataset comprises ﬁve subjects, 77 sequences, and 496,672 motion frames at 30 fps for a total of 4.6 h. There are around 15 themes, from everyday actions like walking, sprinting, and falling to uncommon actions like crawling, aiming, and a few sports movements. Similar to other works [4,8,16], all sequences of subject ﬁve were used for testing and benchmarking, with the remaining used for training. 4.2
Evaluation Metrics
The model is evaluated against the L2P, L2Q, and NPSS metrics used in previous studies on the subject ﬁve sequences of the LAFAN1 dataset. The L2P deﬁnes the average L2 distances of the positions between the predicted motion sequence and the ground truth sequence. Equation 10 shows the L2P calculation. Similarly, the L2Q deﬁnes the average L2 distances of the global quaternions. A combination of local quaternions, positions, and motion sequence properties is used to compute these metrics. Equation 11 shows the L2Q calculation. N −1 T −1 1 t pˆ − pn t L2P = N T n=0 t=0 n
(10)
Transformer Based Motion InBetweening
307
Fig. 3. Stills from the Ground Truth, LERP, Model Output, and Smoothed Output sequences at diﬀerent timestamps for the action “Dance2” performed by subject “Subject5”. The dance action is unconventional and full of seemingly random movements. Considering the frames at t = 10, t = 20, and t = 30, the output produced by the model is better at t = 10, the output produced by interpolation is better at t = 20, and neither come close at t = 30.
N −1 T −1 1 t L2Q = qˆ − qn t N T n=0 t=0 n
(11)
where qˆ is the rotational quaternion of the predicted motion sequence n at time t. q refers to the ground truth quaternion. Similarly, pˆ refers to the position of the predicted motion sequence p refers to the ground truth position. N refers to the number of sequences, and T refers to the length of each motion sequence. Normalized Power Spectrum Similarity (NPSS) is an approach comparing angular frequencies with the ground truth. It is an Earth Mover Distance (EMD) based metric over the power spectrum, which uses the squared magnitude spectrum values of the Discrete Fourier Transform coeﬃcients. Equation (12) computes the NPSS metric. N −1 T −1 N P SS =
j=0 wi,j ∗ emdi,j N −1 T −1 i=0 j=0 wi,j
i=0
(12)
where emdi,j refers to the EMD distance, and wi,j refers to the weights. Harvey et al. [8] state that the L2P metric is a better metric than any angular loss for assessing the visual quality of transitions with global displacements as it helps us weigh the positions of the bones and joints. Hence, they argue that L2P is a much more critical metric than L2Q and NPSS.
308
P. Sridhar et al.
Fig. 4. Still from the motion generator
4.3
Data Preprocessing
First, the local position and orientation values from the BVH ﬁles provided in the LaFAN1 dataset [7] are extracted. Twentytwo joints are considered for the skeleton model. Forward Kinematics was used to compute the absolute positions of each joint from the relative positions (relative to hip) given in the dataset. Positions are modeled as standard matrices, and orientations are modeled using quaternions. Further, global position and root velocity are computed from local positions using Forward kinematics. 4.4
Hyperparameters
Most hyperparameters from previous baselines are retained to show the relative improvement in performance using Transformers. This study presents a novel hyperparameter comparison using diﬀerent interpolation techniques  Linear and Spherical, to compare the performance of several baseline studies. A batch size of 64 for 100 epochs was used. Adam optimizer with a learning rate of 10−4 along with a constant dropout of 0.2 was utilized. Keyframe gaps of 5, 15, and 30 were tested to compare the performance of the transformer over higher frame gaps. 4.5
Visualizer and Motion Generator
To qualitatively evaluate the model, a visualizer was built using Node and ThreeJs that juxtaposed the ground truth, interpolated sequence, output sequence, and a smoothed output sequence of the transformer model. The model’s output is stored in JSON format and rendered using a custom webbased visualizer. The visualizer was built from scratch using Typescript, NodeJs, Express, and ThreeJs. Figures 2 and 3 show a sample output of the model generated using the visualizer. Further, the motion generator was built using Python,
Transformer Based Motion InBetweening
309
Fig. 5. (a) Comparision of model performance at keyframe gap = 30 with three commonly used metrics  L2P, L2Q, and NPSS, (b) Comparison of L2P losses at various keyframe gaps of the motion inbetweening methods included in this study, (c) Comparison of NPSS losses at various keyframe gaps of the motion inbetweening methods included in this study, (d) Comparison of L2Q losses at various keyframe gaps of the motion inbetweening methods included in this study.
Flask, Node, and ThreeJs using the visualizer module as a base. The motion generator allows a user to modify keyframes in a given motion sequence and generate inbetween frames for the same. The plugin consists of a backend Flask server that uses an instance of our model to generate the inbetween frames. Figure 4 shows a still from the motion generator where the stick model is animating a generated custom motion sequence. 4.6
Inferences
As expected, SLERP performs better than LERP. However, it is observed that the performance at 30 fps is almost comparable, as seen in Fig. 5a. This is because the spherical motion becomes almost linear for very short timescales. As seen in Table 1, it is inferred that the Tween Transformer model outperforms the interpolation model and performs closely with the baseline models. Figures 5b, 5d, and 5c conﬁrm that Tween Transformers follow a similar trend to that of
310
P. Sridhar et al.
Table 1. The Tween Transformer model is compared with baseline Motion Inbetweening methods using L2P, L2Q, and NPSS metrics for various sequence lengths. The Interpolation based methods are included as part of the study. TT (Ours) refers to the Tween Transformer model. Length Zero Velocity SLERP TGrec TGcomplete SSMCTlocal SSMCTGlobal ΔInterpolator TT (Ours)
L2Q 5 15
30
L2P 5 15
30
NPSS 5
15
30
0.56 0.22 0.21 0.17 0.17 0.14 0.11 0.16
1.51 0.98 0.83 0.69 0.71 0.61 0.57 0.65
1.52 0.37 0.32 0.23 0.23 0.22 0.13 0.21
6.60 2.32 1.82 1.28 1.37 1.1 1.00 1.21
0.0053 0.0023 0.0025 0.0020 0.0019 0.0016 0.0014 0.0019
0.0522 0.0391 0.0304 0.0258 0.0291 0.0234 0.0217 0.0261
0.2318 0.2013 0.1608 0.1328 0.143 0.1222 0.1217 0.1358
1.10 0.62 0.48 0.42 0.44 0.36 0.32 0.39
3.69 1.25 0.85 0.65 0.74 0.56 0.47 0.59
other models. Experiments show that training is crucial to obtain a visually smooth output. Moving Average Smoothing was observed to have minimal eﬀect on the output sequence as the model trains.
5
Conclusion
This work presents the Tween Transformer, a novel, robust, transformerbased motion inbetweening technique that serves as a tool for 3D animators and overcomes the challenges faced by existing RNNbased models [8,16], including sequential training, capturing longterm dependencies, and transfer learning. The generic model treats the application of inbetweening as a sequencetosequence problem and solves it using a transformerbased encoderdecoder architecture. It unboxes the potential of robust Transformerbased models for motion inbetweening applications. To conclude, the results encourage the application of lowresource costeﬃcient models and enable further developments with the scope of transfer learning on the generalized implementation.
References 1. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015, Conference Track Proceedings (2015). https://arxiv.org/abs/1409.0473 2. Brown, T., et al.: Language models are fewshot learners. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901. Curran Associates, Inc. (2020). https://proceedings.neurips.cc/paper/2020/ﬁle/1457c0d6bfcb4967418bfb8ac142f6 4aPaper.pdf
Transformer Based Motion InBetweening
311
3. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pretraining of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota (2019). https://doi.org/10.18653/v1/N191423, https://aclanthology. org/N191423 4. Duan, Y., et al.: Singleshot motion completion with transformer. arXiv preprint arXiv:2103.00776 (2021) 5. Fragkiadaki, K., Levine, S., Malik, J.: Recurrent network models for kinematic tracking. CoRR abs/1508.00271 (2015). https://arxiv.org/abs/1508.00271 6. Gopalakrishnan, A., Mali, A.A., Kifer, D., Giles, C.L., II, A.G.O.: A neural temporal model for human motion prediction. CoRR abs/1809.03036 (2018). https:// arxiv.org/abs/1809.03036 7. Harvey, F.G., Pal, C.: Recurrent transition networks for character locomotion. In: SIGGRAPH Asia 2018 Technical Briefs. SA 2018, Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3283254.3283277 8. Harvey, F.G., Yurick, M., Nowrouzezahrai, D., Pal, C.: Robust motion inbetweening. ACM Trans. Graph. 39(4), 1–12 (2020). https://doi.org/10.1145/ 3386569.3392480 9. Holden, D.: Robust solving of optical motion capture data by denoising. ACM Trans. Graph. 37(4), 1–12 (2018). https://doi.org/10.1145/3197517.3201302 10. Holden, D., Saito, J., Komura, T.: A deep learning framework for character motion synthesis and editing. ACM Trans. Graph. 35(4), 1–11 (2016). https://doi.org/10. 1145/2897824.2925975 11. Lehrmann, A.M., Gehler, P.V., Nowozin, S.: Eﬃcient nonlinear Markov models for human motion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014) 12. Liu, Z., et al.: Towards natural and accurate future motion prediction of humans and animals. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9996–10004 (2019). https://doi.org/10.1109/CVPR. 2019.01024 ´ Villamizar, M., Odobez, J.: Pose transformers (POTR): 13. Mart´ınezGonz´ alez, A., human motion prediction with nonautoregressive transformers. CoRR abs/2109.07531 (2021). https://arxiv.org/abs/2109.07531 14. Min, J., Chen, Y.L., Chai, J.: Interactive generation of human animation with deformable motion models. ACM Trans. Graph. 29(1), 1–12 (2009). https://doi. org/10.1145/1640443.1640452 15. M¨ uller, M., R¨ oder, T., Clausen, M.: Eﬃcient contentbased retrieval of motion capture data. ACM Trans. Graph. 24(3), 677–685 (2005). https://doi.org/10.1145/ 1073204.1073247 16. Oreshkin, B.N., Valkanas, A., Harvey, F.G., M´enard, L.S., Bocquelet, F., Coates, M.J.: Motion Inbetweening via Deep ΔInterpolator. arXiv eprints arXiv:2201.06701 (2022) 17. Dhariwal, P., Sastry, G., McCandlish, S.: Enct5: Finetuning t5 encoder for discriminative tasks (2021) 18. Ren, M., Kiros, R., Zemel, R.S.: Exploring models and data for image question answering. In: Proceedings of the 28th International Conference on Neural Information Processing Systems, vol. 2, pp. 2953–2961. NIPS 2015, MIT Press, Cambridge, MA, USA (2015)
312
P. Sridhar et al.
19. Tanuwijaya, S., Ohno, Y.: TFDF indexing for mocap data segments in measuring relevance based on textual search queries. Vis. Comput. 26(6–8), 1091–1100 (2010). https://doi.org/10.1007/s0037101004639 20. Taylor, G.W., Hinton, G.E.: Factored conditional restricted Boltzmann machines for modeling motion style. In: Proceedings of the 26th Annual International Conference on Machine Learning ICML 2009, pp. 1025–1032. Association for Computing Machinery, New York, NY, USA (2009). https://doi.org/10.1145/1553374.1553505 21. Vaswani, A., et al.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems NIPS 2017, pp. 6000–6010. Curran Associates Inc., Red Hook, NY, USA (2017) 22. Wang, J.M., Fleet, D.J., Hertzmann, A.: Gaussian process dynamical models for human motion. IEEE Trans. Pattern Anal. Mach. Intell. 30(2), 283–298 (2008). https://doi.org/10.1109/TPAMI.2007.1167 23. Witkin, A., Kass, M.: Spacetime constraints. In: Proceedings of the 15th Annual Conference on Computer Graphics and Interactive Techniques SIGGRAPH 1988, pp. 159–168. Association for Computing Machinery, New York, NY, USA (1988). https://doi.org/10.1145/54852.378507
A LogicBased Tool for Dynamic Generation and Classification of Musical Content Antonio Lieto(B) , Gian Luca Pozzato(B) , Alberto Valese, and Mattia Zito Dipartimento di Informatica, Università di Torino, Turin, Italy {antonio.lieto,gianluca.pozzato,alberto.valese}@unito.it, [emailprotected]
Abstract. In this work we present NERVOUS, an intelligent recommender system exploiting a probabilistic extension of a Description Logic of typicality to dynamically generate novel contents in AllMusic, a comprehensive and indepth resource about music, providing data about albums, bands, musicians and songs (https://www.allmusic.com). The tool can be used for both the generation of novel music genres and styles, described by a set of typical properties characterizing them, and the reclassification of the available songs within such new genres.
1 Introduction The ability of generating new knowledge via conceptual combination concerns highlevel capacities associated to creative thinking and problem solving, and it represents an open challenge for artificial intelligence [2]. Indeed, dealing with this problem requires, from an AI perspective, the harmonization of two conflicting requirements: on the one hand, the need of a syntactic and semantic compositionality; on the other hand, the need of capturing typicality effects. However, such requirements can be hardly accommodated in standard symbolic systems, including formal ontologies [4]. According to a wellknown argument [18], prototypes, namely commonsense conceptual representations based on typical properties, are not compositional. Consider a concept like pet fish: it results from the composition of the concept pet and of the concept fish, however, the prototype of pet fish cannot result from the composition of the prototypes of a pet and a fish. For instance, a typical pet is furry, whereas a typical fish is grayish, but a typical pet fish is neither furry nor grayish (typically, it is red). This is a paradigmatic example of the difficulty to address when building formalisms and systems trying to imitate this combinatorial human ability. Examples of such difficulties concern handling exceptions to attribute inheritance and handling the possible inconsistencies arising between conflicting properties of the concepts to be combined. In this work we continue our activity started in [9, 10] with the definition of a Typicality Description Logic for concept combination (TCL , typicalitybased compositional logic), that we have exploited in order to build a goaloriented framework for knowledge invention in the cognitive architecture of SOAR [8, 11, 12], as well as for the generation and the suggestion of novel editorial content in multimedia broadcasting [3] and in the artistic domain of paintings, poetic content [15], and museum items [13]. In the Description Logic TCL , “typical” properties can be directly specified by means of c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Dovier et al. (Eds.): AIxIA 2022, LNAI 13796, pp. 313–326, 2023. https://doi.org/10.1007/9783031271816_22
314
A. Lieto et al.
a “typicality” operator T enriching the underlying DL, and a TBox can contain inclusions of the form T(C) D to represent that “typical Cs are also Ds”. As a difference with standard DLs, in the logic TCL one can consistently express exceptions and reason about defeasible inheritance as well. Typicality inclusions are also equipped by a real number p ∈ (0.5, 1] representing the probability/degree of belief in such a typical property: this allows us to define a semantics inspired to the DISPONTE semantics [20] characterizing probabilistic extensions of DLs, which in turn is used in order to describe different scenarios where only some typicality properties are considered. Given a KB containing the description of two concepts CH and CM occurring in it, we then consider only some scenarios in order to define a revised knowledge base, enriched by typical properties of the combined concept C CH CM by also implementing a HEAD/MODIFIER heuristics coming from the cognitive semantics. In this work we exploit the logic TCL in order to dynamically generate novel knowledge by means of a mechanism for commonsense combination, that we apply to data extracted from AllMusic (https://www.allmusic.com), a comprehensive and indepth resource about music. In particular, we introduce NERVOUS (dyNamic gEneratoR of noVel cOntent in mUSic), a tool which is able to compute the following activities: – it builds the prototypical description of 18 basic musical genres (Blues, Classical, Country, Easy Listening, Holiday and so on), by extracting data about musical genres and songs in AllMusic by means of a crawler. Such prototypes are formalized by means of a TCL knowledge base, whose TBox contains both rigid inclusions of the form BasicGenre Concept in order to express essential desiderata but also constraints, for instance Childrens ¬Sex (due to law restrictions, sexual contents for kids are forbidden), as well as prototypical properties of the form p :: T(BasicGenre) TypicalConcept, representing typical concepts of a given genre, where p is a real number in the range (0.5, 1], expressing the degree of belief of such a concept in items belonging to that genre: for instance, 0.84 :: T(AvantGarde) Cerebral is used to express that typical songs belonging to the Avantgarde genre are Cerebral (in some sense) with a probability/degree of belief of the 84%, and such a degree is automatically extracted by NERVOUS from the data available on AllMusic for that genre; – it allows the generation of new musical genres by exploiting the reasoning capabilities of the logic TCL in order to generate new derived genres as the result of the creative combination of two basic or derived ones; – it implements a mechanism of reclassification of the available songs of AllMusic within new genres generated in the previous phase. Intuitively, a song is classified as belonging to the new genre if its moods and themes match the typical properties of the prototype of such a genre, obtaining a score of compatibility higher than 0. A positive matching, namely the same property has a high score in the song and is a typical property in the genre, provides a positive score, whereas a negative one, e.g. the song has a high score for a property which is negated in the prototype of
A LogicBased Tool for Dynamic Generation and Classification of Musical Content
315
the genre, produces a negative score. Songs having at least one positive match and having no negative ones has an overall positive score and is then recommended by NERVOUS for that genre. We have tested NERVOUS by reclassifying the available songs in the highlights of AllMusic with respect to the new generated genres, as well as with an evaluation, in the form of a controlled user study experiment, of the feasibility of using the obtained reclassifications as recommended contents. The obtained results are encouraging and pave the way to many possible further improvements and research directions.
2 Combining Concepts: The Description Logic TCL The tool NERVOUS exploits the Description Logic TCL [9, 10] for the generation of new genres as the combination of two existing ones. The language of TCL extends the basic DL ALC by typicality inclusions of the form p :: T(C) D where p ∈ (0.5, 1] is a real number representing its degree of belief, whose meaning is that “we believe with degree p that, normally, Cs are also Ds”. We avoid degrees p ≤ 0.5 since it would be misleading for typicality inclusions, since typical knowledge is known to come with a low degree of uncertainty. We define a knowledge base K = R, T , A where R is a finite set of rigid properties of the form C D, T is a finite set of typicality properties of the form p :: T(C) D where p ∈ (0.5, 1] ⊆ R is the degree of belief of the typicality inclusion, and A is the ABox, i.e. a finite set of formulas of the form either C(a) or R(a, b), where a, b ∈ O and R ∈ R. The Description Logic TCL relies on the DL of typicality ALC + TR introduced in [5], which allows one to describe the prototype of a concept, in this case a musical genre. As a difference with standard DLs, in the logic ALC + TR one can consistently express exceptions and reason about defeasible inheritance as well. For instance, a knowledge base can consistently express that “typical students are young persons”, whereas “normally, senior students are not young persons” by T(Student) Young and T(SeniorStudent) ¬Young, given a knowledge base also containing the standard inclusion SeniorStudent Student, representing that all senior students are students. The semantics of the T operator is characterized by the properties of rational logic [7], recognized as the core properties of nonmonotonic reasoning. The Description Logic ALC + TR is characterized by a minimal model semantics corresponding to an extension to DLs of a notion of rational closure as defined in [7] for propositional logic: the idea is to adopt a preference relation among ALC + TR models, where intuitively a model is preferred to another one if it contains less exceptional elements, as well as a notion of minimal entailment restricted to models that are minimal with respect to such preference relation. As a consequence, the operator T inherits wellestablished properties like specificity and irrelevance; in the example, the Description Logic ALC + TR allows one to infer that T(Student Italian) Young (being Italian is irrelevant with
316
A. Lieto et al.
respect to being young) and, if one knows that Rachel is a typical senior student, to infer that she is not young, giving preference to the most specific information. A model M of TCL extends standard ALC models by a preference relation among domain elements as in the logic of typicality [5]. In this respect, x < y means that x is “more normal” than y, and that the typical members of a concept C are the minimal elements of C with respect to this relation. An element x ∈ ΔI is a typical instance of some concept C if x ∈ C I and there is no Celement in ΔI more normal than x. Definition 1 (Model of TCL ). A model M is any structure ΔI , 0. This song will be then recommended by NERVOUS, as it can be seen in Fig. 2, where a picture of NERVOUS’s interface is shown. It is worth noticing that, in order to provide a “whitebox” recommender system, each recommended song is equipped by an explanation, relying on the pipeline implemented by system of concept combination. Let us conclude this section by observing that the fact that a recommended song belongs to both original, basic genres that have been combined is far from being obvious: indeed, the system NERVOUS suggests also the song “Moanin” by Art Blakey & the Jazz Messengers, which is classified by AllMusic as belonging to the genre Jazz. In our opinion, this is a further interesting mechanism providing the required component of surprise in the recommendation, justified by the fact that the description of the song matches the one of the novel genre, the last one only partially inheriting properties from the basic genres whose combination lead to such a new genre. The tool NERVOUS is available at https://github.com/Mattia98779/Nervous. A preliminary version of a web interface is available at https://mattia98779.github.io/#/: by means of such a web interface, a user can select two basic genres and then obtain the list of suggested songs, together with an explanation.
6 Evaluation and Discussion In this section we provide a preliminary evaluation of our tool NERVOUS. We have tested it in two different ways. The first evaluation is completely automatic and inheres the capability of the system of generating novel hybrid genres that are able to be populated by the original content of the AllMusic platform via a reclassification mechanism involving the 599 songs of the platform. In this case, the success criterion concerns the avoidance of the creation of empty boxes corresponding to the new generated combined genres. More in detail, at least 69 songs are reclassified by the tool NERVOUS for each derived music genre (the second genre containing “few” songs contains 138
A LogicBased Tool for Dynamic Generation and Classification of Musical Content
323
Fig. 3. Some statistics about the reclassification of NERVOUS.
items), with an average of 307 songs per derived genre. This is summarized in Fig. 3, picture in the left, whereas from the picture on the right we can observe that only 7 out of 599 songs on AllMusic (with very few attributes) are not reclassified in any genre by the system, whereas all the other ones (98.83%) are reclassified in at least one genre. The second evaluation consisted in a user study involving 22 persons (11 females, 11 males, aged 14–72) that evaluated a total of 260 recommendations generated by the system. It is worth observing that this is one of the most commonly used methodology for the evaluation of recommender systems based on controlled small groups analysis [22]. The idea was to estimate the satisfaction of the potential users of the platform when exposed to the contents of the novel categories suggested by NERVOUS: all the participants were voluntary people using an availability sampling strategy. Participants were all naive to the experimental procedure and to the aims of the study. This evaluation was carried out as a classical “one to one” lab controlled experiment (i.e. one person at time with one expert interviewer) and we adopted a thinking aloud protocol, consisting in recording the verbal explanations provided by the people while executing a given laboratory task [16, 17]. In this setting, the users had to start the interview by indicating a couple of preferred genres among those available in AllMusic. This selection triggered both the activation of a novel hybrid prototypical genre by NERVOUS and the corresponding reclassification of the AllMusic songs based on such selection. The output of the system, pruned to show the top 10 best results, was then evaluated with a 1–10 voting scale expressing the satisfaction of the received recommendations. The results we have obtained seem promising: the average score assigned by the users to the recommendations of the reclassified elements is 7.44 out of 10. This score was calculated by considering, for each new category, the score assigned to the top 10 reclassified songs, since they were provided, to the users, as recommendations for the novel genres. It is worth observing that, in few cases, the creative classification performed by the tool NERVOUS has lead to counterintuitive results. As an example, the song “I’m eighteen” by Alice Cooper, known as “The Godfather of Shock Rock”, is classified as belonging to the derived genre result of the combination between Rap and Avantgarde. We strongly conjecture that these situations could be easily avoided by introducing constraints on some genres by means of rigid negated properties.
324
A. Lieto et al.
Furthermore, most of the people we have interviewed observed that AllMusic adopts a debatable choice of basic genres, in particular concerning the fact that Pop and Rock, two of the most popular music genres in the world, are grouped in a single category. This immediately implies some difficulties in combining its prototype with the one of another basic genre. Moreover, some of the (low ranked) items corresponded to old songs. This follows immediately from the fact that few recent songs belong to the highlights of AllMusic, since they have received a lower number of scores by the portal’s users. Notably the first two of the above mentioned issues are not directly related to NERVOUS, since: i) the system can not know if the association description/item is coherent, but it just provides (for the recommended output) the correspondence already in place in AllMusic; ii) the recommendations of old editorial contents is based on the actual dataset of AllMusic (collecting about six hundred songs). This element can be overcome by simply adding an additional filter about the period preferences of the users.
7 Conclusions and Future Works In this work we have presented NERVOUS, a knowledgebased system for the dynamic generation of novel contents about music, exploiting the reasoning mechanism of the logic TCL in order to generate, reclassify and suggest novel content genres in the context of AllMusic, an online platform collecting indepth information about music genres, albums, musicians and songs. The core component of the system NERVOUS relies on CoCoS, a tool for combining concepts in the logic TCL . According to [23] recommender systems “try to identify the need and preferences of users, filter the huge collection of data accordingly and present the best suited option before the users by using some welldefined mechanism”. The literature is rich of proposals, that we can partition in three main groups of recommender systems: – collaborative filtering, which exploits similarities of usage patterns among mutually similar users; – contentbased filtering, which exploits content similarity; – hybrid filtering, which combines the two approaches. It is easy to observe that the tool NERVOUS could be considered an hybrid recommender system, since in its current form it makes use of content description as the input. However, it differs from the state of the art approaches since it exploits the reasoning power of a logic framework capable of representing new intuitive principles influencing user preferences and usage attitudes which cannot be derived from the pure analysis of content and/or the comparison of similar users. The system NERVOUS has been tested in a twofold evaluation showing promising results for both the automatic evaluation and the user acceptability of the recommended items. With evaluation results at hand, we can observe that NERVOUS represents a good approach at addressing the very well known filter bubble effect [19], since it introduces mechanisms that add a sort of “plausible creativity” and a “reasonable serendipity” in content discovery by users. In future research, we aim at extending our work in several directions. On the one hand, we aim at studying the application of optimization techniques in [1] in order to
A LogicBased Tool for Dynamic Generation and Classification of Musical Content
325
improve the efficiency of CoCoS and, as a consequence, of the proposed knowledge generation system. On the other hand, we aim at conducting a large scale experiment to further validate the effectiveness of the proposed approach, including people with sensory impairments, with the objective of promoting empathy, cohesion and inclusion across social groups, partially neglected by stateoftheart recommender systems.
References 1. Alberti, M., Bellodi, E., Cota, G., Riguzzi, F., Zese, R.: cplint on SWISH: probabilistic logical inference with a web browser. Intelligenza Artificiale 11(1), 47–64 (2017). https://doi. org/10.3233/IA170106 2. Boden, M.A.: Creativity and artificial intelligence. Artif. Intell. 103(1–2), 347–356 (1998) 3. Chiodino, E., Di Luccio, D., Lieto, A., Messina, A., Pozzato, G.L., Rubinetti, D.: A knowledgebased system for the dynamic generation and classification of novel contents in multimedia broadcasting. In: De Giacomo, G., et al., (eds.) ECAI 2020–24th European Conference on Artificial Intelligence, 29 August  8 September 2020, Santiago de Compostela, Spain, 29 August  8 September 2020. Frontiers in Artificial Intelligence and Applications, vol. 325, pp. 680–687. IOS Press (2020). https://doi.org/10.3233/FAIA200154 4. Frixione, M., Lieto, A.: Representing and reasoning on typicality in formal ontologies. In: Ghidini, C., Ngomo, A.N., Lindstaedt, S.N., Pellegrini, T. (eds.) Proceedings of the 7th International Conference on Semantic Systems, pp. 119–125. ACM International Conference Proceeding Series, ACM (2011). https://doi.org/10.1145/2063518.2063534 5. Giordano, L., Gliozzi, V., Olivetti, N., Pozzato, G.L.: Semantic characterization of rational closure: from propositional logic to description logics. Artif. Intell. 226, 1–33 (2015). https:// doi.org/10.1016/j.artint.2015.05.001 6. Hampton, J.A.: Inheritance of attributes in natural concept conjunctions. Memory Cognition 15(1), 55–71 (1987) 7. Lehmann, D., Magidor, M.: What does a conditional knowledge base entail? Artif. Intell. 55(1), 1–60 (1992). https://doi.org/10.1016/00043702(92)90041U 8. Lieto, A., Perrone, F., Pozzato, G.L., Chiodino, E.: Beyond subgoaling: a dynamic knowledge generation framework for creative problem solving in cognitive architectures. Cogn. Syst. Res. 58, 305–316 (2019). https://doi.org/10.1016/j.cogsys.2019.08.005 9. Lieto, A., Pozzato, G.L.: A description logic of typicality for conceptual combination. In: Ceci, M., Japkowicz, N., Liu, J., Papadopoulos, G.A., Ra´s, Z.W. (eds.) ISMIS 2018. LNCS (LNAI), vol. 11177, pp. 189–199. Springer, Cham (2018). https://doi.org/10.1007/9783030018511_19 10. Lieto, A., Pozzato, G.L.: A description logic framework for commonsense conceptual combination integrating typicality, probabilities and cognitive heuristics. J. Exp. Theor. Artif. Intell. 32(5), 769–804 (2020). https://doi.org/10.1080/0952813X.2019.1672799 11. Lieto, A., Pozzato, G.L., Perrone, F.: A dynamic knowledge generation system for cognitive agents. In: 31st IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2019, Portland, OR, USA, 4–6 November 2019, pp. 676–681. IEEE (2019). https://doi.org/ 10.1109/ICTAI.2019.00099 12. Lieto, A., Pozzato, G.L., Perrone, F., Chiodino, E.: Knowledge capturing via conceptual reframing: a goaloriented framework for knowledge invention. In: Proceedings of the 10th ACM Conference on Knowledge Capture, KCAP 2019, Marina del Rey, pp. 109–114. ACM (2019)
326
A. Lieto et al.
13. Lieto, A., Pozzato, G.L., Striani, M., Zoia, S., Damiano, R.: Degari 2.0: a diversityseeking, explainable, and affective art recommender for social inclusion. Cognitive Syst. Res. 77, 1–17 (2023). https://doi.org/10.1016/j.cogsys.2022.10.001, https://www.sciencedirect.com/ science/article/pii/S1389041722000456 14. Lieto, A., Pozzato, G.L., Valese, A.: COCOS: a typicality based concept combination system. In: Felli, P., Montali, M. (eds.) Proceedings of the 33rd Italian Conference on Computational Logic, Bolzano, Italy, 20–22 September 2018. CEUR Workshop Proceedings, vol. 2214, pp. 55–59. CEURWS.org (2018). https://ceurws.org/Vol2214/paper6.pdf 15. Lieto, A., Pozzato, G.L., Zoia, S., Patti, V., Damiano, R.: A commonsense reasoning framework for explanatory emotion attribution, generation and reclassification. Knowl.Based Syst. 227, 107166 (2021) 16. Newell, A., Shaw, J.C., Simon, H.A.: Report on a general problem solving program. In: IFIP Congress, vol. 256, p. 64. Pittsburgh, PA (1959) 17. Newell, A., Simon, H.A.: Human Problem Solving, vol. 104, n. 9. PrenticeHall, Englewood Cliffs (1972) 18. Osherson, D.N., Smith, E.E.: On the adequacy of prototype theory as a theory of concepts. Cognition 9(1), 35–58 (1981) 19. Parisier, E.: The Filter Bubble: What the Internet Is Hiding from You (2012) 20. Riguzzi, F., Bellodi, E., Lamma, E., Zese, R.: Probabilistic description logics under the distribution semantics. Semant. Web 6(5), 477–501 (2015). https://doi.org/10.3233/SW140154 21. Riguzzi, F., Bellodi, E., Lamma, E., Zese, R.: Reasoning with probabilistic ontologies. In: Yang, Q., Wooldridge, M. (eds.) Proceedings of IJCAI 2015, pp. 4310–4316. AAAI Press (2015). https://ijcai.org/proceedings/2015 22. Shani, G., Gunawardana, A.: Evaluating recommendation systems. In: Ricci, F., Rokach, L., Shapira, B., Kantor, P.B. (eds.) Recommender Systems Handbook, pp. 257–297. Springer, Boston, MA (2011). https://doi.org/10.1007/9780387858203_8 23. Sohail, S.S., Siddiqui, J., Ali, R.: Classifications of recommender systems: a review. Eng. Sci. Technol. Rev. 10(4), 132–153 (2017)
Why Can Neural Networks Recognize Us by Our Finger Movements? Elena Mariolina Galdi1(B) , Marco Alberti2 , Alessandro D’Ausilio3 , and Alice Tomassini4 1
3
Dipartimento di Ingegneria, Universit´ a di Ferrara, Ferrara, Italy [emailprotected] 2 Dipartimento di Matematica e Informatica, Universit´ a di Ferrara, Ferrara, Italy [emailprotected] Dipartimento di Neuroscienze e Riabilitazione, Universit´ a di Ferrara, Ferrara, Italy [emailprotected] 4 Istituto Italiano di Tecnologia, Ferrara, Italy [emailprotected]
Abstract. Neurobehavioral evidence suggests that human movement may be characterized by relatively stable individual diﬀerences (i.e. individual motor signatures or IMS). While most research has focused on the macroscopic level, all attempts to extract IMS have overlooked the fact that functionally relevant discontinuities are clearly visible when zooming into the microstructure of movements. These recurrent (2–3 Hz) speed breaks (submovements) reﬂect an intermittent motor control policy that might provide a far more robust way to identify IMSs. In this study, we show that individuals can be recognized from motion capture data using a neural network. In particular, we trained a classiﬁer (a convolutional neural network) on a data set composed of time series recording the positions of index ﬁnger movements of 60 individuals; in tests, the neural network achieves an accuracy of 80%. We also investigated how diﬀerent preprocessing techniques aﬀect the accuracy in order to assess which motion features more strongly characterize each individual and, in particular, whether the presence of submovements in the data can improve the classiﬁer’s performance. Keywords: Explainable AI · Convolutional neural networks capture · Movement analysis · Individual motor signature
1
· Motion
Introduction
The possibility of recognizing an individual on the basis of his/her movements or gestures has been studied in depth in the past years due to its signiﬁcant applications in the security and medical areas. Many researches focused on wholebody This work was partly supported by the University of Ferrara FIRD 2022 project “Analisi di serie temporali da motion capture con tecniche di machine learning”. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Dovier et al. (Eds.): AIxIA 2022, LNAI 13796, pp. 327–341, 2023. https://doi.org/10.1007/9783031271816_23
328
E. M. Galdi et al.
movements such as gait [22,23] and most of these analyzed two dimensional input as images or videos. Gohar [14] proposed the use of Inertial Measurement Units or IMU to identify individuals based on gait. The reason for this diﬀerent approach, namely, a onedimensional time series instead of twodimensional image analysis, lies in the fact that “imagebased gait analysis often fails to extract quality measurements of an individual’s motion patterns owing to problems related to variations in viewpoint, illumination (daylight), clothing, worn accessories, etc.” [14]. The latter study showed that individuals could be identiﬁed based on an analysis of their gait with considerable accuracy (approximately 75%). However, wholebody data may not be available for person identiﬁcation in many applications. In addition, very little research has been devoted to investigating individual motor signatures during distal hand movement. The objective of our research is to investigate whether it is possible to identify a subject from the simplest possible movement (i.e., index ﬁnger extension and ﬂexion) using a convolutional neural network (CNN). Although CNNs were initially proposed to classify images [17–19,32], the choice of a CNN for multiclass classiﬁcation, even with time series tasks, has been shown to be eﬀective [8,10,34]. The data we used derive from a recent neuroscience project on interpersonal behavioral coordination across multiple temporal scales [30]. This study was aimed at investigating whether the recurrent discontinuities that characterize the microstructure of human movement composition (tiny recorrective speedbumps in the range of 2–3 Hz which are often called submovements), are ﬁnely coregulated when participants are required to synchronize at the macroscopic scale only. The experimental settings and their speed proﬁle are shown in Fig. 1. The goal of the present work is very diﬀerent and we thus adopt a radically diﬀerent analytical approach. In fact, we here investigated whether these microscopic movement characteristics can be used for the identiﬁcation of individual movement ﬁngerprints. In our research, we ﬁrst wanted to determine whether ﬁnger movements contain suﬃcient information to allow the neural network to recognize the individual who generated the movements. In addition, we intend to carry out an indepth posthoc interpretation [24] of our results to understand the movement characteristics that are more relevant for identiﬁcation. This latter goal is fully in line with the current interest in explainable artiﬁcial intelligence (XAI) [1,26]. The reasons that led to the emergence of this research ﬁeld have to be found in the necessity of providing an explanation before making any decision informed by an aseptic algorithm. In medical applications, a reasonable explanation is sometimes more important than a correct diagnosis [2,11,15]. The same applies to the security domain [9], and considering the implications of recognizing the identity of an individual from minimal bodily movements, it is selfevident how important explainability should be. The European GDPR makes this point very clear, considering that there are more than one article (13–15, 22) that focus on the importance of correctly motivating the decisions and forecasts made by any automated decisionmaking process [28]. XAI in machine learning is a well known problem [4,5,7,13,31], which has received signiﬁcant attention also in the ﬁeld of Deep Learning [3,20,24].
Why Can Neural Networks Recognize Us by Our Finger Movements?
329
Samek [27] provided an extensive description of this ﬁeld and the tools developed for it. Simic [29] reviewed XAI methods for neural timeseries classiﬁcation, highlighting the class activation maps method or CAM [35], which was also used by Goodfellow [15]. Nevertheless, we explored a simpler path that, based on neurophysiological knowledge of the multiscale nature of human movements [30], grants easier and more straightforward interpretability. To investigate which movement features (i.e. temporal scales) are more relevant for the neural network, we decided to decompose the time series on the basis of their spectral content and we evaluated their impact of this and other key preprocessing choices on recognition accuracy. To the best of our knowledge, there is no evidence in the literature on a similar analytic approach to solving an analogous problem (i.e., person identiﬁcation from minimal kinematic data). Instead, timeseries analysis in the frequency domain is standard in speech technologies [12]. In particular, we show which frequencies produce the largest impact on the ability of a CNN to recognize individuals from the movement of their ﬁnger. The remainder of this paper is organized as follows. The experimental settings are described in Sect. 2. We show the most signiﬁcant experimental results in Sect. 3. We conclude the paper (Sect. 4) with a discussion of the results and possible directions for future research.
2
Experimental Settings
In this section, after a brief introduction to the dataset we worked on (Sect. 2.1), we explain our application’s architecture (Sect. 2.2). Then, we will deepen into two main parts: ﬁrst, we list the preprocessing techniques we chose (Sect. 2.3) as series segmentation, series ﬁltering, and series normalization; in the second part, the neural network model (Sect. 2.4) is described. 2.1
Dataset
The dataset we have been working on comes from previous research; all experimental instrumentation is described in depth in [30]. In total, 60 participants, forming 30 couples, performed a movement synchronization task. As shown in Fig. 1, participants were asked to keep their right index ﬁngers pointing toward each other (without touching) and perform rhythmic ﬂexionextension movements around the metacarpophalangeal joint as synchronously as possible, either inphase (toward the same direction) or antiphase (toward opposite directions). Participants were instructed to maintain a slow movement pace (full movement cycle: single ﬂexion/extension movements) by having them practice in a preliminary phase with a reference metronome set at 0.25 Hz. Each participant also performed the same ﬁnger movements alone (solo condition) with the only requirement of complying with the instructed pace (0.25 Hz). Finger movements were recorded using retroreﬂective markers tracked by a 3D realtime motion capture system (Vicon), providing continuous kinematic data sampled at 300
330
E. M. Galdi et al.
Fig. 1. On the left, the experimental setup for data collection. From top to bottom, there are three settings for the solo, inphase, and antiphase tasks. The right panel shows the speed proﬁle for the three diﬀerent cases. Figure granted by the research of Tomassini et al. [30]
Hz. Each trial had a duration of 2,5 min for a total of 45000 points for each time series. In addition, each of the three diﬀerent experiments (solo, dyadic inphase, dyadic antiphase) was repeated twice. This means that for each subject, we had six series, each made of 45000 points. As the ﬁrst approximation, in this work, we considered that the ﬁnger movement was essentially only in one dimension along the xaxis. To augment our dataset, we segmented each series. As we will describe later in Sect. 2.3, we decided to test two diﬀerent types of cutting: cutting at the maximum of the position data in order to have a complete movement (index extension and ﬂexion), and cutting subseries with ﬁxed length with a speciﬁed gap smaller than the subseries dimension. Considering natural movement variability, the ﬁrst choice further requires a resizing of the subseries because the convolutional network needs all inputs with the same dimension. The second method was used by Yang [33] to segment the timeseries signal into a collection of shorter pieces. To investigate which movement features (i.e., temporal scales) are more relevant for the neural network, we decomposed the time series based on their spectral content. In particular, we applied diﬀerent types of ﬁlters and studied their inﬂuence on the output accuracy of the CNN.
Why Can Neural Networks Recognize Us by Our Finger Movements?
331
Two diﬀerent type of ﬁltering operations have been investigated: moving average window and bandpass frequency ﬁlter. The two techniques are described in Sect. 2.3. Another important variable is related to the choice of applying or not a normalization to our signal, see Sect. 2.3. Finally, we decided to investigate if diﬀerentiating the signal would provide diﬀerent information. Thus, we investigated the accuracy of our neural networks with diﬀerent types of inputs, namely, position, speed, and acceleration data. 2.2
Application Architecture
We decided to develop our AI program with Python. The software was essentially split in two main components: a ﬁrst module assigned to the preprocessing, a second module for the neural network model. The ﬁrst module as described in Sect. 2.1, has a composite structure with diﬀerent possible choices to preprocess the data and diﬀerent parameters to set. As shown in Fig. 2, it is possible to independently choose the series type, ﬁlter method, type of segmentation, and whether the series must be normalized. Depending on the choice made, it is necessary to specify diﬀerent input parameters. Table 1 lists the diﬀerent parameters required for each preprocessing choice. Once the data preprocessing has been completed, the segmented series are sent
Fig. 2. Modular structure: Series Type = Speed, Position, Acceleration; Filter Methods = MAW (Moving Average Window), Band Pass Frequency Filter; Cut Choice = Complete movement (extension + ﬂexion), Fixed Dimension Windows sliding with known gap Table 1. Parameters needed in function of diﬀerent choices Choice
Parameter
MAW
Window dimension
Band pass
Low and high frequency cut
Extensionﬂexion Resized subseries dimension Sliding window
Subseries dimesion and gap
332
E. M. Galdi et al.
to the neural network. The TensorFlow Keras library was used to generate the neural network model. A Convolutional Neural Network (CNN) as been built for multiclass classiﬁcation. It’s known that CNN is the most suitable architecture for classifying data among more classes. Usually, it is applied to 2D input data (i.e., images) [16,17] but it also shows good results for 1D input (i.e., timeseries data) [14]. 2.3
Preprocessing Techniques
Series Cut. We considered two diﬀerent methods to cut the series. As a ﬁrst approach, we cut the time series corresponding to the maximum ﬁnger positions on the xaxis. In this way, each subseries represents the complete movement, extension, and ﬂexion of the index ﬁnger (see Fig. 3. This type of cutting is functionally deﬁned, and each subseries contains information about the entire movement. However, this type of cut creates a subset of diﬀerent lengths that cannot be directly used as an input to a CNN. This means that we had to resize the subseries using the TimeSeriesResampler component from the Python library tslearn.preprocessing. Figure 3 shows the result of resizing on diﬀerent subseries.
Fig. 3. On the left: the blue line represents the position in function of time while the orange line is the derived speed; with the red spot, we highlighted where the cut was operated, which corresponds to the maximum ﬁnger positions. On the right: the eﬀect of resizing after cutting on functionally deﬁned kinematic landmarks; the blue line is the original timeseries and the orange one is the resized (Color ﬁgure online)
The second option to cut the time series is to decide a priori the dimension of the subseries we want to obtain and the gap between the following two subseries, as shown in Fig. 4. We applied this method to investigate whether there was any hidden information in the time series that was not locked to the entire movement cycle (extensionﬂexion). However, we have identiﬁed two main issues with this method. First, the dataset increases exponentially with a signiﬁcant increase in program execution times. The second point was that, whereas in the previous
Why Can Neural Networks Recognize Us by Our Finger Movements?
333
case, we had the whole set of subseries and we could randomly choose the data for training and testing, in this case, to avoid overlapping of the data, we had to cut the main series into two parts: the ﬁrst 75% for the training set and the last 25% for testing. Consequently, we cannot exclude the possibility that the data organized in such a way is not biased in some ways.
Fig. 4. Time series example with highlighted the sliding windows and gaps to shows how the second segmentation strategy was done.
Series Filtering. We also investigated the inﬂuence of the ﬁltering time series. Therefore, we applied two diﬀerent types of ﬁlters to our data. Moving Average Window (MAW), is a very basic tool commonly used in time series to smooth signals. For each point of the set, a ﬁxed number of subsequent points was averaged, and the result replaced the starting point. Obviously, we obtain diﬀerent signals depending on the dimension of the window in which we calculate the average (see Fig. 5); the larger it is, the smoother the signal will be.
Fig. 5. Result of MAW with diﬀerent windows dimension.
We also analyzed the eﬀects of diﬀerent frequency filters on the accuracy of the CNN. Essentially, we created a bandpass ﬁlter where we could set low and high frequencies. We created a Butterworth ﬁlter using the predeﬁned function
334
E. M. Galdi et al.
in Python’s scipy. signal library, with the order set at 4. Thus, it is possible to set low and high frequencies. If the lowfrequency cut was set to 0, it was applied as a lowpass ﬁlter (Fig. 6).
Fig. 6. Eﬀect of diﬀerent types of ﬁlters on the raw speed’s time series.
Series Normalization. As a ﬁnal step, we explored the eﬀect of normalization on our dataset in terms of accuracy gains. In general, we know that when the data have a comparable scale, as in our case where the movement and its rhythm are predeﬁned, the normalization does not improve the accuracy because it can destroy important information hidden in the dataset. Nevertheless, we performed experiments with and without normalization. To set the range of values within which we calculated the maximum and minimum of the normalization, we considered the entire series. 2.4
Neural Network Architecture
We decided to apply a CNN, as suggested in the literature, for multiclass classiﬁcation of time series data [8,10,34]. The structure of the neural network is described in the Table 2. We used RMSprop as the optimizer and performed early stopping to avoid overﬁtting. We compared the results obtained with this neural network with a similar network made with a pytorch instead of tensorﬂow.keras. The results of these two networks are very similar.
Why Can Neural Networks Recognize Us by Our Finger Movements?
335
Table 2. CDNN model. 1D convolution Kernel: 3@64 ActFunct: ReLU 1D convolution Kernel: 3@64 ActFunct: ReLU 1D max pooling Pool size: 2 1D convolution Kernel: 3@64 ActFunct: ReLU 1D convolution Kernel: 3@64 ActFunct: ReLU 1D max Pooling Pool size: 2 Dense Dense Softmax
3
Results
To evaluate the impact of diﬀerent parameters on recognition performance, we examined the accuracy of our CNN. First, we show how accuracy is aﬀected by the choice of the type of time series (see Table 3 below). Table 3. The accuracy for the diﬀerent series types was calculated using a lowpass ﬁlter set at 50 Hz and cut based on the maximum ﬁnger position. Series’s type w Norm w/o Norm Position Speed Acceleration
2%
35%
69%
71%
2%
65%
It’s evident the improvement of the accuracy for the acceleration and position once the normalization is applied. Nevertheless, the accuracy of the speed data was still the highest; therefore, it was used as a reference for the following experiments. The experiment comparing the two types of series segmentation doesn’t show signiﬁcant diﬀerences in terms of accuracy, while the computational time was drastically longer when the data was cut with sliding windows. This convinced us to choose the cut at the maximum finger position as the standard segmentation method for our series.
336
E. M. Galdi et al.
MAW, as shown in Fig. 7, has a maximum performance at 0 or when no moving average has actually been applied. By increasing the dimension of its window, its accuracy decreased until it reached 30% with a 100 points window. Remember that when the MAW windows include 100 points, only the main shape of the movement is visible (see Fig. 5).
Fig. 7. The ﬁgure shows how the accuracy changed as a function of the number of points included in the moving average.
Figure 8 report how the accuracy changes with diﬀerent Frequency Filters. As we can clearly see the fundamental frequency (i.e., 0.25 Hz or the instructed ﬁnger ﬂexionextension rhythm) is the most meaningful frequency and if we remove it from our signal no recognition can be done. The fact that each individual is characterized by their own preferred (selfpaced) tapping tempo is well known in neurophysiology literature [25]. In addition, our experiments show that it is clear that this frequency alone is not suﬃcient, and the accuracy increases as we add more frequencies. Interestingly, in neurophysiology, it is well known [6] that in movement, albeit with the diﬀerences that may exist between moving a ﬁnger or leg, frequencies above 15 Hz begin to be attenuated. From to 20–30 onwards, there is no physiological relevance anymore. Instead, in our experiments, we still see further increases when adding frequencies above 30 Hz, which means that the network is still learning something. Future indepth analysis will have to investigate what our neural network is learning in the range of 30 Hz to 70 Hz. More importantly, at 20 Hz, the accuracy is already 65%, which is a very good performance for classiﬁcation among the 60 classes. If look at Fig. 9 we can see the results obtained by applying a frequency domain ﬁlter and a simple MAW. We can see that we have similar results with a 1 Hz band lowpass ﬁlter and an MAW with a window of 100 points (approximately 30% accuracy). Instead, if we did not apply any MAW or ﬁlter, or we used a lowpass ﬁlter with a band higher than 60 Hz, we found a maximum accuracy of approximately 75%. Because one of our initial goals was to investigate the role of submovements in deﬁning individual motor signatures, we focused our attention
Why Can Neural Networks Recognize Us by Our Finger Movements?
337
Fig. 8. In the ﬁgure, it is reported how the accuracy changes for diﬀerent band pass ﬁlters: each curve corresponds to a ﬁlter with a speciﬁc lowfrequency cut (0 Hz, 2 Hz, 4 Hz, 6 Hz, 8 Hz), while in the xaxis, the highfrequency cut is reported.
on the 2–4 Hz frequencies in Fig. 10. Although we did not notice any signiﬁcant variation in the accuracy for the bandpass ﬁlter with a low cut at 2 Hz and a high cut at 4 Hz, we could clearly observe a signiﬁcant slope increase for the lowpass ﬁlter around these frequencies. As explained earlier, the fundamental frequency (0.25 Hz) probably contained most of the information (less than 30% accuracy). However, the performance is far from its plateau; rather, the model largely improves by adding the submovement range. Future research should further investigate this aspect.
4
Discussion
The proposed work demonstrates that it is possible to recognize subjects by starting from their index ﬁnger movements. Interestingly, we achieved the same accuracy obtained by Gohar [14] even if he worked with human gait instead of ﬁnger movement. In addition we found that the fundamental frequency is undoubtedly the pivotal aspect in the recognition of subjects, but not alone. Higher frequencies contribute signiﬁcantly to an increase in accuracy, but only in the presence of fundamental frequency. For instance, we have not yet explained the gap we have from the 30% accuracy obtained with the fundamental frequency only, and 75% obtained with a lowpass ﬁlter with a bandwidth of 60–70 Hz (Fig. 11). At this point of the research, it is not yet fully demonstrated whether submovements play a central role in the recognition process, but we have some clues. The slope of the accuracy curve may provide some hints for future investigations. It is possible to design more targeted experiments to investigate a range of frequencies with greater granularity. However, this work had to ﬁrst
338
E. M. Galdi et al.
Fig. 9. Comparison between results obtained by applying the frequencydomain ﬁlter and MAW.
Fig. 10. The ﬁgure shows the accuracy of the bandpass ﬁlter with diﬀerent lowfrequency cut (0 Hz, 2 Hz, 4 Hz, 6 Hz, 8 Hz).
Fig. 11. The ﬁgure shows that the accuracy increases as the frequencies change. In particular, the slope in the accuracy from 0 to 4 Hz is shown in red, and that from 6 to 20 Hz is shown in blue. (Color ﬁgure online)
test several design choices, such as data type, segmentation, normalization, or ﬁltering strategy, which, as we demonstrated here, have a dramatic impact on model performance. After all these tests, we can say that our approach, with few
Why Can Neural Networks Recognize Us by Our Finger Movements?
339
key design choices, has an interesting potential in recognizing individual motor signatures. In our future work, we will build from this work to better investigate the role of the diﬀerent time scales of movement composition to diﬀerentiate and explore the interaction between macro and microscopic movement features in deﬁning individual movement ﬁngerprints. Moreover, we do not exclude the possibility of answering this question by applying more structured tools such as class activation maps [35], or a clustering method based on PLif, as suggested by Li [21] in order to determine which other signal characteristics aﬀect CNN classiﬁcation the most.
References 1. Adadi, A., Berrada, M.: Peeking inside the blackbox: a survey on explainable artiﬁcial intelligence (XAI). IEEE Access 6, 52138–52160 (2018). https://doi.org/ 10.1109/ACCESS.2018.2870052 2. Ahmad, M.A., Eckert, C., Teredesai, A.: Interpretable machine learning in healthcare. In: Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2018, pp. 559–560. Association for Computing Machinery, New York (2018). https://doi.org/10.1145/ 3233547.3233667 3. Assaf, R., Schumann, A.: Explainable deep neural networks for multivariate time series predictions. In: Proceedings of the TwentyEighth International Joint Conference on Artiﬁcial Intelligence, pp. 6488–6490. International Joint Conferences on Artiﬁcial Intelligence Organization, Macao (2019). https://doi.org/10.24963/ ijcai.2019/932 4. Baehrens, D., Schroeter, T., Harmeling, S., Kawanabe, M., Hansen, K.: How to explain individual classiﬁcation decisions, p. 29 (2010) 5. Burkart, N., Huber, M.F.: A survey on the explainability of supervised machine learning. J. Artif. Intell. Res. 70, 245–317 (2021). https://doi.org/10.1613/jair.1. 12228 6. Burke, R.E.: Motor units: anatomy, physiology, and functional organization, pp. 345–422. Wiley (2011). https://doi.org/10.1002/cphy.cp010210, https:// onlinelibrary.wiley.com/doi/abs/10.1002/cphy.cp010210 7. Burrell, J.: How the machine ‘thinks’: understanding opacity in machine learning algorithms. Big Data Soc. 3(1), 205395171562251 (2016). https://doi.org/10.1177/ 2053951715622512 8. Cui, Z., Chen, W., Chen, Y.: Multiscale convolutional neural networks for time series classiﬁcation (2016) 9. Ernst, C.: Artiﬁcial intelligence and autonomy: selfdetermination in the age of automated systems. In: Wischmeyer, T., Rademacher, T. (eds.) Regulating Artiﬁcial Intelligence, pp. 53–73. Springer, Cham (2020). https://doi.org/10.1007/9783030323615 3 10. Ismail Fawaz, H., Forestier, G., Weber, J., Idoumghar, L., Muller, P.A.: Deep learning for time series classiﬁcation: a review. Data Min. Knowl. Disc. 33(4), 917–963 (2019). https://doi.org/10.1007/s10618019006191 11. Foster, K.R., Koprowski, R., Skufca, J.D.: Machine learning, medical diagnosis, and biomedical engineering research  commentary. Biomed. Eng. Online 13(1), 94 (2014). https://doi.org/10.1186/1475925X1394
340
E. M. Galdi et al.
12. Gee, A.H., GarciaOlano, D., Ghosh, J., Paydarfar, D.: Explaining deep classiﬁcation of timeseries data with learned prototypes, p. 8 (2019) 13. Gilpin, L.H., Bau, D., Yuan, B.Z., Bajwa, A., Specter, M., Kagal, L.: Explaining explanations: an overview of interpretability of machine learning. In: 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA), pp. 80–89 (2018). https://doi.org/10.1109/DSAA.2018.00018 14. Gohar, I., et al.: Person reidentiﬁcation using deep modeling of temporally correlated inertial motion patterns. Sensors 20(3), 949 (2020). https://doi.org/10.3390/ s20030949 15. Goodfellow, S.D., Goodwin, A., Greer, R., Laussen, P.C., Mazwi, M., Eytan, D.: Towards understanding ECG rhythm classiﬁcation using convolutional neural networks and attention mappings, p. 18 (2018) 16. HeenayeMamode Khan, M., et al.: Multi class classiﬁcation of breast cancer abnormalities using deep convolutional neural network (CNN). PLOS One 16(8), 1–15 (2021). https://doi.org/10.1371/journal.pone.0256500 17. Hu, Y., Sokolova, M.: Convolutional neural networks in multiclass classiﬁcation of medical data, p. 13 (2020) 18. Kim, Y.: Convolutional neural networks for sentence classiﬁcation (2014) 19. LeCun, Y., et al.: Backpropagation applied to handwritten zip code recognition. Neural Comput. 1(4), 541–551 (1989). https://doi.org/10.1162/neco.1989.1.4.541 ¨ 20. LeventiPeetz, A.M., Ostreich, T.: Deep learning reproducibility and explainable AI (XAI) (2022) 21. Li, L., Prakash, B.A., Faloutsos, C.: Parsimonious linear ﬁngerprinting for time series. Proc. VLDB Endow. 3(1–2), 385–396 (2010). https://doi.org/10.14778/ 1920841.1920893 22. Little, J.J., Boyd, J.E.: Recognizing people by their gait: the shape of motion, p. 33 (1998) 23. Park, G., Lee, K.M., Koo, S.: Uniqueness of gait kinematics in a cohort study. Sci. Rep. 11(1), 15248 (2021). https://doi.org/10.1038/s4159802194815z 24. Preece, A.: Asking ‘Why’ in AI: explainability of intelligent systems – perspectives and challenges. Intell. Syst. Account. Financ. Manage. 25(2), 63–72 (2018). https://doi.org/10.1002/isaf.1422 25. Repp, B.H., Su, Y.H.: Sensorimotor synchronization: a review of recent research (2006–2012). Psychon. Bull. Rev. 20(3), 403–452 (2013). https://doi.org/10.3758/ s1342301203712 26. Ribeiro, M.T., Singh, S., Guestrin, C.: “Why should i trust you?”: explaining the predictions of any classiﬁer (2016) 27. Samek, W., Montavon, G., Lapuschkin, S., Anders, C.J., M¨ uller, K.R.: Explaining deep neural networks and beyond: a review of methods and applications. Proc. IEEE 109(3), 247–278 (2021). https://doi.org/10.1109/JPROC.2021.3060483 28. Selbst, A.D., Powles, J.: Meaningful information and the right to explanation. Int. Data Priv. Law 7(4), 233–242 (2017). https://doi.org/10.1093/idpl/ipx022 ˇ c, I., Sabol, V., Veas, E.: XAI methods for neural time series classiﬁcation: a 29. Simi´ brief review (2021) 30. Tomassini, A., et al.: Interpersonal synchronization of movement intermittency. iScience 25(4), 104096 (2022). https://doi.org/10.1016/j.isci.2022.104096 31. Vale, D., ElSharif, A., Ali, M.: Explainable artiﬁcial intelligence (XAI) posthoc explainability methods: risks and limitations in nondiscrimination law. AI Ethics (2022). https://doi.org/10.1007/s4368102200142y
Why Can Neural Networks Recognize Us by Our Finger Movements?
341
32. Woan Ching, S.L., et al.: Multiclass convolution neural network for classiﬁcation of COVID19 CT images. Comput. Intell. Neurosci. 2022, 1–15 (2022). https:// doi.org/10.1155/2022/9167707 33. Yang, J.B., Nguyen, M.N., San, P.P., Li, X.L., Krishnaswamy, S.: Deep convolutional neural networks on multichannel time series for human activity recognition, p. 7 (2015) 34. Zheng, Y., Liu, Q., Chen, E., Ge, Y., Zhao, J.L.: Time series classiﬁcation using multichannels deep convolutional neural networks. In: Li, F., Li, G., Hwang, S., Yao, B., Zhang, Z. (eds.) WAIM 2014. LNCS, vol. 8485, pp. 298–310. Springer, Cham (2014). https://doi.org/10.1007/9783319080109 33 35. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2921–2929 (2016). https://doi.org/10.1109/ CVPR.2016.319
Miscellany
Labelled Sequent Calculi for Conditional Logics: Conditional Excluded Middle and Conditional Modus Ponens Finally Together Nicola Olivetti1 , Nikola Panic2 , and Gian Luca Pozzato2(B) 1
Aix Marseille Université, CNRS, ENSAM, Université de Toulon, LSIS UMR 7296, Marseille, France [emailprotected] 2 Dipartimento di Informatica, Università di Torino, Turin, Italy [emailprotected], [emailprotected] Abstract. We introduce labelled sequent calculi for Conditional Logics with a selection function semantics. Conditional Logics are a sort of generalization of multimodal logics where modalities are labelled by formulas of the same language. Recently, they received a renewed attention and have found several applications in knowledge representation and artificial intelligence. In a previous work, we have considered the basic system CK and extensions with well known conditions ID, MP, CS and CEM, with the exception of those admitting both conditions CEM and MP, obtaining labelled sequent calculi called SeqS. Here we provide calculi for the whole cube of the extensions of CK generated by the above axioms, including also those with both CEM and MP: the basic idea is that of replacing the rule dealing with CEM in SeqS, which performs a label substitutions in both its premises, by a new one that avoids such a substitution and adopts a conditional formula on the righthand side of a sequent as its principal formula. We have also implemented the proposed calculi in Prolog following the “lean” methodology, then we have tested the performances of the new prover, called CondLean2022, and compared them with those of CondLean, an implementation of SeqS, on the common systems. The performances of CondLean2022 are promising and seem to be better than those of CondLean, witnessing that the proposed calculi also provide a more efficient theorem prover for Conditional Logics.
1 Introduction Conditional Logics have a long history, starting with the seminal works by [5, 17, 18, 24], and [4] in the seventies. Recently, Conditional Logics have found a renewed interest in several fields of artificial intelligence and knowledge representation, from hypothetical reasoning to belief revision, from diagnosis to nonmonotonic reasoning and planning [6, 8–16, 23]. Conditional Logics are extensions of classical logic by a binary operator ⇒, called conditional operator, used in order to express conditional formulas of the form A ⇒ B. Similarly to modal logics, the semantics of Conditional Logics can be defined in terms of possible world structures. In this respect, Conditional Logics can be seen as a generalization of modal logics (or a type of multimodal logic) where the conditional operator is a sort of modality indexed by a formula of the same language. However, c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Dovier et al. (Eds.): AIxIA 2022, LNAI 13796, pp. 345–357, 2023. https://doi.org/10.1007/9783031271816_24
346
N. Olivetti et al.
as a difference with modal logics, the lack of a universally accepted semantics led to a partial underdevelopment of proof methods and theorem provers for these logics. An effort in the direction of filling this gap is provided in [19]. The semantics considered in this work is the selection function semantics introduced by Nute in [18], where truth values are assigned to formulas depending on a world. Intuitively, the selection function f selects, for a world w and a formula A, the set of worlds f (w, A) which are “mostsimilar to w” given the information A. In normal conditional logics, f depends on the set of worlds satisfying A rather than on A itself, so that f (w, A) = f (w, A ) whenever A and A are true in the same worlds. A conditional formula A ⇒ B is true in w whenever B is true in every world selected by f for A and w. With the selection function semantics at hand, CK is the fundamental system and it has the same role as the system K in modal logic. Formulas valid in CK are exactly those ones that are valid in every selection function model. Extensions are then obtained by imposing restrictions on the selection function. In [19], a labelled sequent calculus for CK and some standard extensions with conditions ID (conditional identity), MP (conditional modus ponens), CEM (conditional third excluded middle), and CS (conditional strong centering) are considered, as well as most of the combinations of them. The proposed calculi, called SeqS, are modular and, in some cases, optimal. The authors also introduce CondLean, a theorem prover implementing the calculi SeqS in Prolog. In [19], however, all the systems including both the axioms CEM and MP are neglected: the reason is that the proof of cut elimination, needed in order to prove the completeness of the calculi, does not work when such axioms are considered together. In this paper we provide labelled sequent calculi, that we call SeqS22, for the whole cube of the extensions of CK generated by the above mentioned axioms, including those with both CEM and MP, filling the existing gap. The basic idea is that of replacing the rule dealing with CEM in SeqS, which performs a label substitution in both its premises, by a new one that avoids such a substitution and adopts a conditional formula on the righthand side of a sequent as its principal formula. We show that one can derive a decision procedure from the cutfree calculi, providing a constructive proof of decidability of the logics considered. By estimating the size of the finite derivations of a given sequent, we also obtain a polynomial space complexity bound for these logics. Furthermore, we sketch an implementation of the proposed calculi SeqS22: the program, called CondLean2022, is implemented in Prolog and it is inspired by the “lean” methodology, whose aim is to write short programs and exploit the power of Prolog’s engine as much as possible: in this respect, every clause of a single predicate, called prove, implement an axiom or rule of the calculi and the proof search is provided for free by the mere depthfirst search mechanism of Prolog, without any additional ad hoc mechanism. We have tested the performances of CondLean2022 and compared them with those of CondLean, obtaining encouraging results that allow us to conclude that the new rule for CEM, on the one hand, makes it possibile to conclude the proof of cut elimination also in systems with MP, on the other hand, avoiding label substitution leads to a significant improvement of the performance of the prover.
2 Conditional Logics with Selection Function Semantics In this section we briefly recall propositional Conditional Logics. A propositional conditional language L contains: (i) a set of propositional variables ATM ; (ii) the constants
Labelled Sequent Calculi for Conditional Logics
347
⊥ and ; (iii) a set of connectives ¬ (unary), ∧, ∨, →, ⇒ (binary). Formulas of L include formulas of classical logic ¬A, A ∧ B, A ∨ B, A → B, to which we add conditional formulas of the form A ⇒ B. We define the selection function semantics as follows: given a nonempty set of possible worlds W, the selection function f selects, for a world w and a formula A, the set of worlds of W which are closer to w given the information A. A conditional formula A ⇒ B holds in a world w if the formula B holds in all the worlds selected by f for w and A. Definition 1 (Selection function semantics). A model is a triple M = W, f, [ ] where: i) W is a nonempty set of worlds; ii) f is the selection function f : W ×2W −→ 2W iii) [ ] is the evaluation function, which assigns to an atom P ∈ ATM the set of worlds where P is true, and is extended to the other formulas in the usual way for classical connectives, whereas for conditional formulas we have [A ⇒ B] = {w ∈ W  f (w, [A]) ⊆ [B]}. It is worth noticing that we have defined f taking [A] rather than A (i.e. f (w, [A]) rather than f (w, A)) as an argument; this is equivalent to define f on formulas, i.e. f (w, A) but imposing that if [A] = [A ] in the model, then f (w, A) = f (w, A ). This condition is called normality. The semantics above characterizes the basic conditional logic CK. An axiomatization of this system is given by: – any axiomatization of classical propositional calculus; A A→B – (Modus Ponens) B A↔B – (RCEA) (A ⇒ C) ↔ (B ⇒ C) (A1 ∧ · · · ∧ An ) → B – (RCK) (C ⇒ A1 ∧ · · · ∧ C ⇒ An ) → (C ⇒ B) As for modal logics, we can consider extensions of CK by assuming further properties on the selection function. We consider the following ones: Logic Axiom
Model condition
ID
A⇒A
f (w, [A]) ⊆ [A]
CS
(A ∧ B) → (A ⇒ B)
w ∈ [A] → f (w, [A]) ⊆ {w}
CEM (A ⇒ B) ∨ (A ⇒ ¬B)  f (w, [A])  ≤ 1 MP
(A ⇒ B) → (A → B) w ∈ [A] → w ∈ f (w, [A])
The above axiomatizations are complete with respect to the respective semantics [18]. It is worth noticing that: Proposition 1. In systems with both axioms (CEM) and (MP), axiom (CS) is derivable. Proof. For (CEM) we have that  f (w, [A]) ≤ 1. For (MP), we have that, if w ∈ [A], then w ∈ f (w, [A]). Therefore, it follows that if w ∈ [A], then f (w, [A]) = {w}, satisfying the (CS) condition.
348
N. Olivetti et al.
3 SeqS22: A Sequent Calculus for Conditional Logics In this section we introduce SeqS22, a family of labelled sequent calculi for the conditional systems under consideration. The calculi are modular and they are able to deal with the basic system CK as well as with the whole cube of extensions with axioms ID, CS, CEM and MP. Given Proposition 1, it is worth noticing that, concerning systems admitting the axiom (CS), the calculi SeqS22 offer two alternative calculi: on the one hand, the calculus obtained by adding the suitable rule (CS), as in SeqS [19], on the other hand, the calculus obtained by including the rules for (MP) and (CEM) and omitting the rule for (CS), thus avoiding the mechanism of label substitution required by such a rule. The calculi make use of labels to represent possible worlds. We consider a language L and a denumerable alphabet of labels A, whose elements are denoted by x, y, z, .... There are two kinds of labelled formulas: – world formulas, denoted by x: A, where x ∈ A and A ∈ L, used to represent that A holds in a world x; A – transition formulas, denoted by x −→ y, where x, y ∈ A and A ∈ L. A transition A formula x −→ y represents that y ∈ f (x, [A]). A sequent is a pair Γ, Δ , usually denoted with Γ Δ, where Γ and Δ are multisets of labelled formulas. The intuitive meaning of Γ Δ is: every model that satisfies all labelled formulas of Γ in the respective worlds (specified by the labels) satisfies at least one of the labelled formulas of Δ (in those worlds). Formally, given a model M = W, f, [ ] for L, and a label alphabet A, we consider any mapping I : A → W. Let F be a labelled formula, we define M =I F as follows: – M =I x: A if and only if I(x) ∈ [A] A – M =I x −→ y if and only if I(y) ∈ f (I(x), [A]) We say that Γ Δ is valid in M if for every mapping I : A → W, if M =I F for every F ∈ Γ , then M =I G for some G ∈ Δ. We say that Γ Δ is valid in a system (CK or any extension of it) if it is valid in every M satisfying the specific conditions for that system. We say that a sequent Γ Δ is derivable if it admits a derivation in SeqS22, i.e. a proof tree, obtained by applying backwards the rules of the calculi, having Γ Δ as a root and whose leaves are all instances of (AX). As usual, the idea is as follows: in order to prove that a formula F is valid in a conditional logic, then one has to check whether the sequent x : F is derivable in SeqS22, i.e. if we can obtain a proof tree by applying backwards the rules, starting from the root x : F . As a difference with the sequent calculi SeqS introduced in [19], the calculi SeqS22 follows the basic idea of the calculus introduced in [22], which in this paper is extended in order to deal also with MP. Such an idea is that of dealing with the CEM condition by means of a second rule having a conditional A ⇒ B on the righthand side of a sequent as a principal formula, rather than the one in SeqS, where the condition on the cardinality (at most 1) of the set of worlds selected by the selection function is captured A by means of a label substitution mechanism: roughly speaking, given x −→ y, in order
Labelled Sequent Calculi for Conditional Logics
349
A
to prove x −→ z, we replace both y and z with a new label u, following the observation that they represent the same world. The “old” rule in SeqS is as follows: A
A
A
Γ, x −→ y Δ, x −→ z
(Γ, x −→ y Δ)[y/u, z/u] A
(CEM )
Γ, x −→ y Δ where Σ[x/u] is used to denote the multiset obtained from Σ by replacing, as mentioned here above, the label x by u wherever it occurs, and where it holds that y = z and u ∈ Γ, Δ. The novel rule introduced in SeqS22 is as follows: A
Γ Δ, x : A ⇒ B, x −→ y
Γ Δ, x : A ⇒ B, y : B
Γ Δ, x : A ⇒ B
(CEM)
Intuitively, given a conditional formula x : A ⇒ B on the righthand side of a sequent, the calculi apply the rule (⇒ R) only one time, introducing a new label y representing the single world selected by the selection function, then the new rule (CEM) makes use of such a label y for all other conditional formulas of the form x : A ⇒ B . As an example, Fig. 2 shows a derivation of an instance of the characterizing axiom (CEM). The calculi SeqS22 are shown in Fig. 1. They satisfy basic structural properties, namely heightpreserving admissibility of weakening, heightpreserving invertibility of the rules (with the exception of (EQ)), heightpreserving admissibility of contraction. These are needed in order to show that the following cut rule is admissible: Theorem 1. The cut rule: Γ Δ, F
F, Γ Δ
Γ Δ
(cut)
where F is any labelled formula, is admissible in SeqS22, i.e. if Γ Δ, F and F, Γ Δ are derivable, so is Γ Δ. Proof. As usual, the proof proceeds by a double induction over the complexity of the cut formula and the sum of the heights of the derivations of the two premises of cut, in the sense that we replace one cut by one or several cuts on formulas of smaller complexity, or on sequents derived by shorter derivations. We show two of the most interesting cases involving the novel rule (CEM). Let us first consider the case involving the rules (CEM) and (MP), those rules that caused the failure of the proof of admissibility of cut in [19]. We consider the case in which the cut formula is principal in the application of the (MP) rule only, as follows: A
A
(1) x −→ x, Γ Δ, x : A ⇒ B, x −→ y A
Γ Δ, x : A ⇒ B, x −→ x, x : A A
(3) Γ Δ, x : A ⇒ B, x −→ x
A
(MP)
(2) x −→ x, Γ Δ, x : A ⇒ B, y : B A
x −→ x, Γ Δ, x : A ⇒ B
(4) Γ Δ, x : A ⇒ B
(cut)
(CEM)
350
N. Olivetti et al.
Fig. 1. Rules of sequent calculi SeqS22
Fig. 2. A derivation of CEM in SeqS22.
Since weakening is heightpreserving admissible, since (3) is derivable, so are A A A (3 ) Γ Δ, x : A ⇒ B, x −→ x, x −→ y and (3 ) Γ Δ, x : A ⇒ B, x −→ x, y : B with derivations of no greater heights. We can then apply the inductive hypothesis on the height of the derivations to (3 ) and (1), obtaining a derivation of A (5) Γ Δ, x : A ⇒ B, x −→ y, as well as to (3 ) and (2), obtaining a derivation of (6) Γ Δ, x : A ⇒ B, y : B. We conclude that (4) can be derived by an application of (CEM) to (5) and (6).
Labelled Sequent Calculi for Conditional Logics
351
Let us now take into account the case in which the cut formula is the principal formulas in both the premises of (cut), and the rules applied to it are (CEM) and (⇒ L). The situation is as follows: A
(7) Γ Δ, x : A ⇒ B, x −→ y (8) Γ Δ, x : A ⇒ B, y : B (11) Γ Δ, x : A ⇒ B
A
(CEM) Γ Δ
(9) Γ, x : A ⇒ B Δ, x −→ y (10) Γ, x : A ⇒ B, y : B Δ (12) Γ, x : A ⇒ B Δ
(⇒ L)
(cut)
Since weakening is heightpreserving admissible, we can obtain a proof (with a derivation of at most the same height of (11)) for (11 ) Γ Δ, x : A ⇒ B, y : B. By inductive hypothesis on the height of the derivations, we can cut (10) and (11 ), obtaining a derivation of (13) Γ, y : B Δ. Since weakening is heightpreserving admissible, we can obtain a proof (with a derivation of at most the same height of (12)) for (12 ) Γ, x : A ⇒ B Δ, y : B. By inductive hypothesis on the height of the derivations, we can cut (8) and (12 ), obtaining a derivation of (14) Γ Δ, y : B. We can then apply the inductive hypothesis on the complexity of the cut formula to cut (13) and (14), and we are done with a derivation of Γ Δ. Due to space limitations, the other cases are omitted and left to the reader. Theorem 2 (Soundness and completeness). Given a conditional formula F , it is valid in a conditional logic if and only if it is derivable in the corresponding calculus of SeqS22, that is to say = F if and only if x : F is derivable in SeqS22. Proof. (Soundness) We have to prove that, if a sequent Γ Δ is derivable, then the sequent is valid. This can be done by induction on the height of the derivation of Γ Δ. The basic cases are those corresponding to derivations of height 0, that is to say instances of (AX). It is easy to see that, in all these cases, Γ Δ is a valid sequent. As an example, consider Γ, x : P Δ, x : P : consider every model M and every mapping I satisfying all formulas in the lefthand side of the sequent, then also x : P . This means that I(x) ∈ [P ], but then we have that M satisfies via I at least a formula in the righthand side of the sequent, the same x : P . For the inductive step, we proceed by considering each rule of the calculi SeqS22 in order to check that, if the premise(s) is (are) valid sequent(s), to which we can apply the inductive hypothesis, so is the conclusion. To save space, we only present the cases of (MP) and of the new rule (CEM), the other ones are left to the reader. Let us start with (MP) and a derivation ended as follows: A (1) Γ Δ, x −→ x, x : A (MP) A (2) Γ Δ, x −→ x By inductive hypothesis, the sequent (1) is valid. By absurd, suppose that (2) is not: this means that there exists a model M and a mapping I satisfying all formulas in Γ but falsifying all formulas in the righthand side of the sequent, namely all formulas in Δ A and x −→ x. Since (1) is valid, every model with any mapping satisfying all formulas in Γ satisfies also at least a formula in the righthand side of the sequent: since M falsifies A all formulas in Δ and (∗) x −→ x via I, it must be that M =I x : A, that is to say the
352
N. Olivetti et al.
world w represented by I(x) is an Aworld, i.e. w ∈ [A]. By the condition (MP), this implies that also w ∈ f (w, [A]), however this would mean that I(x) ∈ f (I(x), [A]), A i.e. M =I x −→ x, against (∗). Let us now consider the rule (CEM) and a proof ended as: A
(3) Γ Δ, x : A ⇒ B, x −→ y
(4) Γ Δ, x : A ⇒ B, y : B
(5) Γ Δ, x : A ⇒ B
(CEM)
By inductive hypothesis, both (3) and (4) are valid. Again by absurd, suppose (5) is not, that is to say there exists a model M and a mapping I satisfying all formulas in Γ but falsifying all formulas in Δ as well as x : A ⇒ B. Since (3) is valid, since M and I falsify all formulas in Δ and x : A ⇒ B, necessarily we have that M =I A x −→ y, that is to say I (y) ∈ f (I (x), [A]). By the (CEM) semantic condition, it follows that (∗∗) f (I (x), [A]) = {I(y)}. Analogously, by the validity of (4) we have that M =I y : B. If M =I x : A ⇒ B in (5), there exists a world w such that w ∈ f (I (x), [A]) and w ∈ [B], however, since (∗∗), we have that I (y) = w, against the validity of (4), and we are done. (Completeness) The completeness is an easy consequence of the admissibility of the cut rule (Theorem 1). We show that if a formula F is valid in a conditional logic, then x : F is derivable in SeqS22. We proceed by induction on the complexity of the formulas, therefore we show that the axioms are derivable and that the set of derivable formulas is closed under (Modus Ponens), (RCEA), and (RCK). A derivation of axioms (ID), (CS) and (MP) can be obtained as in SeqS [19]. A derivation of (CEM) is provided in Fig. 2. For (Modus Ponens), suppose that x : A → B and x : A are derivable. We easily have that x : A → B, x : A x : B is derivable too by applying (→ L). Since cut is admissible by Theorem 1, by two cuts we obtain x : B (weakenings are omitted to increase readability): x : A → B, x : A x : B
x:A→B
x:A x:B
(cut)
x:A
x:B
(cut)
For (RCEA), we have to show that if A ↔ B is derivable, then also (A ⇒ C) ↔ (B ⇒ C) is so. The formula A ↔ B is an abbreviation for (A → B) ∧ (B → A). Suppose that x : (A → B) ∧ (B → A) is derivable, then also x : A x : B and x : B x : A are derivable since rules are heightpreserving invertible. We can derive x : A ⇒ C x : B ⇒ C as follows: x:Ax:B
x:B x:A
B
A
x −→ y, x : A ⇒ C x : B ⇒ C, y : C, x −→ y
(EQ)
B
y : C, x −→ y, x : A ⇒ C x : B ⇒ C, y : C (⇒ L)
B
x −→ y, x : A ⇒ C x : B ⇒ C, y : C (⇒ R) x:A⇒C x:B ⇒C
The other half is symmetric. For (RCK), suppose that x : B1 ∧ B2 · · · ∧ Bn → C is derivable, by the heightpreserving invertibility of the rules also y : B1 , . . . , y : Bn
Labelled Sequent Calculi for Conditional Logics
353
y : C is derivable, then so is (∗) x : A ⇒ B1 , x : A ⇒ B2 , . . . , x : A ⇒ Bn , y : B1 , . . . , y : Bn x : A ⇒ C, y : C by admissibility of weakening. We have: A
A
x −→ y x −→ y
(∗) x : A ⇒ B1 , . . . , y : B1 , . . . , y : Bn x : A ⇒ C, y : C (⇒ L)
A
x −→ y, x : A ⇒ B1 , . . . , x : A ⇒ Bn , y : B1 , . . . , y : Bn−1 x : A ⇒ C, y : C . . . A
A
A
x −→ y x −→ y
x −→ y, x : A ⇒ B1 , . . . , x : A ⇒ Bn , y : B1 x : A ⇒ C, y : C (⇒ L)
A
x −→ y, x : A ⇒ B1 , . . . , x : A ⇒ Bn x : A ⇒ C, y : C (⇒ R) x : A ⇒ B1 , . . . , x : A ⇒ Bn x : A ⇒ C
The presence of labels and of the rules (⇒ L), (⇒ R), (ID), (MP), (CEM), and (CS), which increase the complexity of the sequent in a backward proof search, is a potential cause of a nonterminating proof search. However, with a similar argument to the one proposed in [19], we can define a procedure that can apply such rules in a controlled way and introducing a finite number of labels, ensuring termination. Intuitively, it can be shown that it is useless to apply (⇒ L) and (⇒ R) on x : A ⇒ B by A introducing (looking backward) the same transition formula x −→ y more than once in each branch of a proof tree. Similarly, it is useless to apply (ID), (MP), (CEM), and (CS) on the same transition formula more than once in a backward proof search in each branch of a derivation. This leads to the decidability of the given logics: Theorem 3 (Decidability). Conditional Logics CK and all its extensions with axioms ID, MP, CS, CEM and all their combinations are decidable. It can be shown that provability in all the Conditional Logics considered is decidable in O(n2 log n) space, we omit the proof which is essentially the same as in [19].
4 A Theorem Prover for Conditional Logics with CEM In this section we briefly present CondLean22 (https://gitlab2.educ.di.unito.it/pozzato/ condlean4), a Prolog implementation of the calculi SeqS22 introduced in the previous section. The prover is in the line of the existing provers for that logics [20, 21] and it follows the “lean” methodology, introduced by Beckert and Posegga in the middle of the 90s [2, 3, 7]: they have proposed a very elegant and extremely efficient firstorder theorem prover, called leanTAP, consisting of only five Prolog clauses. The basic idea of the “lean” methodology is “to achieve maximal efficiency from minimal means” [2] by writing short programs and exploiting the power of Prolog’s engine as much as possible. Moreover, it is straightforward to prove soundness and completeness of the theorem prover by exploiting the one to one correspondence between axioms/rules of SeqS22 and clauses of CondLean2022. We implement each component of a sequent by a list of formulas, partitioned into three sublists: atomic formulas, transitions and complex formulas. Atomic and complex formulas are implemented by a Prolog list of the form [x,a], where x is a Prolog
354
N. Olivetti et al. A
constant and a is a formula. A transition formula x −→ y is implemented by a Prolog list of the form [x,a,y]. Labels are implemented by Prolog constants. The sequent calculi are implemented by the predicate prove(Gamma, Delta, Labels, Rcond, LCond, Tree) which succeeds if and only if Γ Δ is derivable in SeqS, where Gamma and Delta are the lists implementing the multisets Γ and Δ, respectively and Labels is the list of labels introduced in that branch. As we will describe later on, arguments RCond and LCond are used in order to ensure the termination of the proof search by restricting the application of some crucial rules. Tree is an output term: if the proof search succeeds, it matches a Prolog representation of the derivation found by the theorem prover. Each clause of the prove predicate implements one axiom or rule of SeqS22. The theorem prover proceeds as follows. First of all, if Γ Δ is an axiom, then the goal will succeed immediately by using the clauses for the axioms. If it is not, then the first applicable rule is chosen. The ordering of the clauses is such that the application of the branching rules is postponed as much as possible. Concerning the rules for ⇒ on the righthand side of a sequent, the rule (⇒ R), which introduces a new label in a backward proof search, is first applied to a sequent of the form Γ Δ, x : A ⇒ B. If this does not lead to a derivation, the new rule for CEM is then applied. As mentioned here above, arguments RCond and LCond are used in order to ensure the termination of the proof search by controlling the application of the rules (⇒ L) and (⇒ R): indeed, these rules copy the conditional formula x : A ⇒ B to which they are applied in their premises, therefore we need to avoid redundant applications that, otherwise, would lead to expand an infinite branch. For instance, RCond is a Prolog list containing all the formulas x : A ⇒ B to which the rule (⇒ R) has been already applied in the current branch: such a rule will be then applied to x : A ⇒ B only if it does not belong to the list RCond. A similar mechanism is implemented for extensions of CK, namely further suitable arguments are added to the predicate prove to keep track of the information needed to avoid useless and uncontrolled applications of the rules (MP), (ID), (CEM), and (CS), which copy their principal formulas in their premise(s). As an example, in systems with condition (CEM), a further argument is a Prolog list, called CEM, whose elements are pairs (y, x : A ⇒ B) representing that the rule (CEM) has been already applied (in a backward proof search) to a conditional formula x : A ⇒ B A by using the label y in the premises, i.e. by introducing x −→ y and y : B in the two premises of the rule. In order to apply the rule (CEM) to a formula x : A ⇒ B, the clause implementing it will choose a label y in the list Labels such that the pair (y, x : A ⇒ B) does not belong to the list CEM. Let us now present some clauses of CondLean2022. As a first example, the clause for the axiom checking whether the same atomic formula occurs in both the left and the right hand side of a sequent is implemented as follows: prove([LitGamma,_,_],[LitDelta,_,_],_,_,_,tree(ax)):member(F,LitGamma),member(F,LitDelta),!.
Labelled Sequent Calculi for Conditional Logics
355
It is easy to observe that the rule succeeds when the same labelled formula F belongs to both the right and the left hand side of the sequent under investigation, completing the proof search: indeed, no recursive call to the predicate prove is performed, and the output term Tree matches a representation of a leaf in the derivation (tree(ax)). As another example, we show the code of the novel rule (CEM): prove([LitGamma,TransGamma,ComplexGamma],[LitDelta,TransDelta,ComplexDelta], Labels, RCond, LCond, CEM, tree(cem,SubTree1,SubTree2)):member([X,A => B],ComplexDelta), member([Y,Labels), \+member([Y,[X,A => B]],CEM), !, put([Y,B],LitDelta,ComplexDelta,NewLitDelta,NewComplexDelta),
(∗)
prove([LitGamma,TransGamma,ComplexGamma], [LitDelta,[[X,A,Y]  TransDelta],ComplexDelta], Labels, RCond, LCond, [ [Y,[X,A => B]]  CEM], SubTree1), prove([LitGamma,TransGamma,ComplexGamma], [NewLitDelta,TransDelta,NewComplexDelta], Labels, RCond, LCond, [ [Y,[X,A => B]]  CEM], SubTree2).
The predicate put is used to put [Y,B] in the proper sublist of the antecedent. The recursive calls to prove implement the proof search on the two premises. As mentioned, in order to ensure termination, in line (∗) the theorem prover checks whether (CEM) has been already applied in the current branch by using the same label y to the conditional formula x : A ⇒ B: to this aim, CondLean2022 looks for the pair [Y,[X,A => B]] in the list CEM and, if needed, it avoids a further, useless application. In order to search a derivation of a sequent Γ Δ, the theorem prover proceeds as follows. First, if Γ Δ is an axiom, the goal will succeed immediately by using the clauses for the axioms. If it is not, then the first applicable rule is chosen, e.g. if ComplexDelta contains a formula [X,A > B], then the clause for (→ R) rule is used, invoking prove on the unique premise of (→ R). The prover proceeds in a similar way for the other rules. The ordering of the clauses is such that the application of the branching rules is postponed as much as possible. In order to check whether a formula is valid in one of the considered system, one has just to invoke the following auxiliary predicate: pr(Formula) which wraps the prove predicate by a suitable initialization of its arguments. In order to provide a first evaluation of the performance of the theorem prover, we have tested both CondLean and CondLean2022 over (i) a set of formulas holding only in systems with CEM, as well as over (ii) a set of randomly generated formulas, either valid or not. We have observed that, over a set of valid formulas, the performances of CondLean2022 are improved of 20, 57% with respect to CondLean. As an example, running both the provers over the formula
356
N. Olivetti et al.
(A ⇒ (B1 ∨ . . . B5 )) ⇒ ((A ⇒ B1 ) ∨ . . . ∨ (A ⇒ B5 )) CondLean2022 is able to build a derivation in 94 ms, against the 266 ms needed by CondLean. Over randomly generated formulas, the statistics are even better: CondLean2022 provides an improvement of the performances of 48, 27% with respect to CondLean. The performance of CondLean2022 are promising, especially concerning all cases in which it has to answer no for a not valid formula: this is justified by the fact that CondLean has to make a great effort in order to explore the whole space of alternative choices in label substitution, needed in order to conclude the proof. The current version of the theorem prover CondLean2022 is available for free download at https:// gitlab2.educ.di.unito.it/pozzato/condlean4, where one can also find an updated version of CondLean in order to compare the two provers on common systems.
5 Conclusions and Future Works In this work we have introduced labelled sequent calculi for Conditional Logics with the selection function semantics, including the basic system CK as well as extensions with well established axioms ID, MP, CEM, and CS and all their combinations. As a difference with the seminal work in [19], we are also able to deal with systems combining the condition of the conditional third excluded middle (CEM) and conditional modus ponens (MP), where the conditional strong centering (CS) is a derived condition. The same extensions, with condition (CSO) in place of (CS), are considered in [1]. We have provided alternative calculi, where the original rule for CEM, based on an expensive mechanism of label substitution, has been replaced by a “standard” rule, called (CEM) inspired to the one introduced in [22] and specifically tailored for handling conditional formulas A ⇒ B in these systems. We have also implemented the proposed calculi and compared the obtained theorem prover, called CondLean2022, with its ancestor CondLean. The promising performance we obtained provide an empirical proof that the proposed system not only fills a gap in terms of considered Conditional Logics, but is also a concrete step in the direction of efficient theorem proving for them. We plan to extend our work in several directions. First, we aim at extending the calculi and the implementation to stronger Conditional Logics. Moreover, we aim at extending the theorem prover CondLean2022 towards a “concrete” theorem prover: in particular, we aim at implementing state of the art heuristics, data structures and suitable refinements, as well as a graphical web interface for it. Last, we aim at extending the set of formulas adopted in the performance evaluation. Acknowledgement. This work has been partially supported by the INdAM  GNCS Project cod. CUP_E55F22000270001 “LESLIE: LogichE nonclaSsiche per tooL Intelligenti ed Explainable”.
Labelled Sequent Calculi for Conditional Logics
357
References 1. Alenda, R., Olivetti, N., Pozzato, G.L.: Nested sequent calculi for normal conditional logics. J. Log. Comput. 26(1), 7–50 (2016). https://doi.org/10.1093/logcom/ext034 2. Beckert, B., Posegga, J.: leanTAP: lean tableaubased deduction. J. Autom. Reason. 15(3), 339–358 (1995) 3. Beckert, B., Posegga, J.: Logic programming as a basis for lean automated deduction. J. Log. Program. 28(3), 231–236 (1996) 4. Burgess, J.P.: Quick completeness proofs for some logics of conditionals. Notre Dame J. Formal Log. 22, 76–84 (1981) 5. Chellas, B.F.: Basic conditional logics. J. Philos. Log. 4, 133–153 (1975) 6. Delgrande, J.P.: A firstorder conditional logic for prototypical properties. Artif. Intell. 33(1), 105–130 (1987) 7. Fitting, M.: leanTAP revisited. J. Log. Comput. 8(1), 33–47 (1998) 8. Friedman, N., Halpern, J.Y.: Plausibility measures and default reasoning. J. ACM 48(4), 648–685 (2001) 9. Gabbay, D.M., Giordano, L., Martelli, A., Olivetti, N., Sapino, M.L.: Conditional reasoning in logic programming. J. Log. Program. 44(1–3), 37–74 (2000) 10. Genovese, V., Giordano, L., Gliozzi, V., Pozzato, G.L.: Logics in access control: a conditional approach. J. Log. Comput. 24(4), 705–762 (2014) 11. Giordano, L., Gliozzi, V., Olivetti, N.: Iterated belief revision and conditional logic. Stud. Log. 70(1), 23–47 (2002) 12. Giordano, L., Gliozzi, V., Olivetti, N.: Weak AGM postulates and strong Ramsey test: a logical formalization. Artif. Intell. 168(1–2), 1–37 (2005) 13. Giordano, L., Schwind, C.: Conditional logic of actions and causation. Artif. Intell. 157(1–2), 239–279 (2004) 14. Giordano, L., Gliozzi, V., Olivetti, N., Pozzato, G.L.: Analytic tableaux for KLM preferential and cumulative logics. In: Sutcliffe, G., Voronkov, A. (eds.) LPAR 2005. LNCS (LNAI), vol. 3835, pp. 666–681. Springer, Heidelberg (2005). https://doi.org/10.1007/11591191_46 15. Grahne, G.: Updates and counterfactuals. J. Log. Comput. 8(1), 87–117 (1998) 16. Kraus, S., Lehmann, D., Magidor, M.: Nonmonotonic reasoning, preferential models and cumulative logics. Artif. Intell. 44(1–2), 167–207 (1990) 17. Lewis, D.: Counterfactuals. Basil Blackwell Ltd. (1973) 18. Nute, D.: Topics in Conditional Logic. Reidel, Dordrecht (1980) 19. Olivetti, N., Pozzato, G.L., Schwind, C.B.: A sequent calculus and a theorem prover for standard conditional logics. ACM Trans. Comput. Log. (ToCL) 8(4), 22es (2007) 20. Olivetti, N., Pozzato, G.L.: CondLean: a theorem prover for conditional logics. In: Cialdea Mayer, M., Pirri, F. (eds.) TABLEAUX 2003. LNCS (LNAI), vol. 2796, pp. 264–270. Springer, Heidelberg (2003). https://doi.org/10.1007/9783540452065_23 21. Olivetti, N., Pozzato, G.L.: CondLean 3.0: improving CondLean for stronger conditional logics. In: Beckert, B. (ed.) TABLEAUX 2005. LNCS (LNAI), vol. 3702, pp. 328–332. Springer, Heidelberg (2005). https://doi.org/10.1007/11554554_27 22. Panic, N., Pozzato, G.L.: Efficient theorem proving for conditional logics with conditional excluded middle. In: Calegari, R., Ciatto, G., Omicini, A. (eds.) Proceedings of the 37th Italian Conference on Computational Logic, Bologna, Italy, 29 June–1 July 2022. CEUR Workshop Proceedings, vol. 3204, pp. 217–231. CEURWS.org (2022). https://ceurws.org/ Vol3204/paper_22.pdf 23. Schwind, C.B.: Causality in action theories. Electron. Trans. Artif. Intell. (ETAI) 3(A), 27– 50 (1999) 24. Stalnaker, R.: A theory of conditionals. In: Rescher, N. (ed.) Studies in Logical Theory, pp. 98–112. Blackwell (1968)
Deep Learning for ECoG BrainComputer Interface: EndtoEnd vs. HandCrafted Features 1,2(B) ´ Maciej Sliwowski , Matthieu Martin1 , Antoine Souloumiac2 , Pierre Blanchart2 , and Tetiana Aksenova1 1
Univ. Grenoble Alpes, CEA, LETI, Clinatec, 38000 Grenoble, France [emailprotected], [emailprotected] 2 Universit´e ParisSaclay, CEA, List, 91120 Palaiseau, France
Abstract. In brain signal processing, deep learning (DL) models have become commonly used. However, the performance gain from using endtoend DL models compared to conventional ML approaches is usually signiﬁcant but moderate, typically at the cost of increased computational load and deteriorated explainability. The core idea behind deep learning approaches is scaling the performance with bigger datasets. However, brain signals are temporal data with a low signaltonoise ratio, uncertain labels, and nonstationary data in time. Those factors may inﬂuence the training process and slow down the models’ performance improvement. These factors’ inﬂuence may diﬀer for endtoend DL model and one using handcrafted features. As not studied before, this paper compares the performance of models that use raw ECoG signals with timefrequency featuresbased decoders for BCI motor imagery decoding. We investigate whether the current dataset size is a stronger limitation for any models. Finally, obtained ﬁlters were compared to identify diﬀerences between handcrafted features and optimized with backpropagation. To compare the eﬀectiveness of both strategies, we used a multilayer perceptron and a mix of convolutional and LSTM layers that were already proved eﬀective in this task. The analysis was performed on the longterm clinical trial database (almost 600 min of recordings over 200 days) of a tetraplegic patient executing motor imagery tasks for 3D hand translation. For a given dataset, the results showed that endtoend training might not be signiﬁcantly better than the handcrafted featuresbased model. The performance gap is reduced with bigger datasets, but considering the increased computational load, endtoend training may not be proﬁtable for this application. Keywords: Deep learning · ECoG · Braincomputer interfaces Dataset size · Motor imagery · Endtoend
1
·
Introduction
In the last decade, deep learning (DL) models achieved extraordinary performance in a variety of complex reallife tasks, e.g., computer vision [4], natural c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Dovier et al. (Eds.): AIxIA 2022, LNAI 13796, pp. 358–373, 2023. https://doi.org/10.1007/9783031271816_25
EndtoEnd Deep Learning for ECoG BrainComputer Interface
359
language processing [2], compared to previously developed models. This was possible mainly thanks to the improvements of data processing units and, most importantly, increased dataset sizes [4]. Generally, in braincomputer interfaces (BCI) research, access to large databases of brain signals is limited due to the experimental and medical constraints as well as the immensity of paradigms/hardware combinations. Given limited datasets, can we still train endtoend (E2E) DL models for the medical BCI application as eﬀectively as in computer vision? In 2019, Roy et al. [12] reported that the number of studies classifying EEG signals with deep learning using handcrafted features (mainly frequency domain) and raw EEG signals (endtoend) was similar. This indicates that decoding raw EEG signals, without feature extraction, is indeed possible. However, in many articles, researchers decided to use harder to design handcrafted features. While endtoend models dominated computer vision, in brain signals processing, it is still common to use features extracted as an input to the DL models. It is unclear whether speciﬁc signal characteristics cause this, e.g., nonstationarity in time making the creation of a homogeneous dataset impractical, low signaltonoise ratio complicating the optimization process and favoring overﬁtting, labels uncertainty originating from humanintheloop experimental setup, or researchers’ bias toward solutions better understood and more explainable. Most studies do not directly compare DL using endtoend and handcrafted features approaches. Usually, DL architectures are compared with each other and with an additional ‘traditional’ ML pipeline, e.g., ﬁlterbank common spatial pattern (FBCSP) in [15], xDAWN and FBCSP in [5], SVM and FBCSP in [17]. In Fig. 1, we aggregated studies analyzed1 by Roy et al. [12] to present the accuracy improvement of the best proposed DL model in every article compared to the ‘traditional’ baseline depending on the recording time and the number of examples in the dataset. The gap between performance improvement of DL compared to the ‘traditional’ baseline increases with the dataset size (except for the last points on the plot, which contain signiﬁcantly fewer studies). In the right plot, the diﬀerence between models using raw EEG and frequency domain features increases which may exhibit a boost of endtoend models with access to bigger datasets compared to handcrafted features. As the proposed DL models are usually compared to the baseline, the boost of endtoend models cannot be clearly stated because the accuracy diﬀerence depends strongly on the ‘traditional’ baseline model performance and the particular task tackled in the study. While EEG and ECoG signals share many characteristics—both are multichannel temporal signals with information encoded in frequency and space, with low signaltonoise ratio and noisy labels—there are also diﬀerences, e.g., a higher spatial resolution of ECoG, higher signaltonoise ratio and higher contribution of informative high gamma band (>70 Hz). In motor imagery ECoG decoding, endtoend DL is not commonly used. Instead, ‘traditional’ ML classiﬁers are 1
limited to the articles that contained all the required information, code adapted from [12].
360
´ M. Sliwowski et al.
Fig. 1. Binned average accuracy diﬀerence between best proposed DL model and ‘traditional’ baseline on EEG datasets. Error bars denote one standard deviation of the values in the bin. Bins are equal in size on a logarithmic scale. Points xaxis position denotes the average dataset size in a bin.
usually preceded by a feature extraction step creating brain signals representation, typically in the form of timefrequency features, containing information about power time course in several frequency bands [8,14] or focused only on lowfrequency component (LFC)/Local Motor Potential (LMP) [14] (detailed analysis can be found in [19]). However, a successful application of an endtoend DL model to motor imagery decoding of ﬁnger movements trajectory from ECoG was performed with convolutional layers ﬁltering the raw signal both in temporal and spatial domains followed by LSTM layers [20]. Nevertheless, an average improvement from training the weights compared to ﬁxed handcrafted features can be estimated as 0.022 ± 0.0393 of Pearson r correlation coeﬃcient, which is relatively small, with 66% of cases noticeable improvement from endtoend training. As this was not studied before, we investigated the diﬀerences in data requirements between an endtoend model and one using handcrafted features on a longterm clinical trial BCI dataset of 3D target reach task. Unique longterm recordings (several months of experiments, more than 600 min duration in total, compared to few minutes of ECoG recording available in previous studies, e.g., [20]) allowed us to explore the relationship between dataset size and the type of feature used for ECoG signal decoding. In this study, we used architectures previously applied to the ECoG dataset for decoding motor imagery signals with handcrafted timefrequency features as input [16]. In addition, we optimized the temporal ﬁltering layer with backpropagation seeking a more eﬃcient set of ﬁlters that were initialized to reproduce continuous wavelet transform. We also investigated whether both approaches react diﬀerently to training dataset perturbations which may be the case due to distinct model properties and may inﬂuence the choice of optimal data processing pipeline for ECoG BCI.
2 2.1
Methods Dataset
The dataset used in this study was collected as a part of the clinical trial ‘BCI and Tetraplegia’ (ClinicalTrials.gov identiﬁer: NCT02550522, details in [1]) approved
EndtoEnd Deep Learning for ECoG BrainComputer Interface
361
by the ethical Committee for the Protection of Individuals (Comit´e de Protection des PersonnesCPP) with the registration number: 15CHUG19 and the Agency for the Safety of Medicines and Health Products (Agence nationale de s´ecurit´e du m´edicament et des produits de sant´e—ANSM) with the registration number: 2015A0065049 and the ethical Committee for the Protection of Individuals (Comit´e de Protection des Personnes—CPP) with the registration number: 15CHUG19. In the experiment, a 28yearsold tetraplegic patient after spinal cord injury was asked to move the hands of a virtual avatar displayed on a screen (see Fig. 2) using motor imagery patterns—by repeatedly imaging/attempting hand/ﬁngers/arm movements (without actual movements) that inﬂuence brain activity in the motor cortex. These changes were then recorded with two WIMAGINE [10] implants placed Fig. 2. Screenshot from the virtual over the primary motor and sensory cortex environment. The patient is asked to bilaterally. Each implant consisted of 8 × 8 reach the yellow square (target) with grid of electrodes with recording performed the left hand (eﬀector) using motor imagery. (Color ﬁgure online) using 32 electrodes selected in a chessboardlike manner due to limited data transfer with a sampling frequency equal to 586 Hz. Signals from implants were transferred to the decoding system that performed online predictions. First, one out of 5 possible states (idle, left and right hand translation, left and right wrist rotation) was selected with a state decoder. Then, for every state (except idle), a multilinear REWNPLS model [3] updated online was used to predict 3D movements or 1D wrist rotation. The dataset consisted of 44 experimental sessions recorded over more than 200 days. It constitutes 300 and 284 min for left and right hand translation, respectively. 2.2
Data Representation and Problem
From the recorded signals, we extracted two datasets for left and right hand translation. The raw signal representation was created from 1second long windows of ECoG signal with 90% overlap. Every observation Xi ∈ R64×590 contained 590 samples2 for each of the 64 channels corresponding to the number of electrodes recording the signal. Every signal window Xi was paired with the corresponding desired trajectory yi ∈ R3 that the patient was asked to follow, i.e., the straight line connecting the tip of the hand to the target. The trajectories were computed in the 3D virtual avatar coordinate system mounted in the pelvis of the eﬀector. Before feeding the data to the models, datasets were cleaned from data loss artifacts that were not caught during the online recordings. Additionally, observations for which the predicted and desired state did not match due to state 2
instead of 586 samples due to 100 ms buﬀer during recording.
362
´ M. Sliwowski et al.
decoder errors were also removed to reduce the number of mislabelled observations (e.g., when the patient was asked to control left hand translation but instead left wrist was rotating). Then, all the models were trained to ﬁnd the mapping between Xi ECoG signal and yi desired trajectories that the hand should follow in the case of optimal prediction. As a performance metric we used cosine similarity (Eq. 1) measuring ˆ i and the desired trajectory yi . cosine of the angle αi between prediction y ˆi) = CS(yi , y
ˆi yi · y = cos αi yi · ˆ yi
(1)
ˆ i ) = 1 − CS(yi , y ˆ i ) was used as optimization Cosine loss deﬁned as CL(yi , y objective. 2.3
HandCrafted Features Extraction and DL Optimization
‘Traditional’ handcrafted features were extracted using complex continuous wavelet transform (CWT). CWT was performed with Morlet wavelets with central frequencies ranging from 10 to 150 Hz (broad band as ECoG contains higher frequencies than EEG) with a step of 10 Hz. Each wavelet support consisted of 118 samples (0.2 s) centered on its maximum value. Features were obtained by applying CWT on onesecondlong signals, computing the module of the complex signals, and performing an average pooling of 0.1 s. The resulting feature tensor was of shape 64 × 15 × 10, with dimensions corresponding to channels, frequency bands, and time steps. CWT can be represented as a convolution between a set of ﬁlters and a signal in the temporal domain. In the standard case, the ﬁlters are ﬁxed and constitute a basis for feature extraction where every ﬁlter detects brain activity in a diﬀerent frequency band. As every spatial channel is convolved separately in time, we obtained a timefrequencyspace representation of the ECoG signal (see Table 1 for feature extractor architecture speciﬁcation). Here, we propose to adjust the ﬁlters during backpropagation together with all other parameters of the models. In the ﬁrst scenario, the ﬁlters were initialized to Morlet wavelets with 15 central frequencies, resulting in 30 kernels (real and imaginary parts). Note that at the beginning of training, the ﬁrst layer reproduces ‘traditional’ handcrafted feature extraction. The ﬁlters were ﬁxed for 5 epochs of socalled pretraining, then they were unfreezed and optimized freely (without any additional constraints imposing speciﬁc ﬁlter shape) for the following 50 epochs. The pretraining was used to not distort the wavelets drastically in the ﬁrst epochs when parameters of the rest of the network are randomly initialized. We also evaluated random weights initialization from uniform distribution as a solution that does not incorporate prior knowledge about the system.
EndtoEnd Deep Learning for ECoG BrainComputer Interface
363
In the second scenario, an alternative approach was used to maintain the wavelet structure by optimizing only the parameters used to generate the wavelets instead of modifying all ﬁlters’ parameters. In our case, the function generating the wavelets was deﬁned as: 2 1 1 Ψ (t, f ) = √ e−(tf ) e2iπtf π fs
(2)
f
where central frequency parameter f deﬁnes the center of the frequency band analyzed by the wavelet and fs is the signal sampling frequency. In the central frequency optimization (CFO) scenario, we optimized only the central frequency f parameters (one per wavelet), so the ﬁlters after training are still from the Morlet wavelets family. Table 1. The architecture used to reproduce handcrafted feature extraction with CWT. Only one convolutional layer (conv time) was used in computations according to the performed experiment E2E/E2E CFO. Layer
Kernel shape
Output shape
Param # Multadds
Input
–
[200, 1, 590, 8, 8]
–
–
Conv time
[1, 30, 118, 1, 1] [200, 30, 590, 8, 8] 3,570
Conv time CFO
[1, 30, 118, 1, 1] [200, 30, 590, 8, 8] 15
27,006,336,000
–
–
Square
[200, 30, 590, 8, 8] –
27,006,336,000
Sum real and imaginary –
[200, 15, 590, 8, 8] –
–
Square root
–
[200, 15, 590, 8, 8] –
–
Dropout
–
[200, 15, 590, 8, 8] –
–
AvgPool
–
[200, 15, 10, 8, 8]
–
–
BatchNorm
[15]
[200, 15, 10, 8, 8]
30
6,000
2.4
DL Architectures
In this study, we used two architectures proposed in [16], i.e., CNN+LSMT+MT, which showed the best performance, and MLP, which was the simplest approach. In the baseline approach, the handcrafted feature extraction was followed with fully connected or convolutional layers. When optimizing the ﬁrst convolutional layer, we kept the rest of the network the same to isolate the inﬂuence of the training feature extraction step. Details of the tested DL architectures are described below and in [16]. Additionally, we used ShallowFBCSPNet and Deep4Net [15] as endtoend DL baseline. MLP. The most basic DL architecture evaluated in the study was multilayer perceptron (MLP), consisting of two fully connected layers. Dropout and batch normalization layers were placed between fully connected layers for stronger regularization (see Table 2).
´ M. Sliwowski et al.
364
Table 2. MLP architecture from [16]. Layer
Kernel shape Output shape Param # Multadds
Flatten
–
[200, 9600]
–
–
Fully connected [9600, 50]
[200, 50]
480,050
96,010,000
BatchNorm
[50]
[200, 50]
100
20,000
ReLU
–
[200, 50]
–
–
Dropout
–
Fully connected [50, 50]
[200, 50]
–
–
[200, 50]
2,550
510,000
ReLU
–
[200, 50]
–
–
Dropout
–
[200, 50]
–
–
[200, 3]
153
30,600
Fully connected [50, 3]
CNN+LSTM+MT. In the CNN+LSTM+MT architecture, CWT features were further analyzed with 3 × 3 convolutional layers in space (electrodes organized on an array 4 × 8 reﬂecting positions of electrodes on implants). After two convolutional layers, two LSTM layers were applied to analyze temporal information from 10 timesteps. Finally, every output of the last LSTM layer was used for training to compute loss based on all predicted and ground truth trajectories corresponding to 1 s (10 timesteps) of signal analyzed (see Table 3). Table 3. CNN+LSTM+MT architecture from [16]. Layer
Kernel shape
Input
Output shape
Param # Multadds
[200, 15, 8, 8, 10] –
Input per implant
[200, 15, 8, 4, 10] –
Conv space
[15, 32, 3, 3, 1] [200, 32, 6, 4, 10] 4,352
ReLU
–
[200, 32, 6, 4, 10] –
208,896,000 –
BatchNorm
[32]
[200, 32, 6, 4, 10] 64
12,800
Dropout
–
[200, 32, 6, 4, 10] –
Conv space
[32, 64, 3, 3, 1] [200, 64, 4, 2, 10] 18,496
– 295,936,000
ReLU
–
[200, 64, 4, 2, 10] –
–
Dropout
–
[200, 64, 4, 2, 10] –
–
LSTM
–
[200, 10, 50]
215,200
430,400,000
LSTM
–
[200, 10, 3]
660
Models Training and Hyperparameters. For every model evaluation, we used 90% and 10% of the training dataset for training and validation, respectively. The validation dataset was used for early stopping after 20 epochs without improvement. All the models used a ﬁxed set of hyperparameters, i.e., learning rate of 0.001, weight decay of 0.01, batch size of 200, and ADAM optimizer [9]. To train DL models we used PyTorch [11], skorch [18], and braindecode [15].
EndtoEnd Deep Learning for ECoG BrainComputer Interface
2.5
365
Oﬄine Experiments
First, we computed results in a classical evaluation scenario, i.e., train/valid/test split. We used the calibration dataset (ﬁrst six sessions, approximately 10% of the dataset) as the training dataset. The rest of the data (online evaluation dataset) was used as the test set. Additionally, we gradually increased the training dataset size from one session up to 22 with a step of 2. As diﬀerent models may have diﬀerent dataset requirements, we wanted to verify whether collecting more data can be more proﬁtable for one of the evaluated optimization/architecture combinations. To investigate the possible inﬂuence of endtoend learning on models’ robustness against data mislabelling, we perturbed the dataset to make training more challenging. In the BCI, part of observations is often mistakenly labeled due to lack of subject attention, tiredness, experimental setup, etc. Therefore, we randomly selected a fraction of observations in which targets were shuﬄed between samples so they no longer have a meaningful connection with the ECoG signal while preserving the same distribution. At the same time, we kept the test set unchanged.
3
Results
Table 4. Test cosine similarity computed in the trainvalidtest split scenario. Values are sorted by average performance and represent the mean and standard deviation of 5 runs. Left hand E2E CNN+LSTM+MT CFO
Right hand
0.304 ± 0.005 0.266 ± 0.020
CNN+LSTM+MT
0.297 ± 0.008
0.270 ± 0.011
E2E CNN+LSTM+MT
0.289 ± 0.007
0.273 ± 0.015
E2E MLP CFO
0.254 ± 0.012
0.230 ± 0.013
MLP
0.247 ± 0.023
0.232 ± 0.005
E2E MLP
0.243 ± 0.014
0.234 ± 0.020
ShallowFBCSPNet [15]
0.235 ± 0.010
0.236 ± 0.011
E2E CNN+LSTM+MT random init 0.216 ± 0.008
0.230 ± 0.020
E2E MLP random init
0.181 ± 0.029
0.223 ± 0.008
Deep4Net [15]
0.111 ± 0.021
0.259 ± 0.013
We started the analysis by comparing diﬀerent model training scenarios when trained on the ﬁrst six sessions (online calibration dataset). The results for the train/test split can be found in Table 4. Diﬀerences between scenarios are rather small, with only small performance improvement coming from full endtoend
366
´ M. Sliwowski et al.
optimization. The best performance was achieved by models using CFO. However, the gap between the handcrafted features approach and CFO is relatively small, considering standard deviations of the computed values. The worst performance was achieved for Deep4Net (especially low performance for the left hand dataset) and both MLP and CNN+LSTM+MT models with random weights initialization, suggesting the high importance of the prior signal processing knowledge used to deﬁne the wavelet shape of the ﬁlters at the beginning of the training.
Fig. 3. Diﬀerence between cosine similarity of endtoend model and its counterpart using handcrafted features. The bold line denotes the moving average with a window of size 3.
We did not notice signiﬁcant improvements coming from endtoend optimization, so we wanted to verify the hypothesis of diﬀerent dataset size requirements for diﬀerent optimization methods. Therefore, the diﬀerences between endtoend models and their handcrafted features counterparts for several training dataset sizes are presented in Fig. 3. In some cases, endtoend models increase the cosine similarity faster than the models using ﬁxed features, so the gap between models can be reduced for approaches using random weights initialization. However, only for models initialized to wavelets and optimized directly, an improvement over handcrafted features can be observed for some points (up to 0.05 of cosine similarity for the right hand dataset). When comparing CFO and standard E2E optimization in Fig. 4, higher eﬀectiveness of CFO for small training datasets can be observed. CFO may limit overﬁtting as the functions represented by the convolutional ﬁlters are constrained to the wavelet family. It may be interpreted as an additional optimization constraint imposed on model parameters. Note that diminished gap between CFO and standard endtoend in Fig. 4 show only relative decrease of CFO performance.
EndtoEnd Deep Learning for ECoG BrainComputer Interface
367
Fig. 4. Diﬀerence between cosine similarity of the CFO model and its counterpart using constraintfree endtoend optimization. The bold line denotes the moving average with a window of size 3.
3.1
Filters Visualization
Fig. 5. Visualized ﬁlters before (blue) and after (red) training for the models with parameters optimized freely. Note that only real part of the wavelet was visualized for clarity. Plot titles denote central wavelet frequency at initialization. (Color ﬁgure online)
We visualized the ﬁlters before and after training to analyze the characteristics of learned feature extraction. In Fig. 5, we presented the ﬁlters modiﬁed without additional constraints. The biggest change can be observed in the central frequencies between 30 Hz and 80 Hz. In most cases, the initial central frequency was maintained, while the wavelets got extended with a signal similar to the sine wave in the central wavelet frequency. This could indicate the importance of information about frequencies from which the signal is composed. At the
368
´ M. Sliwowski et al.
same time, extending wavelets reduces the temporal resolution of the signals. The changes in the highfrequency wavelets (>100 Hz) are less signiﬁcant, and the pattern of extending wavelets is no longer visible. Instead, components of signiﬁcantly lower frequencies and smaller amplitude were added. In Fig. 6, we visualized ﬁlters before and after optimization when the ﬁrst convolutional layer was initialized to random. As ﬁlters initialized to random were much harder to analyze visually, we presented them in the form of power spectra, so the changes in the ﬁltered frequencies could be better visible. All ﬁlters have a maximum power peak lower than 65 Hz with 40% of maxima contained in the frequency range 25–30 Hz. Compared to handcrafted features, endtoend ﬁlters initialized to random covered only approximately half of the frequency band analyzed by the ﬁxed handcrafted feature extraction pipeline. However, in the higher frequencies, there are smaller peaks which can also contribute to the extracted representation and may cover the missing frequency band.
Fig. 6. Power spectra of ﬁlters before (blue) and after (red) training for convolutional layer initialized to random. The plots denoted frequencies for which maximum power spectra were observed before and after training. (Color ﬁgure online)
In Fig. 7a, we presented the diﬀerence between initialized central wavelet frequency and the one obtained after the training. We observed a decrease in almost all frequencies when training the models. The decrease was higher for higher frequencies. This may suggest that more information can be extracted from lower frequencies. However, in our preliminary results, we noticed that adapting the learning rate for the convolutional layer may signiﬁcantly change the frequency behavior (see Fig. 7b), which should be taken into account when analyzing the results. This may lead to diﬀerent changes in the central frequencies than in the
EndtoEnd Deep Learning for ECoG BrainComputer Interface
369
base model. The gradient was increased 150 times by squeezing central frequencies from 10–150 Hz to 0–1. In the case of initialization to wavelet, a network may start the training near a local minimum found by the manual design of feature extraction that can be hard to move out. Setting a higher learning rate may enable reaching diﬀerent regions on the loss function surface. The performance achieved with a higher learning rate was similar to the standard CFO results with a cosine similarity of 0.283 ± 0.014 (left hand) and 0.270 ± 0.011 (right hand) for CNN+LSTM+MT and 0.262 ± 0.01 (left hand) and 0.227 ± 0.007 (right hand) for MLP.
Fig. 7. Diﬀerence between central wavelet frequencies before and after CFO. Models for left hand translation are presented in the left column, for the right hand in the right column. Note that the scale is diﬀerent for (a) and (b).
3.2
Target Perturbation
In the case of perturbed groundtruth (Fig. 8), CNN+LSTM+MT models were more robust to noise in the targets with increased stability (especially for the left hand) of handcrafted features and CFO models compared to models optimized freely. On the other hand, in the case of MLP models, almost no diﬀerences between diﬀerent optimization methods in the inﬂuence of noise on the performance were noticed.
370
´ M. Sliwowski et al.
Fig. 8. Inﬂuence of noise in the targets on models’ performance. Noise level indicates the fraction of observations with perturbed labels.
4
Discussion
We proposed several approaches for the endtoend optimization of deep learning ECoG decoders. However, in this study, we did not observe improvement from endtoend optimization, especially when no prior knowledge was used for ﬁlter initialization. This conﬁrms the usefulness of handcrafted features and years of neuroscientiﬁc signal processing while leaving doors open to more sophisticated endtoend models. Firstly, deeper models with more advanced DL mechanisms [6,13] should be evaluated as they may allow for the extraction of more complex representations and thus outperform handcrafted features. Secondly, machine learning methods for robust learning may be evaluated, e.g., learning from noisy input data, noisy labels, and outofdistribution samples [7]. Those can particularly tackle problems coming from speciﬁc recording/experimental circumstances. The reasoning behind our study is focused on the speciﬁcity of ECoG brain signals and the adequacy of selected DL methods to the problem. The speciﬁcity originates from experimental constraints caused by the presence of a human in the loop but also signals characteristics, hardware capabilities, etc. It results in a distorted dataset with a low signaltonoise ratio, short signal stationarity interval, and uncertain labels. This is quite diﬀerent from computer vision problems, usually with welldeﬁned labels and images understandable with a naked eye. Improving information extraction from noisy data may be especially important in the light of increased robustness to noise in targets shown by the CNN+LSTM+MT model compared to MLP. Using all 10 targets recorded during a 1s window decreases the inﬂuence of single perturbed points on the performance because the information can be eﬃciently extracted even for 40% or 60% of perturbed targets. In this case, the CNN+LSTM+MT model using handcrafted features maintains high performance for a higher noise level than the endtoend model. However, an important point in the discussion is that our dataset, even after data cleaning, still contains a signiﬁcant, unknown amount of observations with incorrect labels. Thus, in Fig. 8, a noise level equal to zero
EndtoEnd Deep Learning for ECoG BrainComputer Interface
371
corresponds to an unknown noise level in labels originating from the experimental setup. Thus, generative models should be used to create datasets with a known level of noise and analyze the inﬂuence of perturbations on the performance in the case of less distorted datasets. All the results were computed oﬄine on datasets recorded with only one patient. These kinds of datasets are hardly accessible due to experimental and legal constraints. It makes the generalization of the results to other patients and datasets hard to estimate. Thus, more simulations should be performed to conﬁrm our conclusions, ideally with more patients and tasks. This should also include hyperparameters search, like learning rate, batch size, weight decay, as those could vary between diﬀerent approaches. However, performing hundreds of evaluations is timeconsuming, and the problem is magniﬁed in the case of endtoend models due to increased computational load. Our study focused on feature extraction based on wavelet transform, which was previously used in this problem. As we optimized the parameters of the wavelet transform without changing other parts of the model, we isolated the inﬂuence of endtoend optimization on models’ performance. While this simpliﬁed the problem, our study did not evaluate other feature extraction pipelines that could behave diﬀerently. Thus, an extended analysis of several feature extraction pipelines compared to their endtoend counterparts would allow for broader generalization and therefore is of great interest. While this article and [20] analyzed ECoG signals, targets used for training models in [20] were actual ﬁngers trajectories recorded while subjects performed real movements. In our case, targets are much noisier due to the lack of labeling based on the hand movements of a tetraplegic patient. This may favor handcrafted features, as could be seen for CNN+LSTM+MT in Fig. 8. Finally, our conclusions are in line with [20] who observed relatively small improvement from optimizing handcrafted features and worse performance/longer training time when initializing the model to random. In our case, endtoend models achieved the same performance as models using CWT features only with smart weights initialization, which emphasizes the importance of prior signal processing knowledge in designing DL for ECoG analysis. Acknowledgement. Clinatec is a Laboratory of CEALeti at Grenoble and has statutory links with the University Hospital of Grenoble (CHUGA) and University Grenoble Alpes (UGA). This study was funded by CEA (recurrent funding) and the French Ministry of Health (Grant PHRC15150124), Institut Carnot, Fonds de Dotation Clinatec. Matthieu Martin was supported by the crossdisciplinary program on Numerical ´ Simulation of CEA. Maciej Sliwowski was supported by the CEA NUMERICS program, which has received funding from European Union’s Horizon 2020 research and innovation program under the Marie SklodowskaCurie grant agreement No 800945— NUMERICS—H2020MSCACOFUND2017.
372
´ M. Sliwowski et al.
References 1. Benabid, A.L., et al.: An exoskeleton controlled by an epidural wireless brainmachine interface in a tetraplegic patient: a proofofconcept demonstration. Lancet Neurol. 18(12), 1112–1122 (2019). https://doi.org/10.1016/S14744422(19)303217 2. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pretraining of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACLHLT 2019, Minneapolis, MN, USA, 2–7 June 2019, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/n191423 3. Eliseyev, A., et al.: Recursive exponentially weighted Nway partial least squares regression with recursivevalidation of hyperparameters in braincomputer interface applications. Sci. Rep. 7(1), 16281 (2017). https://doi.org/10.1038/s41598017165799 4. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classiﬁcation with deep convolutional neural networks. In: Pereira, F., Burges, C., Bottou, L., Weinberger, K. (eds.) Advances in Neural Information Processing Systems, vol. 25. Curran Associates, Inc. (2012). http://proceedings.neurips.cc/paper/2012/ﬁle/ c399862d3b9d6b76c8436e924a68c45bPaper.pdf 5. Lawhern, V.J., Solon, A.J., Waytowich, N.R., Gordon, S.M., Hung, C.P., Lance, B.J.: EEGNet: a compact convolutional neural network for EEGbased braincomputer interfaces. J. Neural Eng. 15(5), 056013 (2018). https://doi.org/10.1088/ 17412552/aace8c 6. Lee, Y.E., Lee, S.H.: EEGtransformer: selfattention from transformer architecture for decoding EEG of imagined speech (2021). https://doi.org/10.48550/ARXIV. 2112.09239 7. Li, J., Xiong, C., Hoi, S.C.: Learning from noisy data with robust representation learning. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9465–9474 (2021). https://doi.org/10.1109/ICCV48922.2021.00935 8. Liang, N., Bougrain, L.: Decoding ﬁnger ﬂexion from bandspeciﬁc ECoG signals in humans. Front. Neurosci. 6, 91 (2012). https://doi.org/10.3389/fnins.2012.00091 9. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2019). http://openreview.net/forum? id=Bkg6RiCqY7 10. Mestais, C.S., Charvet, G., SauterStarace, F., Foerster, M., Ratel, D., Benabid, A.L.: WIMAGINE: wireless 64channel ECoG recording implant for long term clinical applications. IEEE Trans. Neural Syst. Rehabil. Eng. 23(1), 10–21 (2015). https://doi.org/10.1109/TNSRE.2014.2333541 11. Paszke, A., et al.: PyTorch: an imperative style, highperformance deep learning library. In: Advances in Neural Information Processing Systems, vol. 32, pp. 8024–8035. Curran Associates, Inc. (2019). https://papers.neurips.cc/paper/9015pytorchanimperativestylehighperformancedeeplearninglibrary.pdf 12. Roy, Y., Banville, H., Albuquerque, I., Gramfort, A., Falk, T.H., Faubert, J.: Deep learningbased electroencephalography analysis: a systematic review. J. Neural Eng. 16(5), 051001 (2019). https://doi.org/10.1088/17412552/ab260c
EndtoEnd Deep Learning for ECoG BrainComputer Interface
373
13. Santamar´ıaV´ azquez, E., Mart´ınezCagigal, V., VaquerizoVillar, F., Hornero, R.: EEGinception: a novel deep convolutional neural network for assistive ERPbased braincomputer interfaces. IEEE Trans. Neural Syst. Rehabil. Eng. 28(12), 2773– 2782 (2020). https://doi.org/10.1109/TNSRE.2020.3048106 14. Schalk, G., et al.: Decoding twodimensional movement trajectories using electrocorticographic signals in humans. J. Neural Eng. 4(3), 264–275 (2007). https:// doi.org/10.1088/17412560/4/3/012 15. Schirrmeister, R.T., et al.: Deep learning with convolutional neural networks for EEG decoding and visualization. Hum. Brain Mapp. 38(11), 5391–5420 (2017). https://doi.org/10.1002/hbm.23730 ´ 16. Sliwowski, M., Martin, M., Souloumiac, A., Blanchart, P., Aksenova, T.: Decoding ECoG signal into 3D hand translation using deep learning. J. Neural Eng. 19(2), 026023 (2022). https://doi.org/10.1088/17412552/ac5d69 17. Tabar, Y.R., Halici, U.: A novel deep learning approach for classiﬁcation of EEG motor imagery signals. J. Neural Eng. 14(1), 016003 (2016). https://doi.org/10. 1088/17412560/14/1/016003 18. Tietz, M., Fan, T.J., Nouri, D., Bossan, B., skorch Developers: SKORCH: a scikitlearn compatible neural network library that wraps PyTorch (2017). http://skorch. readthedocs.io/en/stable/ 19. Volkova, K., Lebedev, M.A., Kaplan, A., Ossadtchi, A.: Decoding movement from electrocorticographic activity: a review. Front. Neuroinform. 13, 74 (2019). https://doi.org/10.3389/fninf.2019.00074 20. Xie, Z., Schwartz, O., Prasad, A.: Decoding of ﬁnger trajectory from ECoG using deep learning. J. Neural Eng. 15(3), 036009 (2018). https://doi.org/10.1088/17412552/aa9dbe
Quantum Circuit Compilation for the Graph Coloring Problem Angelo Oddi1(B) , Riccardo Rasconi1 , Marco Baioletti2(B) , Vieri Giuliano Santucci1 , and Hamish Beck3 1
3
Institute of Cognitive Sciences and Technologies (ISTCCNR), Rome, Italy {angelo.oddi,riccardo.rasconi,vieri.santucci}@istc.cnr.it 2 University of Perugia, Perugia, Italy [emailprotected] Advanced Concepts Team, ESA European Space Research and Technology Centre, Noordwijk, The Netherlands
Abstract. In this work we investigate the performance of greedy randomised search (GRS) techniques to the problem of compiling quantum circuits that solve instances of the Graph Coloring problem. Quantum computing uses quantum gates that manipulate multivalued bits (qubits). A quantum circuit is composed of a number of qubits and a series of quantum gates that operate on those qubits, and whose execution realises a speciﬁc quantum algorithm. Current quantum computing technologies limit the qubit interaction distance allowing the execution of gates between adjacent qubits only. This has opened the way to the exploration of possible techniques aimed at guaranteeing nearestneighbor (NN) compliance in any quantum circuit through the addition of a number of socalled swap gates between adjacent qubits. In addition, technological limitations (decoherence eﬀect) impose that the overall duration (i.e., depth) of the quantum circuit realization be minimized. One core contribution of the paper is the application of an upgraded version of the greedy randomized search (GRS) technique originally introduced in the literature that synthesises NNcompliant quantum circuits realizations, starting from a set of benchmark instances of diﬀerent size belonging to the Quantum Approximate Optimization Algorithm (QAOA) class tailored for the Graph Coloring problem. We propose a comparison between the presented method and the SABRE compiler, one of the bestperforming compilation procedures present in Qiskit, an opensource SDK for quantum development, both from the CPU eﬃciency and from the solution quality standpoint. Keywords: Randomized search · Quantum circuit compilation Planning · Scheduling · Optimization
·
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Dovier et al. (Eds.): AIxIA 2022, LNAI 13796, pp. 374–386, 2023. https://doi.org/10.1007/9783031271816_26
Quantum Circuit Compilation for the Graph Coloring Problem
1
375
Introduction
Quantum algorithms process information represented as qubits, the basic unit of quantum information, and quantum operations (called gates) are the building blocks of quantum algorithms. In order to be run on real quantum computing hardware, quantum algorithms must be compiled into a set of elementary machine instructions (or gates). Since currently available quantum devices suﬀer a number of technological problems such as noise and decoherence, it is important that the process that carries out the quantum computation be somehow adapted to the physical limitations of the quantum hardware of interest, by means of a proper compilation. For practical applications, it is essential to make quantum computation able to tackle problem instances of more and more realistic size. To this aim, the ability to produce compiled quantum circuits of good quality is of paramount importance. In this paper, we focus our eﬀorts on the socalled Quantum Alternate Operator Ansatz (QAOA) algorithms [9] applied on the gatemodel noisy intermediatescale quantum (NISQ) processor units [18]. Our approach intends to improve over the compilation algorithms employed in the Qiskit quantum computing software development kit [1], and devise solutions that are easily adaptable to diﬀerent classes of problems. In the NISQ era, the leading quantum processors are characterized by about 50 to a few hundred qubits but are not advanced enough to reach fault tolerance, nor large or sophisticated enough to continuously implement quantum error correction. The term “noisy” refers to the fact that quantum processors are very sensitive to the environment and may lose their quantum state due to quantum decoherence. The term “intermediatescale” refers to the relatively small number of qubits and moderate gate ﬁdelity. The term NISQ algorithms refers to algorithms designed for NISQ quantum processors. For example, the Variational Quantum Eigensolver (VQE) or the Quantum Alternate Operator Ansatz (QAOA) (and its subclass, the Quantum Approximate Optimization Algorithm [6,8]) are hybrid algorithms that use NISQ devices but reduce the calculation load by implementing some parts of the algorithm in usual classical processors. Usually, NISQ algorithms require error mitigation techniques to recover useful data, which however make use of precious qubits to be implemented. Thus, the creation of a computer with tens of thousands of qubits and suﬃcient error correction capabilities would eventually end the NISQ era. These “beyondNISQ” devices would be able, for example, to implement Shor’s algorithm, for very large numbers, and break RSA encryption. Until that point however, the need to produce circuits runnable in the current (or nearfuture) quantum architectures in a reasonably reliable manner (i.e., counting on noise minimization techniques rather than on errorcorrecting techniques) will stand. Hence, the need to provide quantum circuit compilation procedures that minimize the eﬀects of decoherence by minimizing the circuit’s depth. In this work, we investigate the performance of an upgraded version of the greedy randomized search (GRS) technique [10,16,19] originally introduced in [17] applied to the problem of compiling quantum circuits to emerging quantum
376
A. Oddi et al.
hardware. In particular, we experiment on a set of benchmark instances belonging to the Quantum Alternate Operator Ansatz (QAOA) class tailored for the Graph Coloring problem, and devised to be executed on top of a hardware architecture inspired by Rigetti Computing Inc. [20]. We compare our algorithm’s performance against the SABRE compiler [13], one of the best compilers present in the Qiskit framework, and demonstrate the superiority of our approach. The paper is organized as follows. Section 2 provides some background information. Section 3 formally describes the problem, whereas Sect. 4 describes the proposed heuristic solving algorithms and the Greedy Randomised Search approach. Finally, an empirical comparison with the results obtained from the SABRE compiler [1] and some conclusions close the paper.
2
Background
Quantum computing is based on the manipulation of qubits rather than conventional bits; a quantum computation is performed by executing a set of quantum gates on the qubits. A gate whose execution involves k qubits is called kqubit quantum gate. Current NISQ devices only allow the direct execution of 1qubit and 2qubit quantum gates. In order to be executed, a quantum circuit must be mapped on a quantum chip which determines the circuit’s hardware architecture speciﬁcation [14]. The chip can be seen as an undirected multigraph whose nodes represent the qubits (quantum physical memory locations) and whose edges represent the types of gates that can be physically implemented between adjacent qubits of the physical hardware (see Fig. 1 as an example of three chip topologies of increasing size). Since a 2qubit gate requiring two speciﬁc qstates can only be executed on a pair of adjacent (NN) qubits, the required qstates must be made nearestneighbors prior to gate execution. NNcompliance can be obtained by adding a number of swap gates so that every pair of qstates involved in the quantum gates can be eventually made adjacent, allowing all gates to be correctly executed. Figure 2 shows an example of quantum circuit that only uses the ﬁrst three qubits of the chip (N = 8) of Fig. 1, which assumes that qstates q1 , q2 and q3 are initially allocated to qubits n1 , n2 and n3 . The circuit is composed of four generic 2qubit gates (i.e., CNOT gates) and one generic 1qubit gate (i.e., the Hadamard gate). Note that the circuit is not NNcompliant as the last gate involves two qstates resting on to two nonadjacent qbits (n1 and n3 ). The right side of Fig. 2 shows the same circuit made NNcompliant through the insertion of a swap gate. In this work, we tackle the compilation problem of quantum circuit following a schedulingoriented formulation, as described in the next sections. In particular, our approach is related to a body of heuristic eﬀorts available in the current literature, see [11,12] for two relatively recent representative works. Even though these papers pursue the same objective, i.e., optimizing the realization of nearestneighbor compliant quantum circuits, they focus on quantum circuits characterized by preordered noncommutative gates. On the contrary, our approach
Quantum Circuit Compilation for the Graph Coloring Problem
377
Fig. 1. Three quantum chip designs characterized by an increasing number of qubits (N = 8, 21, 40) inspired by Rigetti Computing Inc. Every qubit is located at a diﬀerent location (node), and the integers at each node represent the qubit’s identiﬁer.
leverages the parallel nature of the considered planning/scheduling problem, and proposes a greedy randomized algorithm that exploits gate commutativity through a heuristic ranking function for quantum gate selection.
3
The QCC Problem
The problem tackled in this work consists in compiling a given quantum circuit on a speciﬁc quantum hardware architecture. To this aim, we focus on the Quantum Alternating Operator Ansatz (QAOA) framework [9] a generalization of the Quantum Approximate Optimization Algorithm (QAOA) circuits [6,8], a class of hybrid quantum algorithms used in the literature to solve problems like the MaxCut, while the Graph Coloring problem has received much less attention. The quantum hardware architecture we consider is inspired by the one proposed by Rigetti Computing Inc. [20]. The quantum circuits that solve the benchmark problems considered in this work are characterized by a high number of commuting quantum gates (i.e., gates among which no particular order is superimposed) that allow for great ﬂexibility and parallelism in the solution, which makes the corresponding optimization problem very interesting and allows for an a signiﬁcant depth minimization potential to limit circuit’s decoherence [21]. The rest of this section is devoted to: (i) describing the Graph Coloring problem and (ii) providing a formulation of the Quantum Circuit Compilation Problem (QCCP).
378
A. Oddi et al.
Fig. 2. Example of quantum circuit: (a) not NNcompliant; (b) NNcompliant through the insertion of a swap gate between qbits n1 and n2 just before the last gate, which exchanges the position of their respective qstates. It is implicitly supposed that at the beginning, the ith qstate is resting on the ith qubit.
3.1
The Graph Coloring Problem
Given a graph G(V, E) with n = V  nodes and m = E edges, the objective is to maximize the number of edges in E that have end points with diﬀerent colours, using for each node one among k available colours (k > 2), see Fig. 3a. Similarly to the MaxCut problem case, the quantum state preparation circuit within the QAOA solving framework relative to the Graph Coloring problem is divided in the following ordered phases: (i) initial state preparation (INIT block), (ii) phaseshift (PS block), and (iii) mixing (MIX block) (see Fig. 3b). Speciﬁcally, the initial state preparation phase serves the purpose of initializing the quantum states to represent a feasible initial assignment, and its objective is to create a superposition with equal coeﬃcients of all the k n possible colorings (WN state [4]), following the onehot encoding [7]. According to the onehot encoding, k qubits are required to represent the color of each vertex, where all but the ith qubit (1 ≤ i ≤ k) are assigned the 0 value and the ith qubit, which is assigned the 1 value, indicates whether that node is coloured with the colour i. As a consequence, in order to solve a Graph Coloring instance characterized by n nodes and k colors following the onehot encoding, it is necessary to use quantum machines with at least nk qubits. More concretely, the feasible initial state assignment is obtained through the utilization of a series of controlledG(p) rotations followed by an inverted CNOT (WN gates, see Fig. 3c). The analysis of the speciﬁc circuitry necessary to develop the WN quantum state is beyond the scope of this paper; the interested reader may refer to [4]. The PSphase is composed of a series of phaseshift (RZZ ) gates whose task is counting the edges colored with diﬀerent colors. For this purpose, an RZZ gate (see Fig. 3c) is applied to all the (k 2 − k)/2 combinations of diﬀerent colors associated to the endpoints of any edge of the graph to be colored. All the phaseshift gates are commutative, so the compilation process does not need to worry about their order in the ﬁnal circuit.
Quantum Circuit Compilation for the Graph Coloring Problem
379
Finally, the MIX phase serves the purpose of implementing the rotation of all the k colors on every node on the graph, thus potentially allowing any possible color assignment. The basic component of the MIX phase is the RXX RY Y (or M IXXY ) gate (see Fig. 3c), applied to each vertex of the graph to be colored, and for each pair of adjacent colors in the graph that represents the color rotation on each vertex. The placement of the M IXXY gates in the compiled circuit requires some attention, as these gates are only partially commutative (see the next section).
Fig. 3. (a) An example of Graph Coloring instance with k = 3 colors. (b) Schema of the quantum state preparation circuit within the QAOA framework, composed of the initialization block, PS (phaseshift) block and MIX block. (c) Decomposition in terms of unary and binary basic gates of the quantum gates that respectively compose the three previous blocks.
Figure 3a shows an example of the graph G that represents a Graph Coloring problem instance composed of 5 vertices, 8 edges and k = 3 colors. Figure 3b, presents the quantum state preparation schema of the QAOA framework, typically composed of the initial qubit allocation block (state initialization), the PS (phaseshift) block and the MIX block. In the Graph Coloring problem case, each of the previous three blocks are composed of particular quantum gate aggregations, the WN , the RZZ (phaseshift), and the M IXXY gates respectively, shown in Fig. 3c. Generally, the PS and the MIX blocks within the QAOA framework can be executed along multiple passes (p) in order to obtain more accurate results; in the context of this work, we consider quantum circuits composed of two passes (p = 2). 3.2
Quantum Gate Compilation Problem
Formally, the Quantum Circuit Compilation Problem (QCCP) is a tuple P = C0 , L0 , QM , where C0 is the input quantum circuit, representing the execution
380
A. Oddi et al.
of the Graph Coloring algorithm, L0 is the initial assignment of the ith qstate qi to the ith qubit ni , and QM is a representation of the quantum hardware as a multigraph. – The input quantum circuit is a tuple C0 = Q, VC0 , T C0 , where: (1) Q = {q1 , q2 , . . . , qN } is the set of qstates which, from a planning & scheduling perspective, represent the resources necessary for each gate’s execution (see for example [15], Chap. 15); (2) VC0 = WN ∪ PS ∪ M IXXY ∪ {gstart , gend } represents the set of state initialization, phaseshift and mix gate operations that have to be scheduled. Note that all the previous gates are binary, in the sense that they require two qstates. Note also that gstart and gend are two ﬁctitious reference gate operations requiring no qstates. The execution of every quantum gate requires the uninterrupted use of the involved qstates during its processing time, and each qstate qi can process at most one quantum gate at a time. (3) Finally, T C0 is a set of simple precedence constraints imposed on the WN , PS, M IXXY and {gstart , gend } sets, such that: (i) each gate in the three sets WN , PS, M IXXY occurs after gstart and before gend ; moreover, within the same pass: (ii) every PS gate must follow any WN gate with which it shares a qstate; (iii) any M IXXY gate must follow any PS gate with which it shares a qstate; (iv) all the PS are totally commutative; (v) a partial ordering exists in the M IXXY set, as follows: the M IXXY is initially partitioned in two sets called M IXodd and M IXeven depending on the numbering of their initial qstate; all the gates mix ∈ M IXodd can commute as they have no qstate in common, and the same applies to all the gates mix ∈ M IXeven , while there exists a precedence imposed between a mix ∈ M IXodd and a mix ∈ M IXeven if and only if they share at least one qstate. Between two consecutive passes, no PS gate that belongs to the i + 1th pass can be executed before any M IXXY gate that belongs to the ith pass if they share at least one qstate. – L0 is the initial assignment at the time origin t = 0 of qstates qi to qubits ni . – QM is a representation of the quantum hardware as an undirected multigraph QM = VN , EWN , Eps , Eswap , where VN = {n1 , n2 , . . . , nN } is the set of qubits (nodes), Eps , Eswap or EWN is a set of undirected edges (ni , nj ) representing the set of adjacent locations the qstates qi and qj of the gates ps( qi , qj ), swap( qi , qj ) or WN (qi , qj ) can potentially be allocated to. Figure 1 shows an example of quantum hardware. A feasible solution is a tuple S = SWAP, T C, which extends the initial circuit C0 to a circuit CS = Q, VCS , T CS , such that VCS = SWAP ∪ WN ∪ PS ∪ MIX ∪ {gstart , gend } and T CS = T C0 ∪ T C where: (i) SWAP is a set of additional swap( qi , qj ) gates added to guarantee the adjacency constraints for the set of WN , PS and M IXXY gates, and (ii) T C is a set of additional simple precedence constraints such that: – for each qstate qi , a total order i is imposed among the set Qi of operations requiring qi , with Qi = {op ∈ WN ∪ PS ∪ M IXXY ∪ SWAP : op requires qi };
Quantum Circuit Compilation for the Graph Coloring Problem
381
Algorithm 1. Greedy Randomized Search Require: An problem P , stop criterion Sbest ← CompileCircuit(P ) while (stopping criterion not satisﬁed) do S ← CompileCircuit(P ) if (depth(S) < depth(Sbest )) then Sbest ← S end if end while return (Sbest )
– all the wN (qi , qj ), ps( qi , qj ), mixXY (qi , qj ) and swap( qi , qj ) gate operations are allocated on adjacent qubits in QM ; – the graph VCS , T CS does not contain cycles. Given a solution S, a path between the two ﬁctitious gates gstart and gend is a sequence of gates gstart , op1 , op2 , . . . , opk , gend , with opj ∈ WN ∪PS∪M IXXY ∪ SW AP , such that gstart op1 , op1 op2 , . . . , opk gend ∈ T C0 ∪ T CS . The length of the path is the number of all the path’s gates and depth(S) is the length of the longest path from gstart to gend . An optimal solution S is a feasible solution characterized by the minimum depth.
4
A Greedy Randomized Search Algorithm
In this section we provide a detailed description of the Greedy Randomized Search (GRS) procedure used to compile the circuit introduced in previous Sect. 3. GRS has traditionally proved to be a very eﬀective method for the resolution of complex optimization problems (such as the QCCP ), as it realizes a simple optimization process that quickly guides the search towards good solutions [10,16,19]). The GRS is particularly useful in cases where a highquality solution is needed in a relatively short time. Among other applications, it is particularly suitable for constraintbased scheduling problems; since the QCCP can be reduced to a Planning and Scheduling (P&S) problem [17,21]. Algorithm 1 depicts the complete randomized search algorithm for generating a nearoptimal solutions, which is designed to invoke the CompileCircuit() procedure until a stop criterion is satisﬁed. It essentially realizes an optimization cycle in which a new solution S is computed at each iteration through the CompileCircuit() algorithm, and its depth (depth(S)) is compared with the best depth found so far (depth( Sbest )) in the iterative process. In case depth(S) is smaller than depth( Sbest ), then the current solution S becomes the new best solution Sbest . The optimization process continues until a stopping condition (generally a max time limit) is met, where the GRS procedure returns the best solution found. As can be readily observed, the eﬃcacy of the GRS mainly depends on the eﬃcacy of the
382
A. Oddi et al.
Algorithm 2. Compile Circuit Require: A problem P = C0 , L0 , QM S ← InitSolution(P ); t←0 while not all the PS and M IX operations are inserted in S do op ← SelectExecutableGate(P , S, t) if op = nil then S ← InsertGate(op, S, t) else t←t+1 end if end while return S
CompileCircuit() procedure (described in the following section), which has the task of synthesizing increasingly better solutions. 4.1
Compile Circuit Algorithm
Algorithm 2 is a randomized algorithm, it operates on macrogates containing primitive gates that use two qstates at most. Indeed, Algorithm 2 is in itself a heuristicallybased iterative algorithm that implements a constructive methodology where a solution is built from scratch using a randomized ranking heuristic. This heuristic returns a ranking among the gates that takes into account the “neighbouring cost” of all the gates that have yet to be inserted in the solution. At each iteration, a subset of gates that guarantee the fastest realization of the neighbouring conditions of all the remaining gates is generated and one gate is selected at random from this subset, for insertion in the current partial solution. Algorithm 2 takes as input a QCCP problem P = C0 , L0 , QM , and proceeds by chronologically inserting in the partial solution S one gate operation at a time until all the gates in the set WN ∪ PS ∪ M IXXY are in S. Let op ∈ Qi be a general gate operation that involves qstate qi , we deﬁne a chain chi = {op ∈ Qi : op ∈ S} as the set of gates involving qi and currently present in the partial solution S, among which a total order is imposed. Let us also deﬁne last(chi ) as the last gate in the chain chi according to the imposed total order and nlast(chi ) as the QM node at which the last operation in the chain chi terminates its execution. Finally, we deﬁne the state of a partial solution as follows. Given a partial solution S, the state LS is the tuple LS = nlast(ch1 ), nlast(ch2 ), . . . , nlast(chN ) of QM locations (nodes) where each last chain operation last(chi ) terminates its execution. The ﬁrst step of Algorithm 2 is the initialisation of the partial solution S; in particular, it sets the current state LS to the init value L0 by initialising the locations of every qstate qi (i.e., for every chain chi ) at the time origin1 t = 0. 1
It is implicitly supposed that at the beginning, the ith qstate is initialized at the ith location.
Quantum Circuit Compilation for the Graph Coloring Problem
383
The core of the algorithm is the function SelectExecutableGate(), which returns at each iteration either one of the gates in the set WN ∪ PS ∪ M IXXY or a swap( qi , qj ) gate in the SWAP set necessary to guarantee NNcompliance as described in the previous Sect. 3. Indeed, it is a random algorithm targeted to minimize the solution depth, in particular its implementation is inspired to [3], such that the selection of a gate is based on two criteria: (i) the earliest start time gate selection (a value correlated to depth minimization); (ii) a metric to minimize the number of swaps. At each iteration, SelectExecutableGate(P , S, t) selects the next gate to be inserted in the solution by means of the InsertGate(op, S, t) method. In all time instants t where no quantum gate can be selected for insertion, the current time t is increased (t = t+1). In particular, SelectExecutableGate() resembles Algorithm 3 (see [2], page 8) with the following important diﬀerence: while the cited Algorithm 3 generates a set of eligible gates Ω and then selects a gate at random on the basis the proposed pheromone model (see [2]), the SelectExecutableGate() procedure chooses one gate at random following the same strategy proposed in [17], so that a set of equivalent gates Ω ∗ is extracted from Ω by identifying one gate op∗ associated with the minimal lexicographic heuristic value Δsum (op∗ ) (see [17] for further details on its deﬁnition) and by considering equivalent to op∗ all the gates op such that Δsum (op) = Δsum (op∗ ), Ω ∗ = {op : op ∈ Ω, Δsum (op) = Δsum (op∗ )}. A full description of the procedure SelectExecutableGate() is given in [2]. The randomly selected gate op ∈ Ω ∗ is inserted in the partial solution S at the earliest feasible time as the last operation of the chains relative to the qstates involved in op: last(chi ) ← op; subsequently, the state LS of the partial solution is updated accordingly. Algorithm 2 proceeds until a complete solution is built.
5
Experimental Evaluation
We have implemented and tested the proposed ideas leveraging the Qiskit opensource quantumrelated framework [1]. Qiskit is a known opensource Software Development Kit for working with quantum computers at the level of pulses, circuits and application modules. It allows for the creation, modiﬁcation, simulation, and optimization of quantum circuits on a set of both simulated and real quantum architectures, as well as allowing the possibility to test mapping algorithms on arbitrary quantum hardware topologies. Our contribution for this study focuses on the process of quantum circuit compilation with reference to a given hardware topology with the aim of minimizing the circuit’s depth. The proposed procedure was implemented in Python in order to allow its integration within Qiskit. The performance of the algorithm was tested on a benchmark set speciﬁcally created to represent the application of quantum computing to the Graph Coloring problem. 5.1
Setup
The benchmark set for the graph colouring circuits is obtained as an extension of part of the N 8 benchmark set for the MaxCut problem [21]. Following the
384
A. Oddi et al.
Fig. 4. Comparison between GRS and SABRE
approach in [21], the graph G for which the optimal coloring assignment needs to be found are randomly generated as Erd¨ osR´enyi graphs [5]. In particular, 100 graphs are generated for the N = 8 qubit case. Half (50 problems) are generated by choosing N of N (N − 1)/2 edges over 7 qstates randomly located on the circuit of size 8 qubits (referred as ‘Utilization’ u = 90%). The other 50 problems are generated by choosing N edges over 8 qstates  referred as utilization u = 100%). For the graph colouring benchmark, we only consider the N 8 problems with utilization u = 100%, and such that the connected graph contains exactly 7 nodes, assigning three colours (k = 3) to each node of the graph, for a total of 22 graph instance problems. Hence, quantum processors with at least 21 qubits (7 nodes times 3 colours) are necessary for the execution of such instances (see Sect. 3.1). More speciﬁcally, we consider a Rigettiinspired 21 qubit processor and set p = 2 (two PSmixing passes). 5.2
Results
The Python version of the proposed greedy randomized search (GRS ) algorithm compiles a QAOA circuit with the following choices: (i) a onehot encoding to represent the graphcoloring problems [7], and (ii) a decomposition procedure for the QAOA blocks based on the identiﬁcation of odd and even M IXXY gates [9,22], as explained in Sect. 3.2. Figure 4 compares the proposed GRS algorithm with the SABRE compiler available in Qiskit (SabreSwap), launched according to its three diﬀerent
Quantum Circuit Compilation for the Graph Coloring Problem
385
heuristics (basic, lookahead, and decay). The algorithms are compared with respect to the depth of the compiled circuits (the circuit’s depth represents the longest path in the compiled circuit graph). For each algorithm, a CPU time limit of 10 seconds is imposed on each run. From the results in Fig. 4 it is clear that GRS outperforms SABRE in all the latter’s execution modes. One possible explanation for the superiority of GRS is its capability to better exploit the commutativity rules of the gates in the QAOAbased Graph Coloring quantum circuit instances. Indeed, our algorithm imposes no particular order in the selection of the WN , PS, and M IXXY macrogates as the solution is built, beyond the precedence constraints originally present in the input quantum circuit, contained in the T C0 set described in Sect. 3.2. As opposed to GRS, SABRE performs the SWAP addition process reasoning directly on the circuit expressed in terms of basic gates, and it is not capable of changing the order of such gates after the circuit is loaded.
6
Conclusions
This study focused on quantum computing as an accelerator for optimization problem resolution. We have considered the compilation techniques for Noisy IntermediateScale Quantum (NISQ) devices [18]. In particular, we have explored the Quantum Alternating Operator Ansatz (QAOA) framework [9] for solving optimization problems and studied the quantum circuits for the Graph Coloring reference problem. We have proposed a greedy randomized search (GRS) algorithm targeted at optimizing the compilation of quantum circuits and deﬁned an original benchmark set for testing compilation algorithms. On the basis of our empirical validation the proposed GRS algorithm outperforms other compilation algorithms available in the Qiskit framework. Acknowledgement. This work is the result of an Ariadna study, a joint collaborative research project with the Advanced Concepts Team (ACT) of the European Space Agency (ESA): MetaHeuristic Algorithms for the Quantum Circuit Compilation Problem, ESA Contract No. 4000134995/21/NL/GLC/my.
References 1. Qiskit: an opensource framework for quantum computing (2021). https://doi.org/ 10.5281/zenodo.2573505 2. Baioletti, M., Rasconi, R., Oddi, A.: A novel ant colony optimization strategy for the quantum circuit compilation problem. In: Zarges, C., Verel, S. (eds.) EvoCOP 2021. LNCS, vol. 12692, pp. 1–16. Springer, Cham (2021). https://doi.org/10.1007/ 9783030729042 1 3. Chand, S., Singh, H.K., Ray, T., Ryan, M.: Rollout based heuristics for the quantum circuit compilation problem. In: 2019 IEEE Congress on Evolutionary Computation (CEC), pp. 974–981 (2019) 4. Cruz, D., et al.: Eﬃcient quantum algorithms for GHZ and w states, and implementation on the IBM quantum computer. Adv. Quant. Technol. 2(5–6), 1900015 (2019)
386
A. Oddi et al.
5. Erdos, P., Renyi, A.: On the evolution of random graphs. Publ. Math. Inst. Hungary. Acad. Sci. 5, 17–61 (1960) 6. Farhi, E., Goldstone, J., Gutmann, S.: A quantum approximate optimization algorithm. arXiv preprint arXiv:1411.4028 (2014) 7. Fuchs, F.G., Kolden, H.Ø., Aase, N.H., Sartor, G.: Eﬃcient encoding of the weighted max $$k$$cut on a quantum computer using QAOA. SN Comput. Sci. 2(2), 89 (2021). https://doi.org/10.1007/s4297902000437z 8. Guerreschi, G.G., Park, J.: Gate scheduling for quantum algorithms. arXiv preprint arXiv:1708.00023 (2017) 9. Hadﬁeld, S., Wang, Z., O’Gorman, B., Rieﬀel, E., Venturelli, D., Biswas, R.: From the quantum approximate optimization algorithm to a quantum alternating operator ansatz. Algorithms 12(2), 34 (2019) 10. Hart, J., Shogan, A.: Semigreedy heuristics: an empirical study. Oper. Res. Lett. 6, 107–114 (1987) 11. Kole, A., Datta, K., Sengupta, I.: A heuristic for linear nearest neighbor realization of quantum circuits by swap gate insertion using ngate lookahead. IEEE J. Emerg. Sel. Topics Circuits Syst. 6(1), 62–72 (2016). https://doi.org/10.1109/JETCAS. 2016.2528720 12. Kole, A., Datta, K., Sengupta, I.: A new heuristic for ndimensional nearest neighbor realization of a quantum circuit. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 37(1), 182–192 (2018). https://doi.org/10.1109/TCAD.2017.2693284 13. Li, G., Ding, Y., Xie, Y.: Tackling the qubit mapping problem for NISQera quantum devices. CoRR abs/1809.02573 (2018). https://arxiv.org/1809.02573 arxiv.org/abs/1809.02573 14. Maslov, D., Falconer, S.M., Mosca, M.: Quantum circuit placement: optimizing qubittoqubit interactions through mapping quantum circuits into a physical experiment. In: Proceedings of the 44th Annual Design Automation Conference, DAC’07, pp. 962–965. ACM, New York, NY, USA (2007). https://doi.org/10.1145/ 1278480.1278717 15. Nau, D., Ghallab, M., Traverso, P.: Automated Planning: Theory & Practice. Morgan Kaufmann Publishers Inc., San Francisco (2004) 16. Oddi, A., Smith, S.: Stochastic procedures for generating feasible schedules. In: Proceedings 14th National Conference on AI (AAAI97), pp. 308–314 (1997) 17. Oddi, A., Rasconi, R.: Greedy randomized search for scalable compilation of quantum circuits. In: van Hoeve, W.J. (ed.) Integration of Constraint Programming, Artiﬁcial Intelligence, and Operations Research, pp. 446–461. Springer International Publishing, Cham (2018). https://doi.org/10.1007/9783319930312 32 18. Preskill, J.: Quantum computing in the NISQ era and beyond. Quantum 2, 79 (2018). https://doi.org/10.22331/q2018080679 19. Resende, M.G., Werneck, R.F.: A hybrid heuristic for the pmedian problem. J. Heuristics 10(1), 59–88 (2004) 20. Sete, E.A., Zeng, W.J., Rigetti, C.T.: A functional architecture for scalable quantum computing. In: 2016 IEEE International Conference on Rebooting Computing (ICRC), pp. 1–6 (2016).https://doi.org/10.1109/ICRC.2016.7738703 21. Venturelli, D., Do, M., Rieﬀel, E., Frank, J.: Temporal planning for compilation of quantum approximate optimization circuits. In: Proceedings of the 26th International Joint Conference on Artiﬁcial Intelligence, IJCAI17, pp. 4440–4446 (2017). https://doi.org/10.24963/ijcai.2017/620 22. Wang, Z., Rubin, N.C., Dominy, J.M., Rieﬀel, E.G.: xy mixers: analytical and numerical results for the quantum alternating operator ansatz. Phys. Rev. A 101, 012320 (2020)
Toward a Heterogeneous Multirobot Framework for PriorityBased Sanitization of Railway Stations Riccardo Caccavale1 , Mirko Ermini2 , Eugenio Fedeli2 , Alberto Finzi1 , Vincenzo Lippiello1 , and Fabrizio Tavano1,2(B) 1
Universit` a degli studi di Napoli “Federico II”, via Claudio 21, 80125 Naples, Italy {riccardo.caccavale,alberto.finzi,vincenzo.lippiello}@unina.it 2 Rete Ferroviaria Italiana, Piazza della Croce Rossa 1, 00161 Rome, Italy {mi.ermini,e.fedeli}@rfi.it, [emailprotected]
Abstract. We present a new framework for the prioritized multirobot sanitization of railway stations based on Deep Reinforcement Learning. The proposed framework allows us to deﬁne teams of robots having different sanitizing strategies/capabilities, e.g., faster robots rapidly sanitizing small areas in cooperation with slower but longrange ones. Here, robotspeciﬁc policies are deﬁned in order to accommodate the diﬀerent capabilities of the single agents, while two global metrics are deﬁned to assess the performance of the overall team. This capability of managing heterogeneous teams is an important requirement for the infrastructure manager Rete Ferroviaria Italiana S.p.A., which plans to verify to what extent diﬀerent technologies or diﬀerent strategies can be combined to reduce costs or increase cleaning eﬃciency. We tested our framework considering real data collected by the WiFi network of the main Italian railway station, Roma Termini, comparing its results with a similar Deep Reinforcement Learning system where homogeneous robots are employed.
Keywords: Heterogeneous multirobot system learning · Prioritybased sanitization
1
· Deep reinforcement
Introduction
The work illustrated in this paper is motivated by a request from the Italian railway infrastructure manager Rete Ferroviaria Italiana concerned about the spread of Covid19 disease in the common areas of railway stations. A recent [15] study shows that in train stations there is a high probability of being infected during the pandemic: passengers gathered in the corridors and platforms of stations, eating at restaurants, getting on trains, facilitates the transmission of diseases. The pandemic caused by the SARSCoV2 has spawned a crisis that has aﬀected the railway sector in a signiﬁcant way [31], for example, by inducing people to prefer c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Dovier et al. (Eds.): AIxIA 2022, LNAI 13796, pp. 387–401, 2023. https://doi.org/10.1007/9783031271816_27
388
R. Caccavale et al.
cars instead of trains [4]. It is then strategic for the infrastructure managers (such as the Italian Rete Ferroviaria Italiana) to deploy adequate and modern tools to prevent future contagion inside railway stations [26]. In the last two decades, we have seen the spreading in the world of diseases such as SARSCoV, MERSCoV, or COVID19 with diﬀerent waves [9,33]. In this regard, disinfectant robot technologies are proven very useful in ﬁghting global pandemics [32] by reducing the number of people involved in the cleaning process and by optimizing sterilization. In this work, we propose a multirobot sanitizing framework speciﬁc for teams of robots capable of cleaning humanpopulated environments [20] as railway stations. Our main goal is to provide a strategy in which a team of autonomous and heterogeneous indoor cleaning robots executes the sanitizing activities by exploiting both the speciﬁc capabilities oﬀered by the diﬀerent robots and the information about human presence. The latter knowledge is stored in a shared heatmap representing the most populated areas retrieved from the station’ WIFI infrastructure. The proposed work is an extended version of our previous multirobot sanitizing framework [3] where heterogeneous agents are considered. Speciﬁcally, we extended the framework by allowing diﬀerent robots with diﬀerent features  such as cleaning range, the shape of the cleaning area, or speed  to cooperate during the execution of the shared cleaning task. Analogously to [3], we propose a framework based on a decentralized deep reinforcement learning, where each robot of the team performs its own learning process. Our claim is that the robotspeciﬁc policies produced by this approach permits cooperation between the heterogeneous robots without performance degradation of the overall team. The need to consider heterogeneous robots is relevant for the infrastructure manager. Diﬀerent typologies of robots with diﬀerent cleaning strategies can be deployed, for instance, by integrating several small (lowcost) robots, having limited sanitizing capabilities, in alternative to bigger but more performing ones. The possibility to increase the number of robots in a team with diﬀerent or less expensive models without reducing the cleaning performance of the overall system may be convenient in terms of costs but also in terms of hardware and software maintenance [10], especially over prolonged usage periods [24]. In the literature, frameworks that simulate heterogeneous teams of robots are often considered and deployed in several diﬀerent contexts [28]. For instance, in the pursuitevasion class of games, robots with diﬀerent capabilities cooperate to catch moving targets in a shared environment and their pursuit strategies must be adapted with respect to the behavior of the opponent [30,34]. This case is similar to our domain, where clusters of people appear/disappear/move around the station and the robots’ cleaning strategy should be adapted accordingly. The beneﬁt of the coordinated heterogeneous robots is also emphasized in [35] where diﬀerent robots (aerial and ground vehicles) are deployed. In the cleaning context, several multirobot frameworks have been proposed based on Coverage Path Planning (CPP) [11–14,17,21–23]. In these works, each robot is assigned to a speciﬁc area of the environment executing ﬁxed shaped paths (spirals, rectangles, etc.) to cover them. These methods are eﬀective in providing a continuous cleaning service which maximizes the coverage and minimizes the
Heterogeneous Multirobot Framework for PriorityBased Sanitization
389
idleness of the agents, but prioritybased cleaning is hardly considered. Priority issues are instead considered in Persistent CPP [16,19,25,29], where robots’ paths have to be adjusted in order to ensure that static prioritized locations are visited within the prespeciﬁed time period. These approaches often consider static priorities and a graphbased representation of the environments with only a limited number of nodes. Deep QLearning (DQN) methods for sanitization are considered in [11,22], but in a single robot framework. In contrast, our approach is to dynamically update the behavior of a team of heterogeneous robots by considering the continuous evolution about the positions of the people and the diﬀusion of the contaminants in the map. Moreover, we are interested in ﬁnding a multirobot sanitization strategy considering heterogeneous teams and high resolution priorities in very large railway stations. For this reason, we proposed a solution based on MultiAgent Reinforcement Learning [3] capable of adapting the cleaning strategy to the continuous changes in a very large dynamic environment. In this work, our main contribution is the design of a heterogeneous framework where multiple mobile robots of diﬀerent characteristics and typologies learn to cooperate during the execution of cleaning tasks. To evaluate the approach, we consider a very large crowded environment from a real railway station, exploiting WiFi information about the distribution of people in order to assess the performance of diﬀerent heterogeneous teams of robots. We also propose an assessment of a heterogeneous robotic team in a real case study, using a oneday data recording of the people’s movements inside the Roma Termini station, retrieved from the Meraki Cisco System WiFi Network. In this context, the empirical results collected show that the performance of the heterogeneous team is comparable to that of the homogeneous team working under the same conditions. The rest of the paper is structured as follows. In Sect. 2, we describe the architecture of the proposed framework along with the main components and the overall learning process. In Sect. 3, we focus on the experiments about the convergence and performance of the proposed heterogeneous team in comparison with the homogeneous one. Finally, Sect. 4 concludes the paper and discusses future works.
2
The Architecture
The multirobot DQN approach proposed in this work is an evolution of the decentralized clientserver architecture presented in [3], where it is now possible to specify diﬀerent characteristics of each single robot. In particular, the team is composed of k robots with diﬀerent capabilities, each endowed with a robotspeciﬁc policy. The robots interact with a central system (server) that maintains/updates a shared representation of the station in the form of a heatmap whose hot spots are areas to be sanitized. Speciﬁcally, we represent the environment as a 2dimensional gridmap whose grids outline 1 m2 areas of the station. Grids are then associated with a priority level (the heatmap), which depends on the distribution of people in the station and indicates how risky the area is and how urgently the robots should sterilize it. The goal of the agents is to
390
R. Caccavale et al.
Fig. 1. Graphical representation of the framework including multiple agents (left), each endowed with agentspeciﬁc experience replay buﬀers and networks, along with a single server (right) that exploits WiFi statistics to provide a heatmap of priorities (red to yellow spots) for the agents. (Color ﬁgure online)
suitably navigate the gridmap cleaning the traversed grids in the process, in so minimizing the risky areas and reducing the level of priority on the overall map. We formalize our domain as a distributed multirobot Deep QLearning problem [1] where a set of agentspeciﬁc policies (π1 , . . . , πk ) should be found for the k robots in order to guide each agent toward the cleaning targets. More formally, we deﬁne M as the gridmap of the station, X as the set of all freeobstacle grids in the map, S as the set of possible heatmaps (i.e., priority distributions) on the map M and A as the set of actions available for a single agent, where ai ∈ A drives the agent i from the current grid to an adjacent one. The aim is to ﬁnd, for each robot i, a suitable policy πi : S × X → A associating the agent positions xi ∈ X and the distributions of priority in the map s ∈ S to robotspeciﬁc actions ai ∈ A, driving the agent from the current grid to the next grid to be sanitized. A representation of the overall architecture is depicted in Fig. 1. The framework includes a team of diﬀerent typologies of mobile cleaning robots. Every typology is characterized by a diﬀerent dimension of the area that agents sanitize during their movements in the environment. Each robot communicates with a single shared WiFi server that is responsible for building a heatmap of the Roma Termini railway station. The server updates the heatmap considering the information about the agents’ cleaning activities and the (anonymized) data on the location of people, which are used to deﬁne the risky areas to be sterilized. The role of each agent is to elaborate the heatmap by means of an agentspeciﬁc DQN and to update the local strategy πi considering their speciﬁc capabilities, the environmental settings and the priorities in the map.
Heterogeneous Multirobot Framework for PriorityBased Sanitization
391
Fig. 2. Planimetry of the Roma Termini shared by Rete Ferroviaria Italiana (a) and the selected occupancy gridmap (b). (Color ﬁgure online)
2.1
Heatmap Definition and Update
The gridmap representing the environment is built from the real planimetry of the Roma Termini railway station, which has been provided to us by Italian Infrastructure Manager Rete Ferroviaria Italiana. The area of the station selected for our experiments is depicted in Fig. 2 (yellow box). We deﬁned this area because, on the one hand, it represents the indoor part of the station, where open air and wind cannot attenuate contamination and, on the other hand, it includes the areas of the station where it is more likely to have crowed areas. Selected sectors include: access gates for the railway lines, commercial activities like shops, restaurants, ticket oﬃces, waiting rooms, and luggage storage. Starting from this gridmap, we design a heatmap where populated areas are associated with colored spots (from red to yellow) representing the cleaning priority that the heterogeneous team should take into account during the sanitizing process. More speciﬁcally, the resulting heatmap has a dimension of 100 × 172 pixels and a resolution of 1 m2 per pixel. During every step of the execution, each robot of the team decides the new position to reach in order to start the cleaning action depending on its own speciﬁc capability. After a movement, each robot cleans at a ﬁxed cleaning rate (i.e., 4 pixels per step) a cleaning area of ﬁxed dimensions and shape. Each robot in the team has its assigned dimension and shape of the cleaning area. This areacleaning process is simulated by holding the robot in the current pose for a certain number of steps which depends by its cleaning rate. In our framework, the WiFi Server constantly communicates with all the members of the team to update of the shared heatmap. Speciﬁcally, the server updates the heatmap by removing the cleaning priorities of areas sanitized by the robots, while new priorities are also added as colored spots at the positions of newly detected people. Furthermore, at every step of the execution, the server updates the priorities on the heatmap by simulating the natural spreading and the attenuation of contamination over time. This eﬀect is computed from the position of people (clusters) by modeling the possible spreading of viruses or bacteria using a Gaussian model of dispersion [7]. Speciﬁcally, we exploit the
392
R. Caccavale et al.
periodic convolution of a Gaussian ﬁlter N (μ, σ 2 ) every ψ steps, where μ, σ 2 and ψ are suitable parameters that can be regulated depending on the meters/pixels ratio, the timestep, and the considered typology of spreading (in this work we assume a setting inspired to the aerial diﬀusion of the Covid19 [27]). In our case, we set μ and σ according with the spreading parameters proposed in [2,8]. An exempliﬁcation of the evolution of a heatmap conﬁguration is provided in Fig. 3. The convolution process acts at every step by incrementally reducing the magnitude of the elements of the heatmap matrix, while distributing the priority on a wider area. Notice that in Fig. 3 there are several black areas (0 priority) that are regions of space associated with the static obstacles of the environment (shops, rooms and walls inside the station). These areas are assumed to be always clean, hence unattractive for the robots. When an agent moves with an action ai ∈ A, it sends the new position to the WiFi Server. The region of the heatmap in the neighborhood of the newly reached position, with the cleaning area assigned to the agent, is cleaned by the server, which then sets to 0 the associated priority level when updating the heatmap. 2.2
Multiagent Experience Replay and the Learning Process
In our framework, we propose a multiagent variation of the experience replay method proposed in [1,3,18]. In particular, our training scheme exploits a Distributed Training Decentralized Execution (DTDE) approach [6], where each robot is independent during both the execution phase and the training phase, while its own individual policy is updated by considering only its own experience, without explicit information exchange between robots. In this framework, our idea is to exploit this DTDE approach to allow robots of diﬀerent types to cooperate in a heterogeneous team. Robotspeciﬁc capabilities are: the travelling speed of the robot in the map (denoted by the movement length in Table 1), the shape and the dimensions of the areas that the robots are able to clean after each movement, and the time that the robot takes to clean the reached area (denoted by the cleaning speed in Table 1). In order to ensure that every robot learns by its own experience, each of the k agents is endowed with a speciﬁc replay buﬀer, along with speciﬁc target and main DQNs, which are synchronously updated with respect to the position of the agent and to the shared environment provided by the server (see Fig. 1). The target and the main networks are two identical convolutional neuralnetwork composed of the following layers: the ﬁrst layer is a 2D convolutional layer with 32 ﬁlters 8 × 8, strides (4, 4) and ReLU activation; the second is a 2D convolutional layer with 64 ﬁlters 4 × 4, strides (2, 2) and ReLU activation; the third is a 2D convolutional layer with 64 ﬁlters 3 × 3, strides (1, 1) and ReLU activation; the fourth is a ﬂatten layer; the ﬁfth layer is a dense layer of 512 neurons still with ReLU activation; ﬁnally, the output layer is a dense layer composed of 8 neurons with linear activation. The input of the neural network is an image with 2 channels of dimensions 100 × 172 pixels. In the ﬁrst channel there is the heatmap, represented as matrix where each element is a real number in the interval [0, 1] where 1 is the maximum priority and 0 means that no cleaning is needed. This matrix can be displayed as a colorcoded
Heterogeneous Multirobot Framework for PriorityBased Sanitization
393
Table 1. Parameters of the framework Actor
Parameter
Value
Exp. replay
Discount factor γ Maximum Minimum Decay Replay buﬀer size Target network update Main network update Batch size
0.99 1.0 0.1 9 · 10−7 104 104 steps 4 steps 32
WiFi server
Refresh period
60 steps
Cluster of people
Diameter
1 px
Longrange robot Cleaning area Cleaning speed Movement length Cleaning shape
25 px 4 px/step 2 px Square
Midrange robot
Cleaning area Cleaning speed Movement length Cleaning shape
17 px 4 px/step 2 px Hexagon
Shortrange robot Cleaning area Cleaning speed Movement length Cleaning shape
9 px 4 px/step 1 px Square
Spreading
Diameter μ σ
5 px 0 0.9
Environment
Dimensions
100 × 172 px
image (see map in Fig. 3), where black pixels are associated with 0 priority areas, while colors from red to yellow are for increasingly higher priorities. The second channel x is a binary m × n matrix (100 × 172 pixels in our case) representing the position and size of the cleaning area of the robot in the heatmap, which is 1 for the portions of the environment that are currently in the range of the robot cleaning eﬀect, and 0 otherwise. In order to update the networks, we apply the Adam optimizer with learning rate α = 0.00025. A local reward function ri is deﬁned, to permit each agent to evaluate its performance during the cleaning activity in the training process. The local reward function ri is designed to give a beneﬁt to the agents that reach prioritized areas of the environment (hot points), while there is a penalty if a robot meets a ﬁxed obstacle or an already
394
R. Caccavale et al.
Fig. 3. Generation of the heatmap from Meraki data. From left to right, the starting georeferenced Meraki data (a) are converted into a robotframe heatmap (b), which is then updated by the server through Gaussian convolution after 100 timesteps(c).
visited area (cold point) in the heatmap. In this direction, we ﬁrstly introduce a cumulative priority function cpi that summarizes the importance of a cleaned area, cpi = si (j, l)xi (j, l) (1) (j,l)
represented in Eq. 1 as the sum of the elementwise priorities from matrix si in the area sterilized by the agent i (where xi (j, l) = 1). Such value is then exploited to deﬁne the reward ri for the agent i as follows: cpi if cpi > 0; ri = (2) penalty otherwise. Speciﬁcally, when an agent i sanitizes a priority area, the reward is equal to the cumulative value cpi ; otherwise, if no priority is associated with the cleaned area (i.e., cpi = 0) a negative reward penalty < 0 is earned [5] (we empirically set penalty = −2 for our case studies). This way, agents receive a reward that is proportional to the importance of the sanitized area, while routes toward zeropriority areas, such as obstacles or clean regions, are discouraged. Notice that in this framework, when the action of an agent leads to an obstacle (collision), no motion is performed. This behavior penalizes the agent (no further cleaning is performed), thus producing an indirect drive toward collisionfree paths. We k deﬁne also an overall reward function r = i ri to summarize and evaluate the team performance as illustrated in Fig. 4.
3
Experiments
In this section, we show how the proposed heterogeneous multirobot framework can be deployed in a realistic environment. As illustrated in the previous sections, we consider Roma Termini station (the largest and most populated Italian railway station) as the environment for our experiments. The station is endowed with several access points managed through a Meraki platform of Cisco System WiFi Network that allows remote operators to monitor the information about
Heterogeneous Multirobot Framework for PriorityBased Sanitization
395
the presence and the positions of mobile devices (smartphones) all over the station. This information is exploited by the system (WiFi Server) to estimate the distribution of people and then to update the heatmap shared by the heterogeneous team. An example of the distribution of people retrieved from the Meraki system can be found in Fig. 3(a). We consider the WiFi Server to receive an updated distribution of people every 1 h. The information from the Meraki is then converted into a heatmap for the robots by associating each location with a priority value proportional to the density of people. Since the information from the Meraki are georeferenced, the retrieved value are ﬁnally rotated, translated and scaled in order to match the reference frame of the robots (see Fig. 3(b)). Thanks to the collaboration with Rete Ferroviaria Italiana, we obtained an entire day of recording of the Meraki system (2 September 2021) to be exploited for our experiments. In order to assess the performance of the proposed heterogeneous framework, we compare its performance with respect to a similar framework in which a homogeneous team is deployed. Speciﬁcally, we assume two teams, both composed of 4 robots: the ﬁrst team (homogeneous) is composed of 4 midrange sanitizing robots, while the second team (heterogeneous) is composed of 2 types of agents, namely, 2 shortrange robots and 2 longrange ones. The parameters (ranges and velocities) for these 3 categories are shown in Table 1. In our tests, we consider for each robot the same cleaning speed. The two teams are associated with equal values of the total sum of the cleaning areas. The movement length of each robot, after the conclusion of the sanitization of its cleaning area, is equal to the ray of its own cleaning area. In the ﬁrst case study we have compare the convergence of the two teams during the training phase by ran