# Research themes

My research activities focus around the treatment of uncertainty in machine learning problems, with the aim of strenghtening the aspects belonging to the fields of computer science and statistics.

The application of supervised machine learning methods in bioinformatics requires the selection among non-positively labeled data of those representing reliable negative examples, that is excluding entities on which no experiments have been conducted. In [Frasca and Malchiodi, 2017; Frasca and Malchiodi, 2016] such negative selection problem has been tackled using a ranking based on fuzzy membership functions, while [Frasca et al., 2017] proposes an encoding for the available data promoting the negative selection process in the problem of protein functions prediction. Finally, a similar procedure has been proposed in [Frasca et al., 2017] for the problem of gene prioritization.
Machine learning models have as starting point a labeled sample whose elements are processed homogeneously (that is, each element has the same importance). In [Malchiodi, 2008] the general model of data quality-based learning was proposed. In this model it is possible to associate each of the available data items a numerical quantification of its importance with reference to the remaining data. This model was applied to the problem of classification through Support Vector Machines, both in its linear [Apolloni and Malchiodi, 2006] and kernel-based version [Apolloni et al., 2007]. A first analysis of the performance for these applications has been undertaken both theoretically [Apolloni et al., 2007] and experimentally [Malchiodi, 2009]. Some preliminary applications in the bioinformatics field is described in [Malchiodi et al., 2010]. A similar approach has also been applied to the regression problem in [Apolloni et al., 2010; Malchiodi et al., 2009; Apolloni et al., 2005] and to unbalanced learning in [Malchiodi, 2013b].
Several types of learning algorithms have been designed, implemented and analyzed. In particular, [Malchiodi and Legnani, 2014] proposes an improvement of the support vector-based classification algorithms dealing both with partially labeled data and with uncertain labels, while [Malchiodi and Pedrycz, 2013] introduces a learning algorithm for membership functions of fuzzy sets.
The perception of computer science, in society and in primary and secondary education, is often linked almost exclusively to the introduction to specific technological tools rather than to the study and processing of information [Bellettini et al., 2014]. In order to sensitize teachers to a different approach to basic computer science education, a methodology based on interactive laboratories, which is currently being tested, has been proposed in [Bellettini et al., 2012; Bellettini et al., 2013; Bellettini et al., 2014; Bellettini et al., 2014].
The granular computing model, giving information a granular meaning and allowing its analysis and its processing at different abstraction levels, is described in [Apolloni et al., 2008], where its links with machine learning models are analysed. The effects of a fusion of these two models have been studied within the general field of regression, proposing new algorithms based on Support Vector Machines [Apolloni et al., 2008; Apolloni et al., 2006] or on local search techniques [Apolloni et al., 2005].
Bootstrap techniques are based on data resampling models with the aim of approximating the distribution of a population. A specialization of this kind of techniques, intially proposed in [Apolloni et al., 2006] and subsequently refined in [Apolloni et al., 2009; Apolloni et al., 2007], gives as output confidence regions for regression curves, avoiding usual assumptions on the distribution of measurement drifts. The use of this technique to solve linear and nonlinear regression problems is shown in [Apolloni et al., 2008], while [Apolloni et al., 2007] describes some applications to the medical field.
The task of integrating under a unique theoretical model istances of inference problems from statistics (point and interval estimation of distribution parameters) and computer science (estimation of approximation error in machine learning) is tackled in [Apolloni et al., 2006; Apolloni et al., 2005; Apolloni et al., 2002; Apolloni et al., 2002; Apolloni and Malchiodi, 2001; Malchiodi, 2000], building on previously obtained results on sample complexity [Apolloni and Malchiodi, 2001] and describing the Algorithmic Inference model. This model was used with the aim of estimating the risk in classification problems based on Support Vector Machines [Apolloni et al., 2007; Apolloni et al., 2005; Apolloni and Malchiodi, 2002; Apolloni and Malchiodi, 2001], learning confidence regions for regression lines avoiding the typical assumption requiring a Gaussian drift distribution [Apolloni et al., 2005; Apolloni et al., 2002], and learning confidence regions for the risk function of re-occurrence distribution times in particular cancer pathologies [Apolloni et al., 2007; Apolloni et al., 2005; Apolloni et al., 2002].
Systems for scientific computation can be used to run simulations and to analyze mathematical problems from an interactive and incremental point of view; To this effect, such systems offer interesting cues in order to design educational activities [Bulgheroni and Malchiodi, 2009; Malchiodi, 2008a]. A commercial version of this kind of systems, thoroughly described in [Malchiodi, 2007], has been extended so as to solve purely computational aspects associated to information encoding [Malchiodi, 2006c], remote procedure invocation [Malchiodi, 2006b; Malchiodi, 2006], production of scientific documentation [Malchiodi, 2011], and solutions to optimization [Malchiodi, 2006a] and machine learning problems based on Support Vectors [Malchiodi et al., 2009; Malchiodi et al., 2009], as well as to perform software validation techniques [Malchiodi, 2013a]. The related code has been used in order to build up the simulations in [Apolloni et al., 2007; Apolloni and Malchiodi, 2006]. Moreover, [Malchiodi, 2010a] describes a library handling machine learning problems within an open source system for scientific computation.
Hybrid learning systems are typically organized coupling sub-symbolic modules (typically based on the neural networks paradigm) with symbolic ones (described in terms of logic circuits). Such a system, having as inputs a set of features describing the available data and extracting their boolean independent components, is described in [Apolloni et al., 2005; Apolloni et al., 2004]. These components, interpreted as truth values, are used in order to infer logical formulas describing in a symbolic ways the relations among original input data [Apolloni et al., 2006; Apolloni et al., 2003; Apolloni et al., 2002; Apolloni et al., 2000]. This system is applied in [Apolloni et al., 2004] to the problem of emotion recognition on the basis of voice signals, while [Apolloni et al., 2004; Apolloni et al., 2004; Apolloni et al., 2003; Apolloni et al., 2003; Apolloni et al., 2003] describes an applications to the monitoring of awareness in car driving in function of biosignals, within the research project IST-2000-26091 ORESTEIA (mOdular hybRid artEfactS wiTh adaptivE functIonAlity, funded between 2001 and 2003 by the EC within the fifth framework programme, under the IST-FET initiative). Moreover, [Apolloni and Malchiodi, 2006; Apolloni et al., 2005] study two hybrid systems obtained through the integration of a fuzzy system for the measurement of quality in available data respectively with a linear Support Vector classifier and with a linear regression model.
Whithin computational learning theory, the structural risk minimization principle investigates on the problem of balancing the complexity of a model with its accuracy in describing experimental data. This principle has been applied to classifiers based on logic expressions built in terms of disjuctive and conjunctive boolean normal forms. A simplification algorithm for such forms was developed in [Apolloni et al., 2006; Apolloni et al., 2005; Apolloni et al., 2003; Apolloni et al., 2002; Apolloni et al., 2002], focusing on the stochastic optimization of parameters in fuzzy sets describing the above mentioned forms.
Within this subject the activities have been focused on the problem of modeling conflicting situations through an approach alternative to that of classical game theory. In particular, these conflicts were modeled in terms of approximating the solution to an NP-hard problem [Apolloni et al., 2006; Apolloni et al., 2003; Apolloni et al., 2002; Apolloni et al., 2002], applying the Algorithmic Inference model in order to assign limited computational resources to two players, subsequently extending this technique to team games [Apolloni et al., 2006]. This model is applied in [Apolloni et al., 2007; Apolloni et al., 2005] to the biologic field, while [Apolloni et al., 2010] uses this approach with the aim of correctly dimensioning the running time for learning algorithms based on local error minimization.
The research project ORESTEIA (mOdular hybRid artEfactS wiTh adaptivE functIonAlity, funded between 2001 and 2003 by the EC within the fifth framework programme, under the IST-FET initiative) was grounded on the design, implementation and analysis of intelligent systems for pervasive and ubiquitous computing. These fields are characterized by highly specialized computers devoted to execute specific tasks. These special computers can be produced so as to significantly reduce their size and cost, consequently being able to immerse them inside an environment. Focusing specifically on the awareness detection problem [Kasderidis et al., 2003], a prototype for the detection of driving awareness on the basis of biosignals [Apolloni et al., 2004; Apolloni et al., 2004; Apolloni et al., 2003; Apolloni et al., 2003; Apolloni et al., 2003] have been developed.
Within the progress of reserach project PHYSTA (Principled Hybrid Systems: Theory and Applications, funded between 1998 and 2000 by the EC within the fourth framework programme, within the TMR initiative), the Algorithmic Inference model described in [Apolloni et al., 2006; Malchiodi, 2000] was applied to the problem of automatic classification of emotions on the basis of vocal signals [Apolloni et al., 2004; Apolloni et al., 2002]. The obtained results were presented at an international school on computational learning within the same research project.
The availability of hardware circuits able to directly process information with the aim of synthesizing them through estimators allow a remarkable shortening in running times. Their use imply a set of constraints basically linked to the architecture of the circuits themselves. The inference-among-gossips, developed in [Malchiodi, 1996], has been applied within this scope with the aim of obtaining a family of estimators for bernoulli populations directly implementable on pRAM boards [Apolloni et al., 1997]. The same model has been applied in [Apolloni et al., 2013] to the study of information exchange in social networks.