Handling labeled data

This yaplf recipe explains how examples and labeled examples are dealt with. It requires a basic knowledge of the python programming language.

A light gray cell denotes one or more python statements, while a subsequent darw gray cell contains the expected output of the above statements. Statements can either be executed in a python or in a sage shell.

Examples, patterns and labels

Data in machine learning problem usually come in form of several examples. In its simplest form, an example is merely a sequence of numeric values (expressed as a python list or tuple, including the particular case of sequences containing exclusively one element) called a pattern. Yaplf refers to the class Example in package yaplf.data (as well as all other classes described in this section) in order to deal with generic examples. The class instances are created through specification of the corresponding pattern as argument of the constructor:

from yaplf.data import Example
e=Example((1,4,3))

Examples instantiated as in previous cell have a string representation consisting of the example pattern enclosed in angular brackets, as well as an accessible pattern field:

print e
print e.pattern
<(1, 4, 3)>
(1, 4, 3)

Remember that if a pattern is described by a sole numeric value inside a tuple, say n, python syntax requires to add a trailing comma in order to avoid the expression (n) to be obviously simplified into the numeric value n (which is not a sequence):

print Example((2,))
<(2,)>

Examples often represent particular instances of a given problem together with the corresponding solutions (or an approximation of the latter). Machine learning problems dealing with such kinds of examples are said to belong to the general hat of supervised learning, as the whole process can be described by the metaphor of a teacher showing the solutions of particular instances of a given problem with the aim of letting the learner generalize so as to be able to solve (even in an approximate way) other instances of the same problem. In such cases we speak of labeled examples, each made up of a pattern (the problem instance, or a suitable encoding of it) and a label (the corresponding solution). yaplf refers to the class LabeledExample in order to represent labeled examples; this class behaves similarly to the previously defined class Example (actually, the latter class is the parent of the former one), the only differences standing in the following facts:

from yaplf.data import LabeledExample
example = LabeledExample((1, 1), 1)
print example
print example.pattern
print example.label
<(1, 1), 1>
(1, 1)
1

Examples are typically gathered in samples, described as sequences of either Example or LabeledExample objects; for instance, if one wants to build a sample containing all the possible ways of computing the logical AND function of two bits, it will be necessary to use lists (or tuples) of two (binary) elements as patterns and a single bit as label:

and_sample = (LabeledExample((1, 1), 1), LabeledExample((0, 0), 0),
  LabeledExample((0, 1), 0), LabeledExample((1, 0), 0))

Said in other words, a sample is nothing but a python sequence of a particular kind of objects, and as such can be accessed through standard idioms:

print and_sample[1]
<(0, 0), 0>

Generating sample plots

Bidimensional labeled samples (i.e., samples of LabeledExample objects whose patterns have two elements) can be viewed graphically through the classification_data_plot function (defined in package yaplf.data), producing a plot where each example is represented by a bullet. The position and color of the latter are identified, respectively, by i) the example pattern, intended as a point in the bidimensional Euclidean plane, and ii) particular properties of the examples, the default rule using black colour for examples having label set to 1 and gray colour otherwise.

If the python code is executed in a sage shell or notebook, classification_data_plot automatically returns a graphic which is rendered respectively through a helper application or directly in the notebook:

from yaplf.data import classification_data_plot
classification_data_plot(and_sample)
sage-output-0

When using a regular python shell, the function returns a matplotlib object which can be saved onto disk:

from yaplf.data import classification_data_plot
fig = classification_data_plot(and_sample)
fig.savefig('and-plot.png')

For sake of visualization the rest of this tutorial assumes that code is run inside a sage notebook, so that each graphic output is visualized right after the cell which produces it.

The plot returned by classification_data_plot can be fine tuned through the named arguments color_function, size_function and alpha_function, respectively working on the colour, size and transparency of each bullet. Each of these arguments can be assigned a function having as argument a generic example and as value the corresponding style. For instance, the following cell redraws the bitwise AND sample colouring with a rather transparent green colour all the examples labeled by 1 and with a more opaque yellow the remaining sample elements; moreover, the bullet size is chosen according to the pattern position in the Euclidean plane:

classification_data_plot(and_sample,
  color_function=lambda x: ('green' if x.label==1 else 'yellow'),
  size_function=lambda x: (90 if x.pattern<(.5, .5) else 20),
  alpha_function = lambda x: (.4 if x.label == 1 else .8))
sage-output-1

The classification_data_plot function also works for labeled samples whose patterns have precisely three components. In this case each bullet will be positioned in the three-dimensional Euclidean space:

parity_sample=(LabeledExample((0, 0, 0), 0), LabeledExample((0, 0, 1), 1),
    LabeledExample((0, 1, 0), 1), LabeledExample((0, 1, 1), 0),
    LabeledExample((1, 0, 0), 1), LabeledExample((1, 0, 1), 0),
    LabeledExample((1, 1, 0), 0), LabeledExample((1, 1, 1), 1))
classification_data_plot(parity_sample,
    color_function=lambda x: ('green' if x.label==1 else 'yellow'),
    alpha_function = lambda x: (.2 if x.pattern[0] == 1 else 1),
    size_function = lambda x: (20 if x.pattern[1] == 1 else 40))
sage-output-2

When executing the above code in a sage notebook, the graphical result is produced using jmol, and thus can be interactively modified in the following ways:

When instead a python shell is used, classification_data_plot behaves as previously, returning a matplotlib object which can be saved to a file.