Friday, March 10, 2006

The Nature of the Problem

Structured Data

Chatting with a friend yesterday, it became apparent that what is meant by structured data is not clear. As yet, there is no set of sample data to illustrate what is meant by this.

What is clear is that the structured data is richer than simple multi-dimensional vectors. For example, it is hoped that the data structure or its description allows for the expression of orderings on attributes; short < medium < long.

An established process that works with rich and structured data instances is Inductive Logic Programming (ILP), a process for inducing descriptive statements for a set of logical clauses. An explanation of this working with the well known trains data set is given by this presentation by Peter Flach. Another example is the determination of carcinogenic compounds, and a worked example is presented here.

Minimal Expressiveness

As a general word of caution it should be noted that the language to describe the data or the constraints, should be minimally expressive. Too much expressiveness leads to weak results or uncessary computational complexity.

Distance Metric

The core of the problem then, is not the particular clustering algorithm to be used, but rather the notion of distance between instances of such rich or structured data. Such a metric must take into account the additional constraints such as orderings, or more general relations between attributes, and other constraints that can be expressed. In ILP the expression of such constraints is with logical clauses, the same language for desribing the data instances themselves.

So in general, what is required is a data structure and an associated representation language that allows for the general expression of relations or constraints over the elements of that data structure, as well as the usual value of those elements.