Allgemeine Joins in RapidMiner

(English version)

Der letzte Beitrag hat sich mit der Verbesserung von geographischen Joins in RapidMiner beschäftigt. Tatsächlich kann die beschriebene Methode aber auf alle Arten von Joins angewendet werden, die nicht nur für einen kleinen Nutzerkreis interessant sind.

Der eingebaute Join-Operator in RapidMiner erlaubt nur „ist gleich“-Vergleiche. In der Praxis sind aber auch andere Operationen nützlich: z. B. reguläre Ausdrücke, oder „kleiner gleich“.

Der Beispielprozess nutzt generierte Testdaten und wendet eine Liste von regulären Ausdrücken auf sie an. Wie schon bei den geographischen Joins wird dafür ein Scripting-Operator mit einem Groovy-Skript verwendet.

Das Skript muß wie gehabt ziemlich viel Datenverwaltung betreiben, um die resultierende Tabelle aus den beiden Eingangs-Tabelle zu bauen. Bei der Erstellung des Skripts war wichtig, daß es möglichst allgemein verwendbar ist. Das ist in Groovy mit Hilfe der Closure-Syntax möglich: die Join-Funktion wird am Anfang des Skripts als eine Art Konfigurationsvariable festgelegt. Somit sind im Skript nur drei Dinge festzulegen: die beiden Join-Attribute und der Ausdruck, der auf sie anzuwenden ist.

In diesem Beispiel:

es1AttName = "Person";
es2AttName = "regexp";

def joinFunc = { e1, e2 ->
    e1.matches(e2)
}

Die beiden Attributsvariablen aus Example Set 1 und 2 werden am Anfang angegeben. Hier müssen die Attributnamen exakt (inkl. Groß- und Kleinschreibung) angegeben werden.

Die Closure joinFunc nimmt zwei Parameter (e1 und e2) entgegen, das sind die Werte aus den genannten Attributen. Die Funktion prüft in diesem Fall, ob das Person-Attribut dem regulären Ausdruck aus dem zweiten Datensatz entspricht.

Andere Join-Kriterien könnten so lauten:

e1 <= e2 // kleiner-gleich-Vergleich
e1 == e2 || e2 == null // Vergleich mit optional fehlenden Daten

usw.

Für den Vergleich mehrerer Variablen muß das Skript dann doch angepaßt werden.

Das Skript verzichtet aus Performance-Gründen auf eine Prüfung, ob gleichnamige Attribute in beiden Datensätzen existieren. Im Zweifelsfall kann die Eindeutigkeit der Namen mit einem Rename by Replacing vor dem Join-Operator sichergestellt werden.

In SQL verwende ich immer wieder komplexe Joins – jetzt wird Ähnliches auch in RapidMiner möglich sein.

Generic Joins in RapidMiner

The last blog entry presented an improvement of geographic joins in RapidMiner. However, the described method can be applied to all kinds of joins, including those usable for a much larger group of analysts.

The Join operator in RapidMiner supports only the „is equal“ comparison. In reality, other operations like regular expression matching or „less or equal“ can be useful.

The example process uses generated data and joins a list of regular expressions with them. A Groovy script in an Execute Script operator is used, just like in the geographic joins.

Much data wrangling is done in the script as usual to build the resulting table from the incoming example sets. I developed the script with the goal of reusability. The closure syntax in Groovy makes this possible: the join function is specified in a kind of configuration variable at the top of the script. So to reuse the script you just need to set three variables: both join attributes and the expression to apply on them.

This is the configuration part in the example script:

es1AttName = "Person";
es2AttName = "regexp";

def joinFunc = { e1, e2 ->
    e1.matches(e2)
}

The attribute variables from ExampleSet 1 and 2 are given in the first two rows. It is important to specify them exactly (including capitalization).

The closure called joinFunc takes two parameters (e1 and e2), the values of the selected attributes. In this case the function executes a regular expression match on them.

Other possible join criteria:

e1 <= e2 // less-or-equal comparison
e1 == e2 || e2 == null // comparison that allows missing data

and so on.

The script needs to be changed for joins on multiple variables.

For performance reasons the script doesn‘t check for duplicate attribute names in the incoming data sets. If unsure, just use a Rename by Replace operator before the join to make sure the names are unique.

I‘m using complex joins in SQL all the time – it‘s time to make similar things possible in RapidMiner!

(English version)

Update 2020-02: GIS-Funktionalität ist jetzt als Erweiterung verfügbar.

Nach der Einführung und Installation geht es um eine konkrete Aufgabenstellung: den Import von Shapefiles.

Shapefile ist ein etabliertes Dateiformat, das zusätzlich zu den Geodaten auch eine Liste von Attributen der abgebildeten Objekte enthalten kann. Die geographischen Objekte können vom Typ Punkt (Point), Linie (LineString) und Polygon sein, jeweils als einzelne oder als Multi-Objekte (MultiPoint usw.).

Meistens bekommt man Shapefile in Form von Zip-Archiven, die die Shape-Datei selbst (*.shp), die Attributdatebank (*.dbf), die Information zur verwendeten Projektion (*.prj) und weitere enthalten.

Wer keine Shape-Datei zur Verfügung hat, findet z. B. bei Geofabrik oder auf Open-Data-Sites (AT, US) welche.

Der RapidMiner-Prozess fürs Einlesen der entpackten Datei ist gleichzeitig eine gute Einführung in die Datenstrukturen von RapidMiner.

Der Prozess steht hier zum Download bereit. Da der Dateiname als Makro im Prozesskontext (View/Show View/Context) definiert wird, steht dieser Prozess als fertiges „Element” zum Einbinden in eigene Prozesse zur Verfügung; der Filename-Parameter wird im aufrufenden Prozess unter „macros” eingetragen.

Nach dem Einlesen der Shape-Datei werden die Metadaten abgefragt (Layer.schema.fields), danach werden die einzelnen Elemente (Features in der GIS-Terminologie) gelesen:

def shp = new Shapefile("%{FILENAME}")
int fields = shp.schema.fields.size();
...
shp.schema.fields.each{f -&amp;gt;
  ...
  fieldMeta[fld] = f;
}
...
shp.features.each { f -&amp;gt;
  data = new Object[fields];
  fieldMeta.each{ attr -&amp;gt;
    data[fld] = f.get(attr.name).toString();
  }
  fld++;
}

Der RapidMiner-spezifische Teil erzeugt zuerst ein Array mit den Attributen, deren Namen und Datentypen aus den Fields des Shapefile-Schemas aufgebaut werden. Mit dem Attribut-Array wird dann ein ExampleTable erzeugt. Nach dem schrittweise  Befüllen des data-Arrays wird jedes Example mit einem DataRowFactory erzeugt und ans ExampleTable angefügt. Dieses wird am Ende in ein ExampleSet umgewandelt und als erster Output ausgegeben.
Attribute[] attributes= new Attribute[fields];

shp.schema.fields.each{f -&amp;gt;
 if (f.typ == "Long" || f.typ == "Integer") {
 attributes[fld] = AttributeFactory.createAttribute(f.name, Ontology.INTEGER);
 ...
}

MemoryExampleTable table = new MemoryExampleTable(attributes);

DataRowFactory ROW_FACTORY = new DataRowFactory(0);
...
shp.features.each { f -&amp;gt;
 data = new Object[fields];
 fld = 0;
 fieldMeta.each{ attr -&amp;gt;
  if (attr.typ == "Long" || attr.typ == "Integer" || attr.typ == "Double" || attr.typ == "Single") {
   data[fld] = f.get(attr.name);
  } else {
   data[fld] = f.get(attr.name).toString();
  }
  fld++;
 }
 DataRow row = ROW_FACTORY.create(data, attributes);
 table.addDataRow(row);
}

ExampleSet exampleSet = table.createExampleSet();
return exampleSet;

Das Ergebnis des Skript-Aufrufs ist ein normales RapidMiner-ExampleSet, dessen Metadaten aus dem Shapefile stammen. Numerische Attribute sind entsprechend konvertiert, sodaß der Datensatz ganz normal weiterverwendet werden kann. Handelt es sich um eine Geometrie aus Punkten, kann man sogar leicht die X- und Y-Koordinaten extrahieren und den Datensatz darstellen (in diesem Beispiel Orte in Ungarn):
Orte in Ungarn. Shapefile von Geofabrik in RapidMiner importiert.
Daten: © OpenStreetMap Contributors
Shapefile Import into RapidMiner
Update 2020-02: GIS functionality is now available in an extension.
After the introduction and installation we can start to work on an actual task: importing shapefiles.
Shapefile is a popular file format that is able to store geospatial data with additional attributes of each object. Points, LineStrings, Polygons and their Multi-versions (e. g. MultiPoint) are supported.
Usually a “shapefile” is a Zip archive that contains the shape file itself (*.shp), the database of attributes (*.dbf), the projection information (*.prj) and more.
If you don’t have a shapefile for testing yet, you can find some at Geofabrik and on Open Data sites (AT, US).
The RapidMiner process that reads the unpacked file is also a good introduction into the data structures of RapidMiner.
The process is available for download here. The file name is defined in the process context as a macro (View/Show View/Context), so you can use this process as a ready-to-use element in your own processes. Just include the file name to import in the macros entry of the Execute Process operator.
After reading the shapefile, the script determines the metadata of the fields (Layer.schema.fields) and reads the elements (called feature in GIS software).
def shp = new Shapefile("%{FILENAME}")
int fields = shp.schema.fields.size();
...
shp.schema.fields.each{f -&amp;gt;
  ...
  fieldMeta[fld] = f;
}
...
shp.features.each { f -&amp;gt;
  data = new Object[fields];
  fieldMeta.each{ attr -&amp;gt;
    data[fld] = f.get(attr.name).toString();
  }
  fld++;
}

The RapidMiner specific part creates an array for the attributes and reads their names and data types from the Fields of the shapefile Schema. An ExampleTable is created from the attribute array. After filling a data array step by step, it is converted to a data row using a DataRowFactory and appended to the ExampleTable. This table is converted to an ExampleSet and returned as the first output of the operator.
Attribute[] attributes= new Attribute[fields];

shp.schema.fields.each{f -&amp;gt;
 if (f.typ == "Long" || f.typ == "Integer") {
 attributes[fld] = AttributeFactory.createAttribute(f.name, Ontology.INTEGER);
 ...
}

MemoryExampleTable table = new MemoryExampleTable(attributes);

DataRowFactory ROW_FACTORY = new DataRowFactory(0);
...
shp.features.each { f -&amp;gt;
 data = new Object[fields];
 fld = 0;
 fieldMeta.each{ attr -&amp;gt;
  if (attr.typ == "Long" || attr.typ == "Integer" || attr.typ == "Double" || attr.typ == "Single") {
   data[fld] = f.get(attr.name);
  } else {
   data[fld] = f.get(attr.name).toString();
  }
  fld++;
 }
 DataRow row = ROW_FACTORY.create(data, attributes);
 table.addDataRow(row);
}

ExampleSet exampleSet = table.createExampleSet();
return exampleSet;

The script execution returns a normal RapidMiner ExampleSet with metadata and data from the shapefile. Numeric attributes are converted correctly. If you have a Point geometry, you can easily extract the X and Y coordinates and display the data in a scatterplot (in this example all places in Hungary).
Places in Hungary. Shapefile from Geofabrik imported into RapidMiner. 
Data: © OpenStreetMap Contributors

Data scientist – Berater für Analytik

Schlagwort-Archive: scripting

Generic Joins in RapidMiner

Allgemeine Joins in RapidMiner

Generic Joins in RapidMiner

GIS in RapidMiner (2) – Shapefile import

Shapefile Import into RapidMiner

Balázs Bárány, Freelancer