Tuesday, December 4, 2012

SQLite DB with Orange Data Mining

The following post has no agenda or moral. It's just a story about stuff I did at work and some code I published.

Not a long time ago I was requested to bring my data-mining knowledge back from the dead. The project is actually quite fascinating, something in the software security field, and it's a shame I cannot provide much more details in this blog.
Anyway, since budget is low, and I tend to prefer open-source solutions, I installed Weka and Orange on my machine. Since the raw data to be processed is stored in a SQLite DB and since I recall I had better experience with Weka - I went for the first.
After a few hours of breaking it and trying to get it to produce some decent classification rules, I gave up on it. I guess that the fact I now love Python more than Java (it wasn't always like that) had quite a serious weight on my decision to stop trying to get Weka to work, so Orange it is.

Unlike my previous experience with Orange, back when I was a student, I figured I should go with the core framework - no UI, only Python commandline and scripting using Orange's extensive data-mining libraries.
After refreshing my memory with the tutorials, I felt comfortable with it, yet I realized Orange isn't capable for handling SQLite DBs. Instead, it uses some of the industry's common file formats for data-mining, and TSV. Yes, TSV. But this shouldn't stop me, right?
Few minutes later, I had my first SQLite-to-TSV converter up and running. You can find the sources on github.

From here, everything was a breeze: some data fiddling, some algorithms threshold settings, and results started to appear. Cool.