I've only used Weka a bit, and it appears to be an impressive machine learning toolkit. However, its API is quite a pain to deal with.
Consider the following code that I wrote, which translates from one representation of an example to Instance, used by Weka:
public static Instance getInstance(WeatherLabeledFeatureVector featureVector, Instances instances) {
Instance inst = new DenseInstance(instances.numAttributes());
for (int i = 0; i attInfo, int capacity) {
// ...
HashSet<string> names = new HashSet<string>();
// ...
}
That's a pain. It's almost always better to code to interfaces instead of implementations. What if I had a different List implementation I wanted to use? I would be out of luck. Moreover, that HashSet line could be rewritten as:
Set<string> names = Sets.newHashSet();
I know that would introduce a dependency on Guava, but Guava is so fantastically useful that it seems worth it.
Code that looks like this is repeated throughout:
if (instance.classAttribute().isNominal()) {
handleNominal(instance.classAttribute());
} else if (instance.classAttribute().isString()) {
handleString(instance.classAttribute());
}
Perhaps this design could be simplified through use of inheritance and overloading:
handle(instance.classAttribute());
// ...
public void handle(NominalAttribute attr) { }
public void handle(StringAttribute attr) { }
Using a bunch of ints instead of enums reduces type safety and readability:
/** combination rule: Average of Probabilities */
public static final int AVERAGE_RULE = 1;
/** combination rule: Product of Probabilities (only nominal classes) */
public static final int PRODUCT_RULE = 2;
/** combination rule: Majority Voting (only nominal classes) */
public static final int MAJORITY_VOTING_RULE = 3;
/** combination rule: Minimum Probability */
public static final int MIN_RULE = 4;
/** combination rule: Maximum Probability */
public static final int MAX_RULE = 5;
/** combination rule: Median Probability (only numeric class) */
public static final int MEDIAN_RULE = 6;
(Although the comments for what each rule mean are admirable.)
What is this, C? Why split variable declaration and assignments?
public Enumeration listOptions() {
Enumeration enm;
Vector result;
result = new Vector();
enm = super.listOptions();
// …
}
Ternary conditions are great:
if (dist[index] == 0)
result = Utils.missingValue();
else
result = index;
Becomes:
result = dist[index] == 0 ? Utils.missingValue() : index;
Also, I generally tend to always put brackets in control flow statements, so there's no risk of adding another line accidentally:
if (dist[index] == 0)
result = Utils.missingValue();
else
result = index;
foo = bar;
Throwing Exception is never a good idea. If you catch Exception, you may catch all sorts of issues that aren't expected:
/**
* ...
*
* @param forPredictionsPrinting varargs parameter that, if supplied, is
* expected to hold a weka.classifiers.evaluation.output.prediction.AbstractOutput
* object
* ...
* @throws Exception if model could not be evaluated
* successfully
*/
public double[] evaluateModel(Classifier classifier,
Instances data,
Object... forPredictionsPrinting) throws Exception { /* … */ }
This interface also takes arguments of type Object, but specifies that they must be of a specific type. The whole point of Java's static typing is that constraints like that can be encoded at compile time instead of run time.
Finally, classes fail to use get*()
as a naming convention for getters, which makes it harder to learn the API. If you're working with Google code, and you're not quite sure how to get what you're looking for, if you type obj.get
in your IDE, you can be confident that the autocomplete will show you all the getters.
Weka is a great way to quickly drop in some machine learning for a project. But coding to it is a bit of a pain that could easily be cleaned up through adherence to best practices for clean code.