Hello, I am using linear SVM to train my model and generate a line through my data. However my model always predicts 1 for all the feature examples. Here is my code:
print data_rdd.take(5) [LabeledPoint(1.0, [1.9643,4.5957]), LabeledPoint(1.0, [2.2753,3.8589]), LabeledPoint(1.0, [2.9781,4.5651]), LabeledPoint(1.0, [2.932,3.5519]), LabeledPoint(1.0, [3.5772,2.856])]
from pyspark.mllib.classification import SVMWithSGD from pyspark.mllib.linalg import Vectors from sklearn.svm import SVC data_rdd=x_df.map(lambda x:LabeledPoint(x[1],x[0]))
model = SVMWithSGD.train(data_rdd, iterations=1000,regParam=1)
X=x_df.map(lambda x:x[0]).collect() Y=x_df.map(lambda x:x[1]).collect()
pred=[] for i in X: pred.append(model.predict(i)) print pred
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
I didn't specify a threshold but you should be careful when splitting the data actually to perform stratified sampling so you don't up just with 1 labels in one split and 0s in the other.