Quantcast
Channel: Question and Answer » svm
Viewing all articles
Browse latest Browse all 48

e1071: CVE returns linear separation, Hold-out returns large error

$
0
0

I’m using the Adult dataset that can be found here: http://archive.ics.uci.edu/ml/datasets/Adult

After taking a sample of the dataset, I use the svm function of e1071 to obtain the accuracy with a linear kernel.

adult.df = read.csv("sample_adult.csv")
adult.df$X = NULL
Income = adult.df$Income...50k
summary(svm(formula=factor(Income)~., data=adult.df, type="C-classification", cost=1, kernel="linear", cross = 10))

This returns:

Number of Classes:  2 

Levels: 
  <=50K  >50K

10-fold cross-validation on training data:

Total Accuracy: 100 
Single Accuracies:
 100 100 100 100 100 100 100 100 100 100 

However, I’ve implemented a holdout method of testing the accuracy (this is rather tailored to the dataset):

holdout <- function(data, params) {
  # randomize the dataset
  data <- data[sample(1:nrow(data)), ]
  # we use an 50/50 split. train on 50% of the data, test on the other 50%
  training.set = data[1:(nrow(data)/2),]
  t.income = training.set$Income
      testing.set = data[(nrow(data)/2 + 1):nrow(data),]
      # train a model on the training set
      model = NULL
      if(is.null(params$degree)){
    model = svm(formula=t.income~., data=training.set,
              type=params$type, cost=params$cost, 
              kernel=params$kernel, cross=10)
      } else {
        model = svm(formula=t.income~., data=training.set,
                    type=params$type, cost=params$cost, 
                    kernel=params$kernel, degree = params$degree, cross=10)
  }
  print(summary(model))
  # test each point in the testing set
  wrong = 0
  for(i in 1:nrow(testing.set)){ 
    prediction = predict(model, testing.set[i,])
    if(prediction != training.set[i,length(training.set)]) {
      wrong = wrong + 1
    }
  }
  return(wrong/nrow(testing.set))
}

If I run the holdout on the same SVM:

>holdout(adult.df,list(type="C-classification", cost=1, kernel="linear"))
...
10-fold cross-validation on training data:

Total Accuracy: 100 
Single Accuracies:
 100 100 100 100 100 100 100 100 100 100 



[1] 0.39

As you can see… The holdout and CVE values are entirely different. I think my holdout code is correct, and my implementation of the svm function is the problem. Please, any help would be appreciated.

Thank you!


Viewing all articles
Browse latest Browse all 48

Latest Images

Trending Articles



Latest Images