I try to classify documents based on bag-of-words single word approach. I employ R with its RTextTools to use SVM. My text files are like below:
TRAIN
Text,Label
nalyt appropri boundari boundedspac cauchi consid deriv....,1
altern andor appear climacter climactericwomen clinic......,2
...
TEST
Text,Label
algorithm approach asymptot avail axi belong branch........,1
advers anxieti avail benefit beyond call care clinic.......,2
...
For the train and test data I have 850 and 993 rows, respectively. Here is my R script:
train <- read.csv("...")
library(RTextTools);
set.seed(123)
dtMatrix <- create_matrix(train["Text"],weighting=tm::weightTfIdf)
container <- create_container(dtMatrix, train$Label, trainSize=1:850, virgin=FALSE)
vector = 10^c(-8,-4,-2,0,2,4,8)
best_c = 0
best_acc = 0
for (var in vector) {
svm <- cross_validate(container,5,algorithm="SVM",cost = var)
print(var)
print(svm$meanAccuracy)
if (svm$meanAccuracy > best_acc){ best_acc <- svm$meanAccuracy; best_c <- var}
}
best_c
best_acc
predictionData <- read.csv("...")
model <- train_model(container, "SVM", kernel="linear", cost=best_c)
predMatrix <- create_matrix(predictionData, originalMatrix=dtMatrix)
predictionContainer <- create_container(predMatrix, labels=rep(0,993), testSize=1:993, virgin=FALSE)
results <- classify_model(predictionContainer, model)
predictionData$svm_label = results$SVM_LABEL
success <- sum(predictionData$Label == predictionData$svm_label)/nrow(predictionData)
print(success)
When I run this code I get good results. However, the data is extremely sparse that is, when I check predMatrix
it gives:
Non-/sparse entries: 27656/19412694
Sparsity : 99%
Maximal term length: 167
Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
Does this mean that my results are not reliable? From this publication http://www.cs.cornell.edu/people/tj/publications/joachims_98a.pdf , it says that SVM does not necessarily rely on number of features. So I did not apply any feature selection method. Considering what I have done, here is my question:
Does such extreme sparsity affect the reliability of my method? Indeed I am classifying text from two scientific disciplines which are totally irrelevant fields. So as I expect, many distinct terms are used in each field. As a result I got an accuracy of around 0.90. I could not be sure how to assess my results. Besides, I have tried the same procedure with Noun Phrases instead of single words. I got again same successful results.
Note: The texts found in the training and test sets are stemmed and stop words are removed.
To sum it up, when extreme sparsity is present, should I apply feature selection even when SVM is employed?