Approved

Selection of Negative Examples for Training an SVM classier

Robert Toth ()

Start

2010-06-04

Presentation

2011-06-10 13:15

Location:

E:2311

Finished:

2011-11-02

Master's thesis:

EITM01-rapport-301-223.pdf

Abstract

Support Vector Machine is an algorithm that can be trained to distinguish between data belonging to different categories – given some training examples. It is a very general method that can be applied to just about any type of information, as long as it is represented in a form that the algorithm can work with. However, this thesis will be restricted to categorization of web-pages.

During training, the SVM needs pre-categorized texts from both categories. If the goal is to determine whether a text belongs to a certain category or not however, one must find good examples of texts that do not belong to the category – negative examples. The problem is to find such texts. The training examples define the separating hyperplane which will be used to categorize new texts. If the training examples are chosen poorly, the performance of the classifier will suffer.

The purpose of this thesis is to evaluate the impact, that different strategies for selecting negative training examples have on the performance of linear classifiers for text categorization using SVM.

The thesis will focus on linear classifiers since they are easier to work with, faster and have been shown to have high prediction accuracy for document categorization purposes. I will only work with English texts and the features will be stemmed single words with stop-words removed. Feature weighing will be based on TF-IDF. The tools LIBSVM and/or LIBLINEAR will most likely be used.

For evaluation, a pre-categorized test set will be manually created

Supervisor: Fredrik Andersson (EntireWeb Sweden AB)
Examiner: Anders Ardö (EIT)

Electrical and Information Technology

Faculty of Engineering LTH | Lund University

Approved

Selection of Negative Examples for Training an SVM classier

Abstract