Document Classification – Tourism Taxonomy

Automatic classification of texts from the Touristic courpus – TURK. 10 million running words, ~5000 documents.

Properties and most important findings:

  • 26 different categories, very unbalanced data,
  • Majority category = just around 14% of data,
  • No single feature by itself can accurately predict the category (~19% acc),
  • Features are highly dependent (NB acc = ~28%),
  • SVM does the best job (~59% acc),
  • Biggest error on rare categories,
  • Smalles error on frequent categories (accuracy up to more than 90%).

Leave a Reply