Document Classification – Tourism Taxonomy

jernejonMarch 16, 2024March 16, 2024Leave a Comment

Automatic classification of texts from the Touristic courpus – TURK. 10 million running words, ~5000 documents.

Properties and most important findings:

26 different categories, very unbalanced data,
Majority category = just around 14% of data,
No single feature by itself can accurately predict the category (~19% acc),
Features are highly dependent (NB acc = ~28%),
SVM does the best job (~59% acc),
Biggest error on rare categories,
Smalles error on frequent categories (accuracy up to more than 90%).

Leave a Reply Cancel reply

You must be logged in to post a comment.