WPC BY
TOPIC

  • /
Results
URL Requested -
Category -
Was this categorization correct? OH! Can you give the right answer?  Yes      No Thanks!   Arts
  Games
  Health
  News
  Science
  Shopping
  Sports

About the project

The world of Internet grows up every day. There are a large number of web pages actives at this moment and more are released every day. For this fact the web page classification needs to be done with an automatic approach. It was already done several approaches in this area. Most of them only use the text information contained in the web pages, ignoring the visual content of them. This work shows that the visual content can improve the accuracies of the classifications that only use the text. It was extracted the text features of the web pages using the term frequency-inverse document frequency method. It was extracted two different types of visual features. The low-level features that are formed by the color and edge histogram, by the Gabor features and Tamura textures. The other visual features were the local SIFT features. Since the amount of the SIFT features is huge, it was created a dictionary using the "Bag-of-Words" method. Using this dictionary the SIFT features were extracted from each web page. After the extraction it was merged the features, using all the types of combinations using this three type of features. It was also used the Chi-Square method that selects the best features of a vector, to understand if the classifications improved. In the classification it was used four different classifiers. It was implemented a multi-label classification, for which we gave unknown web pages to the classifiers, so they could predict the main topic of the web page. It was also implemented a binary classification, for which we used only visual features, separately, to distinguish if a web page was a blog or non-blog. It was obtained good results that shows that adding the visual content to the text the accuracies improve. The best classification it was obtained using the SVM classifier, when only four different categories were used. A classification of 98% of accuracy, where only four web pages of two hundred were badly classified.

Authors:
Nuno Gonçalves
João Costa

Statistics

Arts Games Health News Science Shopping Sports

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

Requests Correct Success rate