P2 – Clustering

This is one of the pre-defined project ideas you can choose for your project.

Clustering Wikipedia articles

Modify your clustering system from Assignment 2 to use Wikipedia articles (90 articles about Programming, 90 about Games). The dataset can be downloaded at the Datasets page.

To use the dataset for clustering, you need to select a number of words and calculate the frequency of these words in each Wikipedia article. It is not recommended to use all words from the articles since similarity calculations will then take long time. You can for example use the following words:

language, programming, computer, software, hardware, data, player, online, system, development,
machine, console, developer, design, history, technology, standard, information, article, example

The article Arcade_game would then have the following frequencies:



Grade Requirements
  • Read all articles about programming and games and convert each article to word frequencies using the word list above
  • Perform k-means clustering on the 180 articles using 2 clusters
  • Are the articles well separated into one cluster of gaming related articles and one cluster about programming?
  • Perform hierarchical clustering on the 180 articles
  • Are articles about similar topics well separated into branches?
  • Generate your own word list of at least 100 words
  • Repeat k-means and hierarchical clustering using the new word list
  • Are the results better with the new word list?

Welcome to CoursePress

en utav Linnéuniversitets lärplattformar. Som inloggad student kan du kommunicera, hålla koll på dina kurser och mycket mer. Du som är gäst kan nå de flesta kurser och dess innehåll utan att logga in.

Läs mer lärplattformar vid Linnéuniversitetet

Student account

To log in you need a student account at Linnaeus University.

Read more about collecting your account

Log in LNU