This is one of the pre-defined project ideas you can choose for your project.
Clustering Wikipedia articles
Modify your clustering system from Assignment 2 to use Wikipedia articles (90 articles about Programming, 90 about Games). The dataset can be downloaded at the Datasets page.
To use the dataset for clustering, you need to select a number of words and calculate the frequency of these words in each Wikipedia article. It is not recommended to use all words from the articles since similarity calculations will then take long time. You can for example use the following words:
language, programming, computer, software, hardware, data, player, online, system, development,
machine, console, developer, design, history, technology, standard, information, article, example
The article Arcade_game would then have the following frequencies: