P4 – Web scraping

This is one of the pre-defined project ideas you can choose for your project.

Web scraping

In this project you shall use a web scraping library to download articles that can be used in your search engine from Assignment 3.

If you use Python, the BeautifulSoup library is very powerful and easy to use. A quick start guide can be found here. For Java can check out HtmlUnit. A quick start guide can be found here.

When scraping a site such as Wikipedia, you usually start on one page and follow all outgoing links.

You can download pages from Wikipedia or from any other site.


Grade Requirements
  • Scrape and store raw HTML for at least 200 pages
  • Parse the raw HTML files to generate a dataset similar to the Wikipedia dataset from Assignment 3
  • For each article, the dataset shall contain a file with all words in the article and another file with all outgoing links in the article
  • Use the dataset with your search engine from Assignment 3
  • Use both content-based ranking and PageRank to rank search results

Welcome to CoursePress

en utav Linnéuniversitets lärplattformar. Som inloggad student kan du kommunicera, hålla koll på dina kurser och mycket mer. Du som är gäst kan nå de flesta kurser och dess innehåll utan att logga in.

Läs mer lärplattformar vid Linnéuniversitetet

Student account

To log in you need a student account at Linnaeus University.

Read more about collecting your account

Log in LNU