Home Page Detection
Olof Friman (D03)
The purpose of my thesis is to gain deeper knowledge about how to determine if a webpage is a given company’s homepage. The basic approach is to analyze the webpage’s contents by keyword matching but many more approaches are possible.
E-directories such as Eniro and Hitta.se sell “ad-space” containing information on hundreds of thousands domestic companies. The information often consists of street address, phone numbers, category and homepage address. The ad price is based on the amount of information presented, etc.
It is interesting for E-directory providers to offer their customers to add more information as this generates value for the customer as well as an increased revenue for the Edirectory
E-directory providers can identify relevant information in different ways. A piece of information of high value in itself and as a source of other relevant information is company web sites, but to identify them manually is time consuming and error prone. A fully automated approach with high accuracy is preferred, but a semi-automatic approach is a good alternative when this is not within reach.
Apptus is currently developing a semi-automatic Home Page Detector with the “simple” task to find a company’s homepage. The flow contains the following steps:
- Generating candidates URLs
- Evaluate and rank the candidates by probability of being correct
- Present the result to an operator who verifies and selects the
correct home page.
The thesis focuses on the second step where the core problem is to determine the
probability that a given webpage is the company’s homepage.
The goal of my thesis is to:
- Find new approaches to determine the probability of correctness.
- Implement and test approaches.
- Suggest modifications to get higher accuracy on the “probability-calculator”.
- Determine a good mixture of approaches.
- Deliver a demonstrational program
I will use random samples from a relatively large dataset (it contains information about
over 10 000 companies) to test the different approaches and the combined solution.
Handledare: Per Carlsson (Apptus) och Anders Ardö (EIT)