toppbild

Redovisade

Home Page Detection

Olof Friman (D03)

Start: 2007-03-12
Presentation: 2007-08-10 10:15:00
Plats:
Avslutat: 2007-08-10

Sammanfattning

The purpose of my thesis is to gain deeper knowledge about how to determine if a webpage is a given company’s homepage. The basic approach is to analyze the webpage’s contents by keyword matching but many more approaches are possible. Motive E-directories such as Eniro and Hitta.se sell “ad-space” containing information on hundreds of thousands domestic companies. The information often consists of street address, phone numbers, category and homepage address. The ad price is based on the amount of information presented, etc. It is interesting for E-directory providers to offer their customers to add more information as this generates value for the customer as well as an increased revenue for the Edirectory company itself. E-directory providers can identify relevant information in different ways. A piece of information of high value in itself and as a source of other relevant information is company web sites, but to identify them manually is time consuming and error prone. A fully automated approach with high accuracy is preferred, but a semi-automatic approach is a good alternative when this is not within reach. Method Apptus is currently developing a semi-automatic Home Page Detector with the “simple” task to find a company’s homepage. The flow contains the following steps:

  1. Generating candidates URLs
  2. Evaluate and rank the candidates by probability of being correct
  3. Present the result to an operator who verifies and selects the correct home page.
The thesis focuses on the second step where the core problem is to determine the probability that a given webpage is the company’s homepage. The goal of my thesis is to:
  • Find new approaches to determine the probability of correctness.
  • Implement and test approaches.
  • Suggest modifications to get higher accuracy on the “probability-calculator”.
  • Determine a good mixture of approaches.
  • Deliver a demonstrational program
I will use random samples from a relatively large dataset (it contains information about over 10 000 companies) to test the different approaches and the combined solution.

Handledare: Per Carlsson (Apptus) och Anders Ardö (EIT)

Examinator:

Tillbaka

Senast uppdaterad: 2008-04-16 10:35:02
Webbansvarig: Daniel Sjöberg
Ansvarig utgivare: Prefekt

Institutionen för Elektro- och informationsteknik, LTH, Box 118, 221 00 Lund. Telefon: 046-222 00 00