Project 6 - A web server

Introduction

In this project a web server will be constructed based on the implementation of TCP and the socket interface introduced earlier in the course. The web server will support browsing with authentication of user access. In addition, a dynamic web page will be implemented which may be updated in a web form.

Another interesting excersise in this project is the file system in an embedded system. The solution will be somewhat simplified due to the read-only memory available. In principle, the simple file system may be generated without knowledge about the details but a general understanding will help in the implementation of the dynamic web page.

There is a lot of string manipulation involved in this project at present. Anyone unfamiliar with the standard C-language string library will have to learn about the basic functions available.

There are a number of requirements for the solution in order to consider this project as finished

There is some administration in order to obtain the source code. Refer to the suggested design of the class hierarchy as well as advice on how to approach the completion of the skeleton code. An executable which may be loaded into the ETRAX unit is compiled, linked, and loaded in the same manner as in the previous overview of the system. Finally, you will have to test your solution.

Recommended reading

HTTP Made Really Easy, which is a practical guide to writing web clients and servers without previous experience.

The application-level protocol is described in RFC1945 covering the Hypertext Transfer Protocol HTTP/1.0. At least the introduction should be read by those unfamiliar with the HTTP protocol.

Read about the world wide web consortium and use the home page in search for further references.

Suggested solution design

The web structure of the server

The directory ~inin/kurs/www contains a sample web structure which may be used as provided. Should you want to create your own web, consult the requirements in order to finish the project. The directory structure of the sample web is

/index.htm
    /dynamic/dynamic.htm
    /pict/aum.gif
    /pict/big.jpg
    /pict/small1.gif
    /pict/small2.gif
    /pict/small3.gif
    /pict/small4.gif
    /pict/small5.gif
    /private/private.htm

where three different subdirectories are present.

The files dynamic.htm and private.htm are created dynamically by the server and can not be stored in the read-only memory of the server. Instead, they must be handled separately by the file system. The file private.htm contains a form which allows dynamical editing of the content of the file dynamic.htm. The form requires authentication prior to use. When text is entered into the form and submitted, the file dynamic.htm changes and the changes should also be reflected in the form the next time text is entered in it.

The file system

The simple file system is based on unpacked LHarc archives and is implemented as a singleton class with two methods. The method readFile will need modifications in order to allow dynamical retrieval and the method writeFile must be implemented in order to allow storage. The read-only memory of the server introduces a difficulty which may be solved by a linked list which holds the changes in the file system. A request for the file dynamic.htm in the readFile method then searches the list first before the actual file system is consulted.

The socket-based application of the server

Most of the implementation in the class SimpleApplication in the previous project may be reused when the class HTTPServer is defined and implemented. The method TCP::connectionEstablished will require modifications in order to start the new thread. In addition the previous usage of the telnet ECHO port number might require changes into support of the WWW port number 80.

One important difference in the HTTPServer compared with the SimpleApplication is a request buffer. Mostly, a number of segments must be received since HTTP requests are often transmitted in more than one TCP segment.

Different parts of a response may be sent in more than one segment. For example,

mySocket->Write(aResponseLine);
mySocket->Write(aHeaderLine);
mySocket->Write(aFile);
mySocket->Close();

Details in the implementation

Introduction to HTTP requests

All requests are terminated with an empty line, <CRLF>. The token <CRLF> means ASCII 0x0D, carriage return, followed by 0x0A , line feed. A complete GET request has been received when two <CRLF> have been found sequentially.

GET / HTTP/1.0<CRLF>
From: user@domain.com<CRLF>
User-Agent: HTTPTool/1.0<CRLF>
<CRLF>

The response of the server will be a transfer of the resource if it is available

HTTP/1.0 200 OK<CRLF>
Content-Type: text/html<CRLF>
<CRLF>
<html><title>Home page</title><body>.....

The HEAD request is very similar to the GET but instead of returning the actual resource, the server returns response headers only which may be useful in order to check characteristics of a resource without actually downloading it.

A POST request is more complicated since data will be sent after the <CRLF><CRLF> of the request. However, one of the header lines will contain the token Content-Length: followed by the number of data items in the request. It is the responsibility of the application to buffer the received data until the right number of byte have arrived. Often, data are transmitted in a number of segments after the receival of the request header.

POST /path/script.cgi HTTP/1.0<CR><LF>
From: user@domain.com<CR><LF>
User-Agent: HTTPTool/1.0<CR><LF>
Content-Type: application/x-www-form-urlencoded<CR><LF>
Content-Length: 32<CR><LF>
<CR><LF>
home=Cosby&favorite+flavor=flies

A method called HTTPServer::contentLength is provided in the file http.cc. It finds the token Content-Length: and converts the string representation of the integer into an integer of type udword.

Below, a few examples of web-transactions are presented. Remember that a new connection is established in every new transaction and that it is the responsibility of the server to terminate the connection when the transaction is finished. Required parameters are denoted by <...> whereas optional parameters are shown within [...].

The standard GET-request

The client establishes a new connection to port 80 on the server and requests the first page by sending

GET / HTTP/1.0<CRLF>
[Header line 1]<CRLF>
[Header line 2]<CRLF>
<CRLF>

The server answers the client with

HTTP/1.0 200 OK<CRLF>
Content-Type: text/html<CRLF>
<CRLF>
<html><title>Home page</title><body>.....

The server closes the connection.

The GET-request with authentication

 The client establishes a new connection to port 80 on the server and requests a private page by sending
 

GET /private/private.htm HTTP/1.0<CRLF>
[Header line 1]<CRLF>
[Header line 2]<CRLF>
<CRLF>

The server answers the client with

HTTP/1.0 401 Unauthorized<CRLF>
Content-Type: text/html<CRLF>
WWW-Authenticate: Basic realm="private"<CRLF>
<CRLF>
<html><head><title>401 Unauthorized</title></head><CRLF>
<body><h1>401 Unauthorized</h1></body></html>

The server closes the connection. At this point the web browser presents a dialog to the user in order to require a user name and a password.

The client establishes a second connection with the supplied user name and pasword

GET /private/private.htm HTTP/1.0<CRLF>
[Header line 1]<CRLF>
Authorization: Basic qWjfhjR124=<CRLF>
[Header line 3]<CRLF>
<CRLF>

The server answers the client with

HTTP/1.0 200 OK<CRLF>
Content-Type: text/html<CRLF>
<CRLF>
<html><title>Secret page</title><body>.....

The server closes the connection.

Details of the standard GET and HEAD requests

The implementation of the GET request requires a correct format of the path and file name. The root path / always translates into the path /index.htm.

Attempt to find the requested file in the file system. If it cannot be found by the server, the response

HTTP/1.0 404 Not found\r\n
Content-type: text/html\r\n
\r\n
<html><head><title>File not found</title></head>
<body><h1>404 Not found</h1></body></html>

must be sent. The characters \r\n are the C-language equivalent of the token <CRLF>, ASCII 0x0D, carriage return, followed by 0x0A , line feed. If the file is available, the response

HTTP/1.0 200 OK\r\n
Content-type: text/html\r\n
\r\n
<file content>

is sent. The content type header field is based on the extension of the file
 

File extension Content-type:
.htm text/html
.gif image/gif
.jpg image/jpeg

and the token <file content> should be replaced with the actual content of the requested resource. A reply to a HEAD request would not send anything after the header lines and the blank line

HTTP/1.0 200 OK\r\n
Content-type: text/html\r\n
\r\n

Details of the GET-request with authentication

Whenever a GET request is received by the server, a path that contains /private means authorisation control. The authorisation is performed in two steps.

Try to find the header field Authorization: Basic in the request. If it can not be found the response

HTTP/1.0 401 Unauthorized<CRLF>
Content-Type: text/html<CRLF>
WWW-Authenticate: Basic realm="private"<CRLF>
<CRLF>
<html><head><title>401 Unauthorized</title></head><CRLF>
<body><h1>401 Unauthorized</h1></body></html>

is sent back to the client. If the request contains the header field as in the example below,

GET /private/private.htm HTTP/1.0<CRLF>
From: user@domain.com<CRLF>
User-Agent: HTTPTool/1.0<CRLF>
Authorization: Basic qWjfhjR124=<CRLF>
<CRLF>

the characters qWjfhjR124= must be decoded with the provided method HTTPServer::decodeBase64 in the file http.cc. The result from the method is a string of the form user:password. Compare the string with a few invented users with passwords stored in the class HTTPServer and decide whether the resource should be sent.

When the user is authorised to require the resource, the file content is sent with the same response type as in a standard GET request. An unauthorised user must be sent another 401 Unauthorized response.

Details of the POST request

In order to implement the dynamic web page the file system must support storage of new files. The updated content of the file /dynamic/dynamic.htm is transferred to the server with a POST request when the form in the resource /private/private.htm is submitted. The file content in the POST request is URL encoded and must be decoded with the provided method HTTPServer::decodeForm before the file is updated. A typical POST request is shown below

POST /private/private.htm HTTP/1.0<CRLF>
[Header line 1]<CRLF>
Content-Length: 339<CRLF>
[Header line n]<CRLF>
<CRLF>
dynamic.htm=%0D%3Chtml%3E%0D%3Chead%3E%0D%3Ctitle%3EDynamic+page%3C%2Ftitl
e%3E%0D%3C%2Fhead%3E%0D%3Cbody+b....

A method called HTTPServer::contentLength is provided in the file http.cc. It finds the token Content-Length: and converts the string representation of the integer into an integer of type udword. Use the value of the field in order to determine whether the entire request is received by the server.

The string dynamic.htm= is removed in the method HTTPServer::decodeForm.

The response to a POST request is sent in the same manner as with GET requests

HTTP/1.0 200 OK\r\n
Content-type: text/html\r\n
\r\n

You can either include the content of the resource dynamic.htm or add a page which confirms the submission of the form and the update of the file

HTTP/1.0 200 OK\r\n
Content-type: text/html\r\n
\r\n
<html><head><title>Accepted</title></head>
<body><h1>The file dynamic.htm was updated successfully.</h1></body></html>

Creation of the file system

Unpacked LHarc archives are created with the command

lha ao5z1 <archive file> [file...]

on the server linus. The command a, add to archive, is complemented by the options

All files specified with [file...] will be added to an existing archive with new name <archive file>. If no previous archive exists with the name <archive file> a new file will be created.

The archive file is a binary file and may be translated into the C-language by the Perl-script command hex2char <archive file>. The result on stdout is redirected into a file called lhafile.bin which is included in the file fs.cc containing the file system class.

For example, the commands below create a file system of the sample web

cd kurs/www
lha ao5z1 ../src/lab6/wwwarc.lzh *
cd ../src/lab6
hex2char wwwarc.lzh > lhafile.bin

The paths and file names in the file system

An absolute path in the file system must be separated into the path and the file name. For example, the absolute path /dynamic/dynamic.htm is transformed into the path dynamic and the file name dynamic.htm. The integer -1, 0xff, is the delimiter and the termination token in the path. If the full path is /, the path is set to NULL. For all other absolute paths the initial / must be removed. For example,
 

Absolute path Path File name
/pict/small.gif pict<0xff> small.gif
/index.htm NULL index.htm

An example of the transformation of the path name into something the file system will understand is presented below. The manual page for unfamiliar C-language functions may be read with the command man <function> on the computer <server name>. The method extractString is provided in the file http.cc.

The method findPathName expects a string like GET /private/private.htm HTTP/1.0

char*
HTTPServer::findPathName(char* str)
{
  char* firstPos = strchr(str, ' ');     // First space on line
  firstPos++;                            // Pointer to first /
  char* lastPos = strchr(firstPos, ' '); // Last space on line
  char* thePath = 0;                     // Result path
  if ((lastPos - firstPos) == 1)
  {
    // Is / only
    thePath = 0;                         // Return NULL
  }
  else
  {
    // Is an absolute path. Skip first /.
    thePath = extractString((char*)(firstPos+1),
                            lastPos-firstPos);
    if ((lastPos = strrchr(thePath, '/')) != 0)
    {
      // Found a path. Insert -1 as terminator.
      *lastPos = '\xff';
      *(lastPos+1) = '\0';
      while ((firstPos = strchr(thePath, '/')) != 0)
      {
        // Insert -1 as separator.
        *firstPos = '\xff';
      }
    }
    else
    {
      // Is /index.html
      delete thePath; thePath = 0; // Return NULL
    }
  }
  return thePath;
}

String manipulation in the server application

A string constant, or string literal, is a sequence of zero or more characters surrounded by double quotes, as in

"I am a string"

or

"" /* The empty string */

The quotes are not part of the string. Technically, a string constant is an array of characters. The internal representation of a string has a null character '\0' at the end. Thus, the physical storage required is one more than the number of characters written between the quotes. It is essential to be consistent with the terminating null character.

The standard string library  is described in most books on the C-language and is also described in the man pages of UNIX systems. On linus, e.g.

man strcpy

provides all the information relevant to string copying. The most common string functions are

char* strcpy(s, ct);
// Copy string ct to string s, including '\0', return string s.

char* strncpy(s, ct, n);
// Copy at most n characters of string ct to s, pad with '\0'
// if ct has fewer than n characters, return string s.

char* strcat(s, ct);
// Concatenate string ct to end of string s, return string s.

int strcmp(cs, ct);
// Compare string cs to string ct, return <0 if cs<ct, 0 if cs==ct, or >0 if cs>ct.

int strncmp(cs, ct, n);
// Compare at most n characters of string cs to string ct,
// return <0 if cs<ct, 0 if cs==ct, or >0 if cs>ct.

char* strchr(char* cs, char c);
// Return pointer to first occurrence of c in cs or NULL if not present

char* strrchr (char* cs, char c);
// Return pointer to last occurrence of c in cs or NULL if not present

char* strstr(char* cs, char* ct);
// Return pointer to first occurrence of string ct in cs, or NULL if not present

Source code, compilation, linking and loading

Make sure your present working directory is ~/kurs/src. Remove the subfolder lab6 if it exists in ~/kurs/src. Then, copy all files from your solution in project 5 with the command

cp -r lab5 lab6

and change your present working directory into ~/kurs/src/lab6. This project will be an extension of your solution in project 5. Add the skeleton of project 6 to your previous files with the command

cp -r ~inin/kurs/src/lab6/* .

There should be three new files in addition to the ones from project 5 in the directory lab6,

The class HTTPServer is not declared in a header file. The file http.hh must be created and defined by you. Make sure that your present working directory is ~/kurs/make and type the commands:

Testing the solution

Use the browser of your preference and connect to your web server by specifying the IP address only. Analyse the network traffic with the network monitor in order to learn more about the HTTP protocol. It might be very instructive to study commercially available web servers in the local network.

With the experience from previous projects you should be able to verify the requirements of the solution on your own.

The browser uses a cache in order to store web pages. Use the reload functionality when you get unexpected results. Occasionally, the browser stores states of failure as well and the computer will have to be restarted in order to browse the pages of your web. The authentication is only performed once when the private page is referred to the first time. The browser has to be restarted in order to perform the authentication of user name and password again.