Project 6 - A web server

Introduction

In this project a web server will be constructed based on the implementation of TCP and the socket interface introduced earlier in the course. The web server will support browsing with authentication of user access. In addition, a dynamic web page will be implemented which may be updated in a web form.

Another interesting excersise in this project is the file system in an embedded system. The solution will be somewhat simplified due to the read-only memory available. In principle, the simple file system may be generated without knowledge about the details but a general understanding will help in the implementation of the dynamic web page.

There is a lot of string manipulation involved in this project at present. Anyone unfamiliar with the standard C-language string library will have to learn about the basic functions available.

There are a number of requirements for the solution in order to consider this project as finished

no introduction of memory leaks,
the web server supports the HTTP commands GET and HEAD in general browsing,
when the root directory of the server is requested for, the first page must be provided,
the web server supports the HTTP command POST in order to allow generation of a dynamic web page,
the web server supports basic authentication of user access based on the directory names,
at least one large picture, approximately 200 kB large, is to be displayed correctly in the browser, and
one page should contain at least three different small pictures.

There is some administration in order to obtain the source code. Refer to the suggested design of the class hierarchy as well as advice on how to approach the completion of the skeleton code. An executable which may be loaded into the ETRAX unit is compiled, linked, and loaded in the same manner as in the previous overview of the system. Finally, you will have to test your solution.

Details in the implementation

Introduction to HTTP requests

All requests are terminated with an empty line, <CRLF>. The token <CRLF> means ASCII 0x0D, carriage return, followed by 0x0A , line feed. A complete GET request has been received when two <CRLF> have been found sequentially.

GET / HTTP/1.0<CRLF>
From: user@domain.com<CRLF>
User-Agent: HTTPTool/1.0<CRLF>
<CRLF>

The response of the server will be a transfer of the resource if it is available

HTTP/1.0 200 OK<CRLF>
Content-Type: text/html<CRLF>
<CRLF>
<html><title>Home page</title><body>.....

The HEAD request is very similar to the GET but instead of returning the actual resource, the server returns response headers only which may be useful in order to check characteristics of a resource without actually downloading it.

A POST request is more complicated since data will be sent after the <CRLF><CRLF> of the request. However, one of the header lines will contain the token Content-Length: followed by the number of data items in the request. It is the responsibility of the application to buffer the received data until the right number of byte have arrived. Often, data are transmitted in a number of segments after the receival of the request header.

POST /path/script.cgi HTTP/1.0<CR><LF>
From: user@domain.com<CR><LF>
User-Agent: HTTPTool/1.0<CR><LF>
Content-Type: application/x-www-form-urlencoded<CR><LF>
Content-Length: 32<CR><LF>
<CR><LF>
home=Cosby&favorite+flavor=flies

Below, a few examples of web-transactions are presented. Remember that a new connection is established in every new transaction and that it is the responsibility of the server to terminate the connection when the transaction is finished. Required parameters are denoted by <...> whereas optional parameters are shown within [...].

The standard GET-request

The client establishes a new connection to port 80 on the server and requests the first page by sending

GET / HTTP/1.0<CRLF>
[Header line 1]<CRLF>
[Header line 2]<CRLF>
<CRLF>

The server answers the client with

HTTP/1.0 200 OK<CRLF>
Content-Type: text/html<CRLF>
<CRLF>
<html><title>Home page</title><body>.....

The server closes the connection.

The GET-request with authentication

The client establishes a new connection to port 80 on the server and requests a private page by sending

GET /private/private.htm HTTP/1.0<CRLF>
[Header line 1]<CRLF>
[Header line 2]<CRLF>
<CRLF>

The server answers the client with

HTTP/1.0 401 Unauthorized<CRLF>
Content-Type: text/html<CRLF>
WWW-Authenticate: Basic realm="private"<CRLF>
<CRLF>
<html><head><title>401 Unauthorized</title></head><CRLF>
<body><h1>401 Unauthorized</h1></body></html>

The server closes the connection. At this point the web browser presents a dialog to the user in order to require a user name and a password.

The client establishes a second connection with the supplied user name and pasword

GET /private/private.htm HTTP/1.0<CRLF>
[Header line 1]<CRLF>
Authorization: Basic qWjfhjR124=<CRLF>
[Header line 3]<CRLF>
<CRLF>

The server answers the client with

HTTP/1.0 200 OK<CRLF>
Content-Type: text/html<CRLF>
<CRLF>
<html><title>Secret page</title><body>.....

The server closes the connection.

Details of the standard GET and HEAD requests

The implementation of the GET request requires a correct format of the path and file name. The root path / always translates into the path /index.htm.

Attempt to find the requested file in the file system. If it cannot be found by the server, the response

HTTP/1.0 404 Not found\r\n
Content-type: text/html\r\n
\r\n
<html><head><title>File not found</title></head>
<body><h1>404 Not found</h1></body></html>

must be sent. The characters \r\n are the C-language equivalent of the token <CRLF>, ASCII 0x0D, carriage return, followed by 0x0A , line feed. If the file is available, the response

HTTP/1.0 200 OK\r\n
Content-type: text/html\r\n
\r\n
<file content>

is sent. The content type header field is based on the extension of the file

File extension	Content-type:
.htm	text/html
.gif	image/gif
.jpg	image/jpeg

and the token <file content> should be replaced with the actual content of the requested resource. A reply to a HEAD request would not send anything after the header lines and the blank line

HTTP/1.0 200 OK\r\n
Content-type: text/html\r\n
\r\n

Details of the GET-request with authentication

Whenever a GET request is received by the server, a path that contains /private means authorisation control. The authorisation is performed in two steps.

Try to find the header field Authorization: Basic in the request. If it can not be found the response

HTTP/1.0 401 Unauthorized<CRLF>
Content-Type: text/html<CRLF>
WWW-Authenticate: Basic realm="private"<CRLF>
<CRLF>
<html><head><title>401 Unauthorized</title></head><CRLF>
<body><h1>401 Unauthorized</h1></body></html>

is sent back to the client. If the request contains the header field as in the example below,

GET /private/private.htm HTTP/1.0<CRLF>
From: user@domain.com<CRLF>
User-Agent: HTTPTool/1.0<CRLF>
Authorization: Basic qWjfhjR124=<CRLF>
<CRLF>

the characters qWjfhjR124= must be decoded with the provided method HTTPServer::decodeBase64 in the file http.cc. The result from the method is a string of the form user:password. Compare the string with a few invented users with passwords stored in the class HTTPServer and decide whether the resource should be sent.

When the user is authorised to require the resource, the file content is sent with the same response type as in a standard GET request. An unauthorised user must be sent another 401 Unauthorized response.

Details of the POST request

In order to implement the dynamic web page the file system must support storage of new files. The updated content of the file /dynamic/dynamic.htm is transferred to the server with a POST request when the form in the resource /private/private.htm is submitted. The file content in the POST request is URL encoded and must be decoded with the provided method HTTPServer::decodeForm before the file is updated. A typical POST request is shown below

POST /private/private.htm HTTP/1.0<CRLF>
[Header line 1]<CRLF>
Content-Length: 339<CRLF>
[Header line n]<CRLF>
<CRLF>
dynamic.htm=%0D%3Chtml%3E%0D%3Chead%3E%0D%3Ctitle%3EDynamic+page%3C%2Ftitl
e%3E%0D%3C%2Fhead%3E%0D%3Cbody+b....

A method called HTTPServer::contentLength is provided in the file http.cc. It finds the token Content-Length: and converts the string representation of the integer into an integer of type udword. Use the value of the field in order to determine whether the entire request is received by the server.

The string dynamic.htm= is removed in the method HTTPServer::decodeForm.

The response to a POST request is sent in the same manner as with GET requests

HTTP/1.0 200 OK\r\n
Content-type: text/html\r\n
\r\n

You can either include the content of the resource dynamic.htm or add a page which confirms the submission of the form and the update of the file

HTTP/1.0 200 OK\r\n
Content-type: text/html\r\n
\r\n
<html><head><title>Accepted</title></head>
<body><h1>The file dynamic.htm was updated successfully.</h1></body></html>

Creation of the file system

Unpacked LHarc archives are created with the command

lha ao5z1 <archive file> [file...]

on the server linus. The command a, add to archive, is complemented by the options

o, use LHarc compatible method,
z, do not compress the files, and
1, the integer one, which defines the header level.

All files specified with [file...] will be added to an existing archive with new name <archive file>. If no previous archive exists with the name <archive file> a new file will be created.

The archive file is a binary file and may be translated into the C-language by the Perl-script command hex2char <archive file>. The result on stdout is redirected into a file called lhafile.bin which is included in the file fs.cc containing the file system class.

For example, the commands below create a file system of the sample web

cd kurs/www
lha ao5z1 ../src/lab6/wwwarc.lzh *
cd ../src/lab6
hex2char wwwarc.lzh > lhafile.bin

The paths and file names in the file system

An absolute path in the file system must be separated into the path and the file name. For example, the absolute path /dynamic/dynamic.htm is transformed into the path dynamic and the file name dynamic.htm. The integer -1, 0xff, is the delimiter and the termination token in the path. If the full path is /, the path is set to NULL. For all other absolute paths the initial / must be removed. For example,

Absolute path	Path	File name
/pict/small.gif	pict`<0xff>`	small.gif
/index.htm	`NULL`	index.htm

An example of the transformation of the path name into something the file system will understand is presented below. The manual page for unfamiliar C-language functions may be read with the command man <function> on the computer <server name>. The method extractString is provided in the file http.cc.

The method findPathName expects a string like GET /private/private.htm HTTP/1.0

char*
HTTPServer::findPathName(char* str)
{
char* firstPos = strchr(str, ' ');     // First space on line
firstPos++;                            // Pointer to first /
char* lastPos = strchr(firstPos, ' '); // Last space on line
char* thePath = 0;                     // Result path
if ((lastPos - firstPos) == 1)
{
    // Is / only
    thePath = 0;                         // Return NULL
}
else
{
    // Is an absolute path. Skip first /.
    thePath = extractString((char*)(firstPos+1),
                            lastPos-firstPos);
    if ((lastPos = strrchr(thePath, '/')) != 0)
    {
      // Found a path. Insert -1 as terminator.
      *lastPos = '\xff';
      *(lastPos+1) = '\0';
      while ((firstPos = strchr(thePath, '/')) != 0)
      {
        // Insert -1 as separator.
        *firstPos = '\xff';
      }
    }
    else
    {
      // Is /index.html
      delete thePath; thePath = 0; // Return NULL
    }
}
return thePath;
}

String manipulation in the server application

A string constant, or string literal, is a sequence of zero or more characters surrounded by double quotes, as in

"I am a string"

"" /* The empty string */

The quotes are not part of the string. Technically, a string constant is an array of characters. The internal representation of a string has a null character '\0' at the end. Thus, the physical storage required is one more than the number of characters written between the quotes. It is essential to be consistent with the terminating null character.

The standard string library is described in most books on the C-language and is also described in the man pages of UNIX systems. On linus, e.g.

man strcpy

provides all the information relevant to string copying. The most common string functions are

char* strcpy(s, ct);
// Copy string ct to string s, including '\0', return string s.

char* strncpy(s, ct, n);
// Copy at most n characters of string ct to s, pad with '\0'
// if ct has fewer than n characters, return string s.

char* strcat(s, ct);
// Concatenate string ct to end of string s, return string s.

int strcmp(cs, ct);
// Compare string cs to string ct, return <0 if cs<ct, 0 if cs==ct, or >0 if cs>ct.

int strncmp(cs, ct, n);
// Compare at most n characters of string cs to string ct,
// return <0 if cs<ct, 0 if cs==ct, or >0 if cs>ct.

char* strchr(char* cs, char c);
// Return pointer to first occurrence of c in cs or NULL if not present

char* strrchr (char* cs, char c);
// Return pointer to last occurrence of c in cs or NULL if not present

char* strstr(char* cs, char* ct);
// Return pointer to first occurrence of string ct in cs, or NULL if not present

Source code, compilation, linking and loading

Make sure your present working directory is ~/kurs/src. Remove the subfolder lab6 if it exists in ~/kurs/src. Then, copy all files from your solution in project 5 with the command

cp -r lab5 lab6

and change your present working directory into ~/kurs/src/lab6. This project will be an extension of your solution in project 5. Add the skeleton of project 6 to your previous files with the command

cp -r ~inin/kurs/src/lab6/* .

There should be three new files in addition to the ones from project 5 in the directory lab6,

http.cc, a few basic support methods to be used in the class HTTPServer,
fs.hh, declarations of the operations supported in the simple file system, and
fs.cc, an almost complete skeleton of the file system.

The class HTTPServer is not declared in a header file. The file http.hh must be created and defined by you. Make sure that your present working directory is ~/kurs/make and type the commands:

genmake -no-lint-files lab6, which creates a new makefile for the new target source code,
axmake -clean lab6, which removes all old object files created in previous compilations (very important as the make application could use outdated object files),
axmake lab6, which compiles and links the new target source code.

Testing the solution

Use the browser of your preference and connect to your web server by specifying the IP address only. Analyse the network traffic with the network monitor in order to learn more about the HTTP protocol. It might be very instructive to study commercially available web servers in the local network.

With the experience from previous projects you should be able to verify the requirements of the solution on your own.

The browser uses a cache in order to store web pages. Use the reload functionality when you get unexpected results. Occasionally, the browser stores states of failure as well and the computer will have to be restarted in order to browse the pages of your web. The authentication is only performed once when the private page is referred to the first time. The browser has to be restarted in order to perform the authentication of user name and password again.

Project 6 - A web server

Introduction

Recommended reading

Suggested solution design

The web structure of the server

The file system

The socket-based application of the server

Details in the implementation

Introduction to HTTP requests

The standard GET-request

The GET-request with authentication

Details of the standard GET and HEAD requests

Details of the GET-request with authentication

Details of the POST request

Creation of the file system

The paths and file names in the file system

String manipulation in the server application

Source code, compilation, linking and loading

Testing the solution