How to get Started Writing an App
This note explains the overall moves to make when you are writing an application on top of the W3C
Reference Library (libwww). For details on how to use the library, you should look at the User's Guide.
It is a good idea to have a look at our example
applications:
This quick guide takes you through the basic steps necessary to create your own application with all
the power of the W3C Reference Library:
- Define preferences and application capabilities
- Binding File Suffixes and Data Formats
- Should I use the Internal SGML/HTML Parse Stream?
- Register a memory cache handler
- Register event handlers for user input
- Register a request termination handler
- Register a "Unknown Header" handler
- Register a select Timeout Function
- Issue requests to the Library
- Putting a request into a context
- Interrupt a request
- Logging and History Management
- User Messages
- When to free Objects
Define preferences and application capabilities
The HTInit module is a special module in that it defines
the capabilities of the Library. Think of libwww as a core that can do absolutely nothing on its
own. All the functionality comes from modules that are registered dynamically at runtime. This is
why it is so important to use the initialization function mentioned above. The library basically
boot-straps itself to find out what it is capable of doing.
When you unpack the library you already have a lot of functionality to plug in. You have protocol
modules that can access HTTP, FTP, Gopher, WAIS, NNTP, Telnet, TN3270, rlogin, services and the
local file system. You also have stream converters that handles HTML documents, plain text
documents, MIME documents and various conversions between these data formats. Furthermore you have
the possibility of defining presenters that use external viewers to present for example
images, audio files etc.
By default, the HTInit module contains functions to
enable all the functionality that the library provides, but you might want to change this, for
example if you have a special image viewer that you would like to use. The way to do this is either
to modify the HTInit module or to override HTInit with your own version. It's described in more
detail in the User's Guide what it means to override a Library module.
Both converters and presenters are stored in a linked list (the HTList object). This list can either be bound to a global
list of converters and presenters that are used in all requests or you can bind the lists
locally to a single request object in which case it is only used in this particular case.
Data Format Conversions
Converters are streams that are capable of converting directly from one data format to
another, for example from a HTML document to some kind of presentation format that the user can
actually see on the screen. This is the fastest way of presenting an object to the user as it
doesn't require external applications to step in and do the presentation.
extern void HTConverterInit (HTList * conversions);
External Presenters
Presenters are used to present a data format to the user by calling an external program,
for example a post script viewer. In practice presenters work by first saving the data object in
local file and then spawn off the external viewer. This is the reason why it is a big advantage to
use a converter instead as the presentation to the user then can happen as more and more of the data
object arrives over the network.
extern void HTPresenterInit (HTList * conversions);
Protocol Modules
By default the Library enables all available protocol modules but if this is desired then it can
easily be changed using the HTAccessInit
function. An example is if you want to make an
application that only can speak HTTP, then you can disable the initialization of all the other
protocol modules by simply taking them out of the HTAccessInit
function.
extern void HTAccessInit (void);
Binding File Suffixes and Data Formats
Not all protocols support the notion of describing the data format of a data object. An example if
FTP where only two modes are possible: text and binary. However this is not sufficient in order to
select an appropriate converter of presenter (both images and audio files are binary data object but
can in general not be handled by the same application). This is also a problem when accessing the
local file system which in most cases do not keep information about the data format.
The library solves this by looking into a table of bindings between a file suffix and the
corresponding data format. By default the Library defines a long list of the "most common file
suffixes" but the list is of course configurable! If you look into the HTInit module again, you will find a function called:
extern void HTFileInit (void);
you will find that it is just a question of adding new suffixes that are not mentioned. However, it
is not necessary to list all possible suffixes. Luckily the first couple of line in many data
objects tell what kind of data they contain and the Library uses this to try and guess unknown data
formats.
Should I use the Internal SGML/HTML Parse Stream?
The library contains code for presenting a data object in a text based application as is the case
with the Line Mode Browser, but as this is not generally sufficient GUI clients must overwrite the
data object presentation modules in the Library in order to take advantage of a more advanced
user-interface. Just to make it more flexible this can be done in three ways depending on how much
you want to do yourself. The three interfaces are illustrated below:

- SGML Stream Interface
- Use this interface if you do not want to use the SGML
parser in the Library and the following transition to a structured stream which is a subclass of
the stream class. This can all be set up with out overwriting any module but simply by defining the
converters not to use the SGML/HTML parsers.
- HTML Structured Stream Interface
- If the client has its own HTML parser then the interface is between the client HTML parser and
the Library SGML parser. The SGML parser is a general SGML parser which can be setup with a specific
DTD and it feeds the HTML parser with structured data. In this case, you will be emulating the HTML module, and generating a hypertext object from the
structured stream.
- HText Call Back Interface
- If the client wants to use the HTML parser in the Library then this is the interface to use for
the application. The hypertext object is parsed and the communication with the client is based on a
set of call-back functions in the HTML parser. The call-back functions are all defined as prototypes
in the HText module but the client must provide the
actual code that defines the presentation method used for a specific HTML tag. If you wish to
maintain the structure of the SGML file within your object, then the SGML interface will be a better
place to connect your code.
You are free to define the structure of the hypertext object (the declaration can be found in the HText module. You may want to define your own styles and
font definitions.
Register a memory cache handler
The library handles a persistent file cache which provides intermediate term storage of cacheable
objects. However the Library also has a notion of a memory cache but this is not maintained
by the Library itself but by the application. The reason for this is that the Library does not know
what kind of data format the the application is capable of storing or where they are stored, and it
also allows the application to control the memory management, that is how much space to use for
storing data objects in memory.
If the application wants a memory cache (which has an important performance impact) it must register
a memory cache handler that can make a decision whether a data object is in memory or not and if it
is stale or not. Every URL has an associated HTAnchor
object that stores all known information about that URL, for example the title, length of the
data object (if any) etc. The Library calls the memory handler on every request but before making a
decision about a data object, the memory cache manager can consult the requested anchor object to
find out if it has expired etc. A memory handler is registered and unregistered using the functions
typedef int HTMemoryCacheHandler (HTRequest * request, HTExpiresMode mode, char * message);
extern BOOL HTMemoryCache_register (HTMemoryCacheHandler * cbf);
extern BOOL HTMemoryCache_unRegister (void);
The handler can use two return codes which both are defined in the HTUtils modules.
HT_LOADED
- If the object was found and looked good
HT_ERROR
- If the object was not found, has expired etc.
An example of a memory handler can be found in the Line Mode Browser
in the GridText module. You can also see
where it is registered as the memory handler in the Library in the HTBrowse.c module
Register event handlers for user input
The Library has a very flexible event model that allows an application to either use it's own event
loop that the library can makes calls to or to use the event loop in the HTEvntrg module in the library. In most cases the library
event loop is sufficient which is the case for example on Mac, Window, and Unix platforms.
The application can register as many event handlers for as many channels that it can get input
on. In the case of the Line Mode Browser, input can only be sent via standard input, so only one
event handler is necessary. However if more than one channel is available then they can all be
associated with an event handler by registering them in the Library. An event handler is a function
of the type
typedef int HTEventCallBack (SOCKET sockfd, HTRequest * request, SockOps ops);
and they can be registered in two ways depending on the event channel:
- The event channel can be redirected to a BSD socket
- The event channel is connected to a TTY console (for example under NT)
In the first case use the following functions to register and unregister a handler:
extern int HTEvent_Register (SOCKET sockfd, HTRequest * request, SockOps ops,
HTEventCallBack *cbf, HTPriority priority);
extern int HTEvent_UnRegister (SOCKET sockfd, SockOps ops);
See the HTEvntrg module for more information on the
arguments. Likewise if you are on for example a Windows NT platform then use the following
functions:
extern int HTEvent_RegisterTTY (SOCKET sockfd, HTRequest * request, SockOps ops,
HTEventCallBack *cbf, HTPriority priority);
extern int HTEvent_UnRegisterTTY(SOCKET sockfd, SockOps ops);
The return code from an event handler is simple but very important. There are two allowed values:
- HT_OK
- If the handler could handle the event
- HT_ERROR
- If an internal error occured, for example a NULL argument
was passed etc. This code will result in that the event loop is
exit'ed and the application is ready to be terminated.
It helps understanding event handlers if you think of the Library as a coke machine (the ones with a
coin slot only - the ones which also has a bill slot work differently!) Event handlers are the coin
slot on top where you can request a service from the library. You are guaranteed that the machine
never returns any coins to you the same way - they always come out at the bottom of the
machine. This means that the Library always returns "Thanks, I have received your request - please
look for the response at the bottom" on any request started from within an event handler. In the
library case the result is always returned in the request termination handler that is the "bottom"
of the Library machine.
Register a request termination handler
The application can register a callback function that is called whenever a request has terminated,
regardless. The callback function can furthermore be registered to be called regardless of the
status code or on upon a specific codes. The set of status codes is:
HT_LOADED
- The request was successful
HT_INTERRUPTED
- The request was interrupted
HT_ERROR
- An error occured
HT_NO_DATA
- The request resulted in an object with no data, for example a telnet session
HT_RETRY
- The service was unavailable but the request can be repeated at at
later time. This later time can be obtained using the function
extern time_t HTRequest_retryTime (HTRequest * request);
Return codes are not used from termination handlers.
Register an "Unknown Header" Handler
If you have additional headers that you want to experiment with you can register a header parser to
handle all unknown headers. This handler is called whenever an unknown header is encountered when
parsing the object header. If the returns YES then the MIME parser stores the metainformation in the
anchor object if self so that it can be found at a later point in time. If the handler returns NO,
then the header is discarded.
The header passed to the call back function will be in normal NULL terminated C string without any
CRLF. The line will also be unwrapped so that it is easier to parse for the call back function.
typedef int HTMIMEHandler (HTRequest * request, char * header);
extern int HTMIME_register (HTMIMEHandler * cbf);
extern int HTMIME_unRegister (void);
Register a Select Timeout Function
You can register a call back function together with a timeout that is used in the select call of the
event loop. When the select() call times out, the call
back function is called. The registration can either be so that the call back always is
called when that the select call times out or only when Library sockets are in use.
typedef int HTEventTimeout (HTRequest *);
extern BOOL HTEvent_registerTimeout (struct timeval *tp, HTRequest * request, HTEventTimeout *tcbf, BOOL always);
Issue Requests to the Library
The Access module has a set of functions that works as
a user interface to the request manager. You can call
the request manager directly but often it is simpler to use the Access module. As mentioned, the request manager returns
immediately and the result of the request is handled back to the application using the termination
handlers.
What and When to Escape Strings
The Library's represents internally location strings as URLs which means that they always
are escaped. Only when it is required to access a resource are they unescaped, for example via
FTP. All location strings passed to the Library are also treated as URLs which means that they
must be escaped already. Hence in the case of the request methods mentioned aboce, all URL
arguments must be escaped if the resource is to be found.
Most often this is only a problem when people type in location strings directly, for example using
"goto location" etc. In the Escape module the Library
provides two functions for escaping location strings into URLs and unescaping URLs into location
strings:
typedef enum _HTURIEncoding {
URL_XALPHAS = 0x1,
URL_XPALPHAS = 0x2,
URL_PATH = 0x4
} HTURIEncoding;
extern char * HTEscape (CONST char * str, HTURIEncoding mask);
extern char * HTUnEscape (char * str);
Multiple Simultaneous Requests
Multiple simultaneous requests can be started at any time from within an event handler. It is simply
a question of calling one of the load functions in the Access module. The Library keeps a request queue where
all active requests are kept. The application can define the maximum number of open sockets that can
be open at any one time. This number is by default 6 but can be changed using the following
functions:
extern BOOL HTNet_setMaxSocket (int newmax);
extern int HTNet_maxSocket (void);
If there are no free sockets available the request is put into a pending queue and started as soon
as possible.
Putting a request into a Context
When multiple requests can be initiated simultaneously it is in general not possible to predict the
order that the results return to the application. This is a function of the size of the individual
data objects, net work speed etc. In order to keep track of ordering of the requests, for example in
the history list, it is necessary to put them into some kind of context. The Library provides the
hooks for maintaining a context together with the Request
object by the following methods:
typedef int HTRequestCallback (HTRequest * request, void *param);
extern void HTRequest_setCallback (HTRequest *request, HTRequestCallback *cb);
extern HTRequestCallback *HTRequest_callback (HTRequest *request);
The callback function can be passed an arbitrary pointer (the void part) which can describe the
context of the current request structure. If such context information is desired then it can be set
using the following methods:
extern void HTRequest_setContext (HTRequest *request, void *context);
extern void *HTRequest_context (HTRequest *request);
There is no limit to the definition of a context data object and the memory management if entirely
up to the application. The Request object simply carries the information around.
Interrupt a request
Any request can be interrupted from an event handler by calling either of the functions (the first
kills a specific request, the second is more radical and kills them all):
extern BOOL HTRequest_kill (HTRequest * request);
extern BOOL HTNet_killAll (void);
Logging and History Management
The library has a set of modules that are not called from within the library at all but can be used
by the application. Two examples are a History manager
and a Log manager that provides functionality for keeping
a history list of visited objects and to provide logging of the results. The application is free to
use there modules but does not have to.
User Messages
The library has a basic User Messages and Prompts
module which is intended for the Line Mode Browser. In case you are building a GUI client you would
probably want to write your own version using windows etc. This can be done by overriding the module
in the Library with your own that has the same set of functions as described in the User's Guide.
When to free Objects
This is a short set of recommendation on when to free objects in memory.
- Anchors
- Anchors are normally not freed before the application terminates as they have important
information about the part of the Web that the user has been in tough with. However, anchors
can be freed at any time if they are not used by an ongoing request
- Requests
- Request objects are only intended to live as long as a request is being executed. A request
object is a "information binder" that binds various data objects together as long as the request is
running. As the Library uses asynchronous network I/O many requests can be handled "in parallel" but
each request object only knows about one request. Therefore a request object can be free whenever a
it has terminated which is the case when the request termination handler is called.
- Streams
- Streams are used as the main media for data transportation in the Library. Therefore they are
normally hooked together in what is called stream chains where data can carry the data from the user
to the network and vice verse. As any operation on a stream ripples through the whole chain of
streams you should not free any streams on your own. The free method is initiated in the Library,
for example when all data has been transferred from a remote server and the connection closed.
- Memory Cache Objects
- As mentioned in the section on memory cache handlers the application can handle all data objects
in memory as it likes. Therefore the app is also free to delete any object at any time but of course
this is a balance between memory and performance - it takes a lot longer to get a data
object from a remote server than to keep it in memory!
Henrik Frystyk,
December 1995