Paul DuBois
dubois@primate.wisc.edu
Wisconsin Regional Primate Research Center
Revision date: 18 May 1997
Yet another troff-related program, with yet another set of misassumptions about how troff interprets its input, and its own set of deficiencies and bugs.
This document describes troffcvt ("troff convert"),
a program which assists in the process of converting troff
documents to other formats. troffcvt doesn't do the full
job of translation itself; rather, it is a preprocessor that turns
troff files into an intermediate format with a syntax that
is easier to interpret than the raw troff input language.
troffcvt is intended as a front end that supplies input
to a postprocessor which finishes the translation to produce output
in the target format. Since the job of writing a translator for
a given target format then need not include writing a troff-parser,
the burden of the translator writer is reduced. In a sense, troffcvt
is simply another sort of ditroff, one that produces a
different output language than does ditroff.
troffcvt started out as a sed script for converting
troff to RTF (Rich Text Format), but it quickly became
evident that that wasn't going to be a very simple job to do correctly.
It seemed a better justification of effort to write a more general
tool that would be useful in contexts other than that of RTF production.
The source distribution contains some example translation methods
(simple postprocessors) you can look at. A standard troffcvt
output reader is included in the distribution; it can be configured
for use with your own postprocessors.
troffcvt has a number of significant shortcomings. It doesn't
do very well with input that has been passed through tbl,
eqn or pic. (For input containing tables, you can
use tblcvt rather than tbl to get better results.)
troff constructs that involve determination of motion or
sizes sometimes are calculated inaccurately since troffcvt
knows nothing of font metrics. Conditional construct processing
is problematic also, as are position-dependent traps. troffcvt
has other limitations; if those just listed are insufficient to
dissuade you from using it as the basis for a translator, see
the document troffcvt -- Notes, Bugs, Deficiencies.
troff files consist of text to be formatted interspersed
with markup requests indicating how the formatting is to be done.
The language implemented by troff is essentially an inverted
programming language, where document text comprises the comments
and markup requests provide the program indicating how to format
the comments. This language isn't especially easy to parse, which
may be why there are few tools for translating troff documents
into other formats. Most tools that do exists seem to use pattern
match-based transformations, rather than making any attempt to
actually understand the troff language. The purpose of
troffcvt is to make it easier to write troff-to-XXX
translators, for arbitrary XXX, by doing the hard work of turning
troff input into something easier to interpret. This means
part of the job is already done for postprocessor writers, who
can then concentrate on producing output in the desired target
format rather than on figuring out how to understand troff
files.
For example, point size might be set by some disgusting sequence
like this:
.ds x *a .ds y *b .nr \(\*x\(\*x 12 .ds \(\*y\(\*y \\n(\(\*x\(\*x .ps \*(\(\*y\(\*yThis is digested by troffcvt and appears in the output in somewhat simpler form:
\point-size 12Caveat: The quality of translators obviously depends on the quality of troffcvt's preprocessing, which is suspect. Nor is the situation improved by the fact that various versions of troff sometimes do different things with identical input. This makes it difficult for troffcvt to do the "correct" thing in all cases, especially for input files that have been tailored to work with (i.e., around bugs in) a particular version of troff.
troffcvt produces output that preserves information about
the structure of a document (e.g., margins, page length) and its
contents (the text it contains). The goal is not to lay
out text on pages. That is left to postprocessors, which are expected
to lay out document content by interpreting structure information.
Postprocessors may use the structure and/or content to varying
extents. For instance, a postprocessor that simply extracts the
text would ignore the structure information. A postprocessor that
produces a summary of the structure (e.g., page layout information)
would ignore the text. Most postprocessors will fall somewhere
between these extremes.
Inevitably, a certain amount of information is lost. Usually this
results from not knowing all the characteristics of the output
device. For instance, no font metric information is used, so it's
not possible to determine the position on the current page, or
even to know what the current page number is.
An example of troffcvt operation is shown below. (The default
resolution of 432 units/inch is assumed.)
Input Output .ps 14 \point-size 14 .vs 16 \spacing 96 .ce \center .ft B \font B troffcvt\-a troff converter troffcvt .ft @minus a troff converter \break \adjust-full \font R
troffcvt produces a mixture of control and text lines.
Control lines correspond to document structure. They consist of
a backslash character \ followed by a control word and
possibly some parameters for the control word, e.g., \space,
\font R, \page-length 4752. Text lines correspond
to document content, and are either plain text written literally
to the output, or begin with a "@" to indicate special
characters (for instance, @bullet for the "*"
character or @alpha for "[[alpha]]").
None of the control or special-text keywords overlap, but it's
still convenient to use different leading characters \
and @ to make it easier for simple filter programs to distinguish
between them. For example, the following command strips control
lines from a file containing troffcvt output:
% sed -e "/^\\/d" filenametroffcvt output is rife with troff-isms, such as \need and \embolden. Little effort was made to map these to more general document layout concepts since it's not clear what gain, if any, there would be in doing so.
There are two steps to turning troff files into some other
format:
Probably the easiest way to get some idea of the relationship
of troffcvt's input and output is to run some troff
files through it and look at what comes out. When troffcvt
runs, it reads one or more action files to configure itself, then
processes input files according to the information in the action
files. These are text files containing symbolic actions that specify
what happens when requests occur. Action files are also used to
define special characters and to set processing parameters.
troffcvt doesn't have built-in knowledge about any troff
request. Stated another way, unless troffcvt is told how
to implement a given troff request by means of some action
file, it ignores that request. It also knows about very few of
the characters that have special meaning (by design, since these
vary from one version of troff to another). All of this
stuff has to be specified in an action file. By default, troffcvt
reads the action file actions when it runs. You can also
name additional action files on the command line using the -a
option.
The format of action files is simple. Blank lines are ignored.
Lines beginning with a "#" character are also ignored,
so you can use them to include comments. Actions are specified
on a line consisting of a leading keyword to indicate the action
type (imm or req), followed by an action list of
zero or more actions. (An action line may be continued to the
next line by putting a backslash at the end of the line.) Action
lists can be executed immediately at the time the action file
is read, or they can be associated with a request, to be executed
whenever the request occurs in the input.
Immediate actions consist of the word imm followed by an
action list that is executed as soon as it has been read. The
first imm line below sets the point size to 10 points and
vertical spacing to 12 points, whereas the second sets the font
to roman:
imm point-size 10 spacing 12p imm font RRequest actions are similar but specify a request name, a set of actions for parsing the request's arguments, and a set of actions for processing those arguments after they have been parsed:
req request-name parsing-actions eol post-parsing-actions ...request-name is the name of the request (without the leading period). The parsing-actions section specifies how to parse the arguments expected by the request. If parsing-actions is empty, no request arguments are expected (or are to be ignored). The eol keyword is mandatory. It signifies the end of the parsing actions and causes troffcvt to skip to the end of the request line. (If this were not done, the remaining part of the request line would be read as a separate line to be processed.) The post-parsing-actions section specifies what should happen after the request arguments have been parsed. Typically this involves interpreting the request arguments. If the post-parsing-actions section is empty, nothing is done with the request (the request is ignored).
troffcvt associates each action name with the number of
arguments that should follow the action when it occurs in action
lists. When an action is performed, any arguments specified in
the action list are passed to it. For instance, the .so
request can be described like this:
req so parse-filename eol push-file $1The parse-filename action parses the line on which the request occurs to find a filename. This filename becomes the value of argument 1, which can be referred to later as $1. push-file pushes the file named by $1 on the input stack. Since $1 refers to the first argument parsed from the .so request, if the request is ".so junk", then "push-file $1" becomes "push-file junk", and junk becomes the current input file.
The two req lines below show how the .ps and .ce
requests can be defined:
req ps parse-absrel-num x point-size eol point-size $1 req ce parse-num x eol center $1The actions to take when a .ps request occurs are: parse a number, which can be an absolute setting or relative to the current point size; skip to the end of the request line; set the point size using the previously parsed number. The actions for .ce are to parse a number, skip to the end of the request line, and cause the next "$1" input lines to be centered.
"Missing" arguments are passed as empty strings. A reference
to $n is passed to the action as the empty string if no
n-th argument was present on the input request line. Suppose
the .ds request is defined like this:
req ds parse-name parse-string-value y eol define-string $1 $2Then if the following input line occurs, the parse-string-value action will find no string on the line, and the define-string action will define xx as the empty string:
.ds xxThe language implemented by troff is expressive (if somewhat unwieldy), so a large number of actions seem to be necessary to allow requests to be specified properly. Descriptions of all actions are given in the troffcvt Action Reference document.
If you don't like the actions file supplied with the troffcvt
distribution, you can modify it as necessary for your own purposes.
Specifying troffcvt's behavior in terms of symbolic actions
rather than hardwiring them into the code allows a good deal of
flexibility, because troffcvt's initial state and response
to requests can be modified without changing troffcvt itself.
For example, different versions of troff often know about
different sets of special characters; building the list at runtime
allows different versions to be accommodated. The initial page
layout can also be specified this way, since although initial
values for processing parameters are the same as those given in
the Ossanna manual, you can change them. Thus you can set up layouts
for letter size, legal, A4, etc.
This method of configuring troffcvt also meansx you can
experiment quite easily with troffcvt's response to particular
requests.
Request, macro, string, and register definitions consist of two
parts: a name, and the underlying object to which the name points.
troffcvt allows groff-style aliases to be created,
such that referring to an alias name is the same as referring
to the original name. Aliases are implemented by creating multiple
names that all point to the same underlying object. The object
structure contains a reference count indicating how many names
point to it.
For example, when a macro is defined, a name is allocated and
pointed at a macro object structure that holds the macro contents
(the body of the macro). The reference count in the object structure
is set to one. If an alias to the macro is created, a new name
is created and made to point to the same object structure as the
original name. The reference count in the object structure is
set to two. When a request, macro, string, or register is removed,
the name is deallocated and the reference count is decremented.
If the count goes to zero, the underlying object is no longer
needed (no other names point to it), and the object structure
is deallocated as well.
The reference count also includes the number of times an object
is currently in use. When a request or macro is invoked or a string
reference occurs, the reference count of the underlying object
is incremented. When the request or macro terminates, or the end
of the string is reached, the count is decremented. This use of
the reference count has two purposes:
.de xx .rm xx ..When the macro is defined initially, the name xx is created and made to point at a macro object, which is given a reference count of one. Invoking the macro results in the following actions:
.de xx .als yy xx .rm xx ..The reference count is set to one when the macro is defined, two when the macro begins executing, three when the alias is created, two when xx removes itself, and one when xx terminates. In this case, however, since the reference count is still one when xx terminates (the name yy still points to the underlying macro object), the object is not deallocated.
Aliases provide a convenient way to implement the .rn request.
The new name is created as an alias for the existing name, and
thus points to the same underlying object. The old name is then
removed, but since the underlying object is now pointed to by
another name, it persists as it should.
troff is commonly invoked with some sort of -mxx
flag (e.g., -man, -me, -mm, -ms),
so these need to be handled by troffcvt as well. There
are several ways of handling a macro package, some better than
others:
req LP eol output-control "\other para" req IP eol output-control "\other indented-para"This will cause \other para or \other indented-para to be written to the output when instances of .LP or .IP occur in the input. A postprocessor can recognize \other para and \other indented-para and do something sensible with them.
req AB parse-macro-args eol \ push-string ".br\n.ce\n\\fIABSTRACT\\fR\n.sp\n"You need to understand something about how the macros are supposed to work for this approach to be fruitful.
% troffcvt -mxx myfileThis will tell you which macros troffcvt handles okay and which it botches. With that information in hand, you can construct an action file tc.mxx containing redefinitions for those macros that troffcvt needs help with. Try out the action file like this:
% troffcvt -mxx -a tc.mxx myfileBy experimenting with tc.mxx, you can improve troffcvt's handling of any document that uses the -mxx macro package.
Some of the examples shown above demonstrate how to redefine macros,
but do so by defining them using req lines. Thus, these
"macros" are actually treated by troffcvt as
requests. Before you redefine a macro as a request, be sure you
understand the following points:
imm push-string ".de xx\n.tm this is macro xx\n..\n"If you provide redefinitions that might get used in concert with macro packages written for groff, here's something to watch out for: before redefining a name for which a definition may have already been read from the macro package file, it's prudent to remove the name first, like this:
imm remove-name XX req XX definition...This is due to the way that groff implements macro packages. Consider the -ms macros. These are supposed to be used such that .TL, .AU, .AI, and .AB occur in order if they are used. To make sure they aren't invoked out of order, the groff -ms definitions initially create .AU, .AI, and .AB as aliases to a macro that checks whether or not .TL has been invoked. When .TL is invoked it redefines the other macros appropriately with their "real" definitions. Now, suppose that you handle -ms by reading the macro package file and then redefining in an action file some of the macros such as .AI, and .AB. If you simply provide a new definition of .AI, what happens is that you also redefine all other names that are aliased along with .AI. In other words, you also redefine .AU and .AB! If you then redefine .AB, you also redefine .AU and .AI. Removing a name before giving it a new definition avoids this problem.
Suppose you normally format a document mydoc using a command
something like this:
% troff -ms mydocIf you use .so mymacros in mydoc to read a file of macro definitions, you may have a problem if you want to process mydoc with troffcvt. In particular, if you want to redefine any of the macros in mymacros for troffcvt's benefit, you won't be able to use an action file to do so:
% troff -ms mymacros mydoc
% troffcvt -ms mymacros -a tc.mymacros mydocThis will cause troffcvt to read, in order, the -ms macros, the standard definitions of the macros in mymacros, the redefinitions in the action file tc.mymacros, and mydoc.
.if d xx .ig end_ignore ...macro definition here... .end_ignoreThen you can format the document with troffcvt like this:
% troffcvt -ms -a tc.mymacros mydocWhen tc.mymacros is processed, it defines some or all the the macros used in mymacros. When mydoc is read and the .so mymacros request is processed, only those macros that were not already defined in tc.mymacros will be defined.
Similar considerations apply if you define macros directly in
your troff source file. You won't be able to override them
in an action file because the definition in the troff source
file occurs later and will take precedence. To work around this,
put the macro definitions in a separate file and use the first
strategy described above, or use .if d as in the
second strategy.
ChIn() returns values of type XChar, which is
typedef'ed as an unsigned integer type. The return value
falls into the following ranges:
.if \(xxstring\(xxstring\(xx stuff ...A warning is written to stderr when characters are created this way so it can be known there is a special-character definition missing from the action file (or that the input file contains an erroneous special-character reference).
UnChIn() takes an XChar argument, which is usually
a value returned from ChIn(). UnChIn() pushes the
argument onto the input pushback stack, unpacking escape and special-character
codes into their original multiple-character input sequences.
Unpacking is done to prevent problems. Suppose an escaped or special
character is first seen in non-copy mode, then pushed back and
reread in copy mode. If the escape code or special-character code
itself were pushed back, the character wouldn't be reread in copy
mode properly.
Values for plain ASCII and 8-bit characters can be represented
in a single byte (as an unsigned character), but escape codes
and special-character codes cannot, since they begin at 512. This
is why the XChar type is wider than a single byte.
Special characters are disallowed in request arguments and escape
sequences that might be written back out directly. For instance,
.ft F is written out as \font F, so
F isn't allowed to contain special characters. A similar
restriction applies to diversion names.
Special-character names must consist entirely of printable ASCII
characters. They are not allowed to be composed of other special
characters, e.g., \(\(ts\(ts is disallowed.
Input may come from a file, a macro, a named string (created with
the define-string action, usually in response to a .ds
request), or an anonymous string (defined below under the description
of the AChIn() function). The bulk of input usually comes
from input files named on the command line, which are processed
in sequence. Inputs sources may be nested (e.g., a macro or string
may be referenced while reading a file). The current input is
suspended when another input source is interpolated into the input
stream, and is resumed when the interpolated source is exhausted.
ChIn() returns the next input character from the input
stream. Embedded newlines (introduced with a backslash character
\ at the end of a line) are deleted so that the following
input line appears contiguous with the current line to any higher-level
routines. Comments (introduced with \) are deleted up to
(but not including) the end of line character. For instance, this
makes:
text followed by comment\" this is the commentappear to be:
text followed by commentThe handling of lines that begin with .\" happens properly; the comment stripping makes the line look like a line beginning with a control character but no request, so it is ignored. ChIn() also manages encoding of escaped characters, and pushing to input sources for number register, string or macro argument references. Handling of escape sequences differs depending on whether copy mode is in effect or not.
Input characters accepted by the file-input routine are non-null
ASCII values (null bytes and bytes with bit 8 on are discarded).
Escaped characters (\x) and special-character references
\(xx or \[xxx]) are converted
to escape codes and special-character codes as described above
under "Character Coding."
Input source pushing occurs automatically in ChIn() when
\n, \* or \$ occur (and also \w if
not in copy mode): the input source switches to a string representing
the value of the number register or string, the macro argument,
or the result of the width calculation. Higher level routines
also can cause the current input source to be pushed down, e.g.,
when a .so request occurs.
ChIn() is also used for the ugly task of processing multi-line
conditional input (bracketed with the \{ and \}
sequences). The conditional request processor saves the current
if-level when it sees a \{, bumps it up one, then processes
lines until the level drops back down to the saved value. ChIn()
notices \}, silently deletes it and decrements the if-level,
which is then noticed by the conditional processor. Pretty horrid.
UnChIn() is used to push characters back onto the input
stream. It understands how to push back escape codes and special-character
codes properly. It also understands how to push back multiple
characters (characters must be pushed in the reverse order from
that in which they were read).
ChIn0() returns the next raw (uninterpreted) character
from the input stream. If there are any pushed back characters
waiting to be reread, it returns the one most recently pushed.
Otherwise, it returns the next input character from the current
input source. If that source is exhausted, it resumes reading
from the previous source. When there is no more input, it returns
endOfInput. Input source unwinding is undetectable at
any level above ChIn0(), including ChIn().
FChIn(), MChIn(), and AChIn() are the lowest
level input routines; they're called by ChIn0(). These
return a single character from a file, macro or named string,
or "anonymous" string input source. Each returns endOfInput
when the source is exhausted (which only means the current source
is done, not necessarily that all sources are done). EOF
is not returned because that is typically -1 (negative), and the
input routines return a value of type XChar, which is
unsigned.
FChIn() discards nulls. (It also converts CR or CRLF to
LF; this has nothing to do with troff, but allows text
files from MS-DOS or Macintosh machines to be read without requiring
you to convert line endings first.)
MChIn() reads the next character from a macro or string
definition. (Strings are implemented internally as macros without
arguments.)
AChIn() reads the next character from an anonymous string,
which is just some arbitrary string that is to be used as an input
source. For instance, when a number register reference (\n)
or width expression (\w) occur, the resulting value is
converted to a character string, which becomes the current input
until the string is completely read. References to macro arguments
are treated similarly; the argument value is retrieved and pushed
on the input stack. Another source of anonymous strings is the
push-string action, which can be used in action files to
push an arbitrary string onto the input stream. This is convenient
for processing certain requests. For instance, if you want to
redefine a macro, you can define the action for that macro to
be one that pushes alternative input. Here's an example that shows
how the .AB macro from the -ms macro package might
be redefined:
req AB parse-macro-args eol \ push-string ".br\n.ce\n\\fIABSTRACT\\fR\n.sp\n"One sticky problem occurs with the .nx request, usually processed with the switch-file action. When .nx occurs, it might happen while other files or macros are active. If the current input source is a file there is no problem since the file pointer for that source is simply switched to the new file. But if the request occurs in the middle of a macro, it's less clear what should happen. Should the macro continue to be processed? I elect to terminate macro sources and unwind the source stack until a file is found, then switch the file pointer of the file source. Possibly this is wrong; the troff manual is ambiguous on this point. (Which may be why different versions of troff behave differently in this situation.)
For handling the .ex request, the end-input action
it used; it sets a flag causing ChIn() to return endOfInput
forever after.
At the lowest output level, there are two calls. One is for writing
characters and it simply writes to the output file and dies if
there was an error. The other is for writing strings; it calls
the write-character routine for each character in the string.
The next level up manages the mechanics of collecting plain text
lines and interspersing them with special text and control lines.
The basic issues are insertion of spaces between successive output
text lines and making sure that special text and control lines
don't get written into the middle of a plain text line.
Control lines begin with a backslash character \. Any
plain text output line being collected is flushed so the control
string doesn't appear on the same line.
There are two kinds of text output: plain text lines, and special
text lines that indicate special characters (e.g., @backslash
for the \ character. Whenever text output (either kind)
is written, a check is made to see whether it's necessary to write
a preceding space first. A space is usually needed between consecutive
input text lines (exceptions are when centering or no-fill are
in effect, or if an input line ends with a \c). For special
text, any plain text output line being collected is flushed so
the control string doesn't appear on the same line.
The output character set for text is such that most printable
ASCII characters appear as themselves, and others are written
out as special text lines. The characters tab, backspace, \,
and @ are printable but written as specials @tab,
@backspace, @backslash, and @at. The leader
character SOH is written as @leader.
troffcvt maintains a notion of input level. The level is
incremented each time a new input source begins and decremented
when the current source ends. A file interpolated with .so
is an input source, but so is a macro, a macro argument, a string,
or a number register. This helps avoid the problem of interpreting
something like this:
.if '\*[xx]'y' ...when the string xx contains an apostrophe. troffcvt uses the input level in such a way that troff constructs bounded by delimiters do not consider the closing delimiter to be found unless it occurs at the same input level as the opening delimiter. (If you simply look at characters as they occur, then the apostrophe in the string prematurely terminates the scan for the first of the strings to be compared, and throws off the comparison.) The affected constructs include:
.if 'x'y' .tl 'left'center'right' \b'abc...' \h'N' \l'Nc' \L'Nc' \o'abc...' \v'N' \w'string'The input level also affects parsing of macro arguments that begin with a double quote. Only a quote at the same input level as the opening quote terminates the argument.
The behavior just described mimics how groff treats its
input, not how standard troff treats its input. However,
groff ignores the input level (and thus acts like standard
troff), in compatibility mode. troffcvt does the
same. (Parsing routines that need to check the input level call
the ILevel() function. In compatibility mode this function
always returns zero, making all input appear to be at the same
level.)
groff produces a quoted argument list when \$@ occurs
in the input. The groff documentation says that it processes
the list such that the quotes surrounding an argument appear at
the same input level, whereas the argument itself is processed
at a higher level. (This prevents the problems that would occur
if an argument contained a quote.) I take this to mean that the
quotes surrounding the arguments are at a level higher than the
context in which the \$@ occurs, and the arguments one
level higher than that, in case something like the following occurs
in a macro:
.xx "\\$@"If the quotes produced by \$@ here were treated as being at the same level as quotes in the surrounding text, the arguments to .xx could be messed up.
troffcvt handles \$@ by constructing a string consisting
of a list of argument references that looks like this:
"\\$1" "\\$2" ... "\\$n"Then the string is pushed on the input stack. This causes the quotes to be processed a level higher than the surrounding text. When each argument reference in the string is encountered, the value of the argument is pushed on the stack, causing the reference to be processed another level higher.
troffcvt handles \$* in a manner similar to \$@
except that no quotes are added to the string containing the list
of argument references.
Macro arguments consist of strings of non-white characters. Arguments
may be quoted to allow whitespace to be included. An argument
that begins with a double quote is parsed in quote mode until
a closing quote, and the leading and terminating quotes are stripped
off.
Double quotes in macro arguments are handled as follows:
Neither groff nor troffcvt have this problem, since
quotes in arguments occur at a higher level than the surrounding
text. (In compatibility mode, troffcvt uses the quote-stripping
behavior of standard troff.)