tc2html Notes Paul DuBois dubois@primate.wisc.edu Wisconsin Regional Primate Research Center Revision date: 9 March 1997 Introduction tc2html is a postprocessor for converting troffcvt output to HTML. It's used by the troff2html front end. This document describes how tc2html works and some of the design issues involved in writing it. In general, the goal of tc2html is that you should get reasonable HTML output with no need for special treatment of the troff input file. The most important thing is that you use a standard macro package. However, there are some additional principles you can follow that will improve the quality of the HTML that tc2html generates. For example, it's possible to embed hypertext links in your troff source with a little prior planning. Techniques for such things are discussed in the section "Generating Better HTML." If you're not interested in implementation details, you can skip directly to that section. Output Format tc2html reads output from troffcvt and produces an HTML document that has the following general form:
or
markers.
However, if your document is marked up using macros from a macro package
such as -ms or -man, it's possible to get output from troffcvt that's
much more suitable for tc2html. The trick is to map troff requests to
HTML structure markers, rather than trying to guess the structure from
the low-level troffcvt output that normally results from those requests.
This is accomplished using the following strategy:
*
Extend the troffcvt output language by defining an \html control that
provides information to tc2html about structural elements within the
troffcvt output. For example, \html para indicates the beginning of a
paragraph.
*
Provide (in a troffcvt action file) a set of HTML-specific macros that
generate the appropriate \html controls for the various structural
elements. For example, .H*para generates \html para.
*
For the important structure-related macros in your macro package,
redefine them (in a troffcvt action file) so they're expressed in
terms of the HTML-specific macros. (It's posssible, of course, to
redefine the macros from the macro package so they generate the \html
controls themselves. But having the \html controls available through a
set of macros allows the macros to be invoked directly in your
document. This is important for some HTML constructs that have no
troff analog, such as hyperlinks.)
Note that "extending" the troffcvt output language to include the \html
control is done using request definitions in an action file.
Source-level changes to troffcvt itself are not needed.
The effect of the strategy outlined above is to remap the macros in your
macro package from their usual actions onto actions that produce
document structure information that tc2html can recognize. For this to
work well, all the important structure-related macros in a macro package
must be redefined, so the redefinition files used for tc2html tend to be
more extensive than those used for other postprocessors. This is really
the source of most of the work involved in getting tc2html to function
well. Once a set of redefinitions is written for a given macro package,
translation from troff to HTML is a straighforward process that usually
generates fairly reasonable HTML.
Here's an example of how the strategy described above works in practice.
The .LP macro in the -ms macro package means "begin paragraph." But .LP
typically is implemented by executing several other requests (restore
font, margins, adjustment, spacing, point size, etc.), and the troffcvt
output you'd get by processing those requests really contains nothing
that specifically indicates a paragraph. To work around this, we use the
fact that tc2html interprets \html para as indicating a paragraph
beginning, and define a macro to generate that control:
req H*para eol output-control "html para"
Then we can redefine the .LP macro in terms of the .H*para macro:
req LP eol \
break center 0 fill adjust b font R \
push-string ".H*para\n"
The break, fill, adjust, and font actions cause troffcvt to adjust its
internal state to match the effect that the .LP macro normally has. The
call to .H*para results in \html para in the output, so that tc2html can
recognize the paragraph beginning.
The \html markers that tc2html recognizes are shown below:
\html title Begin document title
\html header N Begin level N header
\html header-end End header (any level)
\html para Begin paragraph
\html blockquote Begin block quote
\html blockquote-end End block quote
\html list Begin list
\html list-end End list
\html list-item Begin list item
\html display Begin display (preformatted text)
\html display-end End display
\html display-indent N Set display indent to N spaces
\html definition-term Begin definition list term
\html definition-desc Begin definition list description
\html shift-right Shift left margin right
\html shift-left Shift left margin left
\html anchor-href URL Begin HREF anchor for link to URL
\html anchor-name LABEL Begin NAME anchor with label LABEL
\html anchor-toc N Begin NAME anchor for level N TOC entry
\html anchor-end End anchor (any kind)
The troff-level macros used to generate the \html controls are shown
below. These macros are defined in the action file actions-html:
.H*title Begin document title
.H*header N Begin level N header
.H*header-end End header (any level)
.H*para Begin paragraph
.H*bq Begin block quote
.H*bq-end End block quote
.H*list Begin list
.H*list-end End list
.H*list-item Begin list item
.H*disp Begin display (preformatted text)
.H*disp-end End display
.H*disp-indent N Set display indent to N spaces
.H*dterm Begin definition list term
.H*ddesc Begin definition list description
.H*shift-right Shift left margin right
.H*shift-left Shift left margin left
.H*ahref URL Begin HREF anchor for link to URL
.H*aname LABEL Begin NAME anchor with label LABEL
.H*atoc N Begin NAME anchor for level N TOC entry
.H*aend End anchor (any kind)
Note that since these names are longer than two characters, they cannot
be used in compatibility mode.
Invoking tc2html
The \html controls are defined in a file actions-html that you can
access on the troffcvt command line using -a actions-html. If you use a
macro package -mxx, you specify it on the command line, along with the
general and HTML-specific troffcvt redefinitions for that macro package;
these are in the action files tc.mxx and tc.mxx-html. Thus, to translate
a file that you'd normally process using -ms, the command would look
like this:
% troffcvt -a actions.html -ms -a tc.ms -a tc.ms-html myfile.ms \
| tc2html > myfile.html
That's pretty ugly, of course; it's better to use a wrapper script like
troff2html that supplies the necessary options for you:
% troff2httml -ms myfile.ms > myfile.html
Implementation of Various HTML Constructs
This section provides some specifics on how several troff concepts are
turned into HTML elements. It should be considered illustrative rather
than exhaustive.
Document Titles
Title macros are implemented in terms of .H*title, which generates an
\html title control. When tc2html sees this control, it goes into
document HEAD collection mode. If the document contains a title, the
\html title line must be the first \html control that tc2html sees.
Should any other \html control or document text occur first, tc2html
assumes no title is present. Any leading document whitespace (\space or
\break lines) occurring prior to the title is skipped.
The title is terminated by the next \html line with a structural marker,
such as \html para. The title text is used to produce the TITLE in the
document HEAD part and the initial header in the document BODY part.
\space and \break lines within the title do not terminate title text
collection; instead, they are turned into spaces in the title and into
and
in the initial header. Consider the following troff input
(using -ms macros):
.TL
My
.sp
Title
.LP
This is a line
This is converted by troffcvt into the following:
\html title
My
\space
Title
\break
\html para
This is a line.
The output from troffcvt is converted in turn by tc2html into this HTML:
Title
This is a line. -T title may be specified on the tc2html or troff2html command line to specify a title explicitly. It overrides the title in the document if there is one. Standard Paragraphs The "standard" paragraph is a paragraph with the first line flush left. There is no mechanism for writing paragraphs with an indented first line; they're treated simply as standard paragraphs. The standard paragraph is implemented in terms of .H*para, which generates an \html para control. This is turned by tc2html into
. In the document BODY part, \space is also interpreted as a paragraph marker, but during document title collection, \space is treated as described above under "Document Titles ." Indented Paragraphs Indented paragraphs (with or without a hanging tag) are implemented using definition lists (
Para 3
Right and Left Shifts
In troff, the left margin can be shifted right and left, e.g., as is
done with the -ms and -man packages using .RS and .RE. HTML has no good
way of shifting the margin, so shifts are performed using
...). Tabstops are respected within displays, although they must be approximated since characters widths are unknown. tc2html assumes 10 characters/inch for determining the width of tabstops. Display macros are implemented in terms of .H*disp and .H*disp*end. Preformatted text in HTML has no additional indent relative to the left margin, but troff displays often are indented a bit. To handle this, .H*disp*indent N can be used to set the display indent to N spaces. .H*disp, .H*disp*end, and .H*disp*indent generate \html display, \html display-end, and \html display-indent controls. The first two of these are converted by tc2html into
and. \html display-indent generates no output itself, but causes tc2html to add spaces to the beginning of each line of a display. Centered and right-justified displays are not implemented. They're treated as regular displays. Tables If your input document has tables written in the tbl language, preprocess the document with tblcvt rather than with tbl. Your output will look better that way. Table cell borders are hard to do well. In tbl you can put a border on any cell boundary, but in HTML a table has either no borders or borders around every cell. Currently, tc2html puts borders around every cell. Font Handling Fonts are handled in tc2html by means of a table that associates four tags with each font name. The first two tags are used to turn the font on and off in normal text. The second two tags are used to turn the font on and off in displays. This table is read at runtime from the html-fonts file. Here's an example of what the file might look like: R "" "" "" "" I B BI C "" "" CW "" "" CI CB CBI The difference between the tags for regular text and display text is that, since browsers implicitly switch the font to monospaced font in displays, the only thing that can be done for font changes there is to change the style attributes. The initial font when tc2html begins is R (roman). When a font change occurs, the new font's begin tag is written out after terminating the previous font by writing its end tag. Using the font table just shown, this input: \font R abc \font I def \font CW ghi \font R jkl becomes this output: abcdefghijkl Tabs Tabs are ignored except in displays. Adding extra space to tab over has no effect in regular paragraphs anyway, because browsers typically collapse runs of spaces. Right-justified and centered tabs are treated as left-justified tabs. That is, they're completely botched. Generating Better HTML This section describes how you can embed hypertext links in your troff source and how to produce a table of contents containing clickable links to the main sections of your document. Generating Hypertext Links The \html controls used to generate hypertext links are: \html anchor-href URL \html anchor-name LABEL \html anchor-end The first two controls generate opening and tags; the third generates a closing tag. To embed hypertext links in your troff source, you can use the macros .H*ahref and .H*aend, or .H*aname and .H*aend. To write an HREF link, the troff source looks like this: .H*ahref http://www.some.host/some/path hypertext link .H*aend The resulting HTML looks like this: hypertext link To write a NAME link, the troff source looks like this: .H*aname my-name name link .H*aend The resulting HTML looks like this: name link Section-header macros are usually redefined to generate a NAME anchor for the table of contents, so don't surround a section header with anchor-generating macros. You'll end up with nested anchors, which tc2html disallows. You can generate a NAME link for a section (e.g., so that you refer to it using a specific name) as long as you don't write the link like this: .H*aname better-html .SH "Generating Better HTML" .H*aend Instead, write it like this: .H*aname better-html .H*aend .SH "Generating Better HTML" Unfortunately, some browsers don't seem able to jump to NAME anchors unless there is some text between the and tags. You can't make a section header a hypertext link. You'd have to put the header (which generates a NAME link for the TOC) between the .H*ahref and .H*aend macros, which would result in nested anchors. Generating a Table of Contents Putting a table of contents (TOC) into an HTML document requires some postprocessing of the tc2html output. The TOC entries can't be written to the beginning of the document because they're not all known until the input has been read entirely. The approach adopted with tc2html is as follows: * Write a marker to the document indicating the desired TOC position. You do this using a special macro, described below. * Collect TOC entries in memory as the input is processed. * Write the TOC contents as a list near the end of the document. * Run tc2html-toc, a script that examines the HTML document and moves the TOC contents to the location indicated by the TOC position marker. If you run tc2html directly, you must also run tc2html-toc directly. If you use troff2html, tc2html-toc is run for you automatically. The \html controls used to generate TOC entries are: \html anchor-toc N \html anchor-end Text occurring between \html anchor-toc and \html anchor-end pairs is written to the output, but it's also collected and remembered. When tc2html encounters end of file on its input, it writes the TOC entries to the output between two other HTML comments: TOC entries If you want to generate a TOC entry explicitly in your troff source, use .H*atoc and .H*aend. For example: .H*atoc 1 My TOC Entry .H*aend The argument to .H*atoc is the TOC entry level (1, 2, 3, ...). It's unnecessary to invoke TOC macros directly if the section-header macros in your macro package are redefined to invoke the TOC macros for you. For example, the .SH for the -ms package is redefined like this in the tc.ms-html action file: req SH parse-macro-args eol \ break fill adjust b \ push-string ".H*atoc 1\n" \ push-string ".H*header 2\n" \ push-string "$1\n" \ push-string ".H*header*end\n" \ push-string ".H*aend\n" To specify the TOC title and generate the TOC position marker, use the .H*toc*title macro. Invoke it as shown below, passing the title of your TOC as the first argument: .H*toc*title "Table of Contents" .H*toc*title writes the TOC title to the output followed by a special HTML comment: Table of Contents The INSERT TOC HERE comment is used by tc2html-toc, along with the TOC BEGIN and TOC END comments, to find the TOC entries and move them to the desired location. Action files that provide macro package redefinitions for tc2html can try to place an advisory TOC location marker in the document. This is used if you don't specify a location marker explicitly with .H*toc*title: For instance, the -man redefinitions put out this marker when the .TH macro has been seen. The marker causes a TOC to be placed after the title line and the first man page section, unless one is specified explicitly. No TOC title is written with the advisory marker however, so the TOC will be "title-less."