tblcvt -- A troffcvt Preprocessor Paul DuBois dubois@primate.wisc.edu Wisconsin Regional Primate Research Center Revision date: 20 May 1997 ABSTRACT tblcvt reads troff input and converts the tbl-related parts to a format that troffcvt can understand more easily than raw tbl output. Introduction This document describes tblcvt ("tbl convert"), a program that assists the process of using troffcvt to convert troff documents into other formats. It's assumed here that you're familiar with tbl. If you don't have the standard tbl documentation (Tbl - A Program to Format Tables, by M. E. Lesk), check the archive site from which you obtained the troffcvt distribution. tblcvt exists because tables written in the tbl input language present a problem for troffcvt. troffcvt understands only the troff language and knows nothing of the tbl language, so input files containing tables need to be run through some sort of preprocessor before being given to troffcvt. In theory, you could run your troff files through tbl (since tbl generates output written in troff), and feed the result to troffcvt for processing. In practice, tbl output is generally arcane and incomprehensible, and troffcvt doesn't do a very good job with it. The purpose of tblcvt is to convert the parts of troff input files that are intended for tbl into something that's easier for troffcvt to understand. This makes it more likely that troffcvt will generate output that its postprocessors will be able to put back together into something that looks like a table in the target format. Not every table will look great, but any tables in this document are simple enough that they should appear reasonably good if the document is formatted with troff2html, troff2rtf, or unroff. tblcvt is intended as a drop-in replacement for tbl. Suppose you'd normally format a document using a command like this: % tbl file ... | troff [options] The analogous command using tblcvt and troffcvt looks something like this: % tblcvt file ... | troffcvt [options] | postprocessor Or, if you use one of the front ends like troff2html that invoke troffcvt and the appropriate postprocessor for you, the command might look like this: % tblcvt file ... | troff2html [options] If it seems that troffcvt or a front end are not reading the output from tblcvt, specify - after the option list to explicitly tell them to read the standard input after processing their other options: % tblcvt file ... | troffcvt [options] - | postprocessor % tblcvt file ... | troff2html [options] - tblcvt Output Format tblcvt ignores its input except for those parts between corresponding pairs of .TS (table start) and .TE (table end) requests. For each table, tblcvt digests its specification, figures out the table structure, and produces troff output that indicates the structure using a special set of requests. The output format has the property that it explicitly indicates the beginning and end of each table, each row within a table, and each cell within a row. The general form of table information written by tblcvt looks like this: .T*table*begin [table options] .T*column*info [column 1 options] ...options for remaining columns... .T*row*begin .T*cell*info [cell layout options] ...options for remaining cells in row... .T*cell*begin [cell formatting options] ...cell contents... .T*cell*end ...remaining cells in row... .T*row*end ...remaining rows in table... .T*table*end Shortcut requests are used in certain circumstances. If a cell is empty, tblcvt writes the single request .T*empty*cell rather than .T*cell*begin, .T*cell*end, and the cell data between them. Similarly, if a cell of the table matrix is part of the area spanned by an earlier cell, tblcvt writes .T*spanned*cell. If an entire row consists of a table-width line, tblcvt writes the single request .T*row*line rather than .T*row*begin, .T*row*line, and the cell information between then. Note that since tblcvt output uses long request names, you can't use compatibility mode (-C option) with troffcvt or a troffcvt front end. Table Beginning and Ending Requests Each table begins with a .T*table*begin request, which has the following form: .T*table*begin rows cols header-rows align expand box allbox doublebox rows and cols are the number of rows and columns in the table. (A row that draws a line is considered a data row.) For tables that are specified to have a header (using .TS H and .TH), tblcvt writes a non-zero value for the header-rows value. Otherwise header-rows is 0. align is L or C to indicate the table is left-justified or centered. expand is y if the table is expanded to the full line width, n otherwise. The box, allbox, and doublebox values are each y or n, depending on whether or not box, allbox, and doublebox were given in the table specification. (Note that allbox and doublebox both imply box.) Each table is terminated by a .T*table*end request. Column Information Requests Following the .T*table*begin request, tblcvt writes one .T*column*info line for each column of the table, in the format: .T*column*info width sep equal The column number is not specified; .T*column*info lines will appear in consecutive order. width is the minimum required width of the column. The value is non-zero if any entry in the given column specified a w option. If more than one entry specified w, the last one is used. If width is 0, no entry in the column specified w and the width is determined from the data values in the column. sep is the column separation value. The equal value is y if any entry in the column specified the e option, and n otherwise. All columns with an equal value of y should be made the same width. Row Beginning and Ending Requests If a table row does not consist of a table-width line, the row begins and ends with .T*row*begin and .T*row*end requests. Information for the individual cells is written between these two requests (see "Cell Information Requests"). If a row consists of a table-width single or double line, the .T*row*begin and .T*row*end requests are not used. Instead, the row is specified completely by a single .T*row*line request, written using one of the following forms: .T*row*line 1 Table-width single line .T*row*line 2 Table-width double line Cell Information Requests Between each pair of .T*row*begin and .T*row*end requests, tblcvt writes out the information for each cell (column) in the row. First a set of .T*cell*info lines is written, one for each cell. These requests provide basic layout parameters. Then the contents of the cells are written. For the usual case, a cell is written using .T*cell*begin and .T*cell*end requests, with the cell data appearing between the requests. Empty, spanned, or line-drawing cells are written using .T*empty*cell, .T*spanned*cell, and .T*cell*line requests. This means that cells begin with any of .T*cell*begin, .T*empty*cell, .T*spanned*cell, or .T*cell*line, and end with any of .T*cell*end, .T*empty*cell, .T*spanned*cell, or .T*cell*line. The .T*cell*info request has the following form: .T*cell*info type vspan hspan vadjust border The column number of the cell is not specified; .T*cell*info lines will appear in consecutive order. type is the cell type: L Left-justified R Right-justified C Centered N Numeric (align to decimal point) A Alphanumeric vspan and hspan are the number of rows and columns spanned by the cell, including itself. Interpret these values as follows: * If a cell spans no other cells, both vspan and hspan are 1. * If a cell spans other cells vertically, vspan is greater than 1. If a cell is spanned vertically by a cell from above, vspan is zero. * If a cell spans other cells horizontally, hspan is greater than 1. If a cell is spanned horizontally by a cell from the left, hspan is zero. If all you want to know is whether or not a cell is spanned, the product of vspan and hspan is zero if and only if the cell is spanned. If you need to know whether spanning is in a particular direction, you need to examine vspan and hspan individually. This is summarized in the following table. hspan = 0 hspan > 0 vspan = 0 spanned both ways spanned from above vspan > 0 spanned from left not spanned vadjust is T if the cell contents should be vertically adjusted from the top, C if the contents should be vertically centered. vadjust is meaningful only for multiple-line cells. border is the border value. If the value is 0, there is no border. Otherwise, the value is a bitmap with the following fields: Bits Value Meaning 0-1 1 Left border, single line 2 Left border, double line 2-3 1 Right border, single line 2 Right border, double line 4-5 1 Top border, single line 2 Top border, double line 6-7 1 Bottom border, single line 2 Bottom border, double line The .T*cell*begin request has the following form: .T*cell*begin font ptsize vspace font is the font to use for formatting the cell, 0 if no font was specified. ptsize is the point size to use for formatting the cell, 0 if no size was specified. vspace is the vertical spacing to use for formatting the cell, 0 if no spacing was specified. The .T*cell*end request has no arguments: .T*cell*end If a cell is empty or spanned or draws a line, the .T*cell*begin and .T*cell*end requests are not used. Instead, the cell is specified using one of the following requests: * A cell that has no data is indicated by .T*empty*cell. This is functionally equivalent to: .T*cell*begin .T*cell*end except that it's not necessary to scan ahead to the second request to find out that the cell is empty. * A cell that is spanned by a cell occurring earlier in the table is indicated by .T*spanned*cell. The .T*cell*info lines written at the beginning of the row contain the information necessary to determine whether the cell is spanned from the top or from the left or both. Note that that spanned cell may be spanned by a cell with data in it, an empty cell, or a line-drawing cell. * A cell that draws a line is indicated by the .T*cell*line request, written using one of the following forms: .T*cell*line 0 Column-data-width single line .T*cell*line 1 Column-width single line .T*cell*line 2 Column-width double line A column-data-width line is a single line as wide as the contents of the column. It does not extend the full width of the column. This type of cell results from a \_ data value in the table specification. troffcvt Handling of tblcvt Output The .T*xxx requests are defined in the default actions file that troffcvt reads when it starts up. The actions for the requests cause troffcvt to perform a relatively simple mapping: tblcvt output troffcvt output .T*table*begin arguments \table-begin arguments .T*table*end \table-end .T*column*info arguments \table-column-info arguments .T*row*begin \table-row-begin .T*row*end \table-row-end .T*cell*info arguments \table-cell-info arguments .T*cell*begin arguments \table-cell-begin .T*cell*end \table-cell-end .T*row*line argument \table-row-line argument .T*cell*line argument \table-cell-line argument .T*spanned*cell \table-spanned-cell .T*empty*cell \table-empty-cell When a request written by tblcvt has arguments, the corresponding control written by troffcvt is written with arguments that are similar to, but not necessarily exactly the same. The primary exception is that the font, ptsize, and vspace arguments to .T*cell*begin are converted directly by the troffcvt actions file into font and size troff directives, then translated into the troffcvt intermediate language. The font and size controls appear in troffcvt output immediately following the \table-cell-begin control. See troffcvt Output Format and PostProcessor Writing for the exact format of the \table- controls. In addition to the .T*xxx request names used by tblcvt, troffcvt uses the register names T*cell*ft, T*cell*ps, and T*cell*vs for internal purposes. Calculating Spans Table specifications may indicate that a table element spans multiple rows or columns, or both. However, not all spanning specifications are legal, and tblcvt tries to catch those that are malformed. The spanning constraints enforced by tblcvt are: * A cell cannot span left if there is no column to span into, so the s format cannot be used in the first table column. * A cell cannot span upward if there is no row to span into, so the ^ format cannot be used in the first table row. (tblcvt does allow ^ to be used on the first format line following .T&, since the column entry can span up into the last row of the previous section.) * A spanned area must comprise a rectangular block of cells. The smallest illegal table specifications that include spans are shown below; each of them violates one of the first two spanning constraints: .TS .TS s . ^ . data data .TE .TE Assuming the first two constraints are satisfied, the smallest illegal table specifications that include spans are shown below (l is used here, but any non-spanning column type may be substituted): .TS .TS l l l s ^ s . l ^ . data data .TE .TE The first table is illegal by the following reasoning. The cells in the first column form a single vertically-spanned element. The second column could be part of that element if both cells spanned to the left, since the resulting spanned area would be rectangular. However, since only one of the cells spans to the left, the spanned area is L-shaped, which is illegal. The second table is illegal by similar reasoning. The top two cells form a single element. The bottom two cells could be part of that element if they both spanned upward, but only one of them does. tblcvt uses the strategy outlined below to determine the extent of spanned elements and to discover non-rectangularies in cell spanning. The strategy works by operating on a matrix with one column for each column specified in the format section of the table specification, and one row for each row of table data given in the data section of the specification. Working from left to right and top to bottom, each cell of the matrix is visited and the following checks are applied: * If the current cell format is s or ^, it's illegal, because it should have been found to be part of some spanning block to the left or above. (This check also discovers s in the first table column and ^ in the first row.) * If the current cell is not s or ^, determine how far it spans other cells vertically and horizontally. The vertical span includes the cell itself and as many consecutive ^-format cells that lie below the cell. The horizontal span is similar, but is determined by the number of consecutive s-format cells lying to the right of the cell. Denote the vertical and horizontal span count by vspan and hspan. * If vspan and hspan are both one, the cell stands alone and doesn't span any other cells. A standalone cell forms a 1 x 1 rectangle, so the cell is okay. Mark it as visited and go to the next unvisited cell. * If vspan or hspan (or both) are greater than one, then for the span to be rectangular, the cell must span a block of vspan x hspan cells. We know that the cells directly to the right of and below the upper left corner cell are spanned properly, since they determine the span extents. Therefore, it's necessary only to examine the (vspan-1) x (hspan-1) block of cells to the immediate lower right of the corner cell. These cells must span into the left column or top row of the block. This condition is satisfied if all the cells are of type s or type ^, in any combination. (Proof left as an exercise for the reader...) If the condition is not satisfied, the table specification is illegal. Otherwise mark the cells in the spanned block as visited and proceed to the next unvisited cell. Some tables are examined below to illustrate the strategy just described. Example 1: The table shown below is illegal. .TS l s s l ^ s s s . data .TE Beginning at the upper left, we see that the vertical and horizontal spans are 2 and 3. The remaining cells in this 2 x 3 block are the second and third cells in the second row. They both span into the first cell of the second row, so they are part of the span block. Therefore, the upper 2 x 3 block is okay. The next unvisited cell is the fourth cell in the first row. This cell is a standalone cell, so it's okay. The last unvisited cell is the fourth cell of the second row. This cell is a spanned cell, but it can't span into the block to the left without forming a non-rectangular block. The table specification is bad. Example 2: Here's a table that appears at first glance as though it may be illegal. Is it? .TS l s s s ^ s ^ ^ ^ ^ s s . data .TE Beginning at the upper left, we see that the vertical and horizontal spans are 3 and 4. This means 12 cells should be in the span block. We know that the three s cells to the right of the corner cell and the two ^ cells below the corner cell are part of the block, so the next step is to examine the remaining 2 x 3 block at the lower right. In the second row of the block, the second cell spans left into the first column (and is thus part of the span block), and the third and fourth cells span up into the first row (and are thus part of the span block). In the third row, the second cell spans up into the second row (and is thus part of the span block since that row has already been determined to be part of the block), and the third and fourth cells span into the third cell (which, since that cell has just been determined to be part of the block, makes the last two cells part of the block as well). Therefore, in spite of its unusual specification, the table is legal. It consists of a single 3 x 4 spanned entry. Example 3: Span calculations are performed with separate matrices for vertical and horizontal spans that initially assume all spans are 1. Suppose we have a table specification that looks like this: .TS l s l l s l ^ s l . a1 a2 b1 b2 c d .TE There are three format columns. There are three format rows but four data rows, so the last format line is used for the third and fourth data rows. The vertical and horizontal span matrices are 4 x 3, and start out like this: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 After calculating spans, the matrices end up like this: 1 1 1 2 0 1 3 3 1 2 0 1 0 0 1 2 0 1 0 0 1 2 0 1