Parse perl's pod files.

SYNOPSIS

THIS TK SNAPSHOT SHOULD BE REPLACED BY A CPAN MODULE

A module designed to simplify the job of parsing and formatting ``pods'', the documentation format used by perl5. This consists of several different functions to present and modify predigested pod files.

GUESSES

This is a work in progress, so I may have some stuff wrong, perhaps badly. Some of my more reaching guesses:

An =index paragraph should be split into lines, and each line placed inside an `X' formatting command which is then preprended to the next paragraph, like this:
```
  =index foo
  foo2
  foo3
  foo2!subfoo
  
  Foo!
 
Will become:
```
```
  X<foo>X<foo2>X<foo3>X<foo2!subfoo>Foo!
```
A related change: that an `X' command is to be used for indexing data. This implies that all formatters need to at least ignore the `X' command.
Inside an =command, no special significance is to be placed on the first line of the argument. Thus the following two lines should be parsed identically:
```
 =item 1. ABC
 
 =item 1.
 ABC
```
Note that neither of these are identical to this:
```
 =item 1.
 
 ABC
```
which puts the "\s-1ABC\s0" in a separate paragraph.
I actually violate this rule twice: in parsing =index commands, and in passing through the =pragma commands. I hope this make sense.
I added the =comment command, which simply ignores the next paragraph
I also added =pragma, which also ignores the next paragraph, but this time it gives the formatter a chance at doing something sinister with it.

POD CONVENTIONS

This module has two goals: first, to simplify the usage of the pod format, and secondly the codification of the pod format. While perlpod contains some information, it hardly gives the entire story. Here I present "the rules", or at least the rules as far as I've managed to work them out.

Paragraphs: The basic element

The fundamental "atom" of a pod file is the paragraph, where a paragraph is defined as the text up to the next completely blank line ("\n\n"). Any pod parser will read in paragraphs sequentially, deciding what do to with each based solely on the current state and on the text at the _beginning_ of the paragraph.

Commands: The method of communication

A paragraph that starts with the `=' symbol is assumed to be a special command. All of the alphanumeric characters directly after the `=' are assumed to be part of the name of the command, up to the first whitespace. Anything past that whitespace is considered "the arugment", and the argument continues up till the end of the paragraph, regardless of newlines or other whitespace.

Text: Commands that aren't Commands

A paragraph that doesn't start with `=' is treated as either of two types of text. If it starts with a space or tab, it is considered a verbatim paragraph, which will be printed out... verbatim. No formatting changes whatsover may be done. (Actually, this isn't quite true, but I'll get back to that at a later date.) A paragraph that doesn't start with whitespace or `=' is assumed to consist of formmated text that can be molded as the formatter sees fit. Reformatting to fit margins, whatever, it's fair game. These paragraphs also can contain a number of different formatting codes, which verbatim paragraphs can't. These formatting codes are covered later.

=cut: The uncommand

There is one command that needs special mention: =cut. Anything after a paragraph starting with =cut is simply ignored by the formatter. In addition, any text before a valid command is equally ignored. Any valid `=' command will reenable formating. This fact is used to great benefit by Perl, which is glad to ignore anything between an `=' command and `=cut', so you can embed a pod document right inside a perl program, and neither will bother the other.

Reference to paragraph commands

=cut

Ignore anything till the next paragraph starting with `='.

=head1

A top-level heading. Anything after the command (either on the same line or on further lines) is included in the heading, up until the end of the paragraph.

=head2

Secondary heading. Same as =head1, but different. No, there isn't a head3, head4, etc.

=over [N]

Start a list. The N is the number of characters to indent by. Not all formatters will listen to this, though. A good number to use is 4. While =over sounds like it should just be indentation, it's more complex then that. It actually starts a nested environment, specifically for the use of =item's. As this command recurses properly, you can use more then one, you just have to make sure they are closed off properly by =back commands.

=back

Ends the last =over block. Resets the indentation to whatever it was previously. Closes off the list of =item's.

=item

The point behind =over and =back. This command should only be used between them. The argument supplied should be consistent (within a list) to one of three types: enumeration, itemization, or description. To exemplify: An itemized list



  =over 4
  
  =item *
  
  A bulleted item
  
  =item *
  
  Another bulleted item
 
  =back
  
An enumerated list



  =over 4
  
  =item 1.
  
  First item.
  
  =item 2.
  
  Second item.
  
  =back
  
A described list



  =over 4
  
  =item Item #1
  
  First item
  
  =item Item #2 (which isn't really like #1, but is the second).
  
  Second item
  
  =back  
  
  
If you aren't consistent about the arguments to =item, Pod::Parse will
complain.

=comment

Ignore this paragraph

=pragma

Ignore this paragraph, as well, unless you know what you are doing.

=index

Undecided at this time, but probably magic involving X<>.

Reference to formatting directives

B<...>

Format text inside the brackets as bold.

I<...>

Format text inside the brackets as italics.

Z<>

Replace with a zero-width character. You'll probably figure out some uses for this.

And yet more that I haven't described yet...

USAGE

Parse

This function takes a list of files as an argument. If no argument is given, it defaults to the contents of @ARGV. Parse then reads through each file and returns the data as a list. Each element of this list will be a nested list containing data from a paragraph of the pod file. Elements pertaining to "=over" paragraphs will themselves contain the nested entries for all of the paragraphs within that list. Thus, it's easier to parse the output of Parse using a recursive parses. (Um, did that parse?)

It is highly recommended that you use the output of Simplify, not Parse, as it's simpler.

The output will consist of a list, where each element in the list matches one of these prototypes:

[0,0,0,0,$filename]: This is produced at the beginning of each file parsed, where $filename is the name of that file.
[-1,0,0,0,$filename]: End of same.
[1,$line,$pos,0,$verbatim]: This is produced for each paragraph of verbatim text. $verbatim is the text, $line is the line offset of the paragraph within the file, and $pos is the byte offset. (In all of the following elements, $pos and $line have identical meanings, so I'll skip explaining them each time.)
[2,$line,$pos,$level,$heading]: Producded by a =head1 or =head2 command. $level is either 1 or 2, and $heading is the argument.
[3,$line,$pos,0,$item]: $item is the argument from an =item paragraph.
[4,$line,$pos,0,$index]: $index is the argument from an =index paragraph.
[6,$line,$pos,0,$text]: Normal formatted text paragraph. $text is the text.
[7,$line,$pos,0,$pragma]: $pragma is the argument from a =pragma paragraph.
[8,$line,$pos,$indentation,$type,...]: This item is produced for each matching =over/=back pair. $indentation is the argument to =over, $type is 1 if the embedded =item's are bulleted, 2 if they are enumerated, 3 if they are text, and 0 if there are no items. The "..." indicates an unlimited number of further elements which are themselves nested arrays in exactly the format being described. In other words, a list item includes all the paragraphs inside the list inside itself. (Clear? No? Nevermind.)
[9,$line,$pos,0,$cut]: $cut contains the text from a =cut paragraph. You shouldn't need to use this, but I _suppose_ it might be necessary to do special breaks on a cut. I doubt it though. This one is "depreciated", as Larry put it. Or perhaps disappreciated.

Simplify

This procedure takes as it's input the convoluted output from Parse(), and outputs a much simpler array consisting of pairs of commands and arguments, designed to be easy (easier?) to parse in your pod formatting code.

It is used very simply by saying something like:



 @Pod = Simplify(Parse());
 
 while($cmd = shift @Pod) { $arg = shift @Pod;
        #...
 }

Where #... is the code that responds to any of the commands from the following list. Note that you are welcome to ignore any of the commands that you want to. Many contain duplicate information, or at least information that will go unused. A formatted based on this data can be quite simple indeed. (See pod2text for entirely too simple an example.)

Reference to Simplify commands

\*(N: The argument contains the name of the pod file that is being parsed. These will be present at the start of each file. You should open an output file, output headers, etc., based on this, and not when you start parsing.
\*(N: The end of the file. Each file will be ended before the next one begins, and after all files are done with. You can do end processing here. The argument is the same name as in "filename".
\*(N: This gives you a chance to record the "current" input line, probably for debugging purposes. In this case, "current" means that the next command you see that was derived from an input paragraph will have start at the arguments line in the file.
\*(N: Same as setline, but the byte offset in the input, instead of the line offset.
\*(N: The argument contains the text of a pragma command.
\*(N: The argument contains a paragraph of formatted text.
\*(N: The argument contains a paragraph of verbatim text.
\*(N: A =cut command was hit. You shouldn't really need to listen for this one.
\*(N: The argument contains an =index paragraph. (Note: Current =index commands are not fed through, but turned into X<> commands.)
\*(N
\*(N: The argument contains the argument from a header command.
\*(N: If you are tracking indentation, use the argument to set the indentation level.
\*(N: Start a list environment. The argument is the type of list (1,2,3 or 0).
\*(N: Ends a list environment. Same argument as listbegin.
\*(N: The argument is the type of list. You can just record the argument when you see one of these, instead of paying attention to listbegin & listend.
\*(N: The argument is the indentation. It's probably better to listen to the "list..." commands.
\*(N: Ends an "over" list. The argument is the original indentation.
\*(N: The argument is the text of the =item command.

Note that all of these various commands you've seen are syncronized properly so you don't have to pay attention to all at once, but they are all output for your benefit. Consider the following example:



 listtype 2
 listbegin 2
 setindent 4
 over 4
 item 1.
 text Item #1
 item 2.
 text Item #2
 setindent 0
 listend 2
 back 0
 listtype 0
 
=head2 Normalize

This command is normally invoked by Parse, so you shouldn't need to deal with it. It just cleans up text a little, turning spare '<', '>', and '&' characters into \s-1HTML\s0 escapes (<, etc.) as well as generating warnings for some pod formatting mistakes.

Normalize2

A little more aggresive formating based on heuristics. Not applied by default, as it might confuse your own heuristics.

%Escapes

This hash is exported from Pod::Parse, and contains default \s-1ASCII\s0 translations for some common \s-1HTML\s0 escape sequences. You might like to use this as a basis for an %HTML_Escapes array in your own formatter.

NAME

SYNOPSIS

DESCRIPTION

GUESSES

POD CONVENTIONS

USAGE