Pod::Parse - Parse perl's pod files.
THIS TK SNAPSHOT SHOULD BE REPLACED BY A CPAN MODULE
A module designed to simplify the job of parsing and formatting ``pods'', the
documentation format used by perl5. This consists of several different
functions to present and modify predigested pod files.
This is a work in progress, so I may have some stuff wrong, perhaps badly.
Some of my more reaching guesses:
- An =index paragraph should be split into lines, and each line placed inside
an `X' formatting command which is then preprended to the next paragraph,
like this:
=index foo
foo2
foo3
foo2!subfoo
Foo!
Will become:
X<foo>X<foo2>X<foo3>X<foo2!subfoo>Foo!
- A related change: that an `X' command is to be used for indexing data. This
implies that all formatters need to at least ignore the `X' command.
- Inside an =command, no special significance is to be placed on the first line
of the argument. Thus the following two lines should be parsed identically:
=item 1. ABC
=item 1.
ABC
Note that neither of these are identical to this:
=item 1.
ABC
which puts the "\s-1ABC\s0" in a separate paragraph.
- I actually violate this rule twice: in parsing =index commands, and in
passing through the =pragma commands. I hope this make sense.
- I added the =comment command, which simply ignores the next paragraph
- I also added =pragma, which also ignores the next paragraph, but this time
it gives the formatter a chance at doing something sinister with it.
This module has two goals: first, to simplify the usage of the pod format,
and secondly the codification of the pod format. While perlpod contains some
information, it hardly gives the entire story. Here I present "the rules",
or at least the rules as far as I've managed to work them out.
- Paragraphs: The basic element
- The fundamental "atom" of a pod file is the paragraph, where a paragraph is
defined as the text up to the next completely blank line ("\n\n"). Any pod
parser will read in paragraphs sequentially, deciding what do to with each
based solely on the current state and on the text at the _beginning_ of the
paragraph.
- Commands: The method of communication
- A paragraph that starts with the `=' symbol is assumed to be a special command.
All of the alphanumeric characters directly after the `=' are assumed to be
part of the name of the command, up to the first whitespace. Anything past that
whitespace is considered "the arugment", and the argument continues up till
the end of the paragraph, regardless of newlines or other whitespace.
- Text: Commands that aren't Commands
- A paragraph that doesn't start with `=' is treated as either of two types of
text. If it starts with a space or tab, it is considered a verbatim
paragraph, which will be printed out... verbatim. No formatting changes
whatsover may be done. (Actually, this isn't quite true, but I'll get back to
that at a later date.)
A paragraph that doesn't start with whitespace or `=' is assumed to consist of
formmated text that can be molded as the formatter sees fit. Reformatting to
fit margins, whatever, it's fair game. These paragraphs also can contain a
number of different formatting codes, which verbatim paragraphs can't. These
formatting codes are covered later.
- =cut: The uncommand
- There is one command that needs special mention: =cut. Anything after a
paragraph starting with =cut is simply ignored by the formatter. In
addition, any text before a valid command is equally ignored. Any valid
`=' command will reenable formating. This fact is used to great benefit by
Perl, which is glad to ignore anything between an `=' command and `=cut', so
you can embed a pod document right inside a perl program, and neither will
bother the other.
- Reference to paragraph commands
- =cut
- Ignore anything till the next paragraph starting with `='.
- =head1
- A top-level heading. Anything after the command (either on the same line or
on further lines) is included in the heading, up until the end of the paragraph.
- =head2
- Secondary heading. Same as =head1, but different. No, there isn't a head3,
head4, etc.
- =over [N]
- Start a list. The
N
is the number of characters to indent by. Not all
formatters will listen to this, though. A good number to use is 4.
While =over sounds like it should just be indentation, it's more complex then
that. It actually starts a nested environment, specifically for the use of
=item's. As this command recurses properly, you can use more then one, you
just have to make sure they are closed off properly by =back commands.
- =back
- Ends the last =over block. Resets the indentation to whatever it was
previously. Closes off the list of =item's.
- =item
- The point behind =over and =back. This command should only be used between
them. The argument supplied should be consistent (within a list) to one of
three types: enumeration, itemization, or description. To exemplify:
An itemized list
=over 4
=item *
A bulleted item
=item *
Another bulleted item
=back
An enumerated list
=over 4
=item 1.
First item.
=item 2.
Second item.
=back
A described list
=over 4
=item Item #1
First item
=item Item #2 (which isn't really like #1, but is the second).
Second item
=back
If you aren't consistent about the arguments to =item, Pod::Parse will
complain.
- =comment
- Ignore this paragraph
- =pragma
- Ignore this paragraph, as well, unless you know what you are doing.
- =index
- Undecided at this time, but probably magic involving X<>.
- Reference to formatting directives
- B<...>
- Format text inside the brackets as bold.
- I<...>
- Format text inside the brackets as italics.
- Z<>
- Replace with a zero-width character. You'll probably figure out some uses
for this.
- And yet more that I haven't described yet...
Parse
This function takes a list of files as an argument. If no argument is given,
it defaults to the contents of @ARGV
. Parse then reads through each file and
returns the data as a list. Each element of this list will be a nested list
containing data from a paragraph of the pod file. Elements pertaining to
"=over" paragraphs will themselves contain the nested entries for all of the
paragraphs within that list. Thus, it's easier to parse the output of Parse
using a recursive parses. (Um, did that parse?)
It is highly recommended that you use the output of Simplify, not Parse,
as it's simpler.
The output will consist of a list, where each element in the list matches
one of these prototypes:
- [0,0,0,0,$filename]
- This is produced at the beginning of each file parsed, where
$filename
is
the name of that file.
- [-1,0,0,0,$filename]
- End of same.
- [1,$line,$pos,0,$verbatim]
- This is produced for each paragraph of verbatim text.
$verbatim
is the text,
$line
is the line offset of the paragraph within the file, and $pos
is the
byte offset. (In all of the following elements, $pos
and $line
have identical
meanings, so I'll skip explaining them each time.)
- [2,$line,$pos,$level,$heading]
- Producded by a =head1 or =head2 command.
$level
is either 1 or 2, and $heading
is the argument.
- [3,$line,$pos,0,$item]
- $item is the argument from an =item paragraph.
- [4,$line,$pos,0,$index]
- $index is the argument from an =index paragraph.
- [6,$line,$pos,0,$text]
- Normal formatted text paragraph.
$text
is the text.
- [7,$line,$pos,0,$pragma]
- $pragma is the argument from a =pragma paragraph.
- [8,$line,$pos,$indentation,$type,...]
- This item is produced for each matching =over/=back pair.
$indentation
is
the argument to =over, $type
is 1 if the embedded =item's are bulleted, 2 if
they are enumerated, 3 if they are text, and 0 if there are no items.
The "..." indicates an unlimited number of further elements which are
themselves nested arrays in exactly the format being described. In other
words, a list item includes all the paragraphs inside the list inside
itself. (Clear? No? Nevermind.)
- [9,$line,$pos,0,$cut]
- $cut contains the text from a =cut paragraph. You shouldn't need to use
this, but I _suppose_ it might be necessary to do special breaks on a cut. I
doubt it though. This one is "depreciated", as Larry put it. Or perhaps
disappreciated.
Simplify
This procedure takes as it's input the convoluted output from Parse(), and
outputs a much simpler array consisting of pairs of commands and arguments,
designed to be easy (easier?) to parse in your pod formatting code.
It is used very simply by saying something like:
@Pod = Simplify(Parse());
while($cmd = shift @Pod) { $arg = shift @Pod;
#...
}
Where #... is the code that responds to any of the commands from the
following list. Note that you are welcome to ignore any of the commands that
you want to. Many contain duplicate information, or at least information
that will go unused. A formatted based on this data can be quite simple
indeed. (See pod2text for entirely too simple an example.)
Reference to Simplify commands
- \*(N
- The argument contains the name of the pod file that is being parsed. These
will be present at the start of each file. You should open an output file,
output headers, etc., based on this, and not when you start parsing.
- \*(N
- The end of the file. Each file will be ended before the next one begins, and
after all files are done with. You can do end processing here. The argument
is the same name as in "filename".
- \*(N
- This gives you a chance to record the "current" input line, probably for
debugging purposes. In this case, "current" means that the next command you
see that was derived from an input paragraph will have start at the
arguments line in the file.
- \*(N
- Same as setline, but the byte offset in the input, instead of the line offset.
- \*(N
- The argument contains the text of a pragma command.
- \*(N
- The argument contains a paragraph of formatted text.
- \*(N
- The argument contains a paragraph of verbatim text.
- \*(N
- A =cut command was hit. You shouldn't really need to listen for this one.
- \*(N
- The argument contains an =index paragraph. (Note: Current =index commands are
not fed through, but turned into X<> commands.)
- \*(N
- \*(N
- The argument contains the argument from a header command.
- \*(N
- If you are tracking indentation, use the argument to set the indentation level.
- \*(N
- Start a list environment. The argument is the type of list (1,2,3 or 0).
- \*(N
- Ends a list environment. Same argument as listbegin.
- \*(N
- The argument is the type of list. You can just record the argument when you
see one of these, instead of paying attention to listbegin & listend.
- \*(N
- The argument is the indentation. It's probably better to listen to the
"list..." commands.
- \*(N
- Ends an "over" list. The argument is the original indentation.
- \*(N
- The argument is the text of the =item command.
Note that all of these various commands you've seen are syncronized properly
so you don't have to pay attention to all at once, but they are all output
for your benefit. Consider the following example:
listtype 2
listbegin 2
setindent 4
over 4
item 1.
text Item #1
item 2.
text Item #2
setindent 0
listend 2
back 0
listtype 0
=head2 Normalize
This command is normally invoked by Parse, so you shouldn't need to deal
with it. It just cleans up text a little, turning spare '<', '>', and '&'
characters into \s-1HTML\s0 escapes (<, etc.) as well as generating warnings for
some pod formatting mistakes.
Normalize2
A little more aggresive formating based on heuristics. Not applied by
default, as it might confuse your own heuristics.
%Escapes
This hash is exported from Pod::Parse, and contains default \s-1ASCII\s0
translations for some common \s-1HTML\s0 escape sequences. You might like to use this
as a basis for an %HTML_Escapes
array in your own formatter.