|
HTML Parser Home Page | |||||||||
PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES |
See:
Description
Class Summary | |
---|---|
Cursor | A bookmark in a page. |
InputStreamSource | A source of characters based on an InputStream such as from a URLConnection. |
Lexer | This class parses the HTML stream into nodes. |
Page | Represents the contents of an HTML page. |
PageAttribute | An attribute within a tag on a page. |
PageIndex | A sorted array of integers, the positions of the first characters of each line. |
Source | A buffered source of characters. |
Stream | Provides for asynchronous fetching from a stream. |
StringSource | A source of characters based on a String. |
The lexer package is the base level I/O subsystem.
The lexer package is responsible for reading characters from the HTML source and identifying the node lexemes. For example, the HTML code below would return the list of nodes shown:
<html><head><title>Humoresque</title></head> <body bgcolor='silver'> Passengers will please refrain from flushing toilets while the train is standing in the station. I love you! <p> We encourage constipation while the train is in the station If the train can't go then why should you. </body> </html>
Stream, Source, Page and Lexer
The package is arranged in four levels, Stream
,
Source
Page
and Lexer
in the order of lowest to
highest.
A Stream
is raw bytes from the URLConnection or file. It has no
intelligence. A Source
is raw characters, hence it knows about the
encoding scheme used and can be reset if a different encoding is detected after
partially reading in the text. A Page
provides characters from the
source while maintaining the index of line numbers, and hence can be thought of
as an array of strings corresponding to source file lines, but it doesn't
actually store any text, relying on the buffering within the
Source
instead. The Lexer
contains the actual lexeme parsing
code. It reads characters from the page, keeping track of where it is with a
Cursor
and creates the array of nodes using various state
machines.
The following are some design goals and 'invariants' within the package, if you are attempting to understand or modify it.
htmlparser.jar
.
In this way, simple parsing and output is handled with a jar file that is under
45 kilobytes, but anything beyond peephole manipulation, i.e. closing tag detection
and other semantic reasoning, will need the full set of scanners, nodes and ancillary
classes, which now stands at 210 kilobytes.
|
© 2006 Derrick Oswald Sep 17, 2006
|
|||||||||
PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES |
HTML Parser is an open source library released under Common Public License. |