HTML Parser Home Page

Package org.htmlparser

The basic API classes which will be used by most developers when working with the HTML Parser.


Interface Summary
Node Specifies the minimum requirements for nodes returned by the Lexer or Parser.
NodeFactory This interface defines the methods needed to create new nodes.
NodeFilter Implement this interface to select particular nodes.
Remark This interface represents a comment in the HTML document.
Tag This interface represents a tag (<xxx yyy="zzz">) in the HTML document.
Text This interface represents a piece of the content of the HTML document.

Class Summary
Attribute An attribute within a tag.
Parser The main parser class.
PrototypicalNodeFactory A node factory based on the prototype pattern.

Package org.htmlparser Description

The basic API classes which will be used by most developers when working with the HTML Parser.

The Parser class is the main high level class that provides simplified access to the contents of an HTML page. A wide range of methods is available to customize the operation of the Parser, as well as access specific pieces of the page as Nodes.

The NodeFactory interface specifies the requirements for a developer to have the Parser or Lexer generate nodes. Three types of nodes are required: Text, Remark and Tags. Tags contain lists of child nodes and attributes.

The only provided implementation of the NodeFactory interface is the PrototypicalNodeFactory which operates by holding example nodes and cloning them as needed to satisfy the requests for nodes by the Parser. By default, a Lexer is it's own NodeFactory, returning new TextNode, RemarkNode and undifferentiated Tagnodes (see the nodes package), but when the parser uses a lexer it replaces this behaviour with a PrototypicalNodeFactory to return a rich set of specific tags (see the tags package).

The NodeFilter interface is used by the filtering code to determine if a node meets a certain criteria. Some generic examples of filters can be found in the filters package.

© 2005 Derrick Oswald
Jun 10, 2006

HTML Parser is an open source library released under LGPL.