org.htmlparser.tags (HTML Parser 2.0)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

HTML Parser Home Page

PREV PACKAGE NEXT PACKAGE

FRAMES NO FRAMES

Package org.htmlparser.tags

The tags package contains specific tags.

See:
Description

Class Summary
AppletTag	AppletTag represents an <Applet> tag.
BaseHrefTag	BaseHrefTag represents an <Base> tag.
BodyTag	A Body Tag.
Bullet	A bullet tag.
BulletList	A bullet list tag.
CompositeTag	The base class for tags that have an end tag.
DefinitionList	A definition list tag (dl).
DefinitionListBullet	A definition list bullet tag (either DD or DT).
Div	A div tag.
DoctypeTag	The HTML Document Declaration Tag can identify <!DOCTYPE> tags.
FormTag	Represents a FORM tag.
FrameSetTag	Identifies an frame set tag.
FrameTag	Identifies a frame tag
HeadingTag	A heading (h1 - h6) tag.
HeadTag	A head tag.
Html	A html tag.
ImageTag	Identifies an image tag.
InputTag	An input tag in a form.
JspTag	The JSP/ASP tags like <%...%> can be identified by this class.
LabelTag	A label tag.
LinkTag	Identifies a link tag.
MetaTag	A Meta Tag
ObjectTag	ObjectTag represents an <Object> tag.
OptionTag	An option tag within a form.
ParagraphTag	A paragraph (p) tag.
ProcessingInstructionTag	The XML processing instructions like <?xml ...
ScriptTag	A script tag.
SelectTag	A select tag within a form.
Span	A span tag.
StyleTag	A StyleTag represents a <style> tag.
TableColumn	A table column tag.
TableHeader	A table header tag.
TableRow	A table row tag.
TableTag	A table tag.
TextareaTag	A text area tag within a form.
TitleTag	A title tag.

Package org.htmlparser.tags Description

The tags package contains specific tags.

This package has implementations of tags that have functionality beyond the capability of a generic tag. For example, the <META> tag has methods to get the CONTENT and NAME attributes (although this could be done with generic attribute manipulation) and an implementation of doSemanticAction that alters the lexer's encoding.

The classes in this package have been added in an ad-hoc fashion, with the most useful ones having existed a long time, while some obvious ones are rather new. Please feel free to add your own custom tags, and register them with the PrototypicalNodeFactory, and they will be treated like any other in-built tag. In fact tags do not need to reside in this package.

Custom Tags

Creating custom tags is fairly straight forward. Simply copy one of the simpler tags you find in this package and alter it as follows.

If the tag can contain other nodes, i.e. <h1>My Heading</h1>, then it should derive from (i.e. be a subclass of) CompositeTag. In this way it will inherit the CompositeTagScanner and nodes between the start and end tag will be gathered into the list of children. Most of the tags in this package derive from CompositeTag, and that is why the nodes returned from the Parser are nested.

If it is a simple tag, i.e. <br>, then it should derive from TagNode. See for example MetaTag or ImageTag.

To be registered with PrototypicalNodeFactory.registerTag(org.htmlparser.Tag), and especially if it is a composite tag, the tag needs to implement getIds which returns the UPPERCASE list of names for the tag (usually only one), for example "HTML". If the tag can be smart enough to know what other tags can't be contained within it, it should also implement getEnders() which returns the list of other tags that should cause this tag to close itself, and getEndTagEnders() which returns the list of end tags (i.e. </xxx>), other than it's own name, that should cause this tag to close itself. When these 'ender' lists cause a tag to end before seeing it's own end tag, a virtual end tag is created and 'inserted' at the location where the end tag should have been. These end tags can be distinguished because their starting and ending locations are the same (i.e. they take up no character length in the HTML stream).

For example, the <OPTION> tag from a form can be prematurely ended by any of <INPUT>, <TEXTAREA>, <SELECT>, or another <OPTION> tag. These are the tags in the getEnders() list. It can also be prematurely ended by </SELECT>, </FORM>, </BODY>, or </HTML>. These are the tags in the getEndTagEnders() list.

Other than that any functionality is up to you. You should note that doSemanticAction() is called after the tag has been completely scanned (it has it's children and end tag), but before its siblings further downstream have been scanned. If transformation is your purpose, this is the opportunity to mess around with the content, for example to set the link URL, or lowercase the tag name, or whatever.