HTML Parser Home Page

Class CompositeTagScanner

  extended by org.htmlparser.scanners.TagScanner
      extended by org.htmlparser.scanners.CompositeTagScanner
All Implemented Interfaces:
Serializable, Scanner
Direct Known Subclasses:
ScriptScanner, StyleScanner

public class CompositeTagScanner
extends TagScanner

The main scanning logic for nested tags. When asked to scan, this class gathers nodes into a heirarchy of tags.

See Also:
Serialized Form

Constructor Summary
          Create a composite tag scanner.
Method Summary
protected  void addChild(Tag parent, Node child)
          Add a child to the given tag.
protected  Tag createVirtualEndTag(Tag tag, Lexer lexer, Page page, int position)
          Creates an end tag with the same name as the given tag.
protected  void finishTag(Tag tag, Lexer lexer)
          Finish off a tag.
 boolean isTagToBeEndedFor(Tag current, Tag tag)
          Determine if the current tag should be terminated by the given tag.
 Tag scan(Tag tag, Lexer lexer, NodeList stack)
          Collect the children.
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Detail


public CompositeTagScanner()
Create a composite tag scanner.

Method Detail


public Tag scan(Tag tag,
                Lexer lexer,
                NodeList stack)
         throws ParserException
Collect the children.

An initial test is performed for an empty XML tag, in which case the start tag and end tag of the returned tag are the same and it has no children.

If it's not an empty XML tag, the lexer is repeatedly asked for subsequent nodes until an end tag is found or a node is encountered that matches the tag ender set or end tag ender set. In the latter case, a virtual end tag is created. Each node found that is not the end tag is added to the list of children. The end tag is special and not a child.

Nodes that also have a CompositeTagScanner as their scanner are recursed into, which provides the nested structure of an HTML page. This method operates in two possible modes, depending on a private boolean. It can recurse on the JVM stack, which has caused some overflow problems in the past, or it can use the supplied stack argument to nest scanning of child tags within itself. The former is left as an option in the code, mostly to help subsequent modifiers visualize what the internal nesting is doing.

Specified by:
scan in interface Scanner
scan in class TagScanner
tag - The tag this scanner is responsible for.
lexer - The source of subsequent nodes.
stack - The parse stack. May contain pending tags that enclose this tag.
The resultant tag (may be unchanged).
ParserException - if an unrecoverable problem occurs.


protected void addChild(Tag parent,
                        Node child)
Add a child to the given tag.

parent - The parent tag.
child - The child node.


protected void finishTag(Tag tag,
                         Lexer lexer)
                  throws ParserException
Finish off a tag. Perhap add a virtual end tag. Set the end tag parent as this tag. Perform the semantic acton.

tag - The tag to finish off.
lexer - A lexer positioned at the end of the tag.


protected Tag createVirtualEndTag(Tag tag,
                                  Lexer lexer,
                                  Page page,
                                  int position)
                           throws ParserException
Creates an end tag with the same name as the given tag.

tag - The tag to end.
lexer - The object containg the node factory.
page - The page the tag is on (virtually).
position - The offset into the page at which the tag is to be anchored.
An end tag with the name '"/" + tag.getTagName()' and a start and end position at the given position. The fact these positions are equal may be used to distinguish it as a virtual tag later on.


public final boolean isTagToBeEndedFor(Tag current,
                                       Tag tag)
Determine if the current tag should be terminated by the given tag. Examines the 'enders' or 'end tag enders' lists of the current tag for a match with the given tag. Which list is chosen depends on whether tag is an end tag ('end tag enders') or not ('enders').

current - The tag that might need to be ended.
tag - The candidate tag that might end the current one.
true if the name of the given tag is a member of the appropriate list.

© 2006 Derrick Oswald
Sep 17, 2006

HTML Parser is an open source library released under Common Public License.