AbstractNode (HTML Parser 2.0)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

HTML Parser Home Page

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.htmlparser.nodes
Class AbstractNode

java.lang.Object
  org.htmlparser.nodes.AbstractNode

All Implemented Interfaces:: Serializable, Cloneable, Node

Direct Known Subclasses:: RemarkNode, TagNode, TextNode

public abstract class AbstractNode
extends Object
implements Node, Serializable
extends Object
implements Node, Serializable

The concrete base class for all types of nodes (tags, text remarks). This class provides basic functionality to hold the Page, the starting and ending position in the page, the parent and the list of children.

See Also:: Serialized Form

Field Summary
`protected NodeList`	`children` The children of this node.
`protected Page`	`mPage` The page this node came from.
`protected int`	`nodeBegin` The beginning position of the tag in the line
`protected int`	`nodeEnd` The ending position of the tag in the line
`protected Node`	`parent` The parent of this node.

Constructor Summary
`AbstractNode(Page page, int start, int end)` Create an abstract node with the page positions given.

Method Summary
`abstract void`	`accept(NodeVisitor visitor)` Visit this node.
`Object`	`clone()` Clone this object.
`void`	`collectInto(NodeList list, NodeFilter filter)` Collect this node and its child nodes (if-applicable) into the collectionList parameter, provided the node satisfies the filtering criteria.
`void`	`doSemanticAction()` Perform the meaning of this tag.
`NodeList`	`getChildren()` Get the children of this node.
`int`	`getEndPosition()` Gets the ending position of the node.
`Node`	`getFirstChild()` Get the first child of this node.
`Node`	`getLastChild()` Get the last child of this node.
`Node`	`getNextSibling()` Get the next sibling to this node.
`Page`	`getPage()` Get the page this node came from.
`Node`	`getParent()` Get the parent of this node.
`Node`	`getPreviousSibling()` Get the previous sibling to this node.
`int`	`getStartPosition()` Gets the starting position of the node.
`String`	`getText()` Returns the text of the node.
`void`	`setChildren(NodeList children)` Set the children of this node.
`void`	`setEndPosition(int position)` Sets the ending position of the node.
`void`	`setPage(Page page)` Set the page this node came from.
`void`	`setParent(Node node)` Sets the parent of this node.
`void`	`setStartPosition(int position)` Sets the starting position of the node.
`void`	`setText(String text)` Sets the string contents of the node.
`String`	`toHtml()` Return the HTML for this node.
`abstract String`	`toHtml(boolean verbatim)` Return the HTML for this node.
`abstract String`	`toPlainTextString()` Returns a string representation of the node.
`abstract String`	`toString()` Return a string representation of the node.

Methods inherited from class java.lang.Object
`equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait`

Field Detail

mPage

protected Page mPage

The page this node came from.

nodeBegin

protected int nodeBegin

The beginning position of the tag in the line

nodeEnd

protected int nodeEnd

The ending position of the tag in the line

parent

protected Node parent

The parent of this node.

children

protected NodeList children

The children of this node.

Constructor Detail

AbstractNode

public AbstractNode(Page page,
                    int start,
                    int end)

Create an abstract node with the page positions given. Remember the page and start & end cursor positions.

Parameters:: page - The page this tag was read from.; start - The starting offset of this node within the page.; end - The ending offset of this node within the page.

Method Detail

clone

public Object clone()
             throws CloneNotSupportedException

Clone this object. Exposes java.lang.Object clone as a public method.

Specified by:: clone in interface Node
Overrides:: clone in class Object

Returns:: A clone of this object.
Throws:: CloneNotSupportedException - This shouldn't be thrown since the Node interface extends Cloneable.
See Also:: Cloneable

toPlainTextString

public abstract String toPlainTextString()

Returns a string representation of the node. It allows a simple string transformation of a web page, regardless of node type.
Typical application code (for extracting only the text from a web page) would then be simplified to:

 Node node;
 for (Enumeration e = parser.elements (); e.hasMoreElements (); )
 {
     node = (Node)e.nextElement();
     System.out.println (node.toPlainTextString ());
     // or do whatever processing you wish with the plain text string
 }

Specified by:: toPlainTextString in interface Node

Returns:: The 'browser' content of this node.

toHtml

public String toHtml()

Return the HTML for this node. This should be the sequence of characters that were encountered by the parser that caused this node to be created. Where this breaks down is where broken nodes (tags and remarks) have been encountered and fixed. Applications reproducing html can use this method on nodes which are to be used or transferred as they were received or created.

Specified by:: toHtml in interface Node

Returns:: The sequence of characters that would cause this node to be returned by the parser or lexer.

toHtml

public abstract String toHtml(boolean verbatim)

Return the HTML for this node. This should be the exact sequence of characters that were encountered by the parser that caused this node to be created. Where this breaks down is where broken nodes (tags and remarks) have been encountered and fixed. Applications reproducing html can use this method on nodes which are to be used or transferred as they were received or created.

Specified by:: toHtml in interface Node

Parameters:: verbatim - If true return as close to the original page text as possible.
Returns:: The (exact) sequence of characters that would cause this node to be returned by the parser or lexer.

toString

public abstract String toString()

Return a string representation of the node. Subclasses must define this method, and this is typically to be used in the manner

System.out.println(node)

Specified by:: toString in interface Node
Overrides:: toString in class Object

Returns:: A textual representation of the node suitable for debugging

collectInto

public void collectInto(NodeList list,
                        NodeFilter filter)

Collect this node and its child nodes (if-applicable) into the collectionList parameter, provided the node satisfies the filtering criteria.

This mechanism allows powerful filtering code to be written very easily, without bothering about collection of embedded tags separately. e.g. when we try to get all the links on a page, it is not possible to get it at the top-level, as many tags (like form tags), can contain links embedded in them. We could get the links out by checking if the current node is a CompositeTag, and going through its children. So this method provides a convenient way to do this.

Using collectInto(), programs get a lot shorter. Now, the code to extract all links from a page would look like:

 NodeList collectionList = new NodeList();
 NodeFilter filter = new TagNameFilter ("A");
 for (NodeIterator e = parser.elements(); e.hasMoreNodes();)
      e.nextNode().collectInto(collectionList, filter);

Thus, collectionList will hold all the link nodes, irrespective of how deep the links are embedded.

Another way to accomplish the same objective is:

 NodeList collectionList = new NodeList();
 NodeFilter filter = new TagClassFilter (LinkTag.class);
 for (NodeIterator e = parser.elements(); e.hasMoreNodes();)
      e.nextNode().collectInto(collectionList, filter);

This is slightly less specific because the LinkTag class may be registered for more than one node name, e.g. <LINK> tags too.

Specified by:: collectInto in interface Node

Parameters:: list - The node list to collect acceptable nodes into.; filter - The filter to determine which nodes are retained.

getPage

public Page getPage()

Get the page this node came from.

Specified by:: getPage in interface Node

Returns:: The page that supplied this node.
See Also:: Node.setPage(org.htmlparser.lexer.Page)

setPage

public void setPage(Page page)

Set the page this node came from.

Specified by:: setPage in interface Node

Parameters:: page - The page that supplied this node.
See Also:: Node.getPage()

getStartPosition

public int getStartPosition()

Gets the starting position of the node.

Specified by:: getStartPosition in interface Node

Returns:: The start position.
See Also:: Node.setStartPosition(int)

setStartPosition

public void setStartPosition(int position)

Sets the starting position of the node.

Specified by:: setStartPosition in interface Node

Parameters:: position - The new start position.
See Also:: Node.getStartPosition()

getEndPosition

public int getEndPosition()

Gets the ending position of the node.

Specified by:: getEndPosition in interface Node

Returns:: The end position.
See Also:: Node.setEndPosition(int)

setEndPosition

public void setEndPosition(int position)

Sets the ending position of the node.

Specified by:: setEndPosition in interface Node

Parameters:: position - The new end position.
See Also:: Node.getEndPosition()

accept

public abstract void accept(NodeVisitor visitor)

Visit this node.

Specified by:: accept in interface Node

Parameters:: visitor - The visitor that is visiting this node.

getParent

public Node getParent()

Get the parent of this node. This will always return null when parsing without scanners, i.e. if semantic parsing was not performed. The object returned from this method can be safely cast to a CompositeTag.

Specified by:: getParent in interface Node

Returns:: The parent of this node, if it's been set, null otherwise.
See Also:: Node.setParent(org.htmlparser.Node)

setParent

public void setParent(Node node)

Sets the parent of this node.

Specified by:: setParent in interface Node

Parameters:: node - The node that contains this node. Must be a CompositeTag.
See Also:: Node.getParent()

getChildren

public NodeList getChildren()

Get the children of this node.

Specified by:: getChildren in interface Node

Returns:: The list of children contained by this node, if it's been set, null otherwise.
See Also:: Node.setChildren(org.htmlparser.util.NodeList)

setChildren

public void setChildren(NodeList children)

Set the children of this node.

Specified by:: setChildren in interface Node

Parameters:: children - The new list of children this node contains.
See Also:: Node.getChildren()

getFirstChild

public Node getFirstChild()

Get the first child of this node.

Specified by:: getFirstChild in interface Node

Returns:: The first child in the list of children contained by this node, null otherwise.

getLastChild

public Node getLastChild()

Get the last child of this node.

Specified by:: getLastChild in interface Node

Returns:: The last child in the list of children contained by this node, null otherwise.

getPreviousSibling

public Node getPreviousSibling()

Get the previous sibling to this node.

Specified by:: getPreviousSibling in interface Node

Returns:: The previous sibling to this node if one exists, null otherwise.

getNextSibling

public Node getNextSibling()

Get the next sibling to this node.

Specified by:: getNextSibling in interface Node

Returns:: The next sibling to this node if one exists, null otherwise.

getText

public String getText()

Returns the text of the node.

Specified by:: getText in interface Node

Returns:: The text of this node. The default is null.
See Also:: Node.setText(java.lang.String)

setText

public void setText(String text)

Sets the string contents of the node.

Specified by:: setText in interface Node

Parameters:: text - The new text for the node.
See Also:: Node.getText()

doSemanticAction

public void doSemanticAction()
                      throws ParserException

Perform the meaning of this tag. The default action is to do nothing.

Specified by:: doSemanticAction in interface Node

Throws:: ParserException - Not used. Provides for subclasses that may want to indicate an exceptional condition.

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

HTML Parser is an open source library released under Common Public License.

org.htmlparser.nodes Class AbstractNode

mPage

nodeBegin

nodeEnd

parent

children

AbstractNode

clone

toPlainTextString

toHtml

toHtml

toString

collectInto

getPage

setPage

getStartPosition

setStartPosition

getEndPosition

setEndPosition

accept

getParent

setParent

getChildren

setChildren

getFirstChild

getLastChild

getPreviousSibling

getNextSibling

getText

setText

doSemanticAction

org.htmlparser.nodes
Class AbstractNode