|
HTML Parser Home Page | |||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.htmlparser.lexer.Lexer
public class Lexer
This class parses the HTML stream into nodes. There are three major types of nodes (lexemes):
nextNode()
is called, another node is returned until
the stream is exhausted, and null
is returned.
Field Summary | |
---|---|
protected Cursor |
mCursor
The current position on the page. |
protected static int |
mDebugLineTrigger
Line number to trigger on. |
protected NodeFactory |
mFactory
The factory for new nodes. |
protected Page |
mPage
The page lexemes are retrieved from. |
static boolean |
STRICT_REMARKS
Process remarks strictly flag. |
static String |
VERSION_DATE
The date of the version ("Jun 10, 2006"). |
static double |
VERSION_NUMBER
The floating point version number (1.6). |
static String |
VERSION_STRING
The display version ("1.6 (Release Build Jun 10, 2006)"). |
static String |
VERSION_TYPE
The type of version ("Release Build"). |
Constructor Summary | |
---|---|
Lexer()
Creates a new instance of a Lexer. |
|
Lexer(Page page)
Creates a new instance of a Lexer. |
|
Lexer(String text)
Creates a new instance of a Lexer. |
|
Lexer(URLConnection connection)
Creates a new instance of a Lexer. |
Method Summary | |
---|---|
Remark |
createRemarkNode(Page page,
int start,
int end)
Create a new remark node. |
Text |
createStringNode(Page page,
int start,
int end)
Create a new string node. |
Tag |
createTagNode(Page page,
int start,
int end,
Vector attributes)
Create a new tag node. |
String |
getCurrentLine()
Get the current line. |
int |
getCurrentLineNumber()
Get the current line number. |
Cursor |
getCursor()
Get the current scanning position. |
NodeFactory |
getNodeFactory()
Get the current node factory. |
Page |
getPage()
Get the page this lexer is working on. |
int |
getPosition()
Get the current cursor position. |
static String |
getVersion()
Return the version string of this parser. |
static void |
main(String[] args)
Mainline for command line operation |
protected Node |
makeRemark(int start,
int end)
Create a remark node based on the current cursor and the one provided. |
protected Node |
makeString(int start,
int end)
Create a string node based on the current cursor and the one provided. |
protected Node |
makeTag(int start,
int end,
Vector attributes)
Create a tag node based on the current cursor and the one provided. |
Node |
nextNode()
Get the next node from the source. |
Node |
nextNode(boolean quotesmart)
Get the next node from the source. |
Node |
parseCDATA()
Return CDATA as a text node. |
Node |
parseCDATA(boolean quotesmart)
Return CDATA as a text node. |
protected Node |
parseJsp(int start)
Parse a java server page node. |
protected Node |
parsePI(int start)
Parse an XML processing instruction. |
protected Node |
parseRemark(int start,
boolean quotesmart)
Parse a comment. |
protected Node |
parseString(int start,
boolean quotesmart)
Parse a string node. |
protected Node |
parseTag(int start)
Parse a tag. |
void |
reset()
Reset the lexer to start parsing from the beginning again. |
protected void |
scanJIS(Cursor cursor)
Advance the cursor through a JIS escape sequence. |
void |
setCursor(Cursor cursor)
Set the current scanning position. |
void |
setNodeFactory(NodeFactory factory)
Set the current node factory. |
void |
setPage(Page page)
Set the page this lexer is working on. |
void |
setPosition(int position)
Set the current cursor position. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final double VERSION_NUMBER
public static final String VERSION_TYPE
public static final String VERSION_DATE
public static final String VERSION_STRING
public static boolean STRICT_REMARKS
true
, remarks are not terminated by ---$gt;
or --!$gt;, i.e. more than two dashes. If false
,
a more lax (and closer to typical browser handling) remark parsing
is used.
Default true
.
protected Page mPage
protected Cursor mCursor
protected NodeFactory mFactory
protected static int mDebugLineTrigger
nextNode()
call, as a debugging aid.
Alter this value and set a breakpoint on the guarded statement.
Remember, these line numbers are zero based, while most editors are
one based.
nextNode()
Constructor Detail |
---|
public Lexer()
public Lexer(Page page)
page
- The page with HTML text.public Lexer(String text)
text
- The text to parse.public Lexer(URLConnection connection) throws ParserException
connection
- The url to parse.
ParserException
- If an error occurs opening the connection.Method Detail |
---|
public static String getVersion()
"[floating point number] ([build-type] [build-date])"
public Page getPage()
public void setPage(Page page)
page
- The page that nodes will be read from.public Cursor getCursor()
public void setCursor(Cursor cursor)
cursor
- The lexer's new cursor position.public NodeFactory getNodeFactory()
public void setNodeFactory(NodeFactory factory)
factory
- The node factory to be used by the lexer.public int getPosition()
public void setPosition(int position)
position
- The new character offset into the source.public int getCurrentLineNumber()
public String getCurrentLine()
public void reset()
nextNode()
will return the first lexeme on the page.
public Node nextNode() throws ParserException
null
if no
more lexemes are present.
ParserException
- If there is a problem with the
underlying page.public Node nextNode(boolean quotesmart) throws ParserException
quotesmart
- If true
, strings ignore quoted contents.
null
if no
more lexemes are present.
ParserException
- If there is a problem with the
underlying page.public Node parseCDATA() throws ParserException
Element content
When script or style data is the content of an element (SCRIPT and STYLE), the data begins immediately after the element start tag and ends at the first ETAGO ("</") delimiter followed by a name start character ([a-zA-Z]); note that this may not be the element's end tag. Authors should therefore escape "</" within the content. Escape mechanisms are specific to each scripting or style sheet language.
TextNode
of the CDATA or null
if none.
ParserException
- If a problem occurs reading from the source.public Node parseCDATA(boolean quotesmart) throws ParserException
parseCDATA()
this method provides for
parsing CDATA that may contain quoted strings that have embedded
ETAGO ("</") delimiters and skips single and multiline comments.
quotesmart
- If true
the strict definition of CDATA is
extended to allow for single or double quoted ETAGO ("</") sequences.
TextNode
of the CDATA or null
if none.
ParserException
- If a problem occurs reading from the source.parseCDATA()
public Text createStringNode(Page page, int start, int end)
createStringNode
in interface NodeFactory
page
- The page the node is on.start
- The beginning position of the string.end
- The ending positiong of the string.
public Remark createRemarkNode(Page page, int start, int end)
createRemarkNode
in interface NodeFactory
page
- The page the node is on.start
- The beginning position of the remark.end
- The ending positiong of the remark.
public Tag createTagNode(Page page, int start, int end, Vector attributes)
createTagNode
in interface NodeFactory
page
- The page the node is on.start
- The beginning position of the tag.end
- The ending positiong of the tag.attributes
- The attributes contained in this tag.
protected void scanJIS(Cursor cursor) throws ParserException
cursor
- A cursor positioned within the escape sequence.
ParserException
- If a problem occurs reading from the source.protected Node parseString(int start, boolean quotesmart) throws ParserException
null
is returned.
start
- The position at which to start scanning.quotesmart
- If true
, strings ignore quoted contents.
ParserException
- If a problem occurs reading from the source.protected Node makeString(int start, int end) throws ParserException
start
- The starting point of the node.end
- The ending point of the node.
ParserException
- If the nodefactory creation of the text
node fails.protected Node parseTag(int start) throws ParserException
From the HTML 4.01 Specification, W3C Recommendation 24 December 1999 http://www.w3.org/TR/html4/intro/sgmltut.html#h-3.2.2
3.2.2 Attributes
Elements may have associated properties, called attributes, which may
have values (by default, or set by authors or scripts). Attribute/value
pairs appear before the final ">" of an element's start tag. Any number
of (legal) attribute value pairs, separated by spaces, may appear in an
element's start tag. They may appear in any order.
In this example, the id attribute is set for an H1 element:
In certain cases, authors may specify the value of an attribute without
any quotation marks. The attribute value may only contain letters
(a-z and A-Z), digits (0-9), hyphens (ASCII decimal 45),
periods (ASCII decimal 46), underscores (ASCII decimal 95),
and colons (ASCII decimal 58). We recommend using quotation marks even
when it is possible to eliminate them.
Attribute names are always case-insensitive.
Attribute values are generally case-insensitive. The definition of each
attribute in the reference manual indicates whether its value is
case-insensitive.
All the attributes defined by this specification are listed in the
attribute index.
<H1 id="section1">
This is an identified heading thanks to the id attribute
</H1>
By default, SGML requires that all attribute values be delimited using
either double quotation marks (ASCII decimal 34) or single quotation
marks (ASCII decimal 39). Single quote marks can be included within the
attribute value when the value is delimited by double quote marks, and
vice versa. Authors may also use numeric character references to
represent double quotes (") and single quotes (').
For doublequotes authors can also use the character entity reference
".
This method uses a state machine with the following states:
The starting point for the various components is stored in an array
of integers that match the initiation point for the states one-for-one,
i.e. bookmarks[0] is where state 0 began, bookmarks[1] is where state 1
began, etc.
Attributes are stored in a Vector
having
one slot for each whitespace or attribute/value pair.
The first slot is for attribute name (kind of like a standalone attribute).
start
- The position at which to start scanning.
ParserException
- If a problem occurs reading from the source.protected Node makeTag(int start, int end, Vector attributes) throws ParserException
start
- The starting point of the node.end
- The ending point of the node.attributes
- The attributes parsed from the tag.
ParserException
- If the nodefactory creation of the tag node fails.protected Node parseRemark(int start, boolean quotesmart) throws ParserException
From the HTML 4.01 Specification, W3C Recommendation 24 December 1999 http://www.w3.org/TR/html4/intro/sgmltut.html#h-3.2.4
3.2.4 Comments
HTML comments have the following syntax:
<!-- and so is this one,
which occupies more than one line -->
<!-- this is a comment -->
White space is not permitted between the markup declaration
open delimiter("<!") and the comment open delimiter ("--"),
but is permitted between the comment close delimiter ("--") and
the markup declaration close delimiter (">").
A common error is to include a string of hyphens ("---") within a comment.
Authors should avoid putting two or more adjacent hyphens inside comments.
Information that appears between comments has no special meaning
(e.g., character references are not interpreted as such).
Note that comments are markup.
This method uses a state machine with the following states:
All comment text (everything excluding the < and >), is included in the remark text. We allow terminators like --!> even though this isn't part of the spec.
start
- The position at which to start scanning.quotesmart
- If true
, strings ignore quoted contents.
ParserException
- If a problem occurs reading from the source.protected Node makeRemark(int start, int end) throws ParserException
start
- The starting point of the node.end
- The ending point of the node.
ParserException
- If the nodefactory creation of the remark node fails.protected Node parseJsp(int start) throws ParserException
null
is returned.
start
- The position at which to start scanning.
ParserException
- If a problem occurs reading from the source.protected Node parsePI(int start) throws ParserException
null
is returned.
start
- The position at which to start scanning.
ParserException
- If a problem occurs reading from the source.public static void main(String[] args) throws MalformedURLException, ParserException
args
- [0] The URL to parse.
MalformedURLException
- If the provided URL cannot be resolved.
ParserException
- If the parse fails.
|
© 2006 Derrick Oswald Sep 17, 2006
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
HTML Parser is an open source library released under Common Public License. |