|
HTML Parser Home Page | |||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.htmlparser.lexer.Page
public class Page
Represents the contents of an HTML page. Contains the source of characters and an index of positions of line separators (actually the first character position on the next line).
Field Summary | |
---|---|
static String |
DEFAULT_CHARSET
The default charset. |
static String |
DEFAULT_CONTENT_TYPE
The default content type. |
static char |
EOF
Character value when the page is exhausted. |
protected String |
mBaseUrl
The base URL for this page. |
protected URLConnection |
mConnection
The connection this page is coming from or null . |
protected static ConnectionManager |
mConnectionManager
Connection control (proxy, cookies, authorization). |
protected PageIndex |
mIndex
Character positions of the first character in each line. |
protected Source |
mSource
The source of characters. |
protected String |
mUrl
The URL this page is coming from. |
Constructor Summary | |
---|---|
Page()
Construct an empty page. |
|
Page(InputStream stream,
String charset)
Construct a page from a stream encoded with the given charset. |
|
Page(Source source)
Construct a page from a source. |
|
Page(String text)
Construct a page from the given string. |
|
Page(String text,
String charset)
Construct a page from the given string. |
|
Page(URLConnection connection)
Construct a page reading from a URL connection. |
Method Summary | |
---|---|
void |
close()
Close the page by destroying the source of characters. |
int |
column(Cursor cursor)
Get the column number for a cursor. |
int |
column(int position)
Get the column number for a cursor. |
URL |
constructUrl(String link,
String base)
Build a URL from the link and base provided using non-strict rules. |
URL |
constructUrl(String link,
String base,
boolean strict)
Build a URL from the link and base provided. |
protected void |
finalize()
Clean up this page, releasing resources. |
static String |
findCharset(String name,
String fallback)
Lookup a character set name. |
String |
getAbsoluteURL(String link)
Create an absolute URL from a relative link. |
String |
getAbsoluteURL(String link,
boolean strict)
Create an absolute URL from a relative link. |
String |
getBaseUrl()
Gets the baseUrl. |
char |
getCharacter(Cursor cursor)
Read the character at the given cursor position. |
String |
getCharset(String content)
Get a CharacterSet name corresponding to a charset parameter. |
URLConnection |
getConnection()
Get the connection, if any. |
static ConnectionManager |
getConnectionManager()
Get the connection manager all Parsers use. |
String |
getContentType()
Try and extract the content type from the HTTP header. |
String |
getEncoding()
Get the current encoding being used. |
String |
getLine(Cursor cursor)
Get the text line the position of the cursor lies on. |
String |
getLine(int position)
Get the text line the position of the cursor lies on. |
Source |
getSource()
Get the source this page is reading from. |
String |
getText()
Get all text read so far from the source. |
void |
getText(char[] array,
int offset,
int start,
int end)
Put the text identified by the given limits into the given array at the specified offset. |
String |
getText(int start,
int end)
Get the text identified by the given limits. |
void |
getText(StringBuffer buffer)
Put all text read so far from the source into the given buffer. |
void |
getText(StringBuffer buffer,
int start,
int end)
Put the text identified by the given limits into the given buffer. |
String |
getUrl()
Get the URL for this page. |
void |
reset()
Reset the page by resetting the source of characters. |
int |
row(Cursor cursor)
Get the line number for a cursor. |
int |
row(int position)
Get the line number for a cursor. |
void |
setBaseUrl(String url)
Sets the baseUrl. |
void |
setConnection(URLConnection connection)
Set the URLConnection to be used by this page. |
static void |
setConnectionManager(ConnectionManager manager)
Set the connection manager to use. |
void |
setEncoding(String character_set)
Begins reading from the source with the given character set. |
void |
setUrl(String url)
Set the URL for this page. |
String |
toString()
Display some of this page as a string. |
void |
ungetCharacter(Cursor cursor)
Return a character. |
Methods inherited from class java.lang.Object |
---|
clone, equals, getClass, hashCode, notify, notifyAll, wait, wait, wait |
Field Detail |
---|
public static final String DEFAULT_CHARSET
"ISO-8859-1"
,
see RFC 2616 (http://www.ietf.org/rfc/rfc2616.txt?number=2616)
section 3.7.1
Another alias is "8859_1".
public static final String DEFAULT_CONTENT_TYPE
public static final char EOF
protected String mUrl
getConnection().toExternalForm()
or
setUrl()
.
protected String mBaseUrl
protected Source mSource
protected PageIndex mIndex
protected transient URLConnection mConnection
null
.
protected static ConnectionManager mConnectionManager
Constructor Detail |
---|
public Page()
public Page(URLConnection connection) throws ParserException
connection
- A fully conditioned connection. The connect()
method will be called so it need not be connected yet.
ParserException
- An exception object wrapping a number of
possible error conditions, some of which are outlined below.
public Page(InputStream stream, String charset) throws UnsupportedEncodingException
stream
- The source of bytes.charset
- The encoding used.
If null, defaults to the DEFAULT_CHARSET
.
UnsupportedEncodingException
- If the given charset
is not supported.public Page(String text, String charset)
text
- The HTML text.charset
- Optional. The character set encoding that will
be reported by getEncoding()
. If charset is null
the default character set is used.public Page(String text)
DEFAULT_CHARSET
.
text
- The HTML text.public Page(Source source)
source
- The source of characters.Method Detail |
---|
public static ConnectionManager getConnectionManager()
public static void setConnectionManager(ConnectionManager manager)
manager
- The new connection manager.public String getCharset(String content)
content
- A text line of the form:
text/html; charset=Shift_JISwhich is applicable both to the HTTP header field Content-Type and the meta tag http-equiv="Content-Type". Note this method also handles non-compliant quoted charset directives such as:
text/html; charset="UTF-8"and
text/html; charset='UTF-8'
findCharset(java.lang.String, java.lang.String)
,
DEFAULT_CHARSET
public static String findCharset(String name, String fallback)
java.nio.charset
.
This uses reflection so the code will still run under prior JDK's but
in that case the default is always returned.
name
- The name to look up. One of the aliases for a character set.fallback
- The name to return if the lookup fails.
public void reset()
public void close() throws IOException
IOException
- If destroying the source encounters an error.protected void finalize() throws Throwable
close()
.
finalize
in class Object
Throwable
- if close()
throws an
IOException
.public URLConnection getConnection()
public void setConnection(URLConnection connection) throws ParserException
connection
- The connection to use.
It will be connected by this method.
ParserException
- If the connect()
method fails,
or an I/O error occurs opening the input stream or the character set
designated in the HTTP header is unsupported.public String getUrl()
getConnection()
returns non-null), or the document base has
been set via a call to setUrl()
.
null
if there is
no conenction or the document base has not been set.public void setUrl(String url)
url
- The new URL.public String getBaseUrl()
null
if not set.public void setBaseUrl(String url)
url
- The base url for this page.public Source getSource()
public String getContentType()
public char getCharacter(Cursor cursor) throws ParserException
cursor
- The position to read at.
ParserException
- If an IOException on the underlying source
occurs, or an attempt is made to read characters in the future (the
cursor position is ahead of the underlying stream)public void ungetCharacter(Cursor cursor) throws ParserException
cursor
- The position to 'unread' at.
ParserException
- If an IOException on the underlying source
occurs.public String getEncoding()
public void setEncoding(String character_set) throws ParserException
Some magic happens here to obtain this result if characters have already been consumed from this page. Since a Reader cannot be dynamically altered to use a different character set, the underlying stream is reset, a new Source is constructed and a comparison made of the characters read so far with the newly read characters up to the current position. If a difference is encountered, or some other problem occurs, an exception is thrown.
character_set
- The character set to use to convert bytes into
characters.
ParserException
- If a character mismatch occurs between
characters already provided and those that would have been returned
had the new character set been in effect from the beginning. An
exception is also thrown if the underlying stream won't put up with
these shenanigans.public URL constructUrl(String link, String base) throws MalformedURLException
link
- The (relative) URI.base
- The base URL of the page, either from the <BASE> tag
or, if none, the URL the page is being fetched from.
MalformedURLException
- If creating the URL fails.constructUrl(String, String, boolean)
public URL constructUrl(String link, String base, boolean strict) throws MalformedURLException
link
- The (relative) URI.base
- The base URL of the page, either from the <BASE> tag
or, if none, the URL the page is being fetched from.strict
- If true
a link starting with '?' is handled
according to RFC 2396,
otherwise the common interpretation of a query appended to the base
is used instead.
MalformedURLException
- If creating the URL fails.public String getAbsoluteURL(String link)
link
- The reslative portion of a URL.
public String getAbsoluteURL(String link, boolean strict)
link
- The reslative portion of a URL.strict
- If true
a link starting with '?' is handled
according to RFC 2396,
otherwise the common interpretation of a query appended to the base
is used instead.
public int row(Cursor cursor)
cursor
- The character offset into the page.
public int row(int position)
position
- The character offset into the page.
public int column(Cursor cursor)
cursor
- The character offset into the page.
public int column(int position)
position
- The character offset into the page.
public String getText(int start, int end) throws IllegalArgumentException
start
- The starting position, zero based.end
- The ending position
(exclusive, i.e. the character at the ending position is not included),
zero based.
start
to end
.
IllegalArgumentException
- If an attempt is made to get
characters ahead of the current source offset (character position).getText(StringBuffer, int, int)
public void getText(StringBuffer buffer, int start, int end) throws IllegalArgumentException
buffer
- The accumulator for the characters.start
- The starting position, zero based.end
- The ending position
(exclusive, i.e. the character at the ending position is not included),
zero based.
IllegalArgumentException
- If an attempt is made to get
characters ahead of the current source offset (character position).public String getText()
getText(StringBuffer)
public void getText(StringBuffer buffer)
buffer
- The accumulator for the characters.getText(StringBuffer,int,int)
public void getText(char[] array, int offset, int start, int end) throws IllegalArgumentException
array
- The array of characters.offset
- The starting position in the array where characters are to be placed.start
- The starting position, zero based.end
- The ending position
(exclusive, i.e. the character at the ending position is not included),
zero based.
IllegalArgumentException
- If an attempt is made to get
characters ahead of the current source offset (character position).public String getLine(Cursor cursor)
cursor
- The position to calculate for.
public String getLine(int position)
position
- The position to calculate for.
public String toString()
toString
in class Object
|
© 2006 Derrick Oswald Sep 17, 2006
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
HTML Parser is an open source library released under Common Public License. |