|
HTML Parser Home Page | |||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.htmlparser.visitors.NodeVisitor org.htmlparser.beans.StringBean
public class StringBean
Extract strings from a URL.
Text within <SCRIPT></SCRIPT> tags is removed.
The text within <PRE></PRE> tags is not altered.
The property Strings
, which is the output property is null
until a URL is set. So a typical usage is:
StringBean sb = new StringBean (); sb.setLinks (false); sb.setReplaceNonBreakingSpaces (true); sb.setCollapse (true); sb.setURL ("http://www.netbeans.org"); // the HTTP is performed here String s = sb.getStrings ();You can also use the StringBean as a NodeVisitor on your own parser, in which case you have to refetch your page if you change one of the properties because it resets the Strings property:
StringBean sb = new StringBean (); Parser parser = new Parser ("http://cbc.ca"); parser.visitAllNodesWith (sb); String s = sb.getStrings (); sb.setLinks (true); parser.reset (); parser.visitAllNodesWith (sb); String sl = sb.getStrings ();According to Nick Burch, who contributed the patch, this is handy if you don't want StringBean to wander off and get the content itself, either because you already have it, it's not on a website etc.
Field Summary | |
---|---|
protected StringBuffer |
mBuffer
The buffer text is stored in while traversing the HTML. |
protected boolean |
mCollapse
If true sequences of whitespace characters are replaced
with a single space character. |
protected int |
mCollapseState
The state of the collapse processiung state machine. |
protected boolean |
mIsPre
Set true when traversing a PRE tag. |
protected boolean |
mIsScript
Set true when traversing a SCRIPT tag. |
protected boolean |
mIsStyle
Set true when traversing a STYLE tag. |
protected boolean |
mLinks
If true the link URLs are embedded in the text output. |
protected Parser |
mParser
The parser used to extract strings. |
protected PropertyChangeSupport |
mPropertySupport
Bound property support. |
protected boolean |
mReplaceSpace
If true regular space characters are substituted for
non-breaking spaces in the text output. |
protected String |
mStrings
The strings extracted from the URL. |
static String |
PROP_COLLAPSE_PROPERTY
Property name in event where the 'collapse whitespace' state changes. |
static String |
PROP_CONNECTION_PROPERTY
Property name in event where the connection changes. |
static String |
PROP_LINKS_PROPERTY
Property name in event where the 'embed links' state changes. |
static String |
PROP_REPLACE_SPACE_PROPERTY
Property name in event where the 'replace non-breaking spaces' state changes. |
static String |
PROP_STRINGS_PROPERTY
Property name in event where the URL contents changes. |
static String |
PROP_URL_PROPERTY
Property name in event where the URL changes. |
Constructor Summary | |
---|---|
StringBean()
Create a StringBean object. |
Method Summary | |
---|---|
void |
addPropertyChangeListener(PropertyChangeListener listener)
Add a PropertyChangeListener to the listener list. |
protected void |
carriageReturn()
Appends a newline to the buffer if there isn't one there already. |
protected void |
collapse(StringBuffer buffer,
String string)
Add the given text collapsing whitespace. |
protected String |
extractStrings()
Extract the text from a page. |
boolean |
getCollapse()
Get the current 'collapse whitespace' state. |
URLConnection |
getConnection()
Get the current connection. |
boolean |
getLinks()
Get the current 'include links' state. |
boolean |
getReplaceNonBreakingSpaces()
Get the current 'replace non breaking spaces' state. |
String |
getStrings()
Return the textual contents of the URL. |
String |
getURL()
Get the current URL. |
static void |
main(String[] args)
Unit test. |
void |
removePropertyChangeListener(PropertyChangeListener listener)
Remove a PropertyChangeListener from the listener list. |
void |
setCollapse(boolean collapse)
Set the current 'collapse whitespace' state. |
void |
setConnection(URLConnection connection)
Set the parser's connection. |
void |
setLinks(boolean links)
Set the 'include links' state. |
void |
setReplaceNonBreakingSpaces(boolean replace)
Set the 'replace non breaking spaces' state. |
protected void |
setStrings()
Fetch the URL contents. |
void |
setURL(String url)
Set the URL to extract strings from. |
protected void |
updateStrings(String strings)
Assign the Strings property, firing the property change. |
void |
visitEndTag(Tag tag)
Resets the state of the PRE and SCRIPT flags. |
void |
visitStringNode(Text string)
Appends the text to the output. |
void |
visitTag(Tag tag)
Appends a NEWLINE to the output if the tag breaks flow, and possibly sets the state of the PRE and SCRIPT flags. |
Methods inherited from class org.htmlparser.visitors.NodeVisitor |
---|
beginParsing, finishedParsing, shouldRecurseChildren, shouldRecurseSelf, visitRemarkNode |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final String PROP_STRINGS_PROPERTY
public static final String PROP_LINKS_PROPERTY
public static final String PROP_URL_PROPERTY
public static final String PROP_REPLACE_SPACE_PROPERTY
public static final String PROP_COLLAPSE_PROPERTY
public static final String PROP_CONNECTION_PROPERTY
protected PropertyChangeSupport mPropertySupport
protected Parser mParser
protected String mStrings
protected boolean mLinks
true
the link URLs are embedded in the text output.
protected boolean mReplaceSpace
true
regular space characters are substituted for
non-breaking spaces in the text output.
protected boolean mCollapse
true
sequences of whitespace characters are replaced
with a single space character.
protected int mCollapseState
protected StringBuffer mBuffer
protected boolean mIsScript
true
when traversing a SCRIPT tag.
protected boolean mIsPre
true
when traversing a PRE tag.
protected boolean mIsStyle
true
when traversing a STYLE tag.
Constructor Detail |
---|
public StringBean()
Links
is set false
so text appears like a
browser would display it, albeit without the colour or underline clues
normally associated with a link.
ReplaceNonBreakingSpaces
is set true
, so
that printing the text works, but the extra information regarding these
formatting marks is available if you set it false.
Collapse
is set true
, so text appears
compact like a browser would display it.
Method Detail |
---|
protected void carriageReturn()
protected void collapse(StringBuffer buffer, String string)
state 0: whitepace was last emitted character state 1: in whitespace state 2: in word A whitespace character moves us to state 1 and any other character moves us to state 2, except that state 0 stays in state 0 until a non-whitespace and going from whitespace to word we emit a space before the character: input: whitespace other-character state\next 0 0 2 1 1 space then 2 2 1 2
buffer
- The buffer to append to.string
- The string to append.protected String extractStrings() throws ParserException
ParserException
- If a parse error occurs.protected void updateStrings(String strings)
Strings
property, firing the property change.
strings
- The new value of the Strings
property.protected void setStrings()
public void addPropertyChangeListener(PropertyChangeListener listener)
listener
- The PropertyChangeListener to be added.public void removePropertyChangeListener(PropertyChangeListener listener)
listener
- The PropertyChangeListener to be removed.public String getStrings()
public boolean getLinks()
true
if link text is included in the text extracted
from the URL, false
otherwise.public void setLinks(boolean links)
links
- Use true
if link text is to be included in the
text extracted from the URL, false
otherwise.public String getURL()
null
if this property has not been set yet.public void setURL(String url)
url
- The URL that text should be fetched from.public boolean getReplaceNonBreakingSpaces()
true
if non-breaking spaces (character '\u00a0',
numeric character reference   or character entity
reference ) are to be replaced with normal
spaces (character '\u0020').public void setReplaceNonBreakingSpaces(boolean replace)
replace
- true
if non-breaking spaces
(character '\u00a0', numeric character reference  
or character entity reference ) are to be replaced with normal
spaces (character '\u0020').public boolean getCollapse()
true
this emulates the operation of browsers
in interpretting text where user agents should collapse input white space sequences when producing output inter-word space. See HTML specification section 9.1 White space http://www.w3.org/TR/html4/struct/text.html#h-9.1.
true
if sequences of whitespace (space '\u0020',
tab '\u0009', form feed '\u000C', zero-width space '\u200B',
carriage-return '\r' and NEWLINE '\n') are to be replaced with a single
space.public void setCollapse(boolean collapse)
setCollapse (getCollapse ());
collapse
- If true
, sequences of whitespace
will be reduced to a single space.public URLConnection getConnection()
null
if it
hasn't been set or the parser hasn't been constructed yet.public void setConnection(URLConnection connection)
connection
- New value of property Connection.public void visitStringNode(Text string)
visitStringNode
in class NodeVisitor
string
- The text node.public void visitTag(Tag tag)
visitTag
in class NodeVisitor
tag
- The tag to examine.public void visitEndTag(Tag tag)
visitEndTag
in class NodeVisitor
tag
- The end tag to process.public static void main(String[] args)
args
- Pass arg[0] as the URL to process.
|
© 2006 Derrick Oswald Sep 17, 2006
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
HTML Parser is an open source library released under Common Public License. |