|
HTML Parser Home Page | |||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.htmlparser.filters.RegexFilter
public class RegexFilter
This filter accepts all string nodes matching a regular expression.
Because this searches Text
nodes. it is
only useful for finding small fragments of text, where it is
unlikely to be broken up by a tag. To find large fragments of text
you should convert the page to plain text with something like the
StringBean
and then apply
the regular expression.
For example, to look for dates use:
(19|20)\d\d([- \\/.](0[1-9]|1[012])[- \\/.](0[1-9]|[12][0-9]|3[01]))?as in:
Parser parser = new Parser ("http://cbc.ca"); RegexFilter filter = new RegexFilter ("(19|20)\\d\\d([- \\\\/.](0[1-9]|1[012])[- \\\\/.](0[1-9]|[12][0-9]|3[01]))?"); NodeIterator iterator = parser.extractAllNodesThatMatch (filter).elements ();which matches a date in yyyy-mm-dd format between 1900-01-01 and 2099-12-31, with a choice of five separators, either a dash, a space, either kind of slash or a period. The year is matched by (19|20)\d\d which uses alternation to allow the either 19 or 20 as the first two digits. The round brackets are mandatory. The month is matched by 0[1-9]|1[012], again enclosed by round brackets to keep the two options together. By using character classes, the first option matches a number between 01 and 09, and the second matches 10, 11 or 12. The last part of the regex consists of three options. The first matches the numbers 01 through 09, the second 10 through 29, and the third matches 30 or 31. The day and month are optional, but must occur together because of the ()? bracketing after the year.
Field Summary | |
---|---|
static int |
FIND
Use find() match strategy. |
static int |
LOOKINGAT
Use lookingAt() match strategy. |
static int |
MATCH
Use match() matching strategy. |
protected Pattern |
mPattern
The compiled regular expression to search for. |
protected String |
mPatternString
The regular expression to search for. |
protected int |
mStrategy
The match strategy. |
Constructor Summary | |
---|---|
RegexFilter()
Creates a new instance of RegexFilter that accepts string nodes matching the regular expression ".*" using the FIND strategy. |
|
RegexFilter(String pattern)
Creates a new instance of RegexFilter that accepts string nodes matching a regular expression using the FIND strategy. |
|
RegexFilter(String pattern,
int strategy)
Creates a new instance of RegexFilter that accepts string nodes matching a regular expression. |
Method Summary | |
---|---|
boolean |
accept(Node node)
Accept string nodes that match the regular expression. |
String |
getPattern()
Get the search pattern. |
int |
getStrategy()
Get the search strategy. |
void |
setPattern(String pattern)
Set the search pattern. |
void |
setStrategy(int strategy)
Set the search pattern. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final int MATCH
public static final int LOOKINGAT
public static final int FIND
protected String mPatternString
protected Pattern mPattern
protected int mStrategy
RegexFilter(String, int)
Constructor Detail |
---|
public RegexFilter()
public RegexFilter(String pattern)
pattern
- The pattern to search for.public RegexFilter(String pattern, int strategy)
pattern
- The pattern to search for.strategy
- The type of match:
MATCH
use matches() method: attempts to match
the entire input sequence against the patternLOOKINGAT
use lookingAt() method: attempts to match
the input sequence, starting at the beginning, against the patternFIND
use find() method: scans the input sequence looking
for the next subsequence that matches the patternMethod Detail |
---|
public String getPattern()
public void setPattern(String pattern)
pattern
- The pattern to set.public int getStrategy()
public void setStrategy(int strategy)
strategy
- The strategy to use. One of MATCH, LOOKINGAT or FIND.public boolean accept(Node node)
accept
in interface NodeFilter
node
- The node to check.
true
if the regular expression matches the
text of the node, false
otherwise.
|
© 2006 Derrick Oswald Sep 17, 2006
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
HTML Parser is an open source library released under Common Public License. |