try { ... parser.parse (...) throws an EncodingChangeException... } catch (EncodingChangeException ece) { ... do whatever necessary to reset your state here try { // reset the parser parser.reset (); // try again with the encoding now in force parser.parse (...); } catch (ParserException pe) { } } catch (ParserException pe) { }
<form name="thisform" method="POST" action="/cgi-bin/whois.pl"> <font face="arial,verdana,helvetica" size="2"> Search for : </font> <input type="text" Name="queryinput" size="20"> <input type="submit"><br> </form>
parser = new Parser (); parser.setConnection (connection); // ... do parser operations
import java.io.PrintWriter; import java.net.HttpURLConnection; import java.net.URL; import java.net.URLConnection; import org.htmlparser.beans.StringBean; /** * WhoIs.java * Use POST to get information about an IP address from ws.arin.net. * Created on April 29, 2006, 11:06 PM */ public class WhoIs { String mText; // text extracted from the response to the POST request /** * Creates a new instance of WhoIs. */ public WhoIs (String ipaddress) { URL url; HttpURLConnection connection; StringBuffer buffer; PrintWriter out; StringBean bean; try { // from the 'action' (relative to the refering page) url = new URL ("http://ws.arin.net/cgi-bin/whois.pl"); connection = (HttpURLConnection)url.openConnection (); connection.setRequestMethod ("POST"); connection.setDoOutput (true); connection.setDoInput (true); connection.setUseCaches (false); // more or less of these may be required // see Request Header Definitions: http://www.ietf.org/rfc/rfc2616.txt connection.setRequestProperty ("Accept-Charset", "*"); connection.setRequestProperty ("Referer", "http://ws.arin.net/cgi-bin/whois.pl"); connection.setRequestProperty ("User-Agent", "WhoIs.java/1.0"); buffer = new StringBuffer (1024); // 'input' fields separated by ampersands (&) buffer.append ("queryinput="); buffer.append (ipaddress); // etc. out = new PrintWriter (connection.getOutputStream ()); out.print (buffer); out.close (); bean = new StringBean (); bean.setConnection (connection); mText = bean.getStrings (); } catch (Exception e) { mText = e.getMessage (); } } public String getText () { return (mText); } /** * Program mainline. * @param args The ip address (dot notation) to look up. */ public static void main (String[] args) { if (0 >= args.length) System.out.println ("Usage: java WhoIs <ipaddress>"); else System.out.println (new WhoIs (args[0]).getText ()); } }
If you are using the Sun jvm, try using:
System.setProperty ("sun.net.client.defaultReadTimeout", "7000"); System.setProperty ("sun.net.client.defaultConnectTimeout", "7000");
This sets the socket timeouts to 7 seconds, but you will need to catch the I/O exceptions.
Authors are sometimes lazy and often fail to close some tags as required by the HTML standard. This causes some problems for the parser.
For this heuristic reason, not all possible tags are registered as composite tags, which is what generates the 'parent/child' nesting relationship. It is considered better to have a valid, less nested parse than a possibly invalid parse.
You are free to add whatever nodes you like as composite nodes using the prototypical node factory paradigm. First create your class that derives from CompositeTagNode (copy and modify one of the existing tags that is most like your desired tag):
public class BoldTag extends CompositeTag { private static final String[] mIds = new String[] {"B"}; public BoldTag () { } public String[] getIds () { return (mIds); } public String[] getEnders () { return (mIds); } public String[] getEndTagEnders () { return (new String[0]); } }
Then, register an instance of your node with a PrototypicalNodeFactory:
PrototypicalNodeFactory factory = new PrototypicalNodeFactory (); factory.registerTag (new BoldTag ()); parser.setNodeFactory (factory);
The problem becomes detecting when the tag doesn't have a </B> like it should, so getEnders() and getEndTagEnders() should probably have a longer list of tag names. Enders are the tag names that force an end tag to be generated, while EndTagEnders are the end tags (</xxx>) that force an end tag to be generated.
The parser sends warning and error messages to standard output by default. You might want to block these messages. To achieve this, use a different feedback object:
Parser parser = new Parser ("http://...", new DefaultParserFeedback (DefaultParserFeedback.QUIET));
The Parser class has a static member with just such a construction:
Parser parser = new Parser ("http://...", Parser.DEVNULL);
You can also switch the feedback to DEBUG mode, to get extra details.
Parser parser = new Parser ("http://...", new DefaultParserFeedback (DefaultParserFeedback.DEBUG));
To handle the feedback yourself, implement the ParserFeedback, interface by implementing info(), warning() and error().
The parser handles tags ending with a slash as a normal Tag object. The Tag interface has a method - isEmptyXmlTag() which returns true if is this such an empty xml tag (has no end tag).
There is a JspTag class that handles "%", "%=" and "%@" tags, but not within tags or remarks. So, the Jsp tag within the tag <input type='<%= MyType %>'> would not be returned as a tag, but would instead be part of the text of the 'type' attribute, but the same tag within the text of the page would be returned as a JspTag tag.
Character positions are much easier to obtain than byte positions. Each tag returned by the parser or lexer has methods getStartPosition() and getEndPosition() which return the starting and ending character positions.
These can be converted to line and column numbers in a hypothetical text file using row() and column() methods on the Page object:
Page page = parser.getLexer ().getPage (); int row = page.row (tag.getStartPosition ()); // note: zero based int column = page.column (tag.getStartPosition ());
Converting a character position into a byte position is dependant on the character encoding used. For the ISO-8859-1 encoding, the correspondence is one byte per character, but for other encodings, often more than one byte is used per character. Perhaps the only safe way is to write all the characters, up to the character position of interest, to a suitably encoded writer on a stream, flush the writer and then examine the byte position of the underlying stream.