HTML Parser Frequently Asked Questions

Frequently Asked Questions

Why am I getting an EncodingChangeException?
How can I use POST to fetch a page?
Is there a way to force a timeout for delinquent pages?
Why aren't , , etc. tags fully nested?
How can I block parser messages from appearing on stdout?
How does the parser deal with tags like <tag/>?
How is JSP parsed using the parser?
How do you find the byte offset from the beginning of a document for a tag?

Why am I getting an EncodingChangeException?

An EncodingChangeException is thrown to let you, the user, know that some nodes already handed out by the parser are incorrect according to an encoding directive in a <META> tag.

When a <META> tag with an encoding directive is encountered, the parser rescans the input up to the current position using the new encoding. If a different character results from interpreting the bytes with the new encoding, the exception is thrown.

If you are supplying the parser with your own input, as from a file, be sure to set the encoding if it is not the default (ISO-8859-1). You can do this on the Page, Lexer, or Parser objects.

If the parser is fetching the data for you, the problem is with the HTTP server, which should have sent the correct encoding as part of the Content-Type header string. Given that you have no control over the server, the only solution is to reattempt the parse with the new encoding.

After the exception is thrown, the parser has set it's encoding to the new value, so you should be able to just reset and reparse, see for example the handling in StringBean:

try
{
    ... parser.parse (...) throws an EncodingChangeException...
}
catch (EncodingChangeException ece)
{
    ... do whatever necessary to reset your state here
    try
    {
        // reset the parser
        parser.reset ();
        // try again with the encoding now in force
        parser.parse (...);
    }
    catch (ParserException pe)
    {
    }

}
catch (ParserException pe)
{
}

How can I use POST to fetch a page?

The standard HTTP request submitted by the parser is a GET. The usual request submitted by a form is a POST.

To illustrate how to use POST with the parser, we'll submit a form to the WHOIS database of the American Registry for Internet Numbers (ARIN).

Note: there is an equivalent GET form at http://ws.arin.net/whois.

See also:.

RIPE http://www.ripe.net/perl/whois
APNIC http://www.apnic.net/apnic-bin/whois.pl
LACNIC http://lacnic.net/cgi-bin/lacnic/whois

On the ARIN web site, the page http://ws.arin.net/cgi-bin/whois.pl has the following FORM that asks for an IP address and returns the registry details:

<form name="thisform" method="POST" action="/cgi-bin/whois.pl">
<font face="arial,verdana,helvetica" size="2"> Search for : </font>
<input type="text" Name="queryinput" size="20">
<input type="submit"><br>
</form>

From this we determine that the METHOD is POST and the form should be submitted to /cgi-bin/whois.pl. This absolute URL is relative to the page it is found on, so the form should be submitted to http://ws.arin.net/cgi-bin/whois.pl when the Submit input is clicked. The only INPUT element other than the Submit is a single text field named queryinput that takes 20 or fewer characters. Other types of input element are described in http://www.w3.org/TR/html4/interact/forms.html.

The basic operation is to pass a fully prepared HttpURLConnection connected to the POST target URL into the Parser, either in the constructor or via the setConnection() method. To condition the connection, use the setRequestMethod() method to set the POST operation, and the setRequestProperty() and other explicit method calls. Then write the input field(s) as an ampersand concatenation ("input1=value1&input2=value2&...") into the PrintWriter obtained by a call to getOutputStream().

The following sample program illustrates the principles using a StringBean, but the same code could be used with a Parser by replacing the last three lines in the try block with:

parser = new Parser ();
parser.setConnection (connection);
// ... do parser operations

import java.io.PrintWriter;
import java.net.HttpURLConnection;
import java.net.URL;
import java.net.URLConnection;

import org.htmlparser.beans.StringBean;

/**
 * WhoIs.java
 * Use POST to get information about an IP address from ws.arin.net.
 * Created on April 29, 2006, 11:06 PM
 */
public class WhoIs
{
    String mText; // text extracted from the response to the POST request

    /**
     * Creates a new instance of WhoIs.
     */
    public WhoIs (String ipaddress)
    {
        URL url;
        HttpURLConnection connection;
        StringBuffer buffer;
        PrintWriter out;
        StringBean bean;

        try
        {
            // from the 'action' (relative to the refering page)
            url = new URL ("http://ws.arin.net/cgi-bin/whois.pl");
            connection = (HttpURLConnection)url.openConnection ();
            connection.setRequestMethod ("POST");

            connection.setDoOutput (true);
            connection.setDoInput (true);
            connection.setUseCaches (false);

            // more or less of these may be required
            // see Request Header Definitions: http://www.ietf.org/rfc/rfc2616.txt
            connection.setRequestProperty ("Accept-Charset", "*");
            connection.setRequestProperty ("Referer", "http://ws.arin.net/cgi-bin/whois.pl");
            connection.setRequestProperty ("User-Agent", "WhoIs.java/1.0");

            buffer = new StringBuffer (1024);
            // 'input' fields separated by ampersands (&)
            buffer.append ("queryinput=");
            buffer.append (ipaddress);
            // etc.

            out = new PrintWriter (connection.getOutputStream ());
            out.print (buffer);
            out.close ();

            bean = new StringBean ();
            bean.setConnection (connection);
            mText = bean.getStrings ();
        }
        catch (Exception e)
        {
            mText = e.getMessage ();
        }

    }

    public String getText ()
    {
        return (mText);
    }

    /**
     * Program mainline.
     * @param args The ip address (dot notation) to look up.
     */
    public static void main (String[] args)
    {
        if (0 >= args.length)
            System.out.println ("Usage:  java WhoIs <ipaddress>");
        else
            System.out.println (new WhoIs (args[0]).getText ());
    }
}

Is there a way to force a timeout for delinquent pages?

If you are using the Sun jvm, try using:

System.setProperty ("sun.net.client.defaultReadTimeout", "7000");
System.setProperty ("sun.net.client.defaultConnectTimeout", "7000");

in the mainline before starting your main application processing.

This sets the socket timeouts to 7 seconds, but you will need to catch the I/O exceptions.

Why aren't , , etc. tags fully nested?

Authors are sometimes lazy and often fail to close some tags as required by the HTML standard. This causes some problems for the parser.

For this heuristic reason, not all possible tags are registered as composite tags, which is what generates the 'parent/child' nesting relationship. It is considered better to have a valid, less nested parse than a possibly invalid parse.

You are free to add whatever nodes you like as composite nodes using the prototypical node factory paradigm. First create your class that derives from CompositeTagNode (copy and modify one of the existing tags that is most like your desired tag):

public class BoldTag extends CompositeTag
{
    private static final String[] mIds = new String[] {"B"};
    public BoldTag ()
    {
    }
    public String[] getIds ()
    {
        return (mIds);
    }
    public String[] getEnders ()
    {
        return (mIds);
    }
    public String[] getEndTagEnders ()
    {
        return (new String[0]);
    }
}

Then, register an instance of your node with a PrototypicalNodeFactory:

PrototypicalNodeFactory factory = new PrototypicalNodeFactory ();
factory.registerTag (new BoldTag ());
parser.setNodeFactory (factory);

The problem becomes detecting when the tag doesn't have a like it should, so getEnders() and getEndTagEnders() should probably have a longer list of tag names. Enders are the tag names that force an end tag to be generated, while EndTagEnders are the end tags (</xxx>) that force an end tag to be generated.

How can I block parser messages from appearing on stdout?

The parser sends warning and error messages to standard output by default. You might want to block these messages. To achieve this, use a different feedback object:

Parser parser = new Parser ("http://...", new DefaultParserFeedback (DefaultParserFeedback.QUIET));

The Parser class has a static member with just such a construction:

Parser parser = new Parser ("http://...", Parser.DEVNULL);

You can also switch the feedback to DEBUG mode, to get extra details.

Parser parser = new Parser ("http://...", new DefaultParserFeedback (DefaultParserFeedback.DEBUG));

To handle the feedback yourself, implement the ParserFeedback, interface by implementing info(), warning() and error().

How does the parser deal with tags like <tag/>?

The parser handles tags ending with a slash as a normal Tag object. The Tag interface has a method - isEmptyXmlTag() which returns true if is this such an empty xml tag (has no end tag).

How is JSP parsed using the parser?

There is a JspTag class that handles "%", "%=" and "%@" tags, but not within tags or remarks. So, the Jsp tag within the tag <input type='<%= MyType %>'> would not be returned as a tag, but would instead be part of the text of the 'type' attribute, but the same tag within the text of the page would be returned as a JspTag tag.

How do you find the byte offset from the beginning of a document for a tag?

Character positions are much easier to obtain than byte positions. Each tag returned by the parser or lexer has methods getStartPosition() and getEndPosition() which return the starting and ending character positions.

These can be converted to line and column numbers in a hypothetical text file using row() and column() methods on the Page object:

Page page = parser.getLexer ().getPage ();
int row = page.row (tag.getStartPosition ()); // note: zero based
int column = page.column (tag.getStartPosition ());

Converting a character position into a byte position is dependant on the character encoding used. For the ISO-8859-1 encoding, the correspondence is one byte per character, but for other encodings, often more than one byte is used per character. Perhaps the only safe way is to write all the characters, up to the character position of interest, to a suitably encoded writer on a stream, flush the writer and then examine the byte position of the underlying stream.