How to Use the HTML Parser libraries
Step 1: Java
You should make sure that a Java development system (JDK) is installed, not
just a Java runtime (JRE). If you are working in an IDE (Integrated
Development Environment) this is usually taken care of for you. If you are
using just a command line, should see help information when you type:
javac
Java versions greater than 1.2 are supported for the parser, and Java 1.1 for
the lexer. You can check your version with the command:
java -version
If you are using Java 5, you may need to specify option "-source 1.3" to avoid
some warnings.
Step 2: Setting the CLASSPATH
To use the HTML Parser you will need to add the htmlparser.jar and htmllexer.jar to the classpath.
If you are using an IDE, you need to add the htmlparser.jar and htmllexer.jar
to the list of jars/libraries used by your project.
NetBeans
- Right click on your project in the Projects Window (Ctrl-1) and choose Properties.
- In the Project Properties pane choose the Libraries view.
- Select the Compile tab.
- Click the Add Jar/Folder button.
- Browse to <htmlp_dir>/lib (where where <htmlp_dir> is the
directory where you unzipped the distribution: xxx/HTMLParserProject-2.0),
select the htmlparser.jar and htmllexer.jar files and click on OK.
Eclipse
- Right click on your project in the Package Explorer Window (Shift-Alt-Q + P) and choose Properties.
- In the Properties pane choose the Java Build Path view.
- Select the Libraries tab.
- Click the Add External Jars button.
- Browse to <htmlp_dir>/lib (where where <htmlp_dir> is the
directory where you unzipped the distribution: xxx/HTMLParserProject-2.0),
select the htmlparser.jar and htmllexer.jar files and click on OK.
Command Line
You can either add the jars to the CLASSPATH environment variable, or specify
it each time on the command line:
Windows
set CLASSPATH=[htmlp_dir]\lib\htmlparser.jar;[htmlp_dir]\lib\htmllexer.jar;%CLASSPATH%
where [htmlp_dir] is the directory where you unzipped the distribution:
xxx\HTMLParserProject-2.0, or use:
javac -classpath=[htmlp_dir]\lib\htmlparser.jar;[htmlp_dir]\lib\htmllexer.jar MyProgram.java
Linux
export CLASSPATH=[htmlp_dir]/lib/htmlparser.jar:[htmlp_dir]/lib/htmllexer.jar:$CLASSPATH
where [htmlp_dir] is the directory where you unzipped the distribution:
xxx/HTMLParserProject-2.0, or use
javac -classpath=[htmlp_dir]/lib/htmlparser.jar:[htmlp_dir]/lib/htmllexer.jar MyProgram.java
Step 3: Import Necessary Classes
Whatever classes you use from the HTML Parser libraries will need to be
imported by your program. For example, the simplest usage is:
import org.htmlparser.Parser;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;
class Test
{
public static void main (String[] args)
{
try
{
Parser parser = new Parser (args[0]);
NodeList list = parser.parse (null);
System.out.println (list.toHtml ());
}
catch (ParserException pe)
{
pe.printStackTrace ();
}
}
}
Note the import statements may also have been written:
import org.htmlparser.*;
import org.htmlparser.util.*;
Step 4: Compile & Run
Within an IDE the compile and execute steps are usually combined.
NetBeans
- From the Run menu select Run Main Project (F6).
Eclipse
- From the Run menu select Run... and browse to the Main class and click the Run button.
Command Line
The above program in a file called Test.java can be compiled and run with the commands:
Windows
javac -classpath=[htmlp_dir]\lib\htmlparser.jar;[htmlp_dir]\lib\htmllexer.jar Test.java
java -classpath=.;[htmlp_dir]\lib\htmlparser.jar;[htmlp_dir]\lib\htmllexer.jar Test.java
Linux
javac -classpath=[htmlp_dir]/lib/htmlparser.jar:[htmlp_dir]/lib/htmllexer.jar Test.java
java -classpath=.:[htmlp_dir]/lib/htmlparser.jar:[htmlp_dir]/lib/htmllexer.jar Test.java