|
HTML Parser Home Page | |||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.htmlparser.parserapplications.SiteCapturer
public class SiteCapturer
Save a web site locally. Illustrative program to save a web site contents locally. It was created to demonstrate URL rewriting in it's simplest form. It uses customized tags in the NodeFactory to alter the URLs. This program has a number of limitations:
Field Summary | |
---|---|
protected boolean |
mCaptureResources
If true , save resources locally too,
otherwise, leave resource links pointing to original page. |
protected HashSet |
mCopied
The set of resources already copied. |
protected NodeFilter |
mFilter
The filter to apply to the nodes retrieved. |
protected HashSet |
mFinished
The set of pages already captured. |
protected ArrayList |
mImages
The list of resources to copy. |
protected ArrayList |
mPages
The list of pages to capture. |
protected Parser |
mParser
The parser to use for processing. |
protected String |
mSource
The web site to capture. |
protected String |
mTarget
The local directory to capture to. |
protected int |
TRANSFER_SIZE
Copy buffer size. |
Constructor Summary | |
---|---|
SiteCapturer()
Create a web site capturer. |
Method Summary | |
---|---|
void |
capture()
Perform the capture. |
protected void |
copy()
Copy a resource (image) locally. |
protected String |
decode(String raw)
Unescape a URL to form a file name. |
boolean |
getCaptureResources()
Getter for property captureResources. |
NodeFilter |
getFilter()
Getter for property filter. |
String |
getSource()
Getter for property source. |
String |
getTarget()
Getter for property target. |
protected boolean |
isHtml(String link)
Returns true if the link contains text/html content. |
protected boolean |
isToBeCaptured(String link)
Returns true if the link is one we are interested in. |
static void |
main(String[] args)
Mainline to capture a web site locally. |
protected String |
makeLocalLink(String link,
String current)
Converts a link to local. |
protected void |
process(NodeFilter filter)
Process a single page. |
void |
setCaptureResources(boolean capture)
Setter for property captureResources. |
void |
setFilter(NodeFilter filter)
Setter for property filter. |
void |
setSource(String source)
Setter for property source. |
void |
setTarget(String target)
Setter for property target. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
protected String mSource
protected String mTarget
protected ArrayList mPages
protected HashSet mFinished
protected ArrayList mImages
protected HashSet mCopied
protected Parser mParser
protected boolean mCaptureResources
true
, save resources locally too,
otherwise, leave resource links pointing to original page.
protected NodeFilter mFilter
protected final int TRANSFER_SIZE
Constructor Detail |
---|
public SiteCapturer()
Method Detail |
---|
public String getSource()
public void setSource(String source)
source
- New value of property source.public String getTarget()
public void setTarget(String target)
target
- New value of property target.public boolean getCaptureResources()
true
, the images and other resources referenced by
the site and within the base URL tree are also copied locally to the
target directory. If false
, the image links are left 'as
is', still refering to the original site.
public void setCaptureResources(boolean capture)
capture
- New value of property captureResources.public NodeFilter getFilter()
public void setFilter(NodeFilter filter)
filter
- New value of property filter.protected boolean isToBeCaptured(String link)
true
if the link is one we are interested in.
link
- The link to be checked.
true
if the link has the source URL as a prefix
and doesn't contain '?' or '#'; the former because we won't be able to
handle server side queries in the static target directory structure and
the latter because presumably the full page with that reference has
already been captured previously. This performs a case insensitive
comparison, which is cheating really, but it's cheap.protected boolean isHtml(String link) throws ParserException
true
if the link contains text/html content.
link
- The URL to check for content type.
true
if the HTTP header indicates the type is
"text/html".
ParserException
- If the supplied URL can't be read from.protected String makeLocalLink(String link, String current)
link
- The link to make relative.current
- The current page URL, or empty if it's an absolute URL
that needs to be converted.
protected String decode(String raw)
raw
- The escaped URI.
protected void copy()
protected void process(NodeFilter filter) throws ParserException
filter
- The filter to apply to the collected nodes.
ParserException
- If a parse error occurs.public void capture()
public static void main(String[] args) throws MalformedURLException, IOException
args
- The command line arguments.
There are three arguments the web site to capture, the local directory
to save it to, and a flag (true or false) to indicate whether resources
such as images and video are to be captured as well.
These are requested via dialog boxes if not supplied.
MalformedURLException
- If the supplied URL is invalid.
IOException
- If an error occurs reading the page or resources.
|
© 2006 Derrick Oswald Sep 17, 2006
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
HTML Parser is an open source library released under Common Public License. |