All Packages Class Hierarchy This Package Previous Next Index
Class Acme.Spider
java.lang.Object
|
+----Acme.Spider
- public class Spider
- extends Object
- implements HtmlObserver, Enumeration
A web-robot class.
This is an Enumeration class that traverses the web starting at
a given URL. It fetches HTML files and parses them for new
URLs to look at. All files it encounters, HTML or otherwise,
are returned by the nextElement() method as a URLConnection.
The traversal is breadth-first, and by default it is limited to
files at or below the starting point - same protocol, hostname, and
initial directory.
Because of the security restrictions on applets, this is currently
only useful from applications.
Sample code:
Enumeration spider = new Acme.Spider( "http://some.site.com/whatever/" );
while ( spider.hasMoreElements() )
{
URLConnection conn = (URLConnection) spider.nextElement();
// Then do whatever you like with conn:
URL thisUrl = conn.getURL();
String thisUrlStr = thisUrl.toExternalForm();
String mimeType = conn.getContentType();
long changed = conn.getLastModified();
InputStream s = conn.getInputStream();
// Etc. etc. etc., your code here.
}
There are also a couple of methods you can override via a subclass, to
control things like the search limits and what gets done with broken links.
Sample applications that use Acme.Spider:
- WebList - make a list of the files in a web subtree
- WebCopy - copy a remote web subtree to the local disk
- WebGrep - grep a web subtree for a pattern
Fetch the software.
Fetch the entire Acme package.
- See Also:
- HtmlScanner, NoRobots
-
done
-
-
err
-
-
todo
-
-
Spider()
- Constructor with no size limits, and the default error stream.
-
Spider(int, int)
- Constructor with size limits.
-
Spider(int, int, PrintStream)
- Constructor with size limits.
-
Spider(PrintStream)
- Constructor with no size limits.
-
Spider(String)
- Constructor with a single URL and no size limits, and the default
error stream.
-
Spider(String, PrintStream)
- Constructor with a single URL and no size limits.
-
addObserver(HtmlObserver)
- Add an extra observer to the scanners we make.
-
addUrl(String)
- Add a URL to the to-do list.
-
brokenLink(String, String, String)
- This method can be overridden by a subclass if you want to change
the broken link policy.
-
doThisUrl(String, int, String)
- This method can be overridden by a subclass if you want to change
the search policy.
-
gotAHREF(String, URL, Object)
- Acme.HtmlObserver callback.
-
gotAREAHREF(String, URL, Object)
- Acme.HtmlObserver callback.
-
gotBASEHREF(String, URL, Object)
- Acme.HtmlObserver callback.
-
gotBODYBACKGROUND(String, URL, Object)
- Acme.HtmlObserver callback.
-
gotFRAMESRC(String, URL, Object)
- Acme.HtmlObserver callback.
-
gotIMGSRC(String, URL, Object)
- Acme.HtmlObserver callback.
-
gotLINKHREF(String, URL, Object)
- Acme.HtmlObserver callback.
-
hasMoreElements()
- Standard Enumeration method.
-
main(String[])
- Test program.
-
nextElement()
- Standard Enumeration method.
-
reportError(String, String, String)
- This method can be overridden by a subclass if you want to change
the error reporting policy.
-
setAuth(String)
- Set the authorization cookie.
err
protected PrintStream err
todo
protected Queue todo
done
protected Hashtable done
Spider
public Spider(PrintStream err)
- Constructor with no size limits.
- Parameters:
- err - the error stream
Spider
public Spider()
- Constructor with no size limits, and the default error stream.
Spider
public Spider(String urlStr,
PrintStream err) throws MalformedURLException
- Constructor with a single URL and no size limits.
- Parameters:
- urlStr - the URL to start off the enumeration
- err - the error stream
Spider
public Spider(String urlStr) throws MalformedURLException
- Constructor with a single URL and no size limits, and the default
error stream.
- Parameters:
- urlStr - the URL to start off the enumeration
Spider
public Spider(int todoLimit,
int doneLimit,
PrintStream err)
- Constructor with size limits.
This version lets you specify limits on the todo queue and the
done hash-table. If you are using Spider for a large, multi-site
traversal, then you may need to set these limits to avoid running
out of memory. Note that setting a todoLimit means the traversal
will not be complete - you may skip some URLs. And setting the
doneLimit means it may re-visit some pages.
Guesses at good values for an unlimited traversal: 200000 and 20000.
You want the doneLimit pretty small because the hash-table gets checked
for every URL, so it will be mostly in memory; the todo queue, on the
other hand, is only accessed at the front and back, and so will be
mostly paged out.
- Parameters:
- urlStr - the URL to start off the enumeration
- todoLimit - maximum number of URLs to queue for examination
- doneLimit - maximum number of URLs to remember having done already
- err - the error stream
Spider
public Spider(int todoLimit,
int doneLimit)
- Constructor with size limits.
- Parameters:
- urlStr - the URL to start off the enumeration
- todoLimit - maximum number of URLs to queue for examination
- doneLimit - maximum number of URLs to remember having done already
addUrl
public synchronized void addUrl(String urlStr) throws MalformedURLException
- Add a URL to the to-do list.
setAuth
public synchronized void setAuth(String auth_cookie)
- Set the authorization cookie.
Syntax is userid:password.
addObserver
public synchronized void addObserver(HtmlObserver observer)
- Add an extra observer to the scanners we make. Multiple observers
get called in the order they were added.
Alternatively, if you want to add a different observer to each
scanner, you can cast the input stream to a scanner and call
its add routine, like so:
InputStream s = conn.getInputStream();
Acme.HtmlScanner scanner = (Acme.HtmlScanner) s;
scanner.addObserver( this );
doThisUrl
protected boolean doThisUrl(String thisUrlStr,
int depth,
String baseUrlStr)
- This method can be overridden by a subclass if you want to change
the search policy. The default version only does URLs that start
with the same string as the base URL. An alternate version might
instead go by the search depth.
brokenLink
protected void brokenLink(String fromUrlStr,
String toUrlStr,
String errmsg)
- This method can be overridden by a subclass if you want to change
the broken link policy. The default version reports the broken
link on the error stream. An alternate version might attempt to
send mail to the owner of the page with the broken link.
reportError
protected void reportError(String fromUrlStr,
String toUrlStr,
String errmsg)
- This method can be overridden by a subclass if you want to change
the error reporting policy. The default version reports the error
link on the error stream. An alternate version might ignore the error.
hasMoreElements
public synchronized boolean hasMoreElements()
- Standard Enumeration method.
nextElement
public synchronized Object nextElement()
- Standard Enumeration method.
gotAHREF
public void gotAHREF(String urlStr,
URL contextUrl,
Object clientData)
- Acme.HtmlObserver callback.
gotIMGSRC
public void gotIMGSRC(String urlStr,
URL contextUrl,
Object clientData)
- Acme.HtmlObserver callback.
gotFRAMESRC
public void gotFRAMESRC(String urlStr,
URL contextUrl,
Object clientData)
- Acme.HtmlObserver callback.
gotBASEHREF
public void gotBASEHREF(String urlStr,
URL contextUrl,
Object clientData)
- Acme.HtmlObserver callback.
gotAREAHREF
public void gotAREAHREF(String urlStr,
URL contextUrl,
Object clientData)
- Acme.HtmlObserver callback.
gotLINKHREF
public void gotLINKHREF(String urlStr,
URL contextUrl,
Object clientData)
- Acme.HtmlObserver callback.
gotBODYBACKGROUND
public void gotBODYBACKGROUND(String urlStr,
URL contextUrl,
Object clientData)
- Acme.HtmlObserver callback.
main
public static void main(String args[])
- Test program. Shows URLs, file sizes, etc. at the ACME Java site.
All Packages Class Hierarchy This Package Previous Next Index
ACME Java ACME Labs