grab4j

grab4j manual

Before you start
Web-grabbing with grab4j
The Java-driven approach
The JavaScript-driven approach

Before you start

In order to use the grab4j grabber in your Java software, you have to make visible the grab4j.jar file to your application adding it to the CLASSPATH. Since the grab4j library depends on third parties jars (placed in the lib directory of the distribution package) you also have to add them to the CLASSPATH.

The grab4j library requires a Java runtime environment J2SE 1.4 or later.

Web-grabbing with grab4j

Grabbing informations from an online HTML document is a three-step operation:

Retrieving the document.
Parsing the document.
Extracting informations from the document representation, running a grabbing-logic routine.

The main issue concerns the grabbing-logic. Once the document is retrieved and parsed, it can be accessed and explored through an object representation. A document representation contains both useful and useless informations. You may be interested just in a title, a text and a date, but a web page usually contains also a navigation bar, a site menu and other elements quite useless in the context of the grabbing operation. So you have to run a grabbing-logic routine, whose goal is to extract only the informations you need, discarding the others.

The grab4j library allows you to retrieve and parse any online web page. It also gives you all the tools you need to build and run your grabbing-logic routine. The library lets you choose between two grabbing approaches. The first one is Java-driven: get the document representation and navigate its contents by calling the grab4j methods. You can extract all you need. Compile your classes, package your application and let it run. Some time later, however, the online page structure could change, and your grabbing-logic routine probably will not work anymore. If that happens, you have to write down a new grabbing-logic routine, re-compile your application, re-package it and re-distribute or re-deploy it. Very boring, isn't it?
Here it comes the JavaScript-driven grabbing approach, brought to you by the grab4j library. Instead of writing the grabbing-logic routine in Java, write it in JavaScript and put the code in a separate file. The grab4j library can load and run the JavaScript routine. If the online page changes in its structure, write down a new grabbing-logic script and replace the previous file. No rebuild operation is needed.

The Java-driven approach

Use the it.sauronsoftware.grab4j.html.HTMLDocumentFactory class to build a document representation.

HTMLDocument doc = HTMLDocumentFactory.buildDocument("http://www.host.com/page.html");

The returned it.sauronsoftware.grab4j.html.HTMLDocument object is the representation of the parsed document. You can search within its elements with the methods getElements(), getElementCount(), getElement(), getElementById(), getElementsByAttribute() and getElementsByTag(). Advanced search capabilities are given by the searchElements() and the searchElement() methods, and by the it.sauronsoftware.grab4j.html.search.Criteria class.

The elements in the document are represented by the instances of the it.sauronsoftware.grab4j.html.HTMLElement class. You can often cast a generic HTMLElement reference to a more specific one, such HTMLText, HTMLTag, HTMLImage and HTMLLink.

Please refer to the library javadocs to gain more details.

The JavaScript-driven approach

The it.sauronsoftware.grab4j.WebGrabber class lets you grab a web page with a sole static call:

URL pageUrl = new URL("http://www.host.com/page.html");
File scriptFile = new File("grabbing-logic.js");
Object result = WebGrabber.grab(pageUrl, scriptFile);

The document is fetched and parsed, and then the grabbing-logic script is run. The result setted by the script is returned to the caller.

Please refer to the library javadocs to gain more details about the WebGrabber class.

The grabbing script

The grabbing script must be ECMAScript compliant. You can use every ECMAScript standard function, object or constant, such parseInt(), isNaN(), Math and Infinity.

Global references

In addition to the ECMAScript built-ins you receive also a pair of global grabbing-related variables.

document

This is the input variable for your script. It is the retrieved document representation. You can call this object methods to extract the informations you need.

titleElement = document.getElementById("title");

result

This is the output variable for your script. When the work is done, set it to a reference to the result of your grabbing attivity. The referred value will be returned, as a Java object, to the caller routine.

result = myResult;

Global functions

In addition to the ECMAScript built-ins you receive also some global grabbing-related functions.

print(<string>)

It sends a string to the standard output channel. Useful for debug purposes.

print("hello world");
print(document.getElementCount());

openDocument(<string>)

It retrieves and parses the document at URL given as parameter, and returns a reference to its object representation. This function allows the script to open and grab other documents besides the one given with the document reference, which can work as a starting point. Picture the document parsed and passed to the script contains just a list of links to other documents with the real contents you are interested in. If this happens you will find the openDocument() function very useful!

var doc2 = openDocument("http://www.anothersite.com/anotherpage.html");

encodeEntities(<string>)

This one takes a string and encodes it as HTML. It means that all reserved or troublesome characters in the given string will be changed in HTML entities. The new string with the encoded entities is returned to the caller.

var str = encodeEntities("<test>");

decodeEntities(<string>)

This one takes a string and decodes all the HTML entities in it. A new string with the decoded entities is generated and returned to the caller.

var str = decodeEntities("&lt;test&gt;");

Document methods

Of course, you can explore and search a document representation (the one brought by the document reference, or another one obtained with a call to the openDocument() function), calling the methods:

getElementCount()

It returns the number of the first-level elements in the document.

var n = document.getElementCount();

getElement(<integer>)

It returns the first-level element at the given index.

for (var i = 0; i < document.getElementCount(); i++) {
  var el = document.getElement(i);
  // ...
}

getElementById(<string>)

It explores recursively the elements tree, starting from the top-level ones, searching for the first occurrence of an element with the given value in its "id" attribute. If no element is found, it returns null.

var el = document.getElementById("title");

getElementsByTag(<string>)

It explores recursively the elements tree, starting from the top-level ones, searching for the occurrences of the given tag. It returns an element references array. If no element is found, it returns a zero length array.

var elements = document.getElementsByTag("h1");
for (var i = 0; i < elements.length; i++) {
  // ...
}

getElementsByAttribute(<string>, <string>)

It explores recursively the elements tree, starting from the top-level ones, selecting the ones whose have a given attribute with a given value. It returns an element references array. If no element is found, it returns a zero length array.

var elements = document.getElementsByAttribute("align", "center");
for (var i = 0; i < elements.length; i++) {
  // ...
}

searchElements(<string>)

It searches recursively inside the document elements, returning an array with the elements matched by the given criteria. If no element is found, it returns a zero length array.

var elements = document.searchElements("html/body/.../img(src=*.jpg)");
for (var i = 0; i < elements.length; i++) {
  // ...
}

More about search criterias will be explained later.

searchElement(<string>)

It searches recursively inside the document elements, returning the first element matched by the given criteria. If no element is found, it returns null.

var el = document.searchElement("html/body/ul/li");

More about search criterias will be explained later.

Element methods

Every element representation gives you the following methods.

getElementCount()

It returns the number of the sub-elements owned by the current element.

var n = el.getElementCount();

getElement(<integer>)

It returns the sub-element at the given index.

for (var i = 0; i < el.getElementCount(); i++) {
  var el2 = el.getElement(i);
  // ...
}

getElementById(<string>)

It explores recursively the elements tree, starting from the current element children, searching for the first occurrence of an element with the given value in its "id" attribute. If no element is found, it returns null.

var el2 = el.getElementById("title");

getElementsByTag(<string>)

It explores recursively the elements tree, starting from the current element children, searching for the occurrences of the given tag. It returns an element references array. If no element is found, it returns a zero length array.

var elements = el.getElementsByTag("h1");
for (var i = 0; i < elements.length; i++) {
  // ...
}

getElementsByAttribute(<string>, <string>)

It explores recursively the elements tree, starting from the current element children, selecting the ones whose have a given attribute with a given value. It returns an element references array. If no element is found, it returns a zero length array.

var elements = document.getElementsByAttribute("align", "center");
for (var i = 0; i < elements.length; i++) {
  // ...
}

searchElements(<string>)

It searches recursively within the current element children, returning an array with the elements matched by the given criteria. If no element is found, it returns a zero length array.

var elements = el.searchElements("table/tr/td/img(src=*.jpg)");
for (var i = 0; i < elements.length; i++) {
  // ...
}

More about search criterias will be explained later.

searchElement(<string>)

It searches recursively within the current element children, returning the first element matched by the given criteria. If no element is found, it returns null.

var el2 = el.searchElement("ul/li");

More about search criterias will be explained later.

getPreviousElement()

It returns the previous element in the current element group, or null if the current element is the first one.

var p = el.getPreviousElement();

getNextElement()

It returns the next element in the current element group, or null if the current element is the last one.

var n = el.getNextElement();

getParentElement()

It returns the parent element of the current one, or null if the current element is a root element.

var p = el.getParentElement();

Tag methods

If you get an element which is the representation of any HTML tag, you can call:

getTagName()

It returns the current tag name.

var tagname = el.getTagName();

getAttribute(<string>)

It returns the value of the given attribute, or null if no attribute with the supplied name is found.

var attrValue = el.getAttribute("align");

isEmpty()

It returns true if the tag has no contents.

var empty = el.isEmpty();

getInnerText()

It extracts the tag contents as plain text.

var text = el.getInnerText();

getInnerHTML()

It returns the HTML code in the tag contents.

var html = el.getInnerHTML();

getOuterHTML()

It returns the HTML code with the tag and its contents.

var html = el.getOuterHTML();

getLinkURL()

Available only if the tag is <a> and it has a valid href attribute. It extracts and returns the link URL. While getAttribute("href") returns a "raw" value, this one checks the attribute value and returns it as an absolute URL. You can use it with the openDocument() function to load any linked document.

var elements = document.getElementsByTag("a");
for (var i = 0; i < elements.length; i++) {
  print(elements[i].getLinkURL());
}

getImageURL()

Available only if the tag is <img> and it has a valid src attribute. It extracts and retuns the image source URL. While getAttribute("src") returns a "raw" value, this one checks the attribute value and returns it as an absolute URL.

var elements = document.getElementsByTag("img");
for (var i = 0; i < elements.length; i++) {
  print(elements[i].getImageURL());
}

Search criteria

A search criteria string representation is splitted in several parts, separated by a slash character:

token1/token2/token3

Each token is used to recognize a tag or a set of tags. The general model is the following:

tagNamePattern[index](attribute1=valuePattern1)(attribute2=valuePattern2)(...)

The first element in the token model is the tag name pattern. It is usefull to find the wanted tag(s). It is a wildcard pattern: the star character can be used to match any characters sequence.

A first simple example:

html/body/div

This criteria finds all the "div" elements whose father is the "body" tag, which in turn is inside a "html" tag.

A wildcard example:

html/body/*

This criteria finds all the elements whose father is the "body" tag, within the "html" one.

Another one:

html/body/h*

This criteria finds all the elements whose father is the "body" tag and whose name starts with the "h" letter, such "h1", "h2", "h3" and so on.

Using the index selector:

html/body/div[1]

This criteria returns the second "div" element whose father is the "body" tag. Note that the index lesser value is 0, just like in arrays.

html/body/h*[2]

This criteria returns the third element whose father is the "body" tag and whose name starts with the "h" letter.

Using attribute selector(s):

html/body/div(id=d1)

This one searches for divs with an attribute called "id", whose value is exactly "d1".

The star wildcard is admitted in the value part of the selector:

html/body/div(id=*)

This one searches for divs with an attribute called "id", regardless of its value.

html/body/div(id=d*)

This one searches for divs with an attribute called "id", whose value starts with the "d" letter.

More attribute selectors can be combined together:

html/body/div(id=d*)(align=left)

A index selector and two attribute selectors in this example:

html/body/div[1](id=d*)(align=left)

This will search for the second "div" tag, inside the "html"-"body" sequence, whose attribute "id" has a value starting with "d" and whose attribute "align" is exactly "left".

Search criterias admit a special token, called the "recursive deep token" and represented by a sequence of three points.

html/body/.../table

This criteria will search for tables inside the body of the document, regardless if they are placed straight under the "body" tag or not. This is, of course, a recursive search within the body sub-elements. The criteria will return all the tables like the following

<html><body><table>...

but it will return also all the ones like

<html><body><div><div>table>...

Escaping of reserved characters is possibile through the sequence <xx>, where xx is the exadecimal code of the escaped character.

Java classes within the script

Since the script is executed by a Java environment, you can gain access to any Java class from its code.

If the class is in the java.* package hierarchy you can import as follows:

importClass(<class>);

In example:

importClass(java.util.ArrayList);

The other package hierarchies can be imported as follows:

importClass(Packages.<class>);

In example:

importClass(Packages.it.sauronsoftware.grab4j.examples2.Item);

To import a package in the java.* hierarchy:

importPackage(<package>);

In example:

importPackage(java.util);

The other package hierarchies can be imported as follows:

importPackage(Packages.<package>);

In example:

importPackage(Packages.it.sauronsoftware.grab4j.examples2);

Once a Java class has been imported you can use it in the usual way:

importClass(java.util.ArrayList);
var list = new ArrayList();
// ...

Examples

Some working examples can be found in the examples directory within the distribution package.