Skip to content

Html

The Html module can be used to parse and query html formatted files and remote pages. It also contains encoding/decoding helper methods for html.

Requires Manatee v1.29 or greater

This version of the Xml module cannot be used with Manatee v1.28 or earlier.

Loading data

The methods load and loadFrom can be used to load and parse a html document. They both return a HtmlDoc object which can be used for querying/extracting information.

javascript
// Load html from a string
var doc = Html.load("<html><body>Hello, world!</body></html>");
// Load html from an url
doc = Html.loadFrom("http://sirenia.eu");

Html.encode

Use this method to encode a string to replace unicode characters etc with their html encoded counterparts.

js
var encoded = Html.encode("1 < 2");
// encoded is now "1 &lt; 2"

Html.decode

Decode an already html encoded string also includes html5 named entities in the decoding.

js
var decoded = Html.decode("1 &lt; 2 = &angst;");
// decoded is now "1 < 2 = Å"

HtmlDoc

The HtmlDoc object return from Html.load and .loadFrom has two primary methods for querying and extracting information from the html document it represents - the first is via an XPath query and the second is to convert the html to json.

XPath

The xpath method can be used to query the HtmlDoc with a given XPath query. All innerTexts are html decoded strings.

javascript
var d = Html.load("<html><body>Hello</body>");
var body = d.xpath("//body");
Debug.showDialog(body.innerText); // shows "Hello"

Converting to json

Converting the html to json is done with the .json() method. Each node in the resulting tree of objects has the following properties:

  • attrs an object containing the attributes of the html node
  • children is an array of child json nodes
  • innerText is a textual representation of the contents of the node (html decoded)
  • tagName is the name of the original html node

It also has xpath, querySelector and querySelectorAll methods which can be used to query the subtree of the json node as is possible for the HtmlDoc object.

javascript
var d = Html.load("<html><body>Hello</body>");
var json = d.json();
Debug.showDialog(json.tagName);

The json() function takes the can also include #text nodes.

javascript
var d = Html.load("<html><body>He<br>llo</body>");
var json = d.json({ includeTextNodes: true });

This allows for better reconstruction of the original html using the html() function (perhaps after modifying).

javascript
var d = Html.load("<html><body>Hello</body>");
var json = d.json();
// Now we get back get back the original html (if possible)
var html = json.html();

QuerySelectorAll

Use the querySelectorAll method to query the HtmlDoc using CSS selectors.

js
// We'll assume we have a `HtmlDoc` object in `d`
var myClassDivs = d.querySelectorAll("div.myClass");

QuerySelector

The querySelector works similarly to the querySelectorAll except it returns the first hit only.

Table

The table(...) function can be used to extract js objects from html tables.

Given the table:

html
<table id="myTable">
  <thead>
    <tr><td>A</td></tr>
  </thead>
  <tbody>
    <tr><td>100</td></tr>
    <tr><td>200</td></tr>
  </tbody>
</table>

We can use the table function as follows:

js
// Assume we have the html already loaded in `d`
var t = d.table("#myTable");
// and now we can query the contents of the table as follows
var firstRowFirstColumn = t[0]["A"];

if the table does not have header information then the function will return a double array.

We can also use an object to pinpoint the header and/or the body of the table. This is useful if we have on our hands a table where the header is one location while the data is somewhere else. This is often the case for scrollable tables.

html
<table id="myTableHeader">
  <thead>
    <tr><td>A</td></tr>
  </thead>
</table>
<table id="myTableBody">
  <tbody>
    <tr><td>100</td></tr>
    <tr><td>200</td></tr>
  </tbody>
</table>

Now do this:

js
// Assume we have the html already loaded in `d`
var t = d.table(
  {
    headerAt: "#myTableHeader thead tr th",
    rowAt: "#myTableBody tbody tr"
  }
);
// and now we can (again) query the contents of the table as follows
var firstRowFirstColumn = t[0]["A"];

The headerSelector needs to point out the individual header elements, typically th elements, while the rowSelector must point out the tr elements in the table.