# Html
The Html
module can be used to parse and query html formatted files and remote pages. It also contains encoding/decoding helper methods for html.
Requires Manatee v2.0 or greater
This version of the Html module cannot be used with Manatee v1.29 or earlier.
# Loading data
The methods load
and loadFrom
can be used to load and parse a html document. They both return a HtmlDoc object which can be used for querying/extracting information.
// Load html from a string
var doc = Html.load("<html><body>Hello, world!</body></html>");
// Load html from an url
doc = Html.loadFrom("http://sirenia.eu");
# Html.encode
Use this method to encode a string to replace unicode characters etc with their html encoded counterparts.
var encoded = Html.encode("1 < 2");
// encoded is now "1 < 2"
# Html.decode
Decode an already html encoded string also includes html5 named entities (opens new window) in the decoding.
var decoded = Html.decode("1 < 2 = Å");
// decoded is now "1 < 2 = Å"
# Html.btoa
Encode a string to base64.
var encoded = Html.btoa("Hello, world!");
// encoded is now "SGVsbG8sIHdvcmxkIQ=="
# Html.atob
Decode a base64 encoded string.
var decoded = Html.atob("SGVsbG8sIHdvcmxkIQ==");
// decoded is now "Hello, world!"
# HtmlDoc
The HtmlDoc
object return from Html.load
and .loadFrom
has two primary methods for querying and extracting information from the html document it represents - the first is via an XPath query and the second is to convert the html to json.
# XPath
The xpath
method can be used to query the HtmlDoc
with a given XPath (opens new window) query. All innerTexts are html decoded strings.
var d = Html.load("<html><body>Hello</body>");
var body = d.xpath("//body");
Debug.showDialog(body.innerText); // shows "Hello"
# Converting to json
Converting the html to json is done with the .json()
method. Each node in the resulting tree of objects has the following properties:
attrs
an object containing the attributes of the html nodechildren
is an array of child json nodesinnerText
is a textual representation of the contents of the node (html decoded)tagName
is the name of the original html node
It also has xpath
, querySelector
and querySelectorAll
methods which can be used to query the subtree of the json node as is possible for the HtmlDoc
object.
var d = Html.load("<html><body>Hello</body>");
var json = d.json();
Debug.showDialog(json.tagName);
The json()
function can also include #text
nodes.
var d = Html.load("<html><body>He<br>llo</body>");
var json = d.json({ includeTextNodes: true });
This allows for better reconstruction of the original html using the html()
function (perhaps after modifying).
var d = Html.load("<html><body>Hello</body>");
var json = d.json();
// Now we get back get back the original html (if possible)
var html = json.html();
# QuerySelectorAll
Use the querySelectorAll
method to query the HtmlDoc using CSS selectors.
// We'll assume we have a `HtmlDoc` object in `d`
var myClassDivs = d.querySelectorAll("div.myClass");
# QuerySelector
The querySelector
works similarly to the querySelectorAll
except it returns the first hit only.
# Table
The table(...)
function can be used to extract js objects from html tables.
Given the table:
<table id="myTable">
<thead>
<tr><th>A</th></tr>
</thead>
<tbody>
<tr><td>100</td></tr>
<tr><td>200</td></tr>
</tbody>
</table>
We can use the table
function as follows:
// Assume we have the html already loaded in `d`
var t = d.table("#myTable");
// and now we can query the contents of the table as follows
var firstRowFirstColumn = t.rows[0]["A"];
if the table does not have header information then the function will return a double array.
We can also use an object to pinpoint the header and/or the body of the table. This is useful if we have on our hands a table where the header is one location while the data is somewhere else. This is often the case for scrollable tables.
<table id="myTableHeader">
<thead>
<tr><th>A</th></tr>
</thead>
</table>
<table id="myTableBody">
<tbody>
<tr><td>100</td></tr>
<tr><td>200</td></tr>
</tbody>
</table>
Now do this:
// Assume we have the html already loaded in `d`
var t = d.table(
{
tableAt: "#myTableBody",
headerAt: "#myTableHeader thead tr th",
rowAt: "#myTableBody tbody tr",
cellAt: "td"
}
);
// and now we can (again) query the contents of the table as follows
var firstRowFirstColumn = t.rows[0]["A"];
The headerSelector
needs to point out the individual header elements, typically th
elements, while the rowSelector
must point out the tr
elements in the table.