Html
The Html
module can be used to parse and query html formatted files and remote pages. It also contains encoding/decoding helper methods for html.
Requires Manatee v1.29 or greater
This version of the Xml module cannot be used with Manatee v1.28 or earlier.
Loading data
The methods load
and loadFrom
can be used to load and parse a html document. They both return a HtmlDoc object which can be used for querying/extracting information.
// Load html from a string
var doc = Html.load("<html><body>Hello, world!</body></html>");
// Load html from an url
doc = Html.loadFrom("http://sirenia.eu");
Html.encode
Use this method to encode a string to replace unicode characters etc with their html encoded counterparts.
var encoded = Html.encode("1 < 2");
// encoded is now "1 < 2"
Html.decode
Decode an already html encoded string also includes html5 named entities in the decoding.
var decoded = Html.decode("1 < 2 = Å");
// decoded is now "1 < 2 = Å"
HtmlDoc
The HtmlDoc
object return from Html.load
and .loadFrom
has two primary methods for querying and extracting information from the html document it represents - the first is via an XPath query and the second is to convert the html to json.
XPath
The xpath
method can be used to query the HtmlDoc
with a given XPath query. All innerTexts are html decoded strings.
var d = Html.load("<html><body>Hello</body>");
var body = d.xpath("//body");
Debug.showDialog(body.innerText); // shows "Hello"
Converting to json
Converting the html to json is done with the .json()
method. Each node in the resulting tree of objects has the following properties:
attrs
an object containing the attributes of the html nodechildren
is an array of child json nodesinnerText
is a textual representation of the contents of the node (html decoded)tagName
is the name of the original html node
It also has xpath
, querySelector
and querySelectorAll
methods which can be used to query the subtree of the json node as is possible for the HtmlDoc
object.
var d = Html.load("<html><body>Hello</body>");
var json = d.json();
Debug.showDialog(json.tagName);
The json()
function takes the can also include #text
nodes.
var d = Html.load("<html><body>He<br>llo</body>");
var json = d.json({ includeTextNodes: true });
This allows for better reconstruction of the original html using the html()
function (perhaps after modifying).
var d = Html.load("<html><body>Hello</body>");
var json = d.json();
// Now we get back get back the original html (if possible)
var html = json.html();
QuerySelectorAll
Use the querySelectorAll
method to query the HtmlDoc using CSS selectors.
// We'll assume we have a `HtmlDoc` object in `d`
var myClassDivs = d.querySelectorAll("div.myClass");
QuerySelector
The querySelector
works similarly to the querySelectorAll
except it returns the first hit only.
Table
The table(...)
function can be used to extract js objects from html tables.
Given the table:
<table id="myTable">
<thead>
<tr><td>A</td></tr>
</thead>
<tbody>
<tr><td>100</td></tr>
<tr><td>200</td></tr>
</tbody>
</table>
We can use the table
function as follows:
// Assume we have the html already loaded in `d`
var t = d.table("#myTable");
// and now we can query the contents of the table as follows
var firstRowFirstColumn = t[0]["A"];
if the table does not have header information then the function will return a double array.
We can also use an object to pinpoint the header and/or the body of the table. This is useful if we have on our hands a table where the header is one location while the data is somewhere else. This is often the case for scrollable tables.
<table id="myTableHeader">
<thead>
<tr><td>A</td></tr>
</thead>
</table>
<table id="myTableBody">
<tbody>
<tr><td>100</td></tr>
<tr><td>200</td></tr>
</tbody>
</table>
Now do this:
// Assume we have the html already loaded in `d`
var t = d.table(
{
headerAt: "#myTableHeader thead tr th",
rowAt: "#myTableBody tbody tr"
}
);
// and now we can (again) query the contents of the table as follows
var firstRowFirstColumn = t[0]["A"];
The headerSelector
needs to point out the individual header elements, typically th
elements, while the rowSelector
must point out the tr
elements in the table.