A pure Ruby HTML5-compliant parser with XPath 1.0 traversal
Gammo provides a pure Ruby HTML5-compliant parser and XPath support for traversing the DOM tree built by Gammo. The implementation of the HTML5 parsing algorithm in Gammo conforms the WHATWG specification. Given an HTML string, Gammo parses it and builds DOM tree based on the tokenization and tree-construction algorithm defined in WHATWG parsing algorithm, these implementations are provided without any external dependencies.
Gammo, its naming is inspired by Gumbo. But Gammo is a fried tofu fritter made with vegetables.
require 'gammo' require 'open-uri'parser = URI.open('https://google.com') { |f| Gammo.new(f.read) } document = parser.parse #=> #<:node::document>
puts document.xpath('//title').first.inner_text #=> 'Google' </:node::document>
Gammo::Tokenizerimplements the tokenization algorithm in WHATWG. You can get tokens in order by calling
Gammo::Tokenizer#next_token.
Here is a simple example for performing only the tokenizer.
def dump_for(token) puts "data: #{token.data}, class: #{token.class}" endtokenizer = Gammo::Tokenizer.new('') dump_for tokenizer.next_token #=> data: html, class: Gammo::Tokenizer::DoctypeToken dump_for tokenizer.next_token #=> data: input, class: Gammo::Tokenizer::StartTagToken dump_for tokenizer.next_token #=> data: frameset, class: Gammo::Tokenizer::StartTagToken dump_for tokenizer.next_token #=> data: end of string, class: Gammo::Tokenizer::ErrorToken
The parser described below depends on this tokenizer, it applies the WHATWG parsing algorithm to the tokens extracted by this tokenization in order.
The tokens generated by the tokenizer will be categorized into one of the following types:
Token type | Description |
---|---|
Gammo::Tokenizer::ErrorToken |
Represents an error token, it usually means end-of-string. |
Gammo::Tokenizer::TextToken |
Represents a text token like "foo" which is inner text of elements. |
Gammo::Tokenizer::StartTagToken |
Represents a start tag token like <a> . |
Gammo::Tokenizer::EndTagToken |
Represents an end tag token like </a> . |
Gammo::Tokenizer::SelfClosingTagToken |
Represents a self closing tag token like <img />
|
Gammo::Tokenizer::CommentToken |
Represents a comment token like <!-- comment --> . |
Gammo::Tokenizer::DoctypeToken |
Represents a doctype token like <!doctype html> . |
Gammo::Parserimplements processing in the tree-construction stage based on the tokenization described above.
A successfully parsed parser has the
documentaccessor as the root document (this is the same as the return value of the
Gammo::Parser#parse). From the
documentaccessor, you can traverse the DOM tree constructed by the parser.
require 'gammo' require 'pp'document = Gammo.new('').parse
def dump_for(node, strm) strm << node.to_h return unless node && (child = node.first_child) while child dump_for(child, (strm.last[:children] ||= [])) child = child.next_sibling end strm end
pp dump_for(document, [])
Currently, it's not possible to traverse the DOM tree with css selector or xpath like Nokogiri. However, Gammo plans to implement these features in the future.
The nodes generated by the parser will be categorized into one of the following types:
Node type | Description |
---|---|
Gammo::Node::Error |
Represents error node, it usually means end-of-string. |
Gammo::Node::Text |
Represents the text node like "foo" which is inner text of elements. |
Gammo::Node::Document |
Represents the root document type. It's always returned by Gammo::Parser#document . |
Gammo::Node::Element |
Represents any elements of HTML like <p> . |
Gammo::Node::Comment |
Represents comments like <!-- foo -->
|
Gammo::Node::Doctype |
Represents doctype like <!doctype html>
|
For some nodes such as
Gammo::Node::Elementand
Gammo::Node::Document, they contain pointers to nodes that can be referenced by itself, such as
Gammo::Node#next_siblingor
Gammo::Node#first_child. In addition, APIs such as
Gammo::Node#append_childand
Gammo::Node#remove_childthat perform operations defined in DOM living standard are also provided.
Currently, XPath 1.0 is the only way for traversing DOM tree built by Gammo. CSS selector support is also planned but not having any ETA.
Gammo has an original lexer/parser for XPath 1.0, it's provided as a helper in the DOM tree built by Gammo. Here is a simple example:
document = Gammo.new('').parse node_set = document.xpath('//input[@type="button"]') #=> "<:xpath::nodeset>"node_set.length #=> 1 node_set.first #=> "<:node::element>" </:node::element></:xpath::nodeset>
Since this is implemented by full scratch, Gammo is providing this support as a very experimental feature. Please file an issue if you find bugs.
Before proceeding at the details of XPath support, let's have a look at a few simple examples. Given a sample HTML text and its DOM tree:
document = Gammo.new(<namusyaka.com
Here is a sample web site.
EOS
The following XPath expression gets all
lielements and prints those text contents:
document.xpath('//li').each do |elm| puts elm.inner_text end
The following XPath expression gets all
lielements under the
ulelement having the
id=linksattribute:
document.xpath('//ul[@id="links"]/li').each do |elm| puts elm.inner_text end
The following XPath expression gets each text node for each
lielement under the
ulelement having the
id=linksattribute:
document.xpath('//ul[@id="links"]/li/text()').each do |elm| puts elm.data end
In the combination with Gammo, the axis specifier indicates navigation direction within the DOM tree built by Gammo. Here is list of axes. As you can see, Gammo fully supports the all of axes.
Full Syntax | Abbreviated Syntax | Supported | Notes |
---|---|---|---|
ancestor |
yes | ||
ancestor-or-self |
yes | ||
attribute |
@ |
yes |
@abc is the alias for attribute::abc
|
child |
|
yes |
abc is the short for child::abc
|
descendant |
yes | ||
descendant-or-self |
// |
yes |
// is the alias for /descendant-or-self::node()/
|
following |
yes | ||
following-sibling |
yes | ||
namespace |
yes | ||
parent |
.. |
yes |
.. is the alias for parent::node()
|
preceding |
yes | ||
preceding-sibling |
yes | ||
self |
. |
yes |
. is the alias for self::node()
|
Node tests consist of specific node names or more general expressions. Although particular syntax like
:should work for specifying namespace prefix in XPath, Gammo does not support it yet as it's not a core feature in HTML5.
Full Syntax | Supported | Notes |
---|---|---|
text() |
yes | Finds a node of type text, e.g. hello in <p>hello <a href="https://hello">world</a></p>
|
comment() |
yes | Finds a node of type comment, e.g. <!-- comment -->
|
node() |
yes | Finds any node at all. |
Also note that the
processing-instructionis not supported. There is no plan to support it.
/,
//and
[]are used in the path expression.
|forms the union of two node sets.
and,
or
+,
-,
*,
divand
mod
=,
!=,
<,
>,
<=,
>=
XPath 1.0 defines four data types (nodeset, string, number, boolean) and there are various functions based on the types. Gammo supports those functions partially, please check it to be supported before using functions.
Function Name | Supported | Specification |
---|---|---|
last() |
yes | https://www.w3.org/TR/1999/REC-xpath-19991116/#function-last |
position() |
yes | https://www.w3.org/TR/1999/REC-xpath-19991116/#function-position |
count(node-set) |
yes | https://www.w3.org/TR/1999/REC-xpath-19991116/#function-count |
Function Name | Supported | Specification |
---|---|---|
string(object?) |
yes | https://www.w3.org/TR/1999/REC-xpath-19991116/#function-string |
concat(string, string, string*) |
yes | https://www.w3.org/TR/1999/REC-xpath-19991116/#function-concat |
starts-with(string, string) |
yes | https://www.w3.org/TR/1999/REC-xpath-19991116/#function-starts-with |
contains(string, string) |
yes | https://www.w3.org/TR/1999/REC-xpath-19991116/#function-contains |
substring-before(string, string) |
yes | https://www.w3.org/TR/1999/REC-xpath-19991116/#function-substring-before |
substring-after(string, string) |
yes | https://www.w3.org/TR/1999/REC-xpath-19991116/#function-substring-after |
substring(string, number, number?) |
no | https://www.w3.org/TR/1999/REC-xpath-19991116/#function-substring |
string-length(string?) |
no | https://www.w3.org/TR/1999/REC-xpath-19991116/#function-string-length |
normalize-space(string?) |
no | https://www.w3.org/TR/1999/REC-xpath-19991116/#function-string-normalize-space |
translate(string, string, string) |
no | https://www.w3.org/TR/1999/REC-xpath-19991116/#function-string-translate |
Function Name | Supported | Specification |
---|---|---|
boolean(object) |
yes | https://www.w3.org/TR/1999/REC-xpath-19991116/#function-boolean |
not(object) |
yes | https://www.w3.org/TR/1999/REC-xpath-19991116/#function-not |
true() |
yes | https://www.w3.org/TR/1999/REC-xpath-19991116/#function-true |
false() |
yes | https://www.w3.org/TR/1999/REC-xpath-19991116/#function-false |
lang() |
no | https://www.w3.org/TR/1999/REC-xpath-19991116/#function-lang |
Function Name | Supported | Specification |
---|---|---|
number(object?) |
no | https://www.w3.org/TR/1999/REC-xpath-19991116/#function-number |
sum(node-set) |
no | https://www.w3.org/TR/1999/REC-xpath-19991116/#function-sum |
floor(number) |
no | https://www.w3.org/TR/1999/REC-xpath-19991116/#function-floor |
ceiling(number) |
yes | https://www.w3.org/TR/1999/REC-xpath-19991116/#function-ceiling |
round(number) |
no | https://www.w3.org/TR/1999/REC-xpath-19991116/#function-round |
TBD.
As mentioned in the features at the beginning, Gammo doesn't prioritize its performance. Thus, for example, Gammo is not suitable for very performance-sensitive applications (e.g. performing Gammo parsing synchronously from an incoming request from an end user). Instead, the goal is to work well with batch processing such as crawlers. Gammo places the highest priority on making it easy to parse HTML by peforming it without depending on native-extensions and external gems.
This was developed with reference to the following softwares.
The gem is available as open source under the terms of the MIT License.