Need help with gammo?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

namusyaka
161 Stars 3 Forks MIT License 25 Commits 2 Opened issues

Description

A pure Ruby HTML5-compliant parser with XPath 1.0 traversal

Services available

!
?

Need anything else?

Contributors list

# 22,891
Ruby
s3
Rails
Google ...
18 commits
# 127,847
ltsv
Less
C
React
4 commits
# 96,378
Ruby
Shell
oauth2
ml
1 commit

Gammo - A pure-Ruby HTML5 parser

Build Status GitHub issues GitHub forks GitHub stars GitHub license Documentation

Gammo provides a pure Ruby HTML5-compliant parser and XPath support for traversing the DOM tree built by Gammo. The implementation of the HTML5 parsing algorithm in Gammo conforms the WHATWG specification. Given an HTML string, Gammo parses it and builds DOM tree based on the tokenization and tree-construction algorithm defined in WHATWG parsing algorithm, these implementations are provided without any external dependencies.

Gammo, its naming is inspired by Gumbo. But Gammo is a fried tofu fritter made with vegetables.

require 'gammo'
require 'open-uri'

parser = URI.open('https://google.com') { |f| Gammo.new(f.read) } document = parser.parse #=> #<:node::document>

puts document.xpath('//title').first.inner_text #=> 'Google' </:node::document>

Overview

Features

Tokenizaton

Gammo::Tokenizer
implements the tokenization algorithm in WHATWG. You can get tokens in order by calling
Gammo::Tokenizer#next_token
.

Here is a simple example for performing only the tokenizer.

def dump_for(token)
  puts "data: #{token.data}, class: #{token.class}"
end

tokenizer = Gammo::Tokenizer.new('') dump_for tokenizer.next_token #=> data: html, class: Gammo::Tokenizer::DoctypeToken dump_for tokenizer.next_token #=> data: input, class: Gammo::Tokenizer::StartTagToken dump_for tokenizer.next_token #=> data: frameset, class: Gammo::Tokenizer::StartTagToken dump_for tokenizer.next_token #=> data: end of string, class: Gammo::Tokenizer::ErrorToken

The parser described below depends on this tokenizer, it applies the WHATWG parsing algorithm to the tokens extracted by this tokenization in order.

Token types

The tokens generated by the tokenizer will be categorized into one of the following types:

Token type Description
Gammo::Tokenizer::ErrorToken Represents an error token, it usually means end-of-string.
Gammo::Tokenizer::TextToken Represents a text token like "foo" which is inner text of elements.
Gammo::Tokenizer::StartTagToken Represents a start tag token like <a>.
Gammo::Tokenizer::EndTagToken Represents an end tag token like </a>.
Gammo::Tokenizer::SelfClosingTagToken Represents a self closing tag token like <img />
Gammo::Tokenizer::CommentToken Represents a comment token like <!-- comment -->.
Gammo::Tokenizer::DoctypeToken Represents a doctype token like <!doctype html>.

Parsing

Gammo::Parser
implements processing in the tree-construction stage based on the tokenization described above.

A successfully parsed parser has the

document
accessor as the root document (this is the same as the return value of the
Gammo::Parser#parse
). From the
document
accessor, you can traverse the DOM tree constructed by the parser.
require 'gammo'
require 'pp'

document = Gammo.new('').parse

def dump_for(node, strm) strm << node.to_h return unless node && (child = node.first_child) while child dump_for(child, (strm.last[:children] ||= [])) child = child.next_sibling end strm end

pp dump_for(document, [])

Notes

Currently, it's not possible to traverse the DOM tree with css selector or xpath like Nokogiri. However, Gammo plans to implement these features in the future.

Node

The nodes generated by the parser will be categorized into one of the following types:

Node type Description
Gammo::Node::Error Represents error node, it usually means end-of-string.
Gammo::Node::Text Represents the text node like "foo" which is inner text of elements.
Gammo::Node::Document Represents the root document type. It's always returned by Gammo::Parser#document.
Gammo::Node::Element Represents any elements of HTML like <p>.
Gammo::Node::Comment Represents comments like <!-- foo -->
Gammo::Node::Doctype Represents doctype like <!doctype html>

For some nodes such as

Gammo::Node::Element
and
Gammo::Node::Document
, they contain pointers to nodes that can be referenced by itself, such as
Gammo::Node#next_sibling
or
Gammo::Node#first_child
. In addition, APIs such as
Gammo::Node#append_child
and
Gammo::Node#remove_child
that perform operations defined in DOM living standard are also provided.

DOM Tree Traversal

Currently, XPath 1.0 is the only way for traversing DOM tree built by Gammo. CSS selector support is also planned but not having any ETA.

XPath 1.0 (experimental)

Gammo has an original lexer/parser for XPath 1.0, it's provided as a helper in the DOM tree built by Gammo. Here is a simple example:

document = Gammo.new('').parse
node_set = document.xpath('//input[@type="button"]') #=> "<:xpath::nodeset>"

node_set.length #=> 1 node_set.first #=> "<:node::element>" </:node::element></:xpath::nodeset>

Since this is implemented by full scratch, Gammo is providing this support as a very experimental feature. Please file an issue if you find bugs.

Example

Before proceeding at the details of XPath support, let's have a look at a few simple examples. Given a sample HTML text and its DOM tree:

document = Gammo.new(<




  

namusyaka.com

Here is a sample web site.

  • hello
  • world

EOS

The following XPath expression gets all

li
elements and prints those text contents:
document.xpath('//li').each do |elm|
  puts elm.inner_text
end

The following XPath expression gets all

li
elements under the
ul
element having the
id=links
attribute:
document.xpath('//ul[@id="links"]/li').each do |elm|
  puts elm.inner_text
end

The following XPath expression gets each text node for each

li
element under the
ul
element having the
id=links
attribute:
document.xpath('//ul[@id="links"]/li/text()').each do |elm|
  puts elm.data
end

Axis Specifiers

In the combination with Gammo, the axis specifier indicates navigation direction within the DOM tree built by Gammo. Here is list of axes. As you can see, Gammo fully supports the all of axes.

Full Syntax Abbreviated Syntax Supported Notes
ancestor yes
ancestor-or-self yes
attribute @ yes @abc is the alias for attribute::abc
child yes abc is the short for child::abc
descendant yes
descendant-or-self // yes // is the alias for /descendant-or-self::node()/
following yes
following-sibling yes
namespace yes
parent .. yes .. is the alias for parent::node()
preceding yes
preceding-sibling yes
self . yes . is the alias for self::node()

Node Test

Node tests consist of specific node names or more general expressions. Although particular syntax like

:
should work for specifying namespace prefix in XPath, Gammo does not support it yet as it's not a core feature in HTML5.
Full Syntax Supported Notes
text() yes Finds a node of type text, e.g. hello in <p>hello <a href="https://hello">world</a></p>
comment() yes Finds a node of type comment, e.g. <!-- comment -->
node() yes Finds any node at all.

Also note that the

processing-instruction
is not supported. There is no plan to support it.

Operators

  • The
    /
    ,
    //
    and
    []
    are used in the path expression.
  • The union operator
    |
    forms the union of two node sets.
  • The boolean operators:
    and
    ,
    or
  • The arithmetic operators:
    +
    ,
    -
    ,
    *
    ,
    div
    and
    mod
  • Comparison operators:
    =
    ,
    !=
    ,
    <
    ,
    >
    ,
    <=
    ,
    >=

Functions

XPath 1.0 defines four data types (nodeset, string, number, boolean) and there are various functions based on the types. Gammo supports those functions partially, please check it to be supported before using functions.

Node set functions
Function Name Supported Specification
last() yes https://www.w3.org/TR/1999/REC-xpath-19991116/#function-last
position() yes https://www.w3.org/TR/1999/REC-xpath-19991116/#function-position
count(node-set) yes https://www.w3.org/TR/1999/REC-xpath-19991116/#function-count
String Functions
Function Name Supported Specification
string(object?) yes https://www.w3.org/TR/1999/REC-xpath-19991116/#function-string
concat(string, string, string*) yes https://www.w3.org/TR/1999/REC-xpath-19991116/#function-concat
starts-with(string, string) yes https://www.w3.org/TR/1999/REC-xpath-19991116/#function-starts-with
contains(string, string) yes https://www.w3.org/TR/1999/REC-xpath-19991116/#function-contains
substring-before(string, string) yes https://www.w3.org/TR/1999/REC-xpath-19991116/#function-substring-before
substring-after(string, string) yes https://www.w3.org/TR/1999/REC-xpath-19991116/#function-substring-after
substring(string, number, number?) no https://www.w3.org/TR/1999/REC-xpath-19991116/#function-substring
string-length(string?) no https://www.w3.org/TR/1999/REC-xpath-19991116/#function-string-length
normalize-space(string?) no https://www.w3.org/TR/1999/REC-xpath-19991116/#function-string-normalize-space
translate(string, string, string) no https://www.w3.org/TR/1999/REC-xpath-19991116/#function-string-translate
Boolean Functions
Function Name Supported Specification
boolean(object) yes https://www.w3.org/TR/1999/REC-xpath-19991116/#function-boolean
not(object) yes https://www.w3.org/TR/1999/REC-xpath-19991116/#function-not
true() yes https://www.w3.org/TR/1999/REC-xpath-19991116/#function-true
false() yes https://www.w3.org/TR/1999/REC-xpath-19991116/#function-false
lang() no https://www.w3.org/TR/1999/REC-xpath-19991116/#function-lang
Number Functions
Function Name Supported Specification
number(object?) no https://www.w3.org/TR/1999/REC-xpath-19991116/#function-number
sum(node-set) no https://www.w3.org/TR/1999/REC-xpath-19991116/#function-sum
floor(number) no https://www.w3.org/TR/1999/REC-xpath-19991116/#function-floor
ceiling(number) yes https://www.w3.org/TR/1999/REC-xpath-19991116/#function-ceiling
round(number) no https://www.w3.org/TR/1999/REC-xpath-19991116/#function-round

CSS Selector

TBD.

Performance

As mentioned in the features at the beginning, Gammo doesn't prioritize its performance. Thus, for example, Gammo is not suitable for very performance-sensitive applications (e.g. performing Gammo parsing synchronously from an incoming request from an end user). Instead, the goal is to work well with batch processing such as crawlers. Gammo places the highest priority on making it easy to parse HTML by peforming it without depending on native-extensions and external gems.

References

This was developed with reference to the following softwares.

  • x/net/html: I've been working on this package, it gave me strong reason to make this happen.
  • Blink: Blink gave me great impression about tree construction.
  • html5lib-tests: Gammo relies on this test.

License

The gem is available as open source under the terms of the MIT License.

Release History

  • v0.2.0
    • XPath 1.0 support #4
  • v0.1.0
    • Initial Release

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.