Need help with easy-scraper?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

tanakh
156 Stars 5 Forks MIT License 27 Commits 2 Opened issues

Description

Easy scraping library

Services available

!
?

Need anything else?

Contributors list

# 108,211
Shell
C
Haskell
msgpack
26 commits

Workflow Status

easy-scraper

HTML scraping library focused on easy to use.

In this library, matching patterns are described as HTML DOM trees. You can write patterns intuitive and extract desired contents easily.

Example

use easy_scraper::Pattern;

let doc = r#"

    </pre><ul>
        <li>1</li>
        <li>2</li>
        <li>3</li>
    </ul>

"#;

let pat = Pattern::new(r#"

  • {{foo}}
"#).unwrap();

let ms = pat.matches(doc);

assert_eq!(ms.len(), 3); assert_eq!(ms[0]["foo"], "1"); assert_eq!(ms[1]["foo"], "2"); assert_eq!(ms[2]["foo"], "3");

Syntax

DOM Tree

DOM trees are valid pattern. You can write placeholders in DOM trees.

  • {{foo}}

Patterns are matched if the pattern is subset of document.

If the document is:

  • 1
  • 2
  • 3

there trees are subset of this.

  • 1
  • 2
  • 3

So, match result is

[
    { "foo": "1" },
    { "foo": "2" },
    { "foo": "3" },
]

Child

Child nodes are matched to any descendants because of subset rule.

For example, this pattern

  • {{id}}
  • matches against this document.

    • 1

    Siblings

    To avoid useless matches, siblings are restricted to match only consective children of the same parent.

    For example, this pattern

    • {{foo}}
    • {{bar}}

    does not match to this document.

    • 123
    • 456

    And for this document,

    • 1
    • 2
    • 3

    match results are:

    [
        { "foo": "1", "bar": "2" },
        { "foo": "2", "bar": "3" },
    ]
    

    { "foo": 1, "bar": 3 }
    is not contained, because there are not consective children.

    You can specify allow nodes between siblings by writing

    ...
    in the pattern.
    • {{foo}}
    • ...
    • {{bar}}

    Match result for this pattern is:

    [
        { "foo": "1", "bar": "2" },
        { "foo": "1", "bar": "3" },
        { "foo": "2", "bar": "3" },
    ]
    

    If you want to match siblings as subsequence instead of consective substring, you can use the

    subseq
    pattern.
    AAA aaa
    BBB bbb
    CCC ccc
    DDD ddd
    EEE eee

    For this document,

    AAA {{a}}
    BBB {{b}}
    DDD {{d}}

    this pattern matches.

    [
        {
            "a": "aaa",
            "b": "bbb",
            "d": "ddd"
        }
    ]
    

    Attribute

    You can specify attributes in patterns. Attribute patterns match when pattern's attributes are subset of document's attributes.

    This pattern

    {{foo}}

    matches to this document.

    Hello

    You can also write placeholders in attributes.

    {{title}}
    

    Match result for

    Google
    Yahoo
    

    this document is:

    [
        { "url": "https://www.google.com", "title": "Google" },
        { "url": "https://www.yahoo.com", "title": "Yahoo" },
    ]
    

    Partial text-node pattern

    You can write placeholders arbitrary positions in text-node.

    • A: {{a}}, B: {{b}}

    Match result for

    • A: 1, B: 2
    • A: 3, B: 4
    • A: 5, B: 6

    this document is:

    [
        { "a": "1",  "b": "2" },
        { "a": "3",  "b": "4" },
        { "a": "5",  "b": "6" },
    ]
    

    You can also write placeholders in atteibutes position.

    
    
    

    Match result for

    
    
    

    this document is:

    [
        { "userid": "foo",  "username": "Foo" },
        { "userid": "bar",  "username": "Bar" },
        { "userid": "baz",  "username": "Baz" },
    ]
    

    Whole subtree pattern

    The pattern

    {{var:*}}
    matches to whole sub-tree as string.
    {{body:*}}

    Match result for

        Hello
        hoge
        World
    
    

    this document is:

    [
        { "body": "HellohogeWorld" }
    ]
    

    White-space

    White-space are ignored almost all positions.

    Restrictions

    • Whole sub-tree patterns must be the only one element of the parent node.

    This is valid:

    {{foo:*}}

    There are invalid:

    hoge {{foo:*}}
    • {{foo:*}}
      • License: MIT

    We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.