easy-scraper

by tanakh

tanakh /easy-scraper

Easy scraping library

149 Stars 5 Forks Last release: Not found MIT License 18 Commits 0 Releases

Available items

No Items, yet!

The developer of this repository has not created any items for sale yet. Need a bug fixed? Help with integration? A different license? Create a request here:

easy-scraper

HTML scraping library focused on easy to use.

In this library, matching patterns are described as HTML DOM trees. You can write patterns intuitive and extract desired contents easily.

Usage

Add this line to your

Cargo.toml
:
[dependencies]
easy-scraper = "0.1"

Example

use easy_scraper::Pattern;

let pat = Pattern::new(r#"

  • {{foo}}
"#).unwrap();

let ms = pat.matches(r#"

    <ul>
        <li>1</li>
        <li>2</li>
        <li>3</li>
    </ul>

"#);

assert_eq!(ms.len(), 3); assert_eq!(ms[0]["foo"], "1"); assert_eq!(ms[1]["foo"], "2"); assert_eq!(ms[2]["foo"], "3");

Syntax

DOM Tree

DOM trees are valid pattern. You can write placeholders in DOM trees.

  • {{foo}}

Patterns are matched if the pattern is subset of document.

If the document is:

  • 1
  • 2
  • 3

there trees are subset of this.

  • 1
  • 2
  • 3

So, match result is

[
    { "foo": "1" },
    { "foo": "2" },
    { "foo": "3" },
]

Child

Child nodes are matched to any descendants because of subset rule.

For example, this pattern

  • {{id}}
  • matches against this document.

    • 1

    Siblings

    To avoid useless matches, siblings are restricted to match only consective children of the same parent.

    For example, this pattern

    • {{foo}}
    • {{bar}}

    does not match to this document.

    • 123
    • 456

    And for this document,

    • 1
    • 2
    • 3

    match results are:

    [
        { "foo": "1", "bar": "2" },
        { "foo": "2", "bar": "3" },
    ]
    

    { "foo": 1, "bar": 3 }
    is not contained, because there are not consective children.

    You can specify allow nodes between siblings by writing

    ...
    in the pattern.
    • {{foo}}
    • ...
    • {{bar}}

    Match result for this pattern is:

    [
        { "foo": "1", "bar": "2" },
        { "foo": "1", "bar": "3" },
        { "foo": "2", "bar": "3" },
    ]
    

    Attribute

    You can specify attributes in patterns. Attribute patterns match when pattern's attributes are subset of document's attributes.

    This pattern

    {{foo}}

    matches to this document.

    Hello

    You can also write placeholders in attributes.

    {{title}}
    

    Match result for

    Google
    Yahoo
    

    this document is:

    [
        { "url": "https://www.google.com", "title": "Google" },
        { "url": "https://www.yahoo.com", "title": "Yahoo" },
    ]
    

    Partial text-node pattern

    You can write placeholders arbitrary positions in text-node.

    • A: {{a}}, B: {{b}}

    Match result for

    • A: 1, B: 2
    • A: 3, B: 4
    • A: 5, B: 6

    this document is:

    [
        { "a": "1",  "b": "2" },
        { "a": "3",  "b": "4" },
        { "a": "5",  "b": "6" },
    ]
    

    You can also write placeholders in atteibutes position.

    
    
    

    Match result for

    
    
    

    this document is:

    [
        { "userid": "foo",  "username": "Foo" },
        { "userid": "bar",  "username": "Bar" },
        { "userid": "baz",  "username": "Baz" },
    ]
    

    Whole subtree pattern

    The pattern

    {{var:*}}
    matches to whole sub-tree as string.
    {{body:*}}

    Match result for

        Hello
        hoge
        World
    
    

    this document is:

    [
        { "body": "HellohogeWorld" }
    ]
    

    White-space

    White-space are ignored almost all positions.

    Restrictions

    • Whole sub-tree patterns must be the only one element of the parent node.

    This is valid:

    {{foo:*}}

    There are invalid:

    hoge {{foo:*}}
    • {{foo:*}}

    We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.