Need help with extruct?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

scrapinghub
569 Stars 84 Forks Other 447 Commits 30 Opened issues

Description

Extract embedded metadata from HTML markup

Services available

!
?

Need anything else?

Contributors list

=======

extruct

.. image:: https://github.com/scrapinghub/extruct/workflows/build/badge.svg?branch=master :target: https://github.com/scrapinghub/extruct/actions :alt: Build Status

.. image:: https://img.shields.io/codecov/c/github/scrapinghub/extruct/master.svg?maxAge=2592000 :target: https://codecov.io/gh/scrapinghub/extruct :alt: Coverage report

.. image:: https://img.shields.io/pypi/v/extruct.svg :target: https://pypi.python.org/pypi/extruct :alt: PyPI Version

extruct is a library for extracting embedded metadata from HTML markup.

Currently, extruct supports:

  • W3C's HTML Microdata
    _
  • embedded JSON-LD
    _
  • Microformat
    _ via
    mf2py
    _
  • Facebook's Open Graph
    _
  • (experimental)
    RDFa
    _ via
    rdflib
    _
  • Dublin Core Metadata (DC-HTML-2003)
    _

.. W3C's HTML Microdata: http://www.w3.org/TR/microdata/ .. _embedded JSON-LD: http://www.w3.org/TR/json-ld/#embedding-json-ld-in-html-documents .. _RDFa: https://www.w3.org/TR/html-rdfa/ .. _rdflib: https://pypi.python.org/pypi/rdflib/ .. _Microformat: http://microformats.org/wiki/MainPage .. _mf2py: https://github.com/microformats/mf2py .. _Facebook's Open Graph: http://ogp.me/ .. _Dublin Core Metadata (DC-HTML-2003): https://www.dublincore.org/specifications/dublin-core/dcq-html/2003-11-30/

The microdata algorithm is a revisit of

this Scrapinghub blog post
_ showing how to use EXSLT extensions.

.. _this Scrapinghub blog post: http://blog.scrapinghub.com/2014/06/18/extracting-schema-org-microdata-using-scrapy-selectors-and-xpath/

Installation

::

pip install extruct

Usage

All-in-one extraction +++++++++++++++++++++

The simplest example how to use extruct is to call

extruct.extract(htmlstring, base_url=base_url)
with some HTML string and an optional base URL.

Let's try this on a webpage that uses all the syntaxes supported (RDFa with

ogp
_).

First fetch the HTML using python-requests and then feed the response body to

extruct
::

import extruct import requests import pprint from w3lib.html import getbaseurl

pp = pprint.PrettyPrinter(indent=2) r = requests.get('https://www.optimizesmart.com/how-to-use-open-graph-protocol/') baseurl = getbaseurl(r.text, r.url) data = extruct.extract(r.text, baseurl=base_url)

pp.pprint(data) { 'dublincore': [ { 'elements': [ { 'URI': 'http://purl.org/dc/elements/1.1/description', 'content': 'What is Open Graph Protocol ' 'and why you need it? Learn to ' 'implement Open Graph Protocol ' 'for Facebook on your website. ' 'Open Graph Protocol Meta Tags.', 'name': 'description'}], 'namespaces': {}, 'terms': []}],

'json-ld': [ { '@context': 'https://schema.org', '@id': '#organization', '@type': 'Organization', 'logo': 'https://www.optimizesmart.com/wp-content/uploads/2016/03/optimize-smart-Twitter-logo.jpg', 'name': 'Optimize Smart', 'sameAs': [ 'https://www.facebook.com/optimizesmart/', 'https://uk.linkedin.com/in/analyticsnerd', 'https://www.youtube.com/user/optimizesmart', 'https://twitter.com/analyticsnerd'], 'url': 'https://www.optimizesmart.com/'}], 'microdata': [ { 'properties': {'headline': ''}, 'type': 'http://schema.org/WPHeader'}], 'microformat': [ { 'children': [ { 'properties': { 'category': [ 'specialized-tracking'], 'name': [ 'Open Graph ' 'Protocol for ' 'Facebook ' 'explained with ' 'examples\n' '\n' 'Specialized ' 'Tracking\n' '\n' '\n' (...) 'Follow ' '@analyticsnerd\n' '!function(d,s,id){var ' "js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, " "'script', " "'twitter-wjs');"]}, 'type': ['h-entry']}], 'properties': { 'name': [ 'Open Graph Protocol for ' 'Facebook explained with ' 'examples\n' (...) 'Follow @analyticsnerd\n' '!function(d,s,id){var ' "js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, " "'script', 'twitter-wjs');"]}, 'type': ['h-feed']}], 'opengraph': [ { 'namespace': {'og': 'http://ogp.me/ns#'}, 'properties': [ ('og:locale', 'enUS'), ('og:type', 'article'), ( 'og:title', 'Open Graph Protocol for Facebook ' 'explained with examples'), ( 'og:description', 'What is Open Graph Protocol and why you ' 'need it? Learn to implement Open Graph ' 'Protocol for Facebook on your website. ' 'Open Graph Protocol Meta Tags.'), ( 'og:url', 'https://www.optimizesmart.com/how-to-use-open-graph-protocol/'), ('og:sitename', 'Optimize Smart'), ( 'og:updatedtime', '2018-03-09T16:26:35+00:00'), ( 'og:image', 'https://www.optimizesmart.com/wp-content/uploads/2010/07/open-graph-protocol.jpg'), ( 'og:image:secureurl', 'https://www.optimizesmart.com/wp-content/uploads/2010/07/open-graph-protocol.jpg')]}], 'rdfa': [ { '@id': 'https://www.optimizesmart.com/how-to-use-open-graph-protocol/#header', 'http://www.w3.org/1999/xhtml/vocab#role': [ { '@id': 'http://www.w3.org/1999/xhtml/vocab#banner'}]}, { '@id': 'https://www.optimizesmart.com/how-to-use-open-graph-protocol/', 'article:modifiedtime': [ { '@value': '2018-03-09T16:26:35+00:00'}], 'article:publishedtime': [ { '@value': '2010-07-02T18:57:23+00:00'}], 'article:publisher': [ { '@value': 'https://www.facebook.com/optimizesmart/'}], 'article:section': [{'@value': 'Specialized Tracking'}], 'http://ogp.me/ns#description': [ { '@value': 'What is Open ' 'Graph Protocol ' 'and why you need ' 'it? Learn to ' 'implement Open ' 'Graph Protocol ' 'for Facebook on ' 'your website. ' 'Open Graph ' 'Protocol Meta ' 'Tags.'}], 'http://ogp.me/ns#image': [ { '@value': 'https://www.optimizesmart.com/wp-content/uploads/2010/07/open-graph-protocol.jpg'}], 'http://ogp.me/ns#image:secureurl': [ { '@value': 'https://www.optimizesmart.com/wp-content/uploads/2010/07/open-graph-protocol.jpg'}], 'http://ogp.me/ns#locale': [{'@value': 'enUS'}], 'http://ogp.me/ns#sitename': [{'@value': 'Optimize Smart'}], 'http://ogp.me/ns#title': [ { '@value': 'Open Graph Protocol for ' 'Facebook explained with ' 'examples'}], 'http://ogp.me/ns#type': [{'@value': 'article'}], 'http://ogp.me/ns#updatedtime': [ { '@value': '2018-03-09T16:26:35+00:00'}], 'http://ogp.me/ns#url': [ { '@value': 'https://www.optimizesmart.com/how-to-use-open-graph-protocol/'}], 'https://api.w.org/': [ { '@id': 'https://www.optimizesmart.com/wp-json/'}]}]}

Select syntaxes +++++++++++++++ It is possible to select which syntaxes to extract by passing a list with the desired ones to extract. Valid values: 'microdata', 'json-ld', 'opengraph', 'microformat', 'rdfa' and 'dublincore'. If no list is passed all syntaxes will be extracted and returned::

r = requests.get('http://www.songkick.com/artists/236156-elysian-fields') baseurl = getbaseurl(r.text, r.url) data = extruct.extract(r.text, baseurl, syntaxes=['microdata', 'opengraph', 'rdfa'])

pp.pprint(data) { 'microdata': [], 'opengraph': [ { 'namespace': { 'concerts': 'http://ogp.me/ns/fb/songkick-concerts#', 'fb': 'http://www.facebook.com/2008/fbml', 'og': 'http://ogp.me/ns#'}, 'properties': [ ('fb:appid', '308540029359'), ('og:sitename', 'Songkick'), ('og:type', 'songkick-concerts:artist'), ('og:title', 'Elysian Fields'), ( 'og:description', 'Find out when Elysian Fields is next ' 'playing live near you. List of all ' 'Elysian Fields tour dates and concerts.'), ( 'og:url', 'https://www.songkick.com/artists/236156-elysian-fields'), ( 'og:image', 'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg')]}], 'rdfa': [ { '@id': 'https://www.songkick.com/artists/236156-elysian-fields', 'al:ios:appname': [{'@value': 'Songkick Concerts'}], 'al:ios:appstoreid': [{'@value': '438690886'}], 'al:ios:url': [ { '@value': 'songkick://artists/236156-elysian-fields'}], 'http://ogp.me/ns#description': [ { '@value': 'Find out when ' 'Elysian Fields is ' 'next playing live ' 'near you. List of ' 'all Elysian ' 'Fields tour dates ' 'and concerts.'}], 'http://ogp.me/ns#image': [ { '@value': 'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg'}], 'http://ogp.me/ns#sitename': [{'@value': 'Songkick'}], 'http://ogp.me/ns#title': [{'@value': 'Elysian Fields'}], 'http://ogp.me/ns#type': [{'@value': 'songkick-concerts:artist'}], 'http://ogp.me/ns#url': [ { '@value': 'https://www.songkick.com/artists/236156-elysian-fields'}], 'http://www.facebook.com/2008/fbmlapp_id': [ { '@value': '308540029359'}]}]}

Uniform +++++++ Another option is to uniform the output of microformat, opengraph, microdata, dublincore and json-ld syntaxes to the following structure: ::

{'@context': 'http://example.com',
             '@type': 'example_type',
             /* All other the properties in keys here */
             }

To do so set

uniform=True
when calling
extract
, it's false by default for backward compatibility. Here the same example as before but with uniform set to True: ::

r = requests.get('http://www.songkick.com/artists/236156-elysian-fields') baseurl = getbaseurl(r.text, r.url) data = extruct.extract(r.text, baseurl, syntaxes=['microdata', 'opengraph', 'rdfa'], uniform=True)

pp.pprint(data) { 'microdata': [], 'opengraph': [ { '@context': { 'concerts': 'http://ogp.me/ns/fb/songkick-concerts#', 'fb': 'http://www.facebook.com/2008/fbml', 'og': 'http://ogp.me/ns#'}, '@type': 'songkick-concerts:artist', 'fb:appid': '308540029359', 'og:description': 'Find out when Elysian Fields is next ' 'playing live near you. List of all ' 'Elysian Fields tour dates and concerts.', 'og:image': 'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg', 'og:sitename': 'Songkick', 'og:title': 'Elysian Fields', 'og:url': 'https://www.songkick.com/artists/236156-elysian-fields'}], 'rdfa': [ { '@id': 'https://www.songkick.com/artists/236156-elysian-fields', 'al:ios:appname': [{'@value': 'Songkick Concerts'}], 'al:ios:appstoreid': [{'@value': '438690886'}], 'al:ios:url': [ { '@value': 'songkick://artists/236156-elysian-fields'}], 'http://ogp.me/ns#description': [ { '@value': 'Find out when ' 'Elysian Fields is ' 'next playing live ' 'near you. List of ' 'all Elysian ' 'Fields tour dates ' 'and concerts.'}], 'http://ogp.me/ns#image': [ { '@value': 'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg'}], 'http://ogp.me/ns#sitename': [{'@value': 'Songkick'}], 'http://ogp.me/ns#title': [{'@value': 'Elysian Fields'}], 'http://ogp.me/ns#type': [{'@value': 'songkick-concerts:artist'}], 'http://ogp.me/ns#url': [ { '@value': 'https://www.songkick.com/artists/236156-elysian-fields'}], 'http://www.facebook.com/2008/fbmlapp_id': [ { '@value': '308540029359'}]}]}

NB rdfa structure is not uniformed yet

Returning HTML node +++++++++++++++++++

It is also possible to get references to HTML node for every extracted metadata item. The feature is supported only by microdata syntax.

To use that, just set the

return_html_node
option of
extract
method to
True
. As the result, an additional key "nodeHtml" will be included in the result for every item. Each node is of
lxml.etree.Element
type: ::

r = requests.get('http://www.rugpadcorner.com/shop/no-muv/') baseurl = getbaseurl(r.text, r.url) data = extruct.extract(r.text, baseurl, syntaxes=['microdata'], returnhtmlnode=True)

pp.pprint(data) { 'microdata': [ { 'htmlNode': , 'properties': { 'description': 'KEEP RUGS FLAT ON CARPET!\n' 'Not your thin sticky pad, ' 'No-Muv is truly the best!', 'image': ['', ''], 'name': ['No-Muv', 'No-Muv'], 'offers': [ { 'htmlNode': , 'properties': { 'availability': 'http://schema.org/InStock', 'price': 'Price: ' '$45'}, 'type': 'http://schema.org/Offer'}, { 'htmlNode': , 'properties': { 'availability': 'http://schema.org/InStock', 'price': '(Select ' 'Size/Shape ' 'for ' 'Pricing)'}, 'type': 'http://schema.org/Offer'}], 'ratingValue': ['5.00', '5.00']}, 'type': 'http://schema.org/Product'}]}

Single extractors

You can also use each extractor individually. See below.

Microdata extraction ++++++++++++++++++++ ::

import pprint pp = pprint.PrettyPrinter(indent=2)

from extruct.w3cmicrodata import MicrodataExtractor

example from http://www.w3.org/TR/microdata/#associating-names-with-items

html = """<!DOCTYPE HTML> ... ...

... Photo gallery ... ... ...

My photos

...
... A white house, boarded up, sits in a forest. ...
The house I found.
...
...
... Outside the house is a mailbox. It has a leaflet inside. ...
The mailbox.
...
... ... ... """

mde = MicrodataExtractor() data = mde.extract(html) pp.pprint(data) [{'properties': {'license': 'http://www.opensource.org/licenses/mit-license.php', 'title': 'The house I found.', 'work': 'http://www.example.com/images/house.jpeg'}, 'type': 'http://n.whatwg.org/work'}, {'properties': {'license': 'http://www.opensource.org/licenses/mit-license.php', 'title': 'The mailbox.', 'work': 'http://www.example.com/images/mailbox.jpeg'}, 'type': 'http://n.whatwg.org/work'}]

JSON-LD extraction ++++++++++++++++++ ::

import pprint pp = pprint.PrettyPrinter(indent=2)

from extruct.jsonld import JsonLdExtractor

html = """<!DOCTYPE HTML> ... ...

... Some Person Page ... ... ...

This guys

... ... ... """

jslde = JsonLdExtractor()

data = jslde.extract(html) pp.pprint(data) [{'@context': 'http://schema.org', '@type': 'Person', 'additionalName': 'Johnny', 'address': {'@type': 'PostalAddress', 'addressLocality': 'Wonderland', 'addressRegion': 'Georgia', 'streetAddress': '1234 Peach Drive'}, 'affiliation': 'University of Dreams', 'jobTitle': 'Graduate research assistant', 'name': 'John Doe', 'url': 'http://www.example.com'}]

RDFa extraction (experimental) ++++++++++++++++++++++++++++++

::

import pprint pp = pprint.PrettyPrinter(indent=2) from extruct.rdfa import RDFaExtractor # you can ignore the warning about html5lib not being available INFO:rdflib:RDFLib Version: 4.2.1 /home/paul/.virtualenvs/extruct.wheel.test/lib/python3.5/site-packages/rdflib/plugins/parsers/structureddata.py:30: UserWarning: html5lib not found! RDFa and Microdata parsers will not be available. 'parsers will not be available.')

html = """ ...

... ... ... ... ...
...

The trouble with Bob

... ... ...

Alice

...
...

The trouble with Bob is that he takes much better photos than I do:

...
... ... ...
... ... ... """

rdfae = RDFaExtractor() pp.pprint(rdfae.extract(html, baseurl='http://www.example.com/index.html')) [{'@id': 'http://www.example.com/alice/posts/troublewith_bob', '@type': ['http://schema.org/BlogPosting'], 'http://purl.org/dc/terms/creator': [{'@id': 'http://www.example.com/index.html#me'}], 'http://purl.org/dc/terms/title': [{'@value': 'The trouble with Bob'}], 'http://schema.org/articleBody': [{'@value': '\n' ' The trouble with Bob ' 'is that he takes much better ' 'photos than I do:\n' ' '}], 'http://schema.org/creator': [{'@id': 'http://www.example.com/index.html#me'}]}]

You'll get a list of expanded JSON-LD nodes.

Open Graph extraction ++++++++++++++++++++++++++++++

::

import pprint pp = pprint.PrettyPrinter(indent=2)

from extruct.opengraph import OpenGraphExtractor

html = """<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "https://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> ... ...

... Himanshu's Open Graph Protocol ... ... ... ... ... ... ... ... ... ... ... ... ... ...
... ... ... """

opengraphe = OpenGraphExtractor() pp.pprint(opengraphe.extract(html)) [{"namespace": { "og": "http://ogp.me/ns#" }, "properties": [ [ "og:title", "Himanshu's Open Graph Protocol" ], [ "og:type", "article" ], [ "og:url", "https://www.eventeducation.com/test.php" ], [ "og:image", "https://www.eventeducation.com/images/982336weddingdayandouanth.jpg" ], [ "og:sitename", "Event Education" ], [ "og:description", "Event Education provides free courses on event planning and management to event professionals worldwide." ] ] }]

Microformat extraction ++++++++++++++++++++++++++++++

::

import pprint pp = pprint.PrettyPrinter(indent=2)

from extruct.microformat import MicroformatExtractor

html = """<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "https://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> ... ...

... Himanshu's Open Graph Protocol ... ... ... ... ... ...
...

Microformats are amazing

...

Published by W. Developer ... on

...

In which I extoll the virtues of using microformats.

...
...

Blah blah blah

...
...
... ... ... """

microformate = MicroformatExtractor() data = microformate.extract(html) pp.pprint(data) [{"type": [ "h-entry" ], "properties": { "name": [ "Microformats are amazing" ], "author": [ { "type": [ "h-card" ], "properties": { "name": [ "W. Developer" ], "url": [ "http://example.com" ] }, "value": "W. Developer" } ], "published": [ "2013-06-13 12:00:00" ], "summary": [ "In which I extoll the virtues of using microformats." ], "content": [ { "html": "\n

Blah blah blah

\n", "value": "\nBlah blah blah\n" } ] } }]

DublinCore extraction ++++++++++++++++++++++++++++++ ::

>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=2)
>>> from extruct.dublincore import DublinCoreExtractor
>>> html = '''
... Expressing Dublin Core in HTML/XHTML meta and link elements
... 
... 
...
...
... 
... 
... 
... 
... 
... 
... 
... 
... 
... '''
>>> dublinlde = DublinCoreExtractor()
>>> data = dublinlde.extract(html)
>>> pp.pprint(data)
[ { 'elements': [ { 'URI': 'http://purl.org/dc/elements/1.1/title',
                    'content': 'Expressing Dublin Core\n'
                               'in HTML/XHTML meta and link elements',
                    'lang': 'en',
                    'name': 'DC.title'},
                  { 'URI': 'http://purl.org/dc/elements/1.1/creator',
                    'content': 'Andy Powell, UKOLN, University of Bath',
                    'name': 'DC.creator'},
                  { 'URI': 'http://purl.org/dc/elements/1.1/identifier',
                    'content': 'http://dublincore.org/documents/dcq-html/',
                    'name': 'DC.identifier',
                    'scheme': 'DCTERMS.URI'},
                  { 'URI': 'http://purl.org/dc/elements/1.1/format',
                    'content': 'text/html',
                    'name': 'DC.format',
                    'scheme': 'DCTERMS.IMT'},
                  { 'URI': 'http://purl.org/dc/elements/1.1/type',
                    'content': 'Text',
                    'name': 'DC.type',
                    'scheme': 'DCTERMS.DCMIType'}],
    'namespaces': { 'DC': 'http://purl.org/dc/elements/1.1/',
                    'DCTERMS': 'http://purl.org/dc/terms/'},
    'terms': [ { 'URI': 'http://purl.org/dc/terms/issued',
                 'content': '2003-11-01',
                 'name': 'DCTERMS.issued',
                 'scheme': 'DCTERMS.W3CDTF'},
               { 'URI': 'http://purl.org/dc/terms/abstract',
                 'content': 'This document describes how\n'
                            'qualified Dublin Core metadata can be encoded\n'
                            'in HTML/XHTML  elements',
                 'name': 'DCTERMS.abstract'},
               { 'URI': 'http://purl.org/dc/terms/modified',
                 'content': '2001-07-18',
                 'name': 'DC.Date.modified'},
               { 'URI': 'http://purl.org/dc/terms/modified',
                 'content': '2001-07-18',
                 'name': 'DCTERMS.modified'},
               { 'URI': 'http://purl.org/dc/terms/replaces',
                 'href': 'http://dublincore.org/documents/2000/08/15/dcq-html/',
                 'hreflang': 'en',
                 'rel': 'DCTERMS.replaces'}]}]

Command Line Tool

extruct provides a command line tool that allows you to fetch a page and extract the metadata from it directly from the command line.

Dependencies ++++++++++++

The command line tool depends on

requests
, which is not installed by default when you install extruct. In order to use the command line tool, you can install extruct with the
cli
extra requirements::
pip install extruct[cli]

Usage +++++

::

extruct "http://example.com"

Downloads "http://example.com" and outputs the Microdata, JSON-LD and RDFa, Open Graph and Microformat metadata to

stdout
.

Supported Parameters ++++++++++++++++++++

By default, the command line tool will try to extract all the supported metadata formats from the page (currently Microdata, JSON-LD, RDFa, Open Graph and Microformat). If you want to restrict the output to just one or a subset of those, you can pass their individual names collected in a list through 'syntaxes' argument.

For example, this command extracts only Microdata and JSON-LD metadata from "http://example.com"::

extruct "http://example.com" --syntaxes microdata json-ld

NB syntaxes names passed must correspond to these: microdata, json-ld, rdfa, opengraph, microformat

Development version

::

mkvirtualenv extruct
pip install -r requirements-dev.txt

Tests

Run tests in current environment::

py.test tests

Use tox_ to run tests with different Python versions::

tox

.. _tox: https://testrun.org/tox/latest/ .. _ogp: https://ogp.me/

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.