Select syntaxes
+++++++++++++++
It is possible to select which syntaxes to extract by passing a list with the desired ones to extract. Valid values: 'microdata', 'json-ld', 'opengraph', 'microformat', 'rdfa' and 'dublincore'. If no list is passed all syntaxes will be extracted and returned::
r = requests.get('http://www.songkick.com/artists/236156-elysian-fields')
baseurl = getbaseurl(r.text, r.url)
data = extruct.extract(r.text, baseurl, syntaxes=['microdata', 'opengraph', 'rdfa'])
pp.pprint(data)
{ 'microdata': [],
'opengraph': [ { 'namespace': { 'concerts': 'http://ogp.me/ns/fb/songkick-concerts#',
'fb': 'http://www.facebook.com/2008/fbml',
'og': 'http://ogp.me/ns#'},
'properties': [ ('fb:appid', '308540029359'),
('og:sitename', 'Songkick'),
('og:type', 'songkick-concerts:artist'),
('og:title', 'Elysian Fields'),
( 'og:description',
'Find out when Elysian Fields is next '
'playing live near you. List of all '
'Elysian Fields tour dates and concerts.'),
( 'og:url',
'https://www.songkick.com/artists/236156-elysian-fields'),
( 'og:image',
'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg')]}],
'rdfa': [ { '@id': 'https://www.songkick.com/artists/236156-elysian-fields',
'al:ios:appname': [{'@value': 'Songkick Concerts'}],
'al:ios:appstoreid': [{'@value': '438690886'}],
'al:ios:url': [ { '@value': 'songkick://artists/236156-elysian-fields'}],
'http://ogp.me/ns#description': [ { '@value': 'Find out when '
'Elysian Fields is '
'next playing live '
'near you. List of '
'all Elysian '
'Fields tour dates '
'and concerts.'}],
'http://ogp.me/ns#image': [ { '@value': 'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg'}],
'http://ogp.me/ns#sitename': [{'@value': 'Songkick'}],
'http://ogp.me/ns#title': [{'@value': 'Elysian Fields'}],
'http://ogp.me/ns#type': [{'@value': 'songkick-concerts:artist'}],
'http://ogp.me/ns#url': [ { '@value': 'https://www.songkick.com/artists/236156-elysian-fields'}],
'http://www.facebook.com/2008/fbmlapp_id': [ { '@value': '308540029359'}]}]}
Uniform
+++++++
Another option is to uniform the output of microformat, opengraph, microdata, dublincore and json-ld syntaxes to the following structure: ::
{'@context': 'http://example.com',
'@type': 'example_type',
/* All other the properties in keys here */
}
To do so set
uniform=True
when calling
extract
, it's false by default for backward compatibility. Here the same example as before but with uniform set to True: ::
r = requests.get('http://www.songkick.com/artists/236156-elysian-fields')
baseurl = getbaseurl(r.text, r.url)
data = extruct.extract(r.text, baseurl, syntaxes=['microdata', 'opengraph', 'rdfa'], uniform=True)
pp.pprint(data)
{ 'microdata': [],
'opengraph': [ { '@context': { 'concerts': 'http://ogp.me/ns/fb/songkick-concerts#',
'fb': 'http://www.facebook.com/2008/fbml',
'og': 'http://ogp.me/ns#'},
'@type': 'songkick-concerts:artist',
'fb:appid': '308540029359',
'og:description': 'Find out when Elysian Fields is next '
'playing live near you. List of all '
'Elysian Fields tour dates and concerts.',
'og:image': 'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg',
'og:sitename': 'Songkick',
'og:title': 'Elysian Fields',
'og:url': 'https://www.songkick.com/artists/236156-elysian-fields'}],
'rdfa': [ { '@id': 'https://www.songkick.com/artists/236156-elysian-fields',
'al:ios:appname': [{'@value': 'Songkick Concerts'}],
'al:ios:appstoreid': [{'@value': '438690886'}],
'al:ios:url': [ { '@value': 'songkick://artists/236156-elysian-fields'}],
'http://ogp.me/ns#description': [ { '@value': 'Find out when '
'Elysian Fields is '
'next playing live '
'near you. List of '
'all Elysian '
'Fields tour dates '
'and concerts.'}],
'http://ogp.me/ns#image': [ { '@value': 'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg'}],
'http://ogp.me/ns#sitename': [{'@value': 'Songkick'}],
'http://ogp.me/ns#title': [{'@value': 'Elysian Fields'}],
'http://ogp.me/ns#type': [{'@value': 'songkick-concerts:artist'}],
'http://ogp.me/ns#url': [ { '@value': 'https://www.songkick.com/artists/236156-elysian-fields'}],
'http://www.facebook.com/2008/fbmlapp_id': [ { '@value': '308540029359'}]}]}
NB rdfa structure is not uniformed yet
Returning HTML node
+++++++++++++++++++
It is also possible to get references to HTML node for every extracted metadata item.
The feature is supported only by microdata syntax.
To use that, just set the
return_html_node
option of
extract
method to
True
.
As the result, an additional key "nodeHtml" will be included in the result for every
item. Each node is of
lxml.etree.Element
type: ::
r = requests.get('http://www.rugpadcorner.com/shop/no-muv/')
baseurl = getbaseurl(r.text, r.url)
data = extruct.extract(r.text, baseurl, syntaxes=['microdata'], returnhtmlnode=True)
import pprint
pp = pprint.PrettyPrinter(indent=2)
from extruct.rdfa import RDFaExtractor # you can ignore the warning about html5lib not being available
INFO:rdflib:RDFLib Version: 4.2.1
/home/paul/.virtualenvs/extruct.wheel.test/lib/python3.5/site-packages/rdflib/plugins/parsers/structureddata.py:30: UserWarning: html5lib not found! RDFa and Microdata parsers will not be available.
'parsers will not be available.')
html = """
...
... ...
...
...
...
...
The trouble with Bob
... ...
...
Alice
...
...
The trouble with Bob is that he takes much better photos than I do:
...
... ...
...
...
...
... """
rdfae = RDFaExtractor()
pp.pprint(rdfae.extract(html, baseurl='http://www.example.com/index.html'))
[{'@id': 'http://www.example.com/alice/posts/troublewith_bob',
'@type': ['http://schema.org/BlogPosting'],
'http://purl.org/dc/terms/creator': [{'@id': 'http://www.example.com/index.html#me'}],
'http://purl.org/dc/terms/title': [{'@value': 'The trouble with Bob'}],
'http://schema.org/articleBody': [{'@value': '\n'
' The trouble with Bob '
'is that he takes much better '
'photos than I do:\n'
' '}],
'http://schema.org/creator': [{'@id': 'http://www.example.com/index.html#me'}]}]
You'll get a list of expanded JSON-LD nodes.
Open Graph extraction
++++++++++++++++++++++++++++++
::
import pprint
pp = pprint.PrettyPrinter(indent=2)
from extruct.opengraph import OpenGraphExtractor
html = """<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "https://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
...
...
extruct provides a command line tool that allows you to fetch a page and
extract the metadata from it directly from the command line.
Dependencies
++++++++++++
The command line tool depends on
requests
, which is not installed by default
when you install extruct. In order to use the command line tool, you can
install extruct with the
cli
extra requirements::
pip install extruct[cli]
Usage
+++++
::
extruct "http://example.com"
Downloads "http://example.com" and outputs the Microdata, JSON-LD and RDFa, Open Graph
and Microformat metadata to
stdout
.
Supported Parameters
++++++++++++++++++++
By default, the command line tool will try to extract all the supported
metadata formats from the page (currently Microdata, JSON-LD, RDFa, Open Graph
and Microformat). If you want to restrict the output to just one or a subset of
those, you can pass their individual names collected in a list through 'syntaxes' argument.
For example, this command extracts only Microdata and JSON-LD metadata from
"http://example.com"::
We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.