Go Shell Ruby Python HTML
Need help with pup?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.


Parsing HTML at the command line

5.7K Stars 189 Forks MIT License 102 Commits 67 Opened issues

Services available

Need anything else?


pup is a command line tool for processing HTML. It reads from stdin, prints to stdout, and allows the user to filter parts of the page using CSS selectors.

Inspired by jq, pup aims to be a fast and flexible way of exploring HTML from the terminal.


Direct downloads are available through the releases page.

If you have Go installed on your computer just run

go get
go get github.com/ericchiang/pup

If you're on OS X, use Homebrew to install (no Go required).

brew install https://raw.githubusercontent.com/EricChiang/pup/master/pup.rb

Quick start

$ curl -s https://news.ycombinator.com/

Ew, HTML. Let's run that through some pup selectors:

$ curl -s https://news.ycombinator.com/ | pup 'table table tr:nth-last-of-type(n+2) td.title a'

Okay, how about only the links?

$ curl -s https://news.ycombinator.com/ | pup 'table table tr:nth-last-of-type(n+2) td.title a attr{href}'

Even better, let's grab the titles too:

$ curl -s https://news.ycombinator.com/ | pup 'table table tr:nth-last-of-type(n+2) td.title a json{}'

Basic Usage

$ cat index.html | pup [flags] '[selectors] [display function]'


Download a webpage with wget.

$ wget http://en.wikipedia.org/wiki/Robots_exclusion_standard -O robots.html

Clean and indent

By default pup will fill in missing tags and properly indent the page.

$ cat robots.html
# nasty looking HTML
$ cat robots.html | pup --color
# cleaned, indented, and colorful HTML

Filter by tag

$ cat robots.html | pup 'title'

 Robots exclusion standard - Wikipedia, the free encyclopedia

Filter by id

$ cat robots.html | pup 'span#See_also'

 See also

Filter by attribute

$ cat robots.html | pup 'th[scope="row"]'

 Exclusion standards

 Related marketing topics

 Search marketing related topics

 Search engine spam




Pseudo Classes

CSS selectors have a group of specifiers called "pseudo classes" which are pretty cool. pup implements a majority of the relevant ones them.

Here are some examples.

$ cat robots.html | pup 'a[rel]:empty'

$ cat robots.html | pup ':contains("History")'



$ cat robots.html | pup ':parent-of([action="edit"])'

  Edit links

For a complete list, view the implemented selectors section.

, and

These are intermediate characters that declare special instructions. For instance, a comma

allows pup to specify multiple groups of selectors.
$ cat robots.html | pup 'title, h1 span[dir="auto"]'

 Robots exclusion standard - Wikipedia, the free encyclopedia

 Robots exclusion standard

Chain selectors together

When combining selectors, the HTML nodes selected by the previous selector will be passed to the next ones.

$ cat robots.html | pup 'h1#firstHeading'

Robots exclusion standard

$ cat robots.html | pup 'h1#firstHeading span'

 Robots exclusion standard

Implemented Selectors

For further examples of these selectors head over to MDN.

pup '.class'
pup '#id'
pup 'element'
pup 'selector + selector'
pup 'selector > selector'
pup '[attribute]'
pup '[attribute="value"]'
pup '[attribute*="value"]'
pup '[attribute~="value"]'
pup '[attribute^="value"]'
pup '[attribute$="value"]'
pup ':empty'
pup ':first-child'
pup ':first-of-type'
pup ':last-child'
pup ':last-of-type'
pup ':only-child'
pup ':only-of-type'
pup ':contains("text")'
pup ':nth-child(n)'
pup ':nth-of-type(n)'
pup ':nth-last-child(n)'
pup ':nth-last-of-type(n)'
pup ':not(selector)'
pup ':parent-of(selector)'

You can mix and match selectors as you wish.

cat index.html | pup 'element#id[attribute="value"]:first-of-type'

Display Functions

Non-HTML selectors which effect the output type are implemented as functions which can be provided as a final argument.


Print all text from selected nodes and children in depth first order.

$ cat robots.html | pup '.mw-headline text{}'
About the standard
Nonstandard extensions
Crawl-delay directive
Allow directive
Universal "*" match
Meta tags and headers
See also
External links


Print the values of all attributes with a given key from all selected nodes.

$ cat robots.html | pup '.catlinks div attr{id}'


Print HTML as JSON.

$ cat robots.html  | pup 'div#p-namespaces a'



$ cat robots.html | pup 'div#p-namespaces a json{}'
  "accesskey": "c",
  "href": "/wiki/Robots_exclusion_standard",
  "tag": "a",
  "text": "Article",
  "title": "View the content page [c]"
  "accesskey": "t",
  "href": "/wiki/Talk:Robots_exclusion_standard",
  "tag": "a",
  "text": "Talk",
  "title": "Discussion about the content page [t]"

Use the

flag to control the intent level.
$ cat robots.html | pup -i 4 'div#p-namespaces a json{}'
        "accesskey": "c",
        "href": "/wiki/Robots_exclusion_standard",
        "tag": "a",
        "text": "Article",
        "title": "View the content page [c]"
        "accesskey": "t",
        "href": "/wiki/Talk:Robots_exclusion_standard",
        "tag": "a",
        "text": "Talk",
        "title": "Discussion about the content page [t]"

If the selectors only return one element the results will be printed as a JSON object, not a list.

$ cat robots.html  | pup --indent 4 'title json{}'
    "tag": "title",
    "text": "Robots exclusion standard - Wikipedia, the free encyclopedia"

Because there is no universal standard for converting HTML/XML to JSON, a method has been chosen which hopefully fits. The goal is simply to get the output of pup into a more consumable format.



pup --help
for a list of further options

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.