arno/pup

mirror of https://github.com/ericchiang/pup synced 2025-04-17 21:49:01 +00:00

Parsing HTML at the command line

Go to file

ericchiang 2fb2dc084d woops		2014-10-11 14:03:44 -04:00
selector	negative selectors fixed	2014-09-19 16:37:06 -04:00
tests	tests added	2014-09-14 22:20:05 -04:00
.gitignore	dist removed from history	2014-09-14 16:16:21 -04:00
display.go	woops	2014-10-11 14:03:44 -04:00
LICENSE	license added	2014-09-01 13:36:10 -04:00
main.go	json{} displayer added	2014-10-11 13:06:13 -04:00
printing.go	escape html strings	2014-09-28 12:26:31 -04:00
README.md	json{} displayer added	2014-10-11 13:06:13 -04:00
release.sh	release script added	2014-09-28 17:44:44 -04:00

README.md

pup

pup is a command line tool for processing HTML. It reads from stdin, prints to stdout, and allows the user to filter parts of the page using CSS selectors.

Inspired by jq, pup aims to be a fast and flexible way of exploring HTML from the terminal.

Upcoming talk

I'll be giving a talk on pup on October 15th at 11:00am (PDT) for Codementor! RSVP on the event page and join the hangout to discuss, pup, Go, and command-line tools.

Install

Direct download are available on the releases page.

Or if you can run go get to download via git.

go get github.com/ericchiang/pup

Quick start

$ curl -s https://news.ycombinator.com/

Ew, HTML. Let's run that through some pup selectors:

$ curl -s https://news.ycombinator.com/ | pup 'td.title a[href^=http] attr{href}'

Even better, let's grab the titles too:

$ curl -s https://news.ycombinator.com/ | pup 'td.title a[href^=http] json{}'

Basic Usage

$ cat index.html | pup [flags] [selectors] [optional display function]

$ pup < index.html [flags] [selectors] [optional display function]

Examples

Download a webpage with wget.

$ wget http://en.wikipedia.org/wiki/Robots_exclusion_standard -O robots.html

####Clean and indent

By default pup will fill in missing tags and properly indent the page.

$ cat robots.html
# nasty looking HTML
$ cat robots.html | pup --color
# cleaned, indented, and colorful HTML

####Filter by tag

$ pup < robots.html title
<title>
 Robots exclusion standard - Wikipedia, the free encyclopedia
</title>

####Filter by id

$ pup < robots.html span#See_also
<span class="mw-headline" id="See_also">
 See also
</span>

####Chain selectors together

The following two commands are (somewhat) equivalent.

$ pup < robots.html table.navbox ul a | tail

$ pup < robots.html table.navbox | pup ul | pup a | tail

Both produce the ouput:

</a>
<a href="/wiki/Stop_words" title="Stop words">
 Stop words
</a>
<a href="/wiki/Poison_words" title="Poison words">
 Poison words
</a>
<a href="/wiki/Content_farm" title="Content farm">
 Content farm
</a>

Because pup reconstructs the HTML parse tree, funny things can happen when piping two commands together. I'd recommend chaining commands rather than pipes.

####Limit print level

$ pup < robots.html table -l 2
<table class="metadata plainlinks ambox ambox-content" role="presentation">
 <tbody>
  ...
 </tbody>
</table>
<table style="background:#f9f9f9;font-size:85%;line-height:110%;max-width:175px;">
 <tbody>
  ...
 </tbody>
</table>
<table cellspacing="0" class="navbox" style="border-spacing:0;">
 <tbody>
  ...
 </tbody>
</table>

####Slices

Slices allow you to do simple {start:end:by} operations to limit the number of nodes selected for the next round of selection.

Provide one number for a simple index.

$ pup < robots.html a slice{0}
<a id="top">
</a>

You can provide an end to limit the number of nodes selected.

$ # {:3} is the same as {0:3}
$ pup < robots.html a slice{:3}
<a id="top">
</a>
<a href="#mw-navigation">
 navigation
</a>
<a href="#p-search">
 search
</a>

Implemented Selectors

For further examples of these selectors head over to MDN.

cat index.html | pup .class
# '#' indicates comments at the command line so you have to escape it
cat index.html | pup \#id
cat index.html | pup element
cat index.html | pup [attribute]
cat index.html | pup [attribute=value]
# Probably best to quote enclose wildcards
cat index.html | pup '[attribute*=value]'
cat index.html | pup [attribute~=value]
cat index.html | pup [attribute^=value]
cat index.html | pup [attribute$=value]

You can mix and match selectors as you wish.

cat index.html | pup element#id[attribute=value]

Display Functions

Non-HTML selectors which effect the output type are implemented as functions which can be provided as a final argument.

`text{}`

Print all text from selected nodes and children in depth first order.

$ cat robots.html | pup .mw-headline text{}
History
About the standard
Disadvantages
Alternatives
Examples
Nonstandard extensions
Crawl-delay directive
Allow directive
Sitemap
Host
Universal "*" match
Meta tags and headers
See also
References
External links

`attr{attrkey}`

Print the values of all attributes with a given key from all selected nodes.

$ pup < robots.html a attr{href} | head
#mw-navigation
#p-search
/wiki/MediaWiki:Robots.txt
//en.wikipedia.org/robots.txt
/wiki/Wikipedia:What_Wikipedia_is_not#NOTHOWTO
//en.wikipedia.org/w/index.php?title=Robots_exclusion_standard&action=edit
//meta.wikimedia.org/wiki/Help:Transwiki
//en.wikiversity.org/wiki/
//en.wikibooks.org/wiki/
//en.wikivoyage.org/wiki/

`json{}`

Print HTML as JSON.

$ cat robots.html  | pup div#p-namespaces a
<a href="/wiki/Robots_exclusion_standard" title="View the content page [c]" accesskey="c">
 Article
</a>
<a href="/wiki/Talk:Robots_exclusion_standard" title="Discussion about the content page [t]" accesskey="t">
 Talk
</a>

$ cat robots.html  | pup div#p-namespaces a json{}
[
 {
  "attrs": {
   "accesskey": "c",
   "href": "/wiki/Robots_exclusion_standard",
   "title": "View the content page [c]"
  },
  "tag": "a",
  "text": "Article"
 },
 {
  "attrs": {
   "accesskey": "t",
   "href": "/wiki/Talk:Robots_exclusion_standard",
   "title": "Discussion about the content page [t]"
  },
  "tag": "a",
  "text": "Talk"
 }
]

Use the -i / --indent flag to control the intent level.

$ cat robots.html  | pup --indent 4 div#p-namespaces a json{}
[
    {
        "attrs": {
            "accesskey": "c",
            "href": "/wiki/Robots_exclusion_standard",
            "title": "View the content page [c]"
        },
        "tag": "a",
        "text": "Article"
    },
    {
        "attrs": {
            "accesskey": "t",
            "href": "/wiki/Talk:Robots_exclusion_standard",
            "title": "Discussion about the content page [t]"
        },
        "tag": "a",
        "text": "Talk"
    }
]

If the selectors only return one element the results will be printed as a JSON object, not a list.

$ cat robots.html  | pup --indent 4 title json{}
{
    "tag": "title",
    "text": "Robots exclusion standard - Wikipedia, the free encyclopedia"
}

Because there is no universal standard for converting HTML/XML to JSON, a method has been chosen which hopefully fits. The goal is simply to get the output of pup into a more consumable format.

Flags

-c --color         print result with color
-f --file          file to read from
-h --help          display this help
-i --indent        number of spaces to use for indent or character
-n --number        print number of elements selected
-l --limit         restrict number of levels printed
--version          display version

TODO

Add more tests!