2014-09-01 00:01:03 +00:00
|
|
|
# pup
|
|
|
|
|
2014-09-13 12:56:00 +00:00
|
|
|
pup is a command line tool for processing HTML. It reads from stdin,
|
2014-09-13 19:25:02 +00:00
|
|
|
prints to stdout, and allows the user to filter parts of the page using
|
2014-09-13 18:03:03 +00:00
|
|
|
[CSS selectors](http://www.w3schools.com/cssref/css_selectors.asp).
|
2014-09-01 17:29:04 +00:00
|
|
|
|
2014-09-13 12:56:00 +00:00
|
|
|
Inspired by [jq](http://stedolan.github.io/jq/), pup aims to be a
|
2014-09-01 17:29:04 +00:00
|
|
|
fast and flexible way of exploring HTML from the terminal.
|
|
|
|
|
2014-09-13 12:56:00 +00:00
|
|
|
Looking for feature requests and argument design, feel free to open an
|
|
|
|
issue if you'd like to comment.
|
|
|
|
|
2014-09-01 00:01:03 +00:00
|
|
|
## Install
|
|
|
|
|
2014-09-14 20:01:24 +00:00
|
|
|
Direct download are available on the [releases page](
|
|
|
|
https://github.com/EricChiang/pup/releases).
|
|
|
|
|
|
|
|
Or if you can run `go get` to download via git.
|
|
|
|
|
2014-09-01 00:01:03 +00:00
|
|
|
go get github.com/ericchiang/pup
|
|
|
|
|
2014-09-13 12:56:00 +00:00
|
|
|
## Quick start
|
|
|
|
|
|
|
|
```bash
|
|
|
|
$ curl http://www.pro-football-reference.com/years/2013/games.htm
|
|
|
|
```
|
|
|
|
|
|
|
|
Ew, HTML. Let's run that through some pup selectors:
|
|
|
|
|
|
|
|
```bash
|
|
|
|
$ curl http://www.pro-football-reference.com/years/2013/games.htm | \
|
2014-09-14 09:09:11 +00:00
|
|
|
pup table#games 'a[href*=boxscores]' attr{href}
|
2014-09-13 12:56:00 +00:00
|
|
|
```
|
|
|
|
|
2014-09-02 00:56:02 +00:00
|
|
|
## Basic Usage
|
|
|
|
|
|
|
|
```bash
|
|
|
|
$ cat index.html | pup [selectors and flags]
|
|
|
|
```
|
|
|
|
|
|
|
|
or
|
|
|
|
|
|
|
|
```bash
|
|
|
|
$ pup < index.html [selectors and flags]
|
|
|
|
```
|
|
|
|
|
2014-09-01 17:29:04 +00:00
|
|
|
## Examples
|
|
|
|
|
2014-09-13 12:56:00 +00:00
|
|
|
Download a webpage with wget.
|
2014-09-01 17:29:04 +00:00
|
|
|
|
|
|
|
```bash
|
|
|
|
$ wget http://en.wikipedia.org/wiki/Robots_exclusion_standard -O robots.html
|
|
|
|
```
|
|
|
|
|
2014-09-02 03:53:12 +00:00
|
|
|
####Clean and indent
|
2014-09-01 17:29:04 +00:00
|
|
|
|
2014-09-13 12:56:00 +00:00
|
|
|
By default pup will fill in missing tags and properly indent the page.
|
2014-09-01 17:29:04 +00:00
|
|
|
|
|
|
|
```bash
|
|
|
|
$ cat robots.html
|
2014-09-01 18:18:34 +00:00
|
|
|
# nasty looking HTML
|
|
|
|
$ cat robots.html | pup --color
|
|
|
|
# cleaned, indented, and colorful HTML
|
2014-09-01 17:29:04 +00:00
|
|
|
```
|
|
|
|
|
2014-09-02 03:53:12 +00:00
|
|
|
####Filter by tag
|
2014-09-01 18:18:34 +00:00
|
|
|
```bash
|
2014-09-01 17:29:04 +00:00
|
|
|
$ pup < robots.html title
|
|
|
|
<title>
|
|
|
|
Robots exclusion standard - Wikipedia, the free encyclopedia
|
|
|
|
</title>
|
|
|
|
```
|
|
|
|
|
2014-09-02 03:53:12 +00:00
|
|
|
####Filter by id
|
2014-09-01 18:18:34 +00:00
|
|
|
```bash
|
2014-09-01 17:29:04 +00:00
|
|
|
$ pup < robots.html span#See_also
|
|
|
|
<span class="mw-headline" id="See_also">
|
|
|
|
See also
|
|
|
|
</span>
|
|
|
|
```
|
|
|
|
|
2014-09-02 03:53:12 +00:00
|
|
|
####Chain selectors together
|
2014-09-01 17:29:04 +00:00
|
|
|
|
2014-09-13 12:56:00 +00:00
|
|
|
The following two commands are (somewhat) equivalent.
|
2014-09-01 17:29:04 +00:00
|
|
|
|
2014-09-01 18:18:34 +00:00
|
|
|
```bash
|
2014-09-01 17:29:04 +00:00
|
|
|
$ pup < robots.html table.navbox ul a | tail
|
|
|
|
```
|
|
|
|
|
2014-09-01 18:18:34 +00:00
|
|
|
```bash
|
2014-09-01 17:29:04 +00:00
|
|
|
$ pup < robots.html table.navbox | pup ul | pup a | tail
|
|
|
|
```
|
|
|
|
|
|
|
|
Both produce the ouput:
|
|
|
|
|
2014-09-01 18:18:34 +00:00
|
|
|
```bash
|
2014-09-01 17:29:04 +00:00
|
|
|
</a>
|
|
|
|
<a href="/wiki/Stop_words" title="Stop words">
|
|
|
|
Stop words
|
|
|
|
</a>
|
|
|
|
<a href="/wiki/Poison_words" title="Poison words">
|
|
|
|
Poison words
|
|
|
|
</a>
|
|
|
|
<a href="/wiki/Content_farm" title="Content farm">
|
|
|
|
Content farm
|
|
|
|
</a>
|
|
|
|
```
|
|
|
|
|
2014-09-13 12:56:00 +00:00
|
|
|
Because pup reconstructs the HTML parse tree, funny things can
|
|
|
|
happen when piping two commands together. I'd recommend chaining
|
|
|
|
commands rather than pipes.
|
2014-09-01 17:29:04 +00:00
|
|
|
|
2014-09-02 03:53:12 +00:00
|
|
|
####Limit print level
|
2014-09-01 18:18:34 +00:00
|
|
|
|
|
|
|
```bash
|
2014-09-01 17:29:04 +00:00
|
|
|
$ pup < robots.html table -l 2
|
|
|
|
<table class="metadata plainlinks ambox ambox-content" role="presentation">
|
|
|
|
<tbody>
|
|
|
|
...
|
|
|
|
</tbody>
|
|
|
|
</table>
|
|
|
|
<table style="background:#f9f9f9;font-size:85%;line-height:110%;max-width:175px;">
|
|
|
|
<tbody>
|
|
|
|
...
|
|
|
|
</tbody>
|
|
|
|
</table>
|
|
|
|
<table cellspacing="0" class="navbox" style="border-spacing:0;">
|
|
|
|
<tbody>
|
|
|
|
...
|
|
|
|
</tbody>
|
|
|
|
</table>
|
|
|
|
```
|
|
|
|
|
2014-09-02 00:56:02 +00:00
|
|
|
## Implemented Selectors
|
|
|
|
|
|
|
|
For further examples of these selectors head over to [w3schools](
|
|
|
|
http://www.w3schools.com/cssref/css_selectors.asp).
|
|
|
|
|
|
|
|
```bash
|
|
|
|
cat index.html | pup .class
|
|
|
|
# '#' indicates comments at the command line so you have to escape it
|
|
|
|
cat index.html | pup \#id
|
|
|
|
cat index.html | pup element
|
|
|
|
cat index.html | pup [attribute]
|
|
|
|
cat index.html | pup [attribute=value]
|
2014-09-14 23:25:43 +00:00
|
|
|
# Probably best to quote enclose wildcards
|
|
|
|
cat index.html | pup '[attribute*=value]'
|
|
|
|
cat index.html | pup [attribute~=value]
|
|
|
|
cat index.html | pup [attribute^=value]
|
|
|
|
cat index.html | pup [attribute$=value]
|
2014-09-02 00:56:02 +00:00
|
|
|
```
|
|
|
|
|
|
|
|
You can mix and match selectors as you wish.
|
|
|
|
|
|
|
|
```bash
|
|
|
|
cat index.html | pup element#id[attribute=value]
|
|
|
|
```
|
|
|
|
|
2014-09-02 03:53:12 +00:00
|
|
|
## Functions
|
|
|
|
|
|
|
|
Non-HTML selectors which effect the output type are implemented as functions
|
|
|
|
which can be provided as a final argument.
|
|
|
|
|
|
|
|
As of now, `text{}` is the only implemented function.
|
|
|
|
|
|
|
|
#### `text{}`
|
|
|
|
|
|
|
|
Print all text from selected nodes and children in depth first order.
|
|
|
|
|
|
|
|
```bash
|
|
|
|
$ cat robots.html | pup .mw-headline text{}
|
|
|
|
History
|
|
|
|
About the standard
|
|
|
|
Disadvantages
|
|
|
|
Alternatives
|
|
|
|
Examples
|
|
|
|
Nonstandard extensions
|
|
|
|
Crawl-delay directive
|
|
|
|
Allow directive
|
|
|
|
Sitemap
|
|
|
|
Host
|
|
|
|
Universal "*" match
|
|
|
|
Meta tags and headers
|
|
|
|
See also
|
|
|
|
References
|
|
|
|
External links
|
|
|
|
```
|
|
|
|
|
2014-09-03 02:23:02 +00:00
|
|
|
#### `attr{attrkey}`
|
|
|
|
|
|
|
|
Print the values of all attributes with a given key from all selected nodes.
|
|
|
|
|
|
|
|
```bash
|
|
|
|
$ pup < robots.html a attr{href} | head
|
|
|
|
#mw-navigation
|
|
|
|
#p-search
|
|
|
|
/wiki/MediaWiki:Robots.txt
|
|
|
|
//en.wikipedia.org/robots.txt
|
|
|
|
/wiki/Wikipedia:What_Wikipedia_is_not#NOTHOWTO
|
|
|
|
//en.wikipedia.org/w/index.php?title=Robots_exclusion_standard&action=edit
|
|
|
|
//meta.wikimedia.org/wiki/Help:Transwiki
|
|
|
|
//en.wikiversity.org/wiki/
|
|
|
|
//en.wikibooks.org/wiki/
|
|
|
|
//en.wikivoyage.org/wiki/
|
|
|
|
```
|
|
|
|
|
2014-09-02 00:56:02 +00:00
|
|
|
## Flags
|
|
|
|
|
|
|
|
```bash
|
|
|
|
-c --color print result with color
|
|
|
|
-f --file file to read from
|
|
|
|
-h --help display this help
|
|
|
|
-i --indent number of spaces to use for indent or character
|
|
|
|
-n --number print number of elements selected
|
|
|
|
-l --limit restrict number of levels printed
|
|
|
|
--version display version
|
|
|
|
```
|
|
|
|
|
2014-09-01 00:01:03 +00:00
|
|
|
## TODO:
|
|
|
|
|
2014-09-02 03:53:12 +00:00
|
|
|
* Print as json function `json{}`
|