pup/README.md

# pup

pup is a command line tool for processing HTML. It reads from stdin,
prints to stdout, and allows the user to filter parts of the page using
[CSS selectors](http://www.w3schools.com/cssref/css_selectors.asp).

Inspired by [jq](http://stedolan.github.io/jq/), pup aims to be a
fast and flexible way of exploring HTML from the terminal.

Looking for feature requests and argument design, feel free to open an
issue if you'd like to comment.

## Install

Direct download are available on the [releases page](
https://github.com/EricChiang/pup/releases).

Or if you can run `go get` to download via git.

	go get github.com/ericchiang/pup

## Quick start

```bash
$ curl http://www.pro-football-reference.com/years/2013/games.htm 
```

Ew, HTML. Let's run that through some pup selectors:

```bash
$ curl http://www.pro-football-reference.com/years/2013/games.htm | \
pup table#games 'a[href*=boxscores]' attr{href}
```

## Basic Usage

```bash
$ cat index.html | pup [selectors and flags]
```

or

```bash
$ pup < index.html [selectors and flags]
```

## Examples

Download a webpage with wget.

```bash
$ wget http://en.wikipedia.org/wiki/Robots_exclusion_standard -O robots.html
```

####Clean and indent

By default pup will fill in missing tags and properly indent the page.

```bash
$ cat robots.html
# nasty looking HTML
$ cat robots.html | pup --color
# cleaned, indented, and colorful HTML
```

####Filter by tag
```bash
$ pup < robots.html title
<title>
 Robots exclusion standard - Wikipedia, the free encyclopedia
</title>
```

####Filter by id
```bash
$ pup < robots.html span#See_also
<span class="mw-headline" id="See_also">
 See also
</span>
```

####Chain selectors together

The following two commands are (somewhat) equivalent.

```bash
$ pup < robots.html table.navbox ul a | tail
```

```bash
$ pup < robots.html table.navbox | pup ul | pup a | tail
```

Both produce the ouput:

```bash
</a>
<a href="/wiki/Stop_words" title="Stop words">
 Stop words
</a>
<a href="/wiki/Poison_words" title="Poison words">
 Poison words
</a>
<a href="/wiki/Content_farm" title="Content farm">
 Content farm
</a>
```

Because pup reconstructs the HTML parse tree, funny things can
happen when piping two commands together. I'd recommend chaining
commands rather than pipes.

####Limit print level

```bash
$ pup < robots.html table -l 2
<table class="metadata plainlinks ambox ambox-content" role="presentation">
 <tbody>
  ...
 </tbody>
</table>
<table style="background:#f9f9f9;font-size:85%;line-height:110%;max-width:175px;">
 <tbody>
  ...
 </tbody>
</table>
<table cellspacing="0" class="navbox" style="border-spacing:0;">
 <tbody>
  ...
 </tbody>
</table>
```

## Implemented Selectors

For further examples of these selectors head over to [w3schools](
http://www.w3schools.com/cssref/css_selectors.asp).

```bash
cat index.html | pup .class
# '#' indicates comments at the command line so you have to escape it
cat index.html | pup \#id
cat index.html | pup element
cat index.html | pup [attribute]
cat index.html | pup [attribute=value]
```

You can mix and match selectors as you wish.

```bash
cat index.html | pup element#id[attribute=value]
```

## Functions

Non-HTML selectors which effect the output type are implemented as functions
which can be provided as a final argument.

As of now, `text{}` is the only implemented function.

#### `text{}`

Print all text from selected nodes and children in depth first order.

```bash
$ cat robots.html | pup .mw-headline text{}
History
About the standard
Disadvantages
Alternatives
Examples
Nonstandard extensions
Crawl-delay directive
Allow directive
Sitemap
Host
Universal "*" match
Meta tags and headers
See also
References
External links
```

#### `attr{attrkey}`

Print the values of all attributes with a given key from all selected nodes.

```bash
$ pup < robots.html a attr{href} | head
#mw-navigation
#p-search
/wiki/MediaWiki:Robots.txt
//en.wikipedia.org/robots.txt
/wiki/Wikipedia:What_Wikipedia_is_not#NOTHOWTO
//en.wikipedia.org/w/index.php?title=Robots_exclusion_standard&action=edit
//meta.wikimedia.org/wiki/Help:Transwiki
//en.wikiversity.org/wiki/
//en.wikibooks.org/wiki/
//en.wikivoyage.org/wiki/
```

## Flags

```bash
-c --color         print result with color
-f --file          file to read from
-h --help          display this help
-i --indent        number of spaces to use for indent or character
-n --number        print number of elements selected
-l --limit         restrict number of levels printed
--version          display version
```

## TODO:

* Print as json function `json{}`
Initial commit 2014-09-01 00:01:03 +00:00			`# pup`

Update README.md 2014-09-13 12:56:00 +00:00			`pup is a command line tool for processing HTML. It reads from stdin,`
Typo 2014-09-13 19:25:02 +00:00			`prints to stdout, and allows the user to filter parts of the page using`
Update README.md 2014-09-13 18:03:03 +00:00			`[CSS selectors](http://www.w3schools.com/cssref/css_selectors.asp).`
Update README.md 2014-09-01 17:29:04 +00:00
Update README.md 2014-09-13 12:56:00 +00:00			`Inspired by [jq](http://stedolan.github.io/jq/), pup aims to be a`
Update README.md 2014-09-01 17:29:04 +00:00			`fast and flexible way of exploring HTML from the terminal.`

Update README.md 2014-09-13 12:56:00 +00:00			`Looking for feature requests and argument design, feel free to open an`
			`issue if you'd like to comment.`

Initial commit 2014-09-01 00:01:03 +00:00			`## Install`

dist removed from history 2014-09-14 20:01:24 +00:00			`Direct download are available on the [releases page](`
			`https://github.com/EricChiang/pup/releases).`

			Or if you can run `go get` to download via git.

Initial commit 2014-09-01 00:01:03 +00:00			`go get github.com/ericchiang/pup`

Update README.md 2014-09-13 12:56:00 +00:00			`## Quick start`

			```bash
			`$ curl http://www.pro-football-reference.com/years/2013/games.htm`
			```

			`Ew, HTML. Let's run that through some pup selectors:`

			```bash
			`$ curl http://www.pro-football-reference.com/years/2013/games.htm \| \`
Quote possible shell wildcard Because of this: ``` $ echo a[href=boxscores] a[href=boxscores] $ touch ah $ echo a[href*=boxscores] ah ``` 2014-09-14 09:09:11 +00:00			`pup table#games 'a[href*=boxscores]' attr{href}`
Update README.md 2014-09-13 12:56:00 +00:00			```

Update README.md 2014-09-02 00:56:02 +00:00			`## Basic Usage`

			```bash
			`$ cat index.html \| pup [selectors and flags]`
			```

			`or`

			```bash
			`$ pup < index.html [selectors and flags]`
			```

Update README.md 2014-09-01 17:29:04 +00:00			`## Examples`

Update README.md 2014-09-13 12:56:00 +00:00			`Download a webpage with wget.`
Update README.md 2014-09-01 17:29:04 +00:00
			```bash
			`$ wget http://en.wikipedia.org/wiki/Robots_exclusion_standard -O robots.html`
			```

text function added 2014-09-02 03:53:12 +00:00			`####Clean and indent`
Update README.md 2014-09-01 17:29:04 +00:00
Update README.md 2014-09-13 12:56:00 +00:00			`By default pup will fill in missing tags and properly indent the page.`
Update README.md 2014-09-01 17:29:04 +00:00
			```bash
			`$ cat robots.html`
added colorful printing 2014-09-01 18:18:34 +00:00			`# nasty looking HTML`
			`$ cat robots.html \| pup --color`
			`# cleaned, indented, and colorful HTML`
Update README.md 2014-09-01 17:29:04 +00:00			```

text function added 2014-09-02 03:53:12 +00:00			`####Filter by tag`
added colorful printing 2014-09-01 18:18:34 +00:00			```bash
Update README.md 2014-09-01 17:29:04 +00:00			`$ pup < robots.html title`
			`<title>`
			`Robots exclusion standard - Wikipedia, the free encyclopedia`
			`</title>`
			```

text function added 2014-09-02 03:53:12 +00:00			`####Filter by id`
added colorful printing 2014-09-01 18:18:34 +00:00			```bash
Update README.md 2014-09-01 17:29:04 +00:00			`$ pup < robots.html span#See_also`
			`<span class="mw-headline" id="See_also">`
			`See also`
			`</span>`
			```

text function added 2014-09-02 03:53:12 +00:00			`####Chain selectors together`
Update README.md 2014-09-01 17:29:04 +00:00
Update README.md 2014-09-13 12:56:00 +00:00			`The following two commands are (somewhat) equivalent.`
Update README.md 2014-09-01 17:29:04 +00:00
added colorful printing 2014-09-01 18:18:34 +00:00			```bash
Update README.md 2014-09-01 17:29:04 +00:00			`$ pup < robots.html table.navbox ul a \| tail`
			```

added colorful printing 2014-09-01 18:18:34 +00:00			```bash
Update README.md 2014-09-01 17:29:04 +00:00			`$ pup < robots.html table.navbox \| pup ul \| pup a \| tail`
			```

			`Both produce the ouput:`

added colorful printing 2014-09-01 18:18:34 +00:00			```bash
Update README.md 2014-09-01 17:29:04 +00:00			`</a>`
			`<a href="/wiki/Stop_words" title="Stop words">`
			`Stop words`
			`</a>`
			`<a href="/wiki/Poison_words" title="Poison words">`
			`Poison words`
			`</a>`
			`<a href="/wiki/Content_farm" title="Content farm">`
			`Content farm`
			`</a>`
			```

Update README.md 2014-09-13 12:56:00 +00:00			`Because pup reconstructs the HTML parse tree, funny things can`
			`happen when piping two commands together. I'd recommend chaining`
			`commands rather than pipes.`
Update README.md 2014-09-01 17:29:04 +00:00
text function added 2014-09-02 03:53:12 +00:00			`####Limit print level`
added colorful printing 2014-09-01 18:18:34 +00:00
			```bash
Update README.md 2014-09-01 17:29:04 +00:00			`$ pup < robots.html table -l 2`
			`<table class="metadata plainlinks ambox ambox-content" role="presentation">`
			`<tbody>`
			`...`
			`</tbody>`
			`</table>`
			`<table style="background:#f9f9f9;font-size:85%;line-height:110%;max-width:175px;">`
			`<tbody>`
			`...`
			`</tbody>`
			`</table>`
			`<table cellspacing="0" class="navbox" style="border-spacing:0;">`
			`<tbody>`
			`...`
			`</tbody>`
			`</table>`
			```

Update README.md 2014-09-02 00:56:02 +00:00			`## Implemented Selectors`

			`For further examples of these selectors head over to [w3schools](`
			`http://www.w3schools.com/cssref/css_selectors.asp).`

			```bash
			`cat index.html \| pup .class`
			`# '#' indicates comments at the command line so you have to escape it`
			`cat index.html \| pup \#id`
			`cat index.html \| pup element`
			`cat index.html \| pup [attribute]`
			`cat index.html \| pup [attribute=value]`
			```

			`You can mix and match selectors as you wish.`

			```bash
			`cat index.html \| pup element#id[attribute=value]`
			```

text function added 2014-09-02 03:53:12 +00:00			`## Functions`

			`Non-HTML selectors which effect the output type are implemented as functions`
			`which can be provided as a final argument.`

			As of now, `text{}` is the only implemented function.

			#### `text{}`

			`Print all text from selected nodes and children in depth first order.`

			```bash
			`$ cat robots.html \| pup .mw-headline text{}`
			`History`
			`About the standard`
			`Disadvantages`
			`Alternatives`
			`Examples`
			`Nonstandard extensions`
			`Crawl-delay directive`
			`Allow directive`
			`Sitemap`
			`Host`
			`Universal "*" match`
			`Meta tags and headers`
			`See also`
			`References`
			`External links`
			```

attribute function added 2014-09-03 02:23:02 +00:00			#### `attr{attrkey}`

			`Print the values of all attributes with a given key from all selected nodes.`

			```bash
			`$ pup < robots.html a attr{href} \| head`
			`#mw-navigation`
			`#p-search`
			`/wiki/MediaWiki:Robots.txt`
			`//en.wikipedia.org/robots.txt`
			`/wiki/Wikipedia:What_Wikipedia_is_not#NOTHOWTO`
			`//en.wikipedia.org/w/index.php?title=Robots_exclusion_standard&action=edit`
			`//meta.wikimedia.org/wiki/Help:Transwiki`
			`//en.wikiversity.org/wiki/`
			`//en.wikibooks.org/wiki/`
			`//en.wikivoyage.org/wiki/`
			```

Update README.md 2014-09-02 00:56:02 +00:00			`## Flags`

			```bash
			`-c --color print result with color`
			`-f --file file to read from`
			`-h --help display this help`
			`-i --indent number of spaces to use for indent or character`
			`-n --number print number of elements selected`
			`-l --limit restrict number of levels printed`
			`--version display version`
			```

Initial commit 2014-09-01 00:01:03 +00:00			`## TODO:`

text function added 2014-09-02 03:53:12 +00:00			* Print as json function `json{}`