1
0
mirror of https://github.com/ericchiang/pup synced 2024-11-24 08:58:08 +00:00
pup/README.md

217 lines
4.5 KiB
Markdown
Raw Normal View History

2014-09-01 00:01:03 +00:00
# pup
2014-09-13 12:56:00 +00:00
pup is a command line tool for processing HTML. It reads from stdin,
2014-09-13 19:25:02 +00:00
prints to stdout, and allows the user to filter parts of the page using
2014-09-13 18:03:03 +00:00
[CSS selectors](http://www.w3schools.com/cssref/css_selectors.asp).
2014-09-01 17:29:04 +00:00
2014-09-13 12:56:00 +00:00
Inspired by [jq](http://stedolan.github.io/jq/), pup aims to be a
2014-09-01 17:29:04 +00:00
fast and flexible way of exploring HTML from the terminal.
2014-09-13 12:56:00 +00:00
Looking for feature requests and argument design, feel free to open an
issue if you'd like to comment.
2014-09-01 00:01:03 +00:00
## Install
2014-09-14 20:01:24 +00:00
Direct download are available on the [releases page](
https://github.com/EricChiang/pup/releases).
Or if you can run `go get` to download via git.
2014-09-01 00:01:03 +00:00
go get github.com/ericchiang/pup
2014-09-13 12:56:00 +00:00
## Quick start
```bash
$ curl http://www.pro-football-reference.com/years/2013/games.htm
```
Ew, HTML. Let's run that through some pup selectors:
```bash
$ curl http://www.pro-football-reference.com/years/2013/games.htm | \
pup table#games 'a[href*=boxscores]' attr{href}
2014-09-13 12:56:00 +00:00
```
2014-09-02 00:56:02 +00:00
## Basic Usage
```bash
$ cat index.html | pup [selectors and flags]
```
or
```bash
$ pup < index.html [selectors and flags]
```
2014-09-01 17:29:04 +00:00
## Examples
2014-09-13 12:56:00 +00:00
Download a webpage with wget.
2014-09-01 17:29:04 +00:00
```bash
$ wget http://en.wikipedia.org/wiki/Robots_exclusion_standard -O robots.html
```
2014-09-02 03:53:12 +00:00
####Clean and indent
2014-09-01 17:29:04 +00:00
2014-09-13 12:56:00 +00:00
By default pup will fill in missing tags and properly indent the page.
2014-09-01 17:29:04 +00:00
```bash
$ cat robots.html
2014-09-01 18:18:34 +00:00
# nasty looking HTML
$ cat robots.html | pup --color
# cleaned, indented, and colorful HTML
2014-09-01 17:29:04 +00:00
```
2014-09-02 03:53:12 +00:00
####Filter by tag
2014-09-01 18:18:34 +00:00
```bash
2014-09-01 17:29:04 +00:00
$ pup < robots.html title
<title>
Robots exclusion standard - Wikipedia, the free encyclopedia
</title>
```
2014-09-02 03:53:12 +00:00
####Filter by id
2014-09-01 18:18:34 +00:00
```bash
2014-09-01 17:29:04 +00:00
$ pup < robots.html span#See_also
<span class="mw-headline" id="See_also">
See also
</span>
```
2014-09-02 03:53:12 +00:00
####Chain selectors together
2014-09-01 17:29:04 +00:00
2014-09-13 12:56:00 +00:00
The following two commands are (somewhat) equivalent.
2014-09-01 17:29:04 +00:00
2014-09-01 18:18:34 +00:00
```bash
2014-09-01 17:29:04 +00:00
$ pup < robots.html table.navbox ul a | tail
```
2014-09-01 18:18:34 +00:00
```bash
2014-09-01 17:29:04 +00:00
$ pup < robots.html table.navbox | pup ul | pup a | tail
```
Both produce the ouput:
2014-09-01 18:18:34 +00:00
```bash
2014-09-01 17:29:04 +00:00
</a>
<a href="/wiki/Stop_words" title="Stop words">
Stop words
</a>
<a href="/wiki/Poison_words" title="Poison words">
Poison words
</a>
<a href="/wiki/Content_farm" title="Content farm">
Content farm
</a>
```
2014-09-13 12:56:00 +00:00
Because pup reconstructs the HTML parse tree, funny things can
happen when piping two commands together. I'd recommend chaining
commands rather than pipes.
2014-09-01 17:29:04 +00:00
2014-09-02 03:53:12 +00:00
####Limit print level
2014-09-01 18:18:34 +00:00
```bash
2014-09-01 17:29:04 +00:00
$ pup < robots.html table -l 2
<table class="metadata plainlinks ambox ambox-content" role="presentation">
<tbody>
...
</tbody>
</table>
<table style="background:#f9f9f9;font-size:85%;line-height:110%;max-width:175px;">
<tbody>
...
</tbody>
</table>
<table cellspacing="0" class="navbox" style="border-spacing:0;">
<tbody>
...
</tbody>
</table>
```
2014-09-02 00:56:02 +00:00
## Implemented Selectors
For further examples of these selectors head over to [w3schools](
http://www.w3schools.com/cssref/css_selectors.asp).
```bash
cat index.html | pup .class
# '#' indicates comments at the command line so you have to escape it
cat index.html | pup \#id
cat index.html | pup element
cat index.html | pup [attribute]
cat index.html | pup [attribute=value]
```
You can mix and match selectors as you wish.
```bash
cat index.html | pup element#id[attribute=value]
```
2014-09-02 03:53:12 +00:00
## Functions
Non-HTML selectors which effect the output type are implemented as functions
which can be provided as a final argument.
As of now, `text{}` is the only implemented function.
#### `text{}`
Print all text from selected nodes and children in depth first order.
```bash
$ cat robots.html | pup .mw-headline text{}
History
About the standard
Disadvantages
Alternatives
Examples
Nonstandard extensions
Crawl-delay directive
Allow directive
Sitemap
Host
Universal "*" match
Meta tags and headers
See also
References
External links
```
2014-09-03 02:23:02 +00:00
#### `attr{attrkey}`
Print the values of all attributes with a given key from all selected nodes.
```bash
$ pup < robots.html a attr{href} | head
#mw-navigation
#p-search
/wiki/MediaWiki:Robots.txt
//en.wikipedia.org/robots.txt
/wiki/Wikipedia:What_Wikipedia_is_not#NOTHOWTO
//en.wikipedia.org/w/index.php?title=Robots_exclusion_standard&action=edit
//meta.wikimedia.org/wiki/Help:Transwiki
//en.wikiversity.org/wiki/
//en.wikibooks.org/wiki/
//en.wikivoyage.org/wiki/
```
2014-09-02 00:56:02 +00:00
## Flags
```bash
-c --color print result with color
-f --file file to read from
-h --help display this help
-i --indent number of spaces to use for indent or character
-n --number print number of elements selected
-l --limit restrict number of levels printed
--version display version
```
2014-09-01 00:01:03 +00:00
## TODO:
2014-09-02 03:53:12 +00:00
* Print as json function `json{}`