mirror of
https://github.com/ericchiang/pup
synced 2024-10-31 20:58:59 +00:00
Update README.md
This commit is contained in:
parent
1b27120ea8
commit
8ca1d4c5fb
101
README.md
101
README.md
@ -1,13 +1,108 @@
|
|||||||
# pup
|
# pup
|
||||||
|
|
||||||
|
`pup` is a command line tool for processing HTML. It read from stdin,
|
||||||
|
prints to stdout, and allows the user to filter parts ot the page using
|
||||||
|
[CCS selectors](http://www.w3schools.com/cssref/css_selectors.asp).
|
||||||
|
|
||||||
|
Inspired by [`jq`](http://stedolan.github.io/jq/), `pup` aims to be a
|
||||||
|
fast and flexible way of exploring HTML from the terminal.
|
||||||
|
|
||||||
## Install
|
## Install
|
||||||
|
|
||||||
go get github.com/ericchiang/pup
|
go get github.com/ericchiang/pup
|
||||||
|
|
||||||
|
## Examples
|
||||||
|
|
||||||
|
Download a webpage with `wget`. _Please exercise restraint when using any
|
||||||
|
automated request tool._
|
||||||
|
|
||||||
|
```bash
|
||||||
|
$ wget http://en.wikipedia.org/wiki/Robots_exclusion_standard -O robots.html
|
||||||
|
```
|
||||||
|
|
||||||
|
###Clean and indent
|
||||||
|
|
||||||
|
By default, `pup` will fill in missing tags, and properly indent the page.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
$ cat robots.html
|
||||||
|
# nasty looking html
|
||||||
|
$ cat robots.html | pup
|
||||||
|
# cleaned and indented html
|
||||||
|
```
|
||||||
|
|
||||||
|
###Filter by tag
|
||||||
|
```
|
||||||
|
$ pup < robots.html title
|
||||||
|
<title>
|
||||||
|
Robots exclusion standard - Wikipedia, the free encyclopedia
|
||||||
|
</title>
|
||||||
|
```
|
||||||
|
|
||||||
|
###Filter by id
|
||||||
|
```
|
||||||
|
$ pup < robots.html span#See_also
|
||||||
|
<span class="mw-headline" id="See_also">
|
||||||
|
See also
|
||||||
|
</span>
|
||||||
|
```
|
||||||
|
|
||||||
|
###Chain selectors together
|
||||||
|
|
||||||
|
The following two commands are equivalent.
|
||||||
|
|
||||||
|
```
|
||||||
|
$ pup < robots.html table.navbox ul a | tail
|
||||||
|
```
|
||||||
|
|
||||||
|
```
|
||||||
|
$ pup < robots.html table.navbox | pup ul | pup a | tail
|
||||||
|
```
|
||||||
|
|
||||||
|
Both produce the ouput:
|
||||||
|
|
||||||
|
```
|
||||||
|
</a>
|
||||||
|
<a href="/wiki/Stop_words" title="Stop words">
|
||||||
|
Stop words
|
||||||
|
</a>
|
||||||
|
<a href="/wiki/Poison_words" title="Poison words">
|
||||||
|
Poison words
|
||||||
|
</a>
|
||||||
|
<a href="/wiki/Content_farm" title="Content farm">
|
||||||
|
Content farm
|
||||||
|
</a>
|
||||||
|
```
|
||||||
|
|
||||||
|
###How many nodes are selected by a filter?
|
||||||
|
```
|
||||||
|
$ pup < robots.html a -n
|
||||||
|
283
|
||||||
|
```
|
||||||
|
|
||||||
|
###Limit print level
|
||||||
|
```
|
||||||
|
$ pup < robots.html table -l 2
|
||||||
|
<table class="metadata plainlinks ambox ambox-content" role="presentation">
|
||||||
|
<tbody>
|
||||||
|
...
|
||||||
|
</tbody>
|
||||||
|
</table>
|
||||||
|
<table style="background:#f9f9f9;font-size:85%;line-height:110%;max-width:175px;">
|
||||||
|
<tbody>
|
||||||
|
...
|
||||||
|
</tbody>
|
||||||
|
</table>
|
||||||
|
<table cellspacing="0" class="navbox" style="border-spacing:0;">
|
||||||
|
<tbody>
|
||||||
|
...
|
||||||
|
</tbody>
|
||||||
|
</table>
|
||||||
|
```
|
||||||
|
|
||||||
## TODO:
|
## TODO:
|
||||||
|
|
||||||
* Attribute css selector.
|
* Attribute css selectors.
|
||||||
* Take input from file (-f)
|
|
||||||
* Set max print level flag (-l)
|
|
||||||
* Print attribute value rather than html ({href})
|
* Print attribute value rather than html ({href})
|
||||||
* Print result as JSON (--json)
|
* Print result as JSON (--json)
|
||||||
|
* Print colorfully
|
||||||
|
Loading…
Reference in New Issue
Block a user