mirror of
https://github.com/ericchiang/pup
synced 2024-11-24 00:48:36 +00:00
Parsing HTML at the command line
selector | ||
.gitignore | ||
LICENSE | ||
main.go | ||
printing.go | ||
README.md |
pup
pup
is a command line tool for processing HTML. It reads from stdin,
prints to stdout, and allows the user to filter parts ot the page using
CCS selectors.
Inspired by jq
, pup
aims to be a
fast and flexible way of exploring HTML from the terminal.
Install
go get github.com/ericchiang/pup
Basic Usage
$ cat index.html | pup [selectors and flags]
or
$ pup < index.html [selectors and flags]
Examples
Download a webpage with wget
.
$ wget http://en.wikipedia.org/wiki/Robots_exclusion_standard -O robots.html
###Clean and indent
By default pup
will fill in missing tags and properly indent the page.
$ cat robots.html
# nasty looking HTML
$ cat robots.html | pup --color
# cleaned, indented, and colorful HTML
###Filter by tag
$ pup < robots.html title
<title>
Robots exclusion standard - Wikipedia, the free encyclopedia
</title>
###Filter by id
$ pup < robots.html span#See_also
<span class="mw-headline" id="See_also">
See also
</span>
###Chain selectors together
The following two commands are equivalent. (NOTE: pipes do not work with the
--color
flag)
$ pup < robots.html table.navbox ul a | tail
$ pup < robots.html table.navbox | pup ul | pup a | tail
Both produce the ouput:
</a>
<a href="/wiki/Stop_words" title="Stop words">
Stop words
</a>
<a href="/wiki/Poison_words" title="Poison words">
Poison words
</a>
<a href="/wiki/Content_farm" title="Content farm">
Content farm
</a>
###How many nodes are selected by a filter?
$ pup < robots.html a -n
283
###Limit print level
$ pup < robots.html table -l 2
<table class="metadata plainlinks ambox ambox-content" role="presentation">
<tbody>
...
</tbody>
</table>
<table style="background:#f9f9f9;font-size:85%;line-height:110%;max-width:175px;">
<tbody>
...
</tbody>
</table>
<table cellspacing="0" class="navbox" style="border-spacing:0;">
<tbody>
...
</tbody>
</table>
Implemented Selectors
For further examples of these selectors head over to w3schools.
cat index.html | pup .class
# '#' indicates comments at the command line so you have to escape it
cat index.html | pup \#id
cat index.html | pup element
cat index.html | pup [attribute]
cat index.html | pup [attribute=value]
You can mix and match selectors as you wish.
cat index.html | pup element#id[attribute=value]
Flags
-c --color print result with color
-f --file file to read from
-h --help display this help
-i --indent number of spaces to use for indent or character
-n --number print number of elements selected
-l --limit restrict number of levels printed
--version display version
TODO:
- Print attribute value rather than html ({href})
- Print result as JSON (--json)