arno/pup

mirror of https://github.com/ericchiang/pup synced 2025-04-17 21:49:01 +00:00

Parsing HTML at the command line

Go to file

Eric Chiang 6915c6abb9 Update README.md		2014-09-01 20:56:02 -04:00
selector	attribute selectors added	2014-09-01 16:39:26 -04:00
.gitignore	Initial commit	2014-09-01 12:54:45 -04:00
LICENSE	license added	2014-09-01 13:36:10 -04:00
main.go	attribute selectors added	2014-09-01 16:39:26 -04:00
printing.go	cleaned up code and add comments	2014-09-01 15:07:42 -04:00
README.md	Update README.md	2014-09-01 20:56:02 -04:00

README.md

pup

pup is a command line tool for processing HTML. It reads from stdin, prints to stdout, and allows the user to filter parts ot the page using CCS selectors.

Inspired by jq, pup aims to be a fast and flexible way of exploring HTML from the terminal.

Install

go get github.com/ericchiang/pup

Basic Usage

$ cat index.html | pup [selectors and flags]

$ pup < index.html [selectors and flags]

Examples

Download a webpage with wget.

$ wget http://en.wikipedia.org/wiki/Robots_exclusion_standard -O robots.html

###Clean and indent

By default pup will fill in missing tags and properly indent the page.

$ cat robots.html
# nasty looking HTML
$ cat robots.html | pup --color
# cleaned, indented, and colorful HTML

###Filter by tag

$ pup < robots.html title
<title>
 Robots exclusion standard - Wikipedia, the free encyclopedia
</title>

###Filter by id

$ pup < robots.html span#See_also
<span class="mw-headline" id="See_also">
 See also
</span>

###Chain selectors together

The following two commands are equivalent. (NOTE: pipes do not work with the --color flag)

$ pup < robots.html table.navbox ul a | tail

$ pup < robots.html table.navbox | pup ul | pup a | tail

Both produce the ouput:

</a>
<a href="/wiki/Stop_words" title="Stop words">
 Stop words
</a>
<a href="/wiki/Poison_words" title="Poison words">
 Poison words
</a>
<a href="/wiki/Content_farm" title="Content farm">
 Content farm
</a>

###How many nodes are selected by a filter?

$ pup < robots.html a -n
283

###Limit print level

$ pup < robots.html table -l 2
<table class="metadata plainlinks ambox ambox-content" role="presentation">
 <tbody>
  ...
 </tbody>
</table>
<table style="background:#f9f9f9;font-size:85%;line-height:110%;max-width:175px;">
 <tbody>
  ...
 </tbody>
</table>
<table cellspacing="0" class="navbox" style="border-spacing:0;">
 <tbody>
  ...
 </tbody>
</table>

Implemented Selectors

For further examples of these selectors head over to w3schools.

cat index.html | pup .class
# '#' indicates comments at the command line so you have to escape it
cat index.html | pup \#id
cat index.html | pup element
cat index.html | pup [attribute]
cat index.html | pup [attribute=value]

You can mix and match selectors as you wish.

cat index.html | pup element#id[attribute=value]

Flags

-c --color         print result with color
-f --file          file to read from
-h --help          display this help
-i --indent        number of spaces to use for indent or character
-n --number        print number of elements selected
-l --limit         restrict number of levels printed
--version          display version

TODO:

Print attribute value rather than html ({href})
Print result as JSON (--json)