Parsing HTML at the command line
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
Go to file
ericchiang 957fc30cc1
added colorful printing
10 years ago
selector Initial commit 10 years ago
.gitignore Initial commit 10 years ago
LICENSE license added 10 years ago
README.md added colorful printing 10 years ago
main.go added colorful printing 10 years ago
printing.go added colorful printing 10 years ago

README.md

pup

pup is a command line tool for processing HTML. It read from stdin, prints to stdout, and allows the user to filter parts ot the page using CCS selectors.

Inspired by jq, pup aims to be a fast and flexible way of exploring HTML from the terminal.

Install

go get github.com/ericchiang/pup

Examples

Download a webpage with wget. Please exercise restraint when using any automated request tool.

$ wget http://en.wikipedia.org/wiki/Robots_exclusion_standard -O robots.html

###Clean and indent

By default, pup will fill in missing tags, and properly indent the page.

$ cat robots.html
# nasty looking HTML
$ cat robots.html | pup --color
# cleaned, indented, and colorful HTML

###Filter by tag

$ pup < robots.html title
<title>
 Robots exclusion standard - Wikipedia, the free encyclopedia
</title>

###Filter by id

$ pup < robots.html span#See_also
<span class="mw-headline" id="See_also">
 See also
</span>

###Chain selectors together

The following two commands are equivalent. (NOTE: pipes do not work with the --color flag)

$ pup < robots.html table.navbox ul a | tail
$ pup < robots.html table.navbox | pup ul | pup a | tail

Both produce the ouput:

</a>
<a href="/wiki/Stop_words" title="Stop words">
 Stop words
</a>
<a href="/wiki/Poison_words" title="Poison words">
 Poison words
</a>
<a href="/wiki/Content_farm" title="Content farm">
 Content farm
</a>

###How many nodes are selected by a filter?

$ pup < robots.html a -n
283

###Limit print level

$ pup < robots.html table -l 2
<table class="metadata plainlinks ambox ambox-content" role="presentation">
 <tbody>
  ...
 </tbody>
</table>
<table style="background:#f9f9f9;font-size:85%;line-height:110%;max-width:175px;">
 <tbody>
  ...
 </tbody>
</table>
<table cellspacing="0" class="navbox" style="border-spacing:0;">
 <tbody>
  ...
 </tbody>
</table>

TODO:

  • Attribute css selectors.
  • Print attribute value rather than html ({href})
  • Print result as JSON (--json)