From 60b1b5b9a36b1753aa142424d25bf200333b8801 Mon Sep 17 00:00:00 2001 From: Eric Chiang Date: Mon, 1 Sep 2014 13:29:04 -0400 Subject: [PATCH] Update README.md --- README.md | 101 ++++++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 98 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 4c3633a..780547f 100644 --- a/README.md +++ b/README.md @@ -1,13 +1,108 @@ # pup +`pup` is a command line tool for processing HTML. It read from stdin, +prints to stdout, and allows the user to filter parts ot the page using +[CCS selectors](http://www.w3schools.com/cssref/css_selectors.asp). + +Inspired by [`jq`](http://stedolan.github.io/jq/), `pup` aims to be a +fast and flexible way of exploring HTML from the terminal. + ## Install go get github.com/ericchiang/pup +## Examples + +Download a webpage with `wget`. _Please exercise restraint when using any +automated request tool._ + +```bash +$ wget http://en.wikipedia.org/wiki/Robots_exclusion_standard -O robots.html +``` + +###Clean and indent + +By default, `pup` will fill in missing tags, and properly indent the page. + +```bash +$ cat robots.html +# nasty looking html +$ cat robots.html | pup +# cleaned and indented html +``` + +###Filter by tag +``` +$ pup < robots.html title + + Robots exclusion standard - Wikipedia, the free encyclopedia + +``` + +###Filter by id +``` +$ pup < robots.html span#See_also + + See also + +``` + +###Chain selectors together + +The following two commands are equivalent. + +``` +$ pup < robots.html table.navbox ul a | tail +``` + +``` +$ pup < robots.html table.navbox | pup ul | pup a | tail +``` + +Both produce the ouput: + +``` + + + Stop words + + + Poison words + + + Content farm + +``` + +###How many nodes are selected by a filter? +``` +$ pup < robots.html a -n +283 +``` + +###Limit print level +``` +$ pup < robots.html table -l 2 + + + ... + + + + + ... + +
+ + + ... + + +``` + ## TODO: -* Attribute css selector. -* Take input from file (-f) -* Set max print level flag (-l) +* Attribute css selectors. * Print attribute value rather than html ({href}) * Print result as JSON (--json) +* Print colorfully