1
0
mirror of https://github.com/ericchiang/pup synced 2025-01-15 02:00:55 +00:00
pup/README.md

340 lines
7.3 KiB
Markdown
Raw Normal View History

2014-09-01 00:01:03 +00:00
# pup
2014-09-13 12:56:00 +00:00
pup is a command line tool for processing HTML. It reads from stdin,
2014-09-13 19:25:02 +00:00
prints to stdout, and allows the user to filter parts of the page using
2014-09-29 01:09:36 +00:00
[CSS selectors](https://developer.mozilla.org/en-US/docs/Web/Guide/CSS/Getting_started/Selectors).
2014-09-01 17:29:04 +00:00
2014-09-13 12:56:00 +00:00
Inspired by [jq](http://stedolan.github.io/jq/), pup aims to be a
2014-09-01 17:29:04 +00:00
fast and flexible way of exploring HTML from the terminal.
2014-09-01 00:01:03 +00:00
## Install
2014-10-13 15:44:36 +00:00
Direct downloads are available through the [releases page](https://github.com/EricChiang/pup/releases/latest).
2014-09-14 20:01:24 +00:00
2014-10-13 15:25:35 +00:00
If you have Go installed on your computer just run `go get`.
2014-09-14 20:01:24 +00:00
2014-10-13 15:25:35 +00:00
go get github.com/ericchiang/pup
2015-05-31 19:58:19 +00:00
If you're on OS X, use [Homebrew](http://brew.sh/) to install (no Go required).
2014-10-13 15:25:35 +00:00
brew install https://raw.githubusercontent.com/EricChiang/pup/master/pup.rb
2014-09-01 00:01:03 +00:00
2014-09-13 12:56:00 +00:00
## Quick start
```bash
2014-09-28 21:43:42 +00:00
$ curl -s https://news.ycombinator.com/
2014-09-13 12:56:00 +00:00
```
Ew, HTML. Let's run that through some pup selectors:
```bash
$ curl -s https://news.ycombinator.com/ | pup 'table table tr:nth-last-of-type(n+2) td.title a'
2014-09-13 12:56:00 +00:00
```
Okay, how about only the links?
2014-10-11 16:58:29 +00:00
```bash
$ curl -s https://news.ycombinator.com/ | pup 'table table tr:nth-last-of-type(n+2) td.title a attr{href}'
2014-10-11 16:58:29 +00:00
```
Even better, let's grab the titles too:
2014-09-02 00:56:02 +00:00
```bash
$ curl -s https://news.ycombinator.com/ | pup 'table table tr:nth-last-of-type(n+2) td.title a json{}'
2014-09-02 00:56:02 +00:00
```
## Basic Usage
2014-09-02 00:56:02 +00:00
```bash
$ cat index.html | pup [flags] '[selectors] [display function]'
2014-09-02 00:56:02 +00:00
```
2014-09-01 17:29:04 +00:00
## Examples
2014-09-13 12:56:00 +00:00
Download a webpage with wget.
2014-09-01 17:29:04 +00:00
```bash
$ wget http://en.wikipedia.org/wiki/Robots_exclusion_standard -O robots.html
```
2014-09-02 03:53:12 +00:00
####Clean and indent
2014-09-01 17:29:04 +00:00
2014-09-13 12:56:00 +00:00
By default pup will fill in missing tags and properly indent the page.
2014-09-01 17:29:04 +00:00
```bash
$ cat robots.html
2014-09-01 18:18:34 +00:00
# nasty looking HTML
$ cat robots.html | pup --color
# cleaned, indented, and colorful HTML
2014-09-01 17:29:04 +00:00
```
2014-09-02 03:53:12 +00:00
####Filter by tag
2014-09-01 18:18:34 +00:00
```bash
$ cat robots.html | pup 'title'
2014-09-01 17:29:04 +00:00
<title>
Robots exclusion standard - Wikipedia, the free encyclopedia
</title>
```
2014-09-02 03:53:12 +00:00
####Filter by id
2014-09-01 18:18:34 +00:00
```bash
$ cat robots.html | pup 'span#See_also'
2014-09-01 17:29:04 +00:00
<span class="mw-headline" id="See_also">
See also
</span>
```
####Filter by attribute
2014-09-01 17:29:04 +00:00
2014-09-01 18:18:34 +00:00
```bash
$ cat robots.html | pup 'th[scope="row"]'
<th scope="row" class="navbox-group">
Exclusion standards
</th>
<th scope="row" class="navbox-group">
Related marketing topics
</th>
<th scope="row" class="navbox-group">
Search marketing related topics
</th>
<th scope="row" class="navbox-group">
Search engine spam
</th>
<th scope="row" class="navbox-group">
Linking
</th>
<th scope="row" class="navbox-group">
People
</th>
<th scope="row" class="navbox-group">
Other
</th>
2014-09-01 17:29:04 +00:00
```
####Pseudo Classes
CSS selectors have a group of specifiers called ["pseudo classes"](
https://developer.mozilla.org/en-US/docs/Web/CSS/Pseudo-classes) which are pretty
cool. pup implements a majority of the relevant ones them.
2014-09-01 17:29:04 +00:00
Here are some examples.
2014-09-01 17:29:04 +00:00
2014-09-01 18:18:34 +00:00
```bash
$ cat robots.html | pup 'a[rel]:empty'
<a rel="license" href="//creativecommons.org/licenses/by-sa/3.0/" style="display:none;">
2014-09-01 17:29:04 +00:00
</a>
```
2014-09-01 18:18:34 +00:00
```bash
$ cat robots.html | pup ':contains("History")'
<span class="toctext">
History
</span>
<span class="mw-headline" id="History">
History
</span>
2014-09-01 17:29:04 +00:00
```
```bash
$ cat robots.html | pup ':parent-of([action="edit"])'
<span class="wb-langlinks-edit wb-langlinks-link">
<a action="edit" href="//www.wikidata.org/wiki/Q80776#sitelinks-wikipedia" text="Edit links" title="Edit interlanguage links" class="wbc-editpage">
Edit links
</a>
</span>
```
2014-12-01 14:42:46 +00:00
For a complete list, view the [implemented selectors](#implemented-selectors)
section.
2014-09-18 01:50:06 +00:00
2014-11-23 20:11:17 +00:00
####`+`, `>`, and `,`
2015-10-01 11:19:43 +00:00
These are intermediate characters that declare special instructions. For
instance, a comma `,` allows pup to specify multiple groups of selectors.
2014-11-23 20:11:17 +00:00
```bash
2014-11-23 20:21:12 +00:00
$ cat robots.html | pup 'title, h1 span[dir="auto"]'
2014-11-23 20:11:17 +00:00
<title>
Robots exclusion standard - Wikipedia, the free encyclopedia
</title>
<span dir="auto">
Robots exclusion standard
</span>
```
####Chain selectors together
2014-09-18 01:50:06 +00:00
When combining selectors, the HTML nodes selected by the previous selector will
be passed to the next ones.
2014-09-18 01:50:06 +00:00
```bash
$ cat robots.html | pup 'h1#firstHeading'
<h1 id="firstHeading" class="firstHeading" lang="en">
<span dir="auto">
Robots exclusion standard
</span>
</h1>
2014-09-18 01:50:06 +00:00
```
```bash
$ cat robots.html | pup 'h1#firstHeading span'
<span dir="auto">
Robots exclusion standard
</span>
2014-09-18 01:50:06 +00:00
```
2014-09-02 00:56:02 +00:00
## Implemented Selectors
For further examples of these selectors head over to [MDN](
https://developer.mozilla.org/en-US/docs/Web/CSS/Reference).
2014-09-02 00:56:02 +00:00
```bash
pup '.class'
pup '#id'
pup 'element'
pup 'selector + selector'
pup 'selector > selector'
pup '[attribute]'
pup '[attribute="value"]'
pup '[attribute*="value"]'
pup '[attribute~="value"]'
pup '[attribute^="value"]'
pup '[attribute$="value"]'
pup ':empty'
pup ':first-child'
pup ':first-of-type'
pup ':last-child'
pup ':last-of-type'
pup ':only-child'
pup ':only-of-type'
pup ':contains("text")'
pup ':nth-child(n)'
pup ':nth-of-type(n)'
pup ':nth-last-child(n)'
pup ':nth-last-of-type(n)'
pup ':not(selector)'
pup ':parent-of(selector)'
2014-09-02 00:56:02 +00:00
```
You can mix and match selectors as you wish.
```bash
cat index.html | pup 'element#id[attribute="value"]:first-of-type'
2014-09-02 00:56:02 +00:00
```
2014-10-11 16:58:29 +00:00
## Display Functions
2014-09-02 03:53:12 +00:00
Non-HTML selectors which effect the output type are implemented as functions
which can be provided as a final argument.
#### `text{}`
Print all text from selected nodes and children in depth first order.
```bash
$ cat robots.html | pup '.mw-headline text{}'
2014-09-02 03:53:12 +00:00
History
About the standard
Disadvantages
Alternatives
Examples
Nonstandard extensions
Crawl-delay directive
Allow directive
Sitemap
Host
Universal "*" match
Meta tags and headers
See also
References
External links
```
2014-09-03 02:23:02 +00:00
#### `attr{attrkey}`
Print the values of all attributes with a given key from all selected nodes.
```bash
$ cat robots.html | pup '.catlinks div attr{id}'
mw-normal-catlinks
mw-hidden-catlinks
2014-09-03 02:23:02 +00:00
```
2014-10-11 16:58:29 +00:00
#### `json{}`
Print HTML as JSON.
```bash
$ cat robots.html | pup 'div#p-namespaces a'
2014-10-11 16:58:29 +00:00
<a href="/wiki/Robots_exclusion_standard" title="View the content page [c]" accesskey="c">
Article
</a>
<a href="/wiki/Talk:Robots_exclusion_standard" title="Discussion about the content page [t]" accesskey="t">
Talk
</a>
```
```bash
$ cat robots.html | pup 'div#p-namespaces a json{}'
2014-10-11 16:58:29 +00:00
[
{
"accesskey": "c",
"href": "/wiki/Robots_exclusion_standard",
2014-10-11 16:58:29 +00:00
"tag": "a",
"text": "Article",
"title": "View the content page [c]"
2014-10-11 16:58:29 +00:00
},
{
"accesskey": "t",
"href": "/wiki/Talk:Robots_exclusion_standard",
2014-10-11 16:58:29 +00:00
"tag": "a",
"text": "Talk",
"title": "Discussion about the content page [t]"
2014-10-11 16:58:29 +00:00
}
]
```
Use the `-i` / `--indent` flag to control the intent level.
```bash
$ cat robots.html | pup -i 4 'div#p-namespaces a json{}'
2014-10-11 16:58:29 +00:00
[
{
"accesskey": "c",
"href": "/wiki/Robots_exclusion_standard",
2014-10-11 16:58:29 +00:00
"tag": "a",
"text": "Article",
"title": "View the content page [c]"
2014-10-11 16:58:29 +00:00
},
{
"accesskey": "t",
"href": "/wiki/Talk:Robots_exclusion_standard",
2014-10-11 16:58:29 +00:00
"tag": "a",
"text": "Talk",
"title": "Discussion about the content page [t]"
2014-10-11 16:58:29 +00:00
}
]
```
If the selectors only return one element the results will be printed as a JSON
object, not a list.
```bash
$ cat robots.html | pup --indent 4 'title json{}'
2014-10-11 16:58:29 +00:00
{
"tag": "title",
"text": "Robots exclusion standard - Wikipedia, the free encyclopedia"
}
```
Because there is no universal standard for converting HTML/XML to JSON, a
method has been chosen which hopefully fits. The goal is simply to get the
output of pup into a more consumable format.
2014-09-02 00:56:02 +00:00
## Flags
2015-04-05 19:26:22 +00:00
Run `pup --help` for a list of further options