diff --git a/.gitignore b/.gitignore index d50cd8c..5054208 100644 --- a/.gitignore +++ b/.gitignore @@ -1,2 +1,3 @@ dist/ testpages/* +tests/test_results.txt diff --git a/README.md b/README.md index afe2205..56b7820 100644 --- a/README.md +++ b/README.md @@ -28,25 +28,25 @@ $ curl -s https://news.ycombinator.com/ Ew, HTML. Let's run that through some pup selectors: ```bash -$ curl -s https://news.ycombinator.com/ | pup 'td.title a[href^=http] attr{href}' +$ curl -s https://news.ycombinator.com/ | pup 'table table tr:nth-last-of-type(n+2) td.title a' ``` -Even better, let's grab the titles too: +Okay, how about only the links? ```bash -$ curl -s https://news.ycombinator.com/ | pup 'td.title a[href^=http] json{}' +$ curl -s https://news.ycombinator.com/ | pup 'table table tr:nth-last-of-type(n+2) td.title a attr{href}' ``` -## Basic Usage +Even better, let's grab the titles too: ```bash -$ cat index.html | pup [flags] [selectors] [optional display function] +$ curl -s https://news.ycombinator.com/ | pup 'table table tr:nth-last-of-type(n+2) td.title a json{}' ``` -or +## Basic Usage ```bash -$ pup < index.html [flags] [selectors] [optional display function] +$ cat index.html | pup [flags] '[selectors] [display function]' ``` ## Examples @@ -69,123 +69,133 @@ $ cat robots.html | pup --color ``` ####Filter by tag + ```bash -$ pup < robots.html title +$ cat robots.html | pup 'title' Robots exclusion standard - Wikipedia, the free encyclopedia ``` ####Filter by id + ```bash -$ pup < robots.html span#See_also +$ cat robots.html | pup 'span#See_also' See also ``` -####Chain selectors together - -The following two commands are (somewhat) equivalent. +####Filter by attribute ```bash -$ pup < robots.html table.navbox ul a | tail +$ cat robots.html | pup 'th[scope="row"]' + + Exclusion standards + + + Related marketing topics + + + Search marketing related topics + + + Search engine spam + + + Linking + + + People + + + Other + ``` -```bash -$ pup < robots.html table.navbox | pup ul | pup a | tail -``` +####Pseudo Classes + +CSS selectors have a group of specifiers called ["pseudo classes"]( +https://developer.mozilla.org/en-US/docs/Web/CSS/Pseudo-classes) which are pretty +cool. pup implements a majority of the relevant ones them. -Both produce the ouput: +Here are some examples. ```bash - - - Stop words - - - Poison words - - - Content farm +$ cat robots.html | pup 'a[rel]:empty' + ``` -Because pup reconstructs the HTML parse tree, funny things can -happen when piping two commands together. I'd recommend chaining -commands rather than pipes. - -####Limit print level - ```bash -$ pup < robots.html table -l 2 - - - ... - - - - - ... - -
- - - ... - - +$ cat robots.html | pup ':contains("History")' + + History + + + History + ``` -####Slices +For a complete list, view the [implemented selectors](#Implemented Selectors) +section. -Slices allow you to do simple `{start:end:by}` operations to limit the number of nodes -selected for the next round of selection. +####Chain selectors together -Provide one number for a simple index. +When combining selectors, the HTML nodes selected by the previous selector will +be passed to the next ones. ```bash -$ pup < robots.html a slice{0} - - +$ cat robots.html | pup 'h1#firstHeading' +

+ + Robots exclusion standard + +

``` -You can provide an end to limit the number of nodes selected. - ```bash -$ # {:3} is the same as {0:3} -$ pup < robots.html a slice{:3} - - - - navigation - - - search - +$ cat robots.html | pup 'h1#firstHeading span' + + Robots exclusion standard + ``` ## Implemented Selectors -For further examples of these selectors head over to [MDN](https://developer.mozilla.org/en-US/docs/Web/CSS/Reference). +For further examples of these selectors head over to [MDN]( +https://developer.mozilla.org/en-US/docs/Web/CSS/Reference). ```bash -cat index.html | pup .class -# '#' indicates comments at the command line so you have to escape it -cat index.html | pup \#id -cat index.html | pup element -cat index.html | pup [attribute] -cat index.html | pup [attribute=value] -# Probably best to quote enclose wildcards -cat index.html | pup '[attribute*=value]' -cat index.html | pup [attribute~=value] -cat index.html | pup [attribute^=value] -cat index.html | pup [attribute$=value] +cat index.html | pup '.class' +cat index.html | pup '#id' +cat index.html | pup 'element' +cat index.html | pup 'selector + selector' +cat index.html | pup 'selector > selector' +cat index.html | pup '[attribute]' +cat index.html | pup '[attribute="value"]' +cat index.html | pup '[attribute*="value"]' +cat index.html | pup '[attribute~="value"]' +cat index.html | pup '[attribute^="value"]' +cat index.html | pup '[attribute$="value"]' +cat index.html | pup ':empty' +cat index.html | pup ':first-child' +cat index.html | pup ':first-of-type' +cat index.html | pup ':last-child' +cat index.html | pup ':last-of-type' +cat index.html | pup ':only-child' +cat index.html | pup ':only-of-type' +cat index.html | pup ':contains("text")' +cat index.html | pup ':nth-child(n)' +cat index.html | pup ':nth-of-type(n)' +cat index.html | pup ':nth-last-child(n)' +cat index.html | pup ':nth-last-of-type(n)' ``` You can mix and match selectors as you wish. ```bash -cat index.html | pup element#id[attribute=value] +cat index.html | pup 'element#id[attribute="value"]:first-of-type' ``` ## Display Functions @@ -198,7 +208,7 @@ which can be provided as a final argument. Print all text from selected nodes and children in depth first order. ```bash -$ cat robots.html | pup .mw-headline text{} +$ cat robots.html | pup '.mw-headline text{}' History About the standard Disadvantages @@ -221,17 +231,9 @@ External links Print the values of all attributes with a given key from all selected nodes. ```bash -$ pup < robots.html a attr{href} | head -#mw-navigation -#p-search -/wiki/MediaWiki:Robots.txt -//en.wikipedia.org/robots.txt -/wiki/Wikipedia:What_Wikipedia_is_not#NOTHOWTO -//en.wikipedia.org/w/index.php?title=Robots_exclusion_standard&action=edit -//meta.wikimedia.org/wiki/Help:Transwiki -//en.wikiversity.org/wiki/ -//en.wikibooks.org/wiki/ -//en.wikivoyage.org/wiki/ +$ cat robots.html | pup '.catlinks div attr{id}' +mw-normal-catlinks +mw-hidden-catlinks ``` #### `json{}` @@ -239,7 +241,7 @@ $ pup < robots.html a attr{href} | head Print HTML as JSON. ```bash -$ cat robots.html | pup div#p-namespaces a +$ cat robots.html | pup 'div#p-namespaces a' Article @@ -249,25 +251,21 @@ $ cat robots.html | pup div#p-namespaces a ``` ```bash -$ cat robots.html | pup div#p-namespaces a json{} +$ cat robots.html | pup 'div#p-namespaces a json{}' [ { - "attrs": { - "accesskey": "c", - "href": "/wiki/Robots_exclusion_standard", - "title": "View the content page [c]" - }, + "accesskey": "c", + "href": "/wiki/Robots_exclusion_standard", "tag": "a", - "text": "Article" + "text": "Article", + "title": "View the content page [c]" }, { - "attrs": { - "accesskey": "t", - "href": "/wiki/Talk:Robots_exclusion_standard", - "title": "Discussion about the content page [t]" - }, + "accesskey": "t", + "href": "/wiki/Talk:Robots_exclusion_standard", "tag": "a", - "text": "Talk" + "text": "Talk", + "title": "Discussion about the content page [t]" } ] ``` @@ -275,25 +273,21 @@ $ cat robots.html | pup div#p-namespaces a json{} Use the `-i` / `--indent` flag to control the intent level. ```bash -$ cat robots.html | pup --indent 4 div#p-namespaces a json{} +$ cat robots.html | pup -i 4 'div#p-namespaces a json{}' [ { - "attrs": { - "accesskey": "c", - "href": "/wiki/Robots_exclusion_standard", - "title": "View the content page [c]" - }, + "accesskey": "c", + "href": "/wiki/Robots_exclusion_standard", "tag": "a", - "text": "Article" + "text": "Article", + "title": "View the content page [c]" }, { - "attrs": { - "accesskey": "t", - "href": "/wiki/Talk:Robots_exclusion_standard", - "title": "Discussion about the content page [t]" - }, + "accesskey": "t", + "href": "/wiki/Talk:Robots_exclusion_standard", "tag": "a", - "text": "Talk" + "text": "Talk", + "title": "Discussion about the content page [t]" } ] ``` @@ -302,7 +296,7 @@ If the selectors only return one element the results will be printed as a JSON object, not a list. ```bash -$ cat robots.html | pup --indent 4 title json{} +$ cat robots.html | pup --indent 4 'title json{}' { "tag": "title", "text": "Robots exclusion standard - Wikipedia, the free encyclopedia" @@ -324,29 +318,3 @@ output of pup into a more consumable format. -l --limit restrict number of levels printed --version display version ``` - -## TODO - -Add more selectors: - -``` -div > p -div + p - -p:contains - -p:empty - -p:first-child -p:first-of-type -p:last-child -p:last-of-type - -p:nth-child(2) -p:nth-last-child(2) -p:nth-last-of-type(2) -p:nth-of-type(2) - -p:only-of-type -p:only-child -``` diff --git a/display.go b/display.go index 2d08f62..e81a336 100644 --- a/display.go +++ b/display.go @@ -7,15 +7,147 @@ import ( "strings" "code.google.com/p/go.net/html" + "code.google.com/p/go.net/html/atom" + "github.com/fatih/color" + "github.com/mattn/go-colorable" ) +func init() { + color.Output = colorable.NewColorableStdout() +} + type Displayer interface { - Display(nodes []*html.Node) + Display([]*html.Node) } -type TextDisplayer struct { +func ParseDisplayer(cmd string) error { + attrRe := regexp.MustCompile(`attr\{([a-zA-Z\-]+)\}`) + if cmd == "text{}" { + pupDisplayer = TextDisplayer{} + } else if cmd == "json{}" { + pupDisplayer = JSONDisplayer{} + } else if match := attrRe.FindAllStringSubmatch(cmd, -1); len(match) == 1 { + pupDisplayer = AttrDisplayer{ + Attr: match[0][1], + } + } else { + return fmt.Errorf("Unknown displayer") + } + return nil +} + +// Is this node a tag with no end tag such as or
? +// http://www.w3.org/TR/html-markup/syntax.html#syntax-elements +func isVoidElement(n *html.Node) bool { + switch n.DataAtom { + case atom.Area, atom.Base, atom.Br, atom.Col, atom.Command, atom.Embed, + atom.Hr, atom.Img, atom.Input, atom.Keygen, atom.Link, + atom.Meta, atom.Param, atom.Source, atom.Track, atom.Wbr: + return true + } + return false } +var ( + // Colors + tagColor *color.Color = color.New(color.FgCyan) + tokenColor = color.New(color.FgCyan) + attrKeyColor = color.New(color.FgMagenta) + quoteColor = color.New(color.FgBlue) + commentColor = color.New(color.FgYellow) +) + +type TreeDisplayer struct { +} + +func (t TreeDisplayer) Display(nodes []*html.Node) { + for _, node := range nodes { + t.printNode(node, 0) + } +} + +// Print a node and all of it's children to `maxlevel`. +func (t TreeDisplayer) printNode(n *html.Node, level int) { + switch n.Type { + case html.TextNode: + s := html.EscapeString(n.Data) + s = strings.TrimSpace(s) + if s != "" { + t.printIndent(level) + fmt.Println(s) + } + case html.ElementNode: + t.printIndent(level) + if pupPrintColor { + tokenColor.Print("<") + tagColor.Printf("%s", n.Data) + } else { + fmt.Printf("<%s", n.Data) + } + for _, a := range n.Attr { + val := html.EscapeString(a.Val) + if pupPrintColor { + fmt.Print(" ") + attrKeyColor.Printf("%s", a.Key) + tokenColor.Print("=") + quoteColor.Printf(`"%s"`, val) + } else { + fmt.Printf(` %s="%s"`, a.Key, val) + } + } + if pupPrintColor { + tokenColor.Println(">") + } else { + fmt.Println(">") + } + if !isVoidElement(n) { + t.printChildren(n, level+1) + t.printIndent(level) + if pupPrintColor { + tokenColor.Print("") + } else { + fmt.Printf("\n", n.Data) + } + } + case html.CommentNode: + t.printIndent(level) + if pupPrintColor { + commentColor.Printf("\n", n.Data) + } else { + fmt.Printf("\n", n.Data) + } + t.printChildren(n, level) + case html.DoctypeNode, html.DocumentNode: + t.printChildren(n, level) + } +} + +func (t TreeDisplayer) printChildren(n *html.Node, level int) { + if pupMaxPrintLevel > -1 { + if level >= pupMaxPrintLevel { + t.printIndent(level) + fmt.Println("...") + return + } + } + child := n.FirstChild + for child != nil { + t.printNode(child, level) + child = child.NextSibling + } +} + +func (t TreeDisplayer) printIndent(level int) { + for ; level > 0; level-- { + fmt.Print(pupIndentString) + } +} + +// Print the text of a node +type TextDisplayer struct{} + func (t TextDisplayer) Display(nodes []*html.Node) { for _, node := range nodes { if node.Type == html.TextNode { @@ -31,6 +163,7 @@ func (t TextDisplayer) Display(nodes []*html.Node) { } } +// Print the attribute of a node type AttrDisplayer struct { Attr string } @@ -47,18 +180,16 @@ func (a AttrDisplayer) Display(nodes []*html.Node) { } } -type JSONDisplayer struct { -} +// Print nodes as a JSON list +type JSONDisplayer struct{} // returns a jsonifiable struct func jsonify(node *html.Node) map[string]interface{} { vals := map[string]interface{}{} if len(node.Attr) > 0 { - attrs := map[string]string{} for _, attr := range node.Attr { - attrs[attr.Key] = html.EscapeString(attr.Val) + vals[attr.Key] = html.EscapeString(attr.Val) } - vals["attrs"] = attrs } vals["tag"] = node.DataAtom.String() children := []interface{}{} @@ -99,7 +230,7 @@ func (j JSONDisplayer) Display(nodes []*html.Node) { node := nodes[0] if node.Type != html.DocumentNode { jsonNode := jsonify(nodes[0]) - data, err = json.MarshalIndent(&jsonNode, "", indentString) + data, err = json.MarshalIndent(&jsonNode, "", pupIndentString) } else { children := []*html.Node{} child := node.FirstChild @@ -114,38 +245,10 @@ func (j JSONDisplayer) Display(nodes []*html.Node) { for _, node := range nodes { jsonNodes = append(jsonNodes, jsonify(node)) } - data, err = json.MarshalIndent(&jsonNodes, "", indentString) + data, err = json.MarshalIndent(&jsonNodes, "", pupIndentString) } if err != nil { panic("Could not jsonify nodes") } fmt.Printf("%s", data) } - -var ( - // Display function helpers - displayMatcher *regexp.Regexp = regexp.MustCompile(`\{[^\}]*\}$`) - textFuncMatcher = regexp.MustCompile(`^text\{\}$`) - attrFuncMatcher = regexp.MustCompile(`^attr\{([^\}]*)\}$`) - jsonFuncMatcher = regexp.MustCompile(`^json\{([^\}]*)\}$`) -) - -func NewDisplayFunc(text string) (Displayer, error) { - if !displayMatcher.MatchString(text) { - return nil, fmt.Errorf("Not a display function") - } - switch { - case textFuncMatcher.MatchString(text): - return TextDisplayer{}, nil - case attrFuncMatcher.MatchString(text): - matches := attrFuncMatcher.FindStringSubmatch(text) - if len(matches) != 2 { - return nil, fmt.Errorf("") - } else { - return AttrDisplayer{matches[1]}, nil - } - case jsonFuncMatcher.MatchString(text): - return JSONDisplayer{}, nil - } - return nil, fmt.Errorf("Not a display function") -} diff --git a/main.go b/main.go deleted file mode 100644 index e85ea4a..0000000 --- a/main.go +++ /dev/null @@ -1,202 +0,0 @@ -package main - -import ( - "code.google.com/p/go.net/html" - "code.google.com/p/go.net/html/charset" - "fmt" - "github.com/ericchiang/pup/selector" - "io" - "os" - "strconv" - "strings" -) - -const VERSION string = "0.3.2" - -var ( - // Flags - attributes []string = []string{} - inputStream io.ReadCloser = os.Stdin - indentString string = " " - maxPrintLevel int = -1 - printNumber bool = false - printColor bool = false - displayer Displayer = nil -) - -// Print to stderr and exit -func Fatal(format string, args ...interface{}) { - fmt.Fprintf(os.Stderr, format, args...) - fmt.Fprintf(os.Stderr, "\n") - os.Exit(1) -} - -// Print help to stderr and quit -func PrintHelp() { - helpString := `Usage - - pup [flags] [selectors] [optional display function] - -Version - - %s - -Flags - - -c --color print result with color - -f --file file to read from - -h --help display this help - -i --indent number of spaces to use for indent or character - -n --number print number of elements selected - -l --limit restrict number of levels printed - --version display version` - Fatal(helpString, VERSION) -} - -// Process command arguments and return all non-flags. -func ProcessFlags(cmds []string) []string { - var i int - var err error - defer func() { - if r := recover(); r != nil { - Fatal("Option '%s' requires an argument", cmds[i]) - } - }() - nonFlagCmds := make([]string, len(cmds)) - n := 0 - for i = 0; i < len(cmds); i++ { - cmd := cmds[i] - switch cmd { - case "-a", "--attr": - attributes = append(attributes, cmds[i+1]) - i++ - case "-c", "--color": - printColor = true - case "-f", "--file": - filename := cmds[i+1] - inputStream, err = os.Open(filename) - if err != nil { - Fatal(err.Error()) - } - i++ - case "-h", "--help": - PrintHelp() - os.Exit(1) - case "-i", "--indent": - indentLevel, err := strconv.Atoi(cmds[i+1]) - if err == nil { - indentString = strings.Repeat(" ", indentLevel) - } else { - indentString = cmds[i+1] - } - i++ - case "-n", "--number": - printNumber = true - case "-l", "--limit": - maxPrintLevel, err = strconv.Atoi(cmds[i+1]) - if err != nil { - Fatal("Argument for '%s' must be numeric", - cmds) - } - i++ - case "--version": - Fatal(VERSION) - default: - if cmd[0] == '-' { - Fatal("Unrecognized flag '%s'", cmd) - } - nonFlagCmds[n] = cmds[i] - n++ - } - } - return nonFlagCmds[:n] -} - -// Split a string while ignoring strings. -func QuoteSplit(input string) []string { - last := 0 - split := []string{} - inQuote := false - quoteChar := ' ' - escapeNext := false - for i, c := range input { - if escapeNext { - escapeNext = false - continue - } - switch c { - case ' ': - if !inQuote { - if last < i { - split = append(split, input[last:i]) - } - last = i + 1 - } - case '"', '\'': - if inQuote { - if c == quoteChar { - inQuote = false - } - } else { - inQuote = true - quoteChar = c - } - case '\\': - escapeNext = true - } - } - if last < len(input) { - split = append(split, input[last:len(input)]) - } - return split -} - -// pup -func main() { - args := QuoteSplit(strings.Join(os.Args[1:], " ")) - cmds := ProcessFlags(args) - cr, err := charset.NewReader(inputStream, "") - if err != nil { - fmt.Fprintf(os.Stderr, err.Error()) - os.Exit(2) - } - root, err := html.Parse(cr) - if err != nil { - fmt.Fprintf(os.Stderr, err.Error()) - os.Exit(2) - } - inputStream.Close() - if len(cmds) == 0 { - t := TreeDisplayer{indentString} - t.Display([]*html.Node{root}) - os.Exit(0) - } - selectors := make([]selector.Selector, len(cmds)) - for i, cmd := range cmds { - // if this is the last element, check for a function like - // text{} or attr{} - if i+1 == len(cmds) { - d, err := NewDisplayFunc(cmd) - if err == nil { - displayer = d - selectors = selectors[0 : len(cmds)-1] - break - } else { - displayer = TreeDisplayer{indentString} - } - } - selectors[i], err = selector.NewSelector(cmd) - if err != nil { - Fatal("Selector parse error: %s", err) - } - } - currNodes := []*html.Node{root} - for _, selector := range selectors { - currNodes = selector.Select(currNodes) - } - if printNumber { - fmt.Println(len(currNodes)) - } else { - displayer.Display(currNodes) - } -} diff --git a/parse.go b/parse.go new file mode 100644 index 0000000..f61a43f --- /dev/null +++ b/parse.go @@ -0,0 +1,135 @@ +package main + +import ( + "fmt" + "io" + "os" + "strconv" + "strings" +) + +var ( + pupIn io.ReadCloser = os.Stdin + pupMaxPrintLevel int = -1 + pupPrintColor bool = false + pupIndentString string = " " + pupDisplayer Displayer = TreeDisplayer{} +) + +func PrintHelp(w io.Writer, exitCode int) { + helpString := `Usage + pup [flags] [selectors] [optional display function] +Version + %s +Flags + -c --color print result with color + -f --file file to read from + -h --help display this help + -i --indent number of spaces to use for indent or character + -n --number print number of elements selected + -l --limit restrict number of levels printed + --version display version +` + fmt.Fprintf(w, helpString, VERSION) + os.Exit(exitCode) +} + +func ParseArgs() []string { + cmds := ProcessFlags(os.Args[1:]) + return ParseCommands(strings.Join(cmds, " ")) +} + +// Process command arguments and return all non-flags. +func ProcessFlags(cmds []string) []string { + var i int + var err error + defer func() { + if r := recover(); r != nil { + fmt.Fprintf(os.Stderr, "Option '%s' requires an argument", cmds[i]) + os.Exit(2) + } + }() + nonFlagCmds := make([]string, len(cmds)) + n := 0 + for i = 0; i < len(cmds); i++ { + cmd := cmds[i] + switch cmd { + case "-c", "--color": + pupPrintColor = true + case "-f", "--file": + filename := cmds[i+1] + pupIn, err = os.Open(filename) + if err != nil { + fmt.Fprintf(os.Stderr, "%s\n", err.Error()) + os.Exit(2) + } + i++ + case "-h", "--help": + PrintHelp(os.Stdout, 0) + case "-i", "--indent": + indentLevel, err := strconv.Atoi(cmds[i+1]) + if err == nil { + pupIndentString = strings.Repeat(" ", indentLevel) + } else { + pupIndentString = cmds[i+1] + } + i++ + case "-l", "--limit": + pupMaxPrintLevel, err = strconv.Atoi(cmds[i+1]) + if err != nil { + fmt.Fprintf(os.Stderr, "Argument for '%s' must be numeric\n", cmd) + os.Exit(2) + } + i++ + case "--version": + fmt.Println(VERSION) + os.Exit(0) + default: + if cmd[0] == '-' { + fmt.Fprintf(os.Stderr, "Unrecognized flag '%s'", cmd) + os.Exit(2) + } + nonFlagCmds[n] = cmds[i] + n++ + } + } + return nonFlagCmds[:n] +} + +// Split a string with awareness for quoted text +func ParseCommands(cmdString string) []string { + cmds := []string{} + last, next, max := 0, 0, len(cmdString) + for { + // if we're at the end of the string, return + if next == max { + if next > last { + cmds = append(cmds, cmdString[last:next]) + } + return cmds + } + // evalute a rune + c := cmdString[next] + switch c { + case ' ': + if next > last { + cmds = append(cmds, cmdString[last:next]) + } + last = next + 1 + case '\'', '"': + // for quotes, consume runes until the quote has ended + quoteChar := c + for { + next++ + if next == max { + fmt.Fprintf(os.Stderr, "Unmatched open quote (%c)\n", quoteChar) + os.Exit(2) + } + if cmdString[next] == quoteChar { + break + } + } + } + next++ + } +} diff --git a/printing.go b/printing.go deleted file mode 100644 index e22632a..0000000 --- a/printing.go +++ /dev/null @@ -1,128 +0,0 @@ -package main - -import ( - "fmt" - "strings" - - "code.google.com/p/go.net/html" - "code.google.com/p/go.net/html/atom" - - "github.com/fatih/color" - "github.com/mattn/go-colorable" -) - -var ( - // Colors - tagColor *color.Color = color.New(color.FgCyan) - tokenColor = color.New(color.FgCyan) - attrKeyColor = color.New(color.FgMagenta) - quoteColor = color.New(color.FgBlue) - commentColor = color.New(color.FgYellow) -) - -func init() { - color.Output = colorable.NewColorableStdout() -} - -// Is this node a tag with no end tag such as or
? -// http://www.w3.org/TR/html-markup/syntax.html#syntax-elements -func isVoidElement(n *html.Node) bool { - switch n.DataAtom { - case atom.Area, atom.Base, atom.Br, atom.Col, atom.Command, atom.Embed, - atom.Hr, atom.Img, atom.Input, atom.Keygen, atom.Link, - atom.Meta, atom.Param, atom.Source, atom.Track, atom.Wbr: - return true - } - return false - -} - -type TreeDisplayer struct { - IndentString string -} - -func (t TreeDisplayer) Display(nodes []*html.Node) { - for _, node := range nodes { - t.printNode(node, 0) - } -} - -func (t TreeDisplayer) printChildren(n *html.Node, level int) { - if maxPrintLevel > -1 { - if level >= maxPrintLevel { - t.printIndent(level) - fmt.Println("...") - return - } - } - child := n.FirstChild - for child != nil { - t.printNode(child, level) - child = child.NextSibling - } -} - -func (t TreeDisplayer) printIndent(level int) { - for ; level > 0; level-- { - fmt.Print(indentString) - } -} - -// Print a node and all of it's children to `maxlevel`. -func (t TreeDisplayer) printNode(n *html.Node, level int) { - switch n.Type { - case html.TextNode: - s := html.EscapeString(n.Data) - s = strings.TrimSpace(s) - if s != "" { - t.printIndent(level + 1) - fmt.Println(s) - } - case html.ElementNode: - t.printIndent(level) - if printColor { - tokenColor.Print("<") - tagColor.Printf("%s", n.Data) - } else { - fmt.Printf("<%s", n.Data) - } - for _, a := range n.Attr { - if printColor { - fmt.Print(" ") - attrKeyColor.Printf("%s", a.Key) - tokenColor.Print("=") - val := html.EscapeString(a.Val) - quoteColor.Printf(`"%s"`, val) - } else { - val := html.EscapeString(a.Val) - fmt.Printf(` %s="%s"`, a.Key, val) - } - } - if printColor { - tokenColor.Println(">") - } else { - fmt.Print(">\n") - } - if !isVoidElement(n) { - t.printChildren(n, level+1) - t.printIndent(level) - if printColor { - tokenColor.Print("") - } else { - fmt.Printf("\n", n.Data) - } - } - case html.CommentNode: - t.printIndent(level) - if printColor { - commentColor.Printf("\n", n.Data) - } else { - fmt.Printf("\n", n.Data) - } - t.printChildren(n, level) - case html.DoctypeNode, html.DocumentNode: - t.printChildren(n, level) - } -} diff --git a/pup.go b/pup.go new file mode 100644 index 0000000..961553f --- /dev/null +++ b/pup.go @@ -0,0 +1,74 @@ +package main + +import ( + "fmt" + "os" + + "code.google.com/p/go.net/html" + "code.google.com/p/go.net/html/charset" +) + +// _=,_ +// o_/6 /#\ +// \__ |##/ +// ='|--\ +// / #'-. +// \#|_ _'-. / +// |/ \_( # |" +// C/ ,--___/ + +var VERSION string = "0.3.3" + +func main() { + // process flags and arguments + cmds := ParseArgs() + + // Determine the charset of the input + cr, err := charset.NewReader(pupIn, "") + if err != nil { + fmt.Fprintf(os.Stderr, err.Error()) + os.Exit(2) + } + + // Parse the input and get the root node + root, err := html.Parse(cr) + if err != nil { + fmt.Fprintf(os.Stderr, err.Error()) + os.Exit(2) + } + + // Parse the selectors + selectorFuncs := []SelectorFunc{} + funcGenerator := Select + var cmd string + for len(cmds) > 0 { + cmd, cmds = cmds[0], cmds[1:] + if len(cmds) == 0 { + if err := ParseDisplayer(cmd); err == nil { + continue + } + } + switch cmd { + case "*": + continue + case "+": + funcGenerator = SelectFromChildren + case ">": + funcGenerator = SelectNextSibling + default: + selector, err := ParseSelector(cmd) + if err != nil { + fmt.Fprintf(os.Stderr, "Selector parsing error: %s\n", err.Error()) + os.Exit(2) + } + selectorFuncs = append(selectorFuncs, funcGenerator(selector)) + funcGenerator = Select + } + } + + currNodes := []*html.Node{root} + for _, selectorFunc := range selectorFuncs { + currNodes = selectorFunc(currNodes) + } + pupDisplayer.Display(currNodes) +} diff --git a/pup.rb b/pup.rb index 8514fa6..0d6f587 100644 --- a/pup.rb +++ b/pup.rb @@ -2,14 +2,14 @@ require 'formula' class Pup < Formula homepage 'https://github.com/EricChiang/pup' - version '0.3.2' + version '0.3.3' if Hardware.is_64_bit? - url 'https://github.com/EricChiang/pup/releases/download/v0.3.2/pup_darwin_amd64.zip' - sha1 '9d5ad4c0b78701b1868094bf630adbbd26ae1698' + url 'https://github.com/EricChiang/pup/releases/download/v0.3.3/pup_darwin_amd64.zip' + sha1 'e5a74c032abd8bc81e4a12b06d0c071343811949' else - url 'https://github.com/EricChiang/pup/releases/download/v0.3.2/pup_darwin_386.zip' - sha1 '21487bc5abdac34021f25444ab481e267bccbd72' + url 'https://github.com/EricChiang/pup/releases/download/v0.3.3/pup_darwin_386.zip' + sha1 'cd7d18cae7d8bf6af8bdb04c963156a1b217dfcb' end def install diff --git a/selector.go b/selector.go new file mode 100644 index 0000000..75d4307 --- /dev/null +++ b/selector.go @@ -0,0 +1,585 @@ +package main + +import ( + "bytes" + "fmt" + "regexp" + "strconv" + "strings" + "text/scanner" + + "code.google.com/p/go.net/html" +) + +type Selector interface { + Match(node *html.Node) bool +} + +type SelectorFunc func(nodes []*html.Node) []*html.Node + +func Select(s Selector) SelectorFunc { + // have to define first to be able to do recursion + var selectChildren func(node *html.Node) []*html.Node + selectChildren = func(node *html.Node) []*html.Node { + selected := []*html.Node{} + for child := node.FirstChild; child != nil; child = child.NextSibling { + if s.Match(child) { + selected = append(selected, child) + } else { + selected = append(selected, selectChildren(child)...) + } + } + return selected + } + return func(nodes []*html.Node) []*html.Node { + selected := []*html.Node{} + for _, node := range nodes { + selected = append(selected, selectChildren(node)...) + } + return selected + } +} + +// Defined for the '>' selector +func SelectNextSibling(s Selector) SelectorFunc { + return func(nodes []*html.Node) []*html.Node { + selected := []*html.Node{} + for _, node := range nodes { + for ns := node.NextSibling; ns != nil; ns = ns.NextSibling { + if ns.Type == html.ElementNode { + if s.Match(ns) { + selected = append(selected, ns) + } + break + } + } + } + return selected + } +} + +// Defined for the '+' selector +func SelectFromChildren(s Selector) SelectorFunc { + return func(nodes []*html.Node) []*html.Node { + selected := []*html.Node{} + for _, node := range nodes { + for c := node.FirstChild; c != nil; c = c.NextSibling { + if s.Match(c) { + selected = append(selected, c) + } + } + } + return selected + } +} + +type PseudoClass func(*html.Node) bool + +type CSSSelector struct { + Tag string + Attrs map[string]*regexp.Regexp + Pseudo PseudoClass +} + +func (s CSSSelector) Match(node *html.Node) bool { + if node.Type != html.ElementNode { + return false + } + if s.Tag != "" { + if s.Tag != node.DataAtom.String() { + return false + } + } + for attrKey, matcher := range s.Attrs { + matched := false + for _, attr := range node.Attr { + if attrKey == attr.Key { + if !matcher.MatchString(attr.Val) { + return false + } + matched = true + break + } + } + if !matched { + return false + } + } + if s.Pseudo == nil { + return true + } + return s.Pseudo(node) +} + +// Parse a selector +// e.g. `div#my-button.btn[href^="http"]` +func ParseSelector(cmd string) (selector CSSSelector, err error) { + selector = CSSSelector{ + Tag: "", + Attrs: map[string]*regexp.Regexp{}, + Pseudo: nil, + } + var s scanner.Scanner + s.Init(strings.NewReader(cmd)) + err = ParseTagMatcher(&selector, s) + return +} + +// Parse the initial tag +// e.g. `div` +func ParseTagMatcher(selector *CSSSelector, s scanner.Scanner) error { + tag := bytes.NewBuffer([]byte{}) + defer func() { + selector.Tag = tag.String() + }() + for { + c := s.Next() + switch c { + case scanner.EOF: + return nil + case '.': + return ParseClassMatcher(selector, s) + case '#': + return ParseIdMatcher(selector, s) + case '[': + return ParseAttrMatcher(selector, s) + case ':': + return ParsePseudo(selector, s) + default: + if _, err := tag.WriteRune(c); err != nil { + return err + } + } + } +} + +// Parse a class matcher +// e.g. `.btn` +func ParseClassMatcher(selector *CSSSelector, s scanner.Scanner) error { + var class bytes.Buffer + defer func() { + regexpStr := `(\A|\s)` + regexp.QuoteMeta(class.String()) + `(\s|\z)` + selector.Attrs["class"] = regexp.MustCompile(regexpStr) + }() + for { + c := s.Next() + switch c { + case scanner.EOF: + return nil + case '.': + return ParseClassMatcher(selector, s) + case '#': + return ParseIdMatcher(selector, s) + case '[': + return ParseAttrMatcher(selector, s) + case ':': + return ParsePseudo(selector, s) + default: + if _, err := class.WriteRune(c); err != nil { + return err + } + } + } +} + +// Parse an id matcher +// e.g. `#my-picture` +func ParseIdMatcher(selector *CSSSelector, s scanner.Scanner) error { + var id bytes.Buffer + defer func() { + regexpStr := `^` + regexp.QuoteMeta(id.String()) + `$` + selector.Attrs["id"] = regexp.MustCompile(regexpStr) + }() + for { + c := s.Next() + switch c { + case scanner.EOF: + return nil + case '.': + return ParseClassMatcher(selector, s) + case '#': + return ParseIdMatcher(selector, s) + case '[': + return ParseAttrMatcher(selector, s) + case ':': + return ParsePseudo(selector, s) + default: + if _, err := id.WriteRune(c); err != nil { + return err + } + } + } +} + +// Parse an attribute matcher +// e.g. `[attr^="http"]` +func ParseAttrMatcher(selector *CSSSelector, s scanner.Scanner) error { + var attrKey bytes.Buffer + var attrVal bytes.Buffer + hasMatchVal := false + matchType := '=' + defer func() { + if hasMatchVal { + var regexpStr string + switch matchType { + case '=': + regexpStr = `^` + regexp.QuoteMeta(attrVal.String()) + `$` + case '*': + regexpStr = regexp.QuoteMeta(attrVal.String()) + case '$': + regexpStr = regexp.QuoteMeta(attrVal.String()) + `$` + case '^': + regexpStr = `^` + regexp.QuoteMeta(attrVal.String()) + case '~': + regexpStr = `(\A|\s)` + regexp.QuoteMeta(attrVal.String()) + `(\s|\z)` + } + selector.Attrs[attrKey.String()] = regexp.MustCompile(regexpStr) + } else { + selector.Attrs[attrKey.String()] = regexp.MustCompile(`^.*$`) + } + }() + // After reaching ']' proceed + proceed := func() error { + switch s.Next() { + case scanner.EOF: + return nil + case '.': + return ParseClassMatcher(selector, s) + case '#': + return ParseIdMatcher(selector, s) + case '[': + return ParseAttrMatcher(selector, s) + case ':': + return ParsePseudo(selector, s) + default: + return fmt.Errorf("Expected selector indicator after ']'") + } + } + // Parse the attribute key matcher + for !hasMatchVal { + c := s.Next() + switch c { + case scanner.EOF: + return fmt.Errorf("Unmatched open brace '['") + case ']': + // No attribute value matcher, proceed! + return proceed() + case '$', '^', '~', '*': + matchType = c + hasMatchVal = true + if s.Next() != '=' { + return fmt.Errorf("'%c' must be followed by a '='", matchType) + } + case '=': + matchType = c + hasMatchVal = true + default: + if _, err := attrKey.WriteRune(c); err != nil { + return err + } + } + } + // figure out if the value is quoted + c := s.Next() + inQuote := false + switch c { + case scanner.EOF: + return fmt.Errorf("Unmatched open brace '['") + case ']': + return proceed() + case '"': + inQuote = true + default: + if _, err := attrVal.WriteRune(c); err != nil { + return err + } + } + if inQuote { + for { + c := s.Next() + switch c { + case '\\': + // consume another character + if c = s.Next(); c == scanner.EOF { + return fmt.Errorf("Unmatched open brace '['") + } + case '"': + switch s.Next() { + case ']': + return proceed() + default: + return fmt.Errorf("Quote must end at ']'") + } + } + if _, err := attrVal.WriteRune(c); err != nil { + return err + } + } + } else { + for { + c := s.Next() + switch c { + case scanner.EOF: + return fmt.Errorf("Unmatched open brace '['") + case ']': + // No attribute value matcher, proceed! + return proceed() + } + if _, err := attrVal.WriteRune(c); err != nil { + return err + } + } + } +} + +// Parse the selector after ':' +func ParsePseudo(selector *CSSSelector, s scanner.Scanner) error { + if selector.Pseudo != nil { + return fmt.Errorf("Combined multiple pseudo classes") + } + var b bytes.Buffer + for s.Peek() != scanner.EOF { + if _, err := b.WriteRune(s.Next()); err != nil { + return err + } + } + cmd := b.String() + var err error + switch { + case cmd == "empty": + selector.Pseudo = func(n *html.Node) bool { + return n.FirstChild == nil + } + case cmd == "first-child": + selector.Pseudo = firstChildPseudo + case cmd == "last-child": + selector.Pseudo = lastChildPseudo + case cmd == "only-child": + selector.Pseudo = func(n *html.Node) bool { + return firstChildPseudo(n) && lastChildPseudo(n) + } + case cmd == "first-of-type": + selector.Pseudo = firstOfTypePseudo + case cmd == "last-of-type": + selector.Pseudo = lastOfTypePseudo + case cmd == "only-of-type": + selector.Pseudo = func(n *html.Node) bool { + return firstOfTypePseudo(n) && lastOfTypePseudo(n) + } + case strings.HasPrefix(cmd, "contains("): + selector.Pseudo, err = parseContainsPseudo(cmd[len("contains("):]) + if err != nil { + return err + } + case strings.HasPrefix(cmd, "nth-child("), + strings.HasPrefix(cmd, "nth-last-child("), + strings.HasPrefix(cmd, "nth-last-of-type("), + strings.HasPrefix(cmd, "nth-of-type("): + if selector.Pseudo, err = parseNthPseudo(cmd); err != nil { + return err + } + default: + return fmt.Errorf("%s not a valid pseudo class", cmd) + } + return nil +} + +// :first-of-child +func firstChildPseudo(n *html.Node) bool { + for c := n.PrevSibling; c != nil; c = c.PrevSibling { + if c.Type == html.ElementNode { + return false + } + } + return true +} + +// :last-of-child +func lastChildPseudo(n *html.Node) bool { + for c := n.NextSibling; c != nil; c = c.NextSibling { + if c.Type == html.ElementNode { + return false + } + } + return true +} + +// :first-of-type +func firstOfTypePseudo(node *html.Node) bool { + if node.Type != html.ElementNode { + return false + } + for n := node.PrevSibling; n != nil; n = n.PrevSibling { + if n.DataAtom == node.DataAtom { + return false + } + } + return true +} + +// :last-of-type +func lastOfTypePseudo(node *html.Node) bool { + if node.Type != html.ElementNode { + return false + } + for n := node.NextSibling; n != nil; n = n.NextSibling { + if n.DataAtom == node.DataAtom { + return false + } + } + return true +} + +func parseNthPseudo(cmd string) (PseudoClass, error) { + i := strings.IndexRune(cmd, '(') + if i < 0 { + // really, we should never get here + return nil, fmt.Errorf("Fatal error, '%s' does not contain a '('", cmd) + } + pseudoName := cmd[:i] + // Figure out how the counting function works + var countNth func(*html.Node) int + switch pseudoName { + case "nth-child": + countNth = func(n *html.Node) int { + nth := 1 + for sib := n.PrevSibling; sib != nil; sib = sib.PrevSibling { + if sib.Type == html.ElementNode { + nth++ + } + } + return nth + } + case "nth-of-type": + countNth = func(n *html.Node) int { + nth := 1 + for sib := n.PrevSibling; sib != nil; sib = sib.PrevSibling { + if sib.Type == html.ElementNode && sib.DataAtom == n.DataAtom { + nth++ + } + } + return nth + } + case "nth-last-child": + countNth = func(n *html.Node) int { + nth := 1 + for sib := n.NextSibling; sib != nil; sib = sib.NextSibling { + if sib.Type == html.ElementNode { + nth++ + } + } + return nth + } + case "nth-last-of-type": + countNth = func(n *html.Node) int { + nth := 1 + for sib := n.NextSibling; sib != nil; sib = sib.NextSibling { + if sib.Type == html.ElementNode && sib.DataAtom == n.DataAtom { + nth++ + } + } + return nth + } + default: + return nil, fmt.Errorf("Unrecognized pseudo '%s'", pseudoName) + } + + nthString := cmd[i+1:] + i = strings.IndexRune(nthString, ')') + if i < 0 { + return nil, fmt.Errorf("Unmatched '(' for psuedo class %s", pseudoName) + } else if i != len(nthString)-1 { + return nil, fmt.Errorf("%s(n) must end selector", pseudoName) + } + number := nthString[:i] + + // Check if the number is 'odd' or 'even' + oddOrEven := -1 + switch number { + case "odd": + oddOrEven = 1 + case "even": + oddOrEven = 0 + } + if oddOrEven > -1 { + return func(n *html.Node) bool { + return n.Type == html.ElementNode && countNth(n)%2 == oddOrEven + }, nil + } + // Check against '3n+4' pattern + r := regexp.MustCompile(`([0-9]+)n[ ]?\+[ ]?([0-9])`) + subMatch := r.FindAllStringSubmatch(number, -1) + if len(subMatch) == 1 && len(subMatch[0]) == 3 { + cycle, _ := strconv.Atoi(subMatch[0][1]) + offset, _ := strconv.Atoi(subMatch[0][2]) + return func(n *html.Node) bool { + return n.Type == html.ElementNode && countNth(n)%cycle == offset + }, nil + } + // check against 'n+2' pattern + r = regexp.MustCompile(`n[ ]?\+[ ]?([0-9])`) + subMatch = r.FindAllStringSubmatch(number, -1) + if len(subMatch) == 1 && len(subMatch[0]) == 2 { + offset, _ := strconv.Atoi(subMatch[0][1]) + return func(n *html.Node) bool { + return n.Type == html.ElementNode && countNth(n) >= offset + }, nil + } + // the only other option is a numeric value + nth, err := strconv.Atoi(nthString[:i]) + if err != nil { + return nil, err + } else if nth <= 0 { + return nil, fmt.Errorf("Argument to '%s' must be greater than 0", pseudoName) + } + return func(n *html.Node) bool { + return n.Type == html.ElementNode && countNth(n) == nth + }, nil +} + +// Parse a :contains("") selector +// expects the input to be everything after the open parenthesis +// e.g. for `contains("Help")` the argument would be `"Help")` +func parseContainsPseudo(cmd string) (PseudoClass, error) { + var s scanner.Scanner + s.Init(strings.NewReader(cmd)) + switch s.Next() { + case '"': + default: + return nil, fmt.Errorf("Malformed 'contains(\"\")' selector") + } + textToContain := bytes.NewBuffer([]byte{}) + for { + r := s.Next() + switch r { + case '"': + // ')' then EOF must follow '"' + if s.Next() != ')' { + return nil, fmt.Errorf("Malformed 'contains(\"\")' selector") + } + if s.Next() != scanner.EOF { + return nil, fmt.Errorf("'contains(\"\")' must end selector") + } + text := textToContain.String() + contains := func(node *html.Node) bool { + for c := node.FirstChild; c != nil; c = c.NextSibling { + if c.Type == html.TextNode { + if strings.Contains(c.Data, text) { + return true + } + } + } + return false + } + return contains, nil + case '\\': + s.Next() + case scanner.EOF: + return nil, fmt.Errorf("Malformed 'contains(\"\")' selector") + default: + if _, err := textToContain.WriteRune(r); err != nil { + return nil, err + } + } + } +} diff --git a/selector/selector.go b/selector/selector.go deleted file mode 100644 index f09f754..0000000 --- a/selector/selector.go +++ /dev/null @@ -1,345 +0,0 @@ -package selector - -import ( - "code.google.com/p/go.net/html" - "fmt" - "regexp" - "strconv" - "strings" -) - -// A CSS Selector -type BasicSelector struct { - Name *regexp.Regexp - Attrs map[string]*regexp.Regexp -} - -type Selector interface { - Select(nodes []*html.Node) []*html.Node -} - -type selectorField int - -const ( - ClassField selectorField = iota - IDField - NameField - AttrField -) - -// Parse an attribute command to a key string and a regexp -func parseAttrField(command string) (attrKey string, matcher *regexp.Regexp, - err error) { - - attrSplit := strings.Split(command, "=") - matcherString := "" - switch len(attrSplit) { - case 1: - attrKey = attrSplit[0] - matcherString = ".*" - case 2: - attrKey = attrSplit[0] - attrVal := attrSplit[1] - if len(attrKey) == 0 { - err = fmt.Errorf("No attribute key") - return - } - attrKeyLen := len(attrKey) - switch attrKey[attrKeyLen-1] { - case '~': - matcherString = fmt.Sprintf(`\b%s\b`, attrVal) - case '$': - matcherString = fmt.Sprintf("%s$", attrVal) - case '^': - matcherString = fmt.Sprintf("^%s", attrVal) - case '*': - matcherString = fmt.Sprintf("%s", attrVal) - default: - attrKeyLen++ - matcherString = fmt.Sprintf("^%s$", attrVal) - } - attrKey = attrKey[:attrKeyLen-1] - default: - err = fmt.Errorf("more than one '='") - return - } - matcher, err = regexp.Compile(matcherString) - return -} - -// Set a field of this selector. -func (s *BasicSelector) setFieldValue(f selectorField, v string) error { - if v == "" { - return nil - } - switch f { - case ClassField: - r, err := regexp.Compile(fmt.Sprintf(`\b%s\b`, v)) - if err != nil { - return err - } - s.Attrs["class"] = r - case IDField: - r, err := regexp.Compile(fmt.Sprintf("^%s$", v)) - if err != nil { - return err - } - s.Attrs["id"] = r - case NameField: - r, err := regexp.Compile(fmt.Sprintf("^%s$", v)) - if err != nil { - return err - } - s.Name = r - case AttrField: - // Attribute fields are a little more complicated - keystring, matcher, err := parseAttrField(v) - if err != nil { - return err - } - s.Attrs[keystring] = matcher - } - return nil -} - -// Convert a string to a selector. -func NewSelector(s string) (Selector, error) { - // A very simple test for a selector function - if strings.Contains(s, "{") { - return parseSelectorFunc(s) - } - - // Otherwise let's evaluate a basic selector - attrs := map[string]*regexp.Regexp{} - selector := BasicSelector{nil, attrs} - nextField := NameField - start := 0 - for i, c := range s { - switch c { - case '.': - if nextField == AttrField { - continue - } - err := selector.setFieldValue(nextField, s[start:i]) - if err != nil { - return selector, err - } - nextField = ClassField - start = i + 1 - case '#': - if nextField == AttrField { - continue - } - err := selector.setFieldValue(nextField, s[start:i]) - if err != nil { - return selector, err - } - nextField = IDField - start = i + 1 - case '[': - err := selector.setFieldValue(nextField, s[start:i]) - if err != nil { - return selector, err - } - nextField = AttrField - start = i + 1 - case ']': - if nextField != AttrField { - return selector, fmt.Errorf( - "']' must be preceeded by '['") - } - err := selector.setFieldValue(nextField, s[start:i]) - if err != nil { - return selector, err - } - start = i + 1 - } - } - err := selector.setFieldValue(nextField, s[start:]) - if err != nil { - return selector, err - } - return selector, nil -} - -func (sel BasicSelector) Select(nodes []*html.Node) []*html.Node { - selected := []*html.Node{} - for _, node := range nodes { - selected = append(selected, sel.FindAllChildren(node)...) - } - return selected -} - -// Find all nodes which match a selector. -func (sel BasicSelector) FindAllChildren(node *html.Node) []*html.Node { - selected := []*html.Node{} - child := node.FirstChild - for child != nil { - childSelected := sel.FindAll(child) - selected = append(selected, childSelected...) - child = child.NextSibling - } - return selected -} - -// Find all nodes which match a selector. May return itself. -func (sel BasicSelector) FindAll(node *html.Node) []*html.Node { - selected := []*html.Node{} - if sel.Match(node) { - return []*html.Node{node} - } - child := node.FirstChild - for child != nil { - childSelected := sel.FindAll(child) - selected = append(selected, childSelected...) - child = child.NextSibling - } - return selected -} - -// Does this selector match a given node? -func (sel BasicSelector) Match(node *html.Node) bool { - if node.Type != html.ElementNode { - return false - } - if sel.Name != nil { - if !sel.Name.MatchString(strings.ToLower(node.Data)) { - return false - } - } - matchedAttrs := []string{} - for _, attr := range node.Attr { - matcher, ok := sel.Attrs[attr.Key] - if !ok { - continue - } - if !matcher.MatchString(attr.Val) { - return false - } - matchedAttrs = append(matchedAttrs, attr.Key) - } - for k := range sel.Attrs { - attrMatched := false - for _, attrKey := range matchedAttrs { - if k == attrKey { - attrMatched = true - } - } - if !attrMatched { - return false - } - } - return true -} - -type SliceSelector struct { - Start int - LimitStart bool - End int - LimitEnd bool - By int - N int -} - -func (sel SliceSelector) Select(nodes []*html.Node) []*html.Node { - var start, end, by int - selected := []*html.Node{} - nNodes := len(nodes) - switch { - case !sel.LimitStart: - start = 0 - case sel.Start < 0: - start = nNodes + sel.Start - default: - start = sel.Start - } - switch { - case sel.N == 1: - end = start + 1 - case !sel.LimitEnd: - end = nNodes - case sel.End < 0: - end = nNodes + sel.End - default: - end = sel.End - } - by = sel.By - if by == 0 { - return selected - } - if by > 0 { - for i := start; i < nNodes && i < end; i = i + by { - selected = append(selected, nodes[i]) - } - } else { - for i := end - 1; i >= 0 && i >= start; i = i + by { - selected = append(selected, nodes[i]) - } - } - return selected -} - -// expects input to be the slice only, e.g. "9:4:-1" -func parseSliceSelector(s string) (sel SliceSelector, err error) { - sel = SliceSelector{ - Start: 0, - End: 0, - By: 1, - LimitStart: false, - LimitEnd: false, - } - split := strings.Split(s, ":") - n := len(split) - sel.N = n - if n > 3 { - err = fmt.Errorf("too many slices") - return - } - var value int - if split[0] != "" { - value, err = strconv.Atoi(split[0]) - if err != nil { - return - } - sel.Start = value - sel.LimitStart = true - } - if n == 1 { - sel.End = sel.Start + 1 - sel.LimitEnd = true - return - } - if split[1] != "" { - value, err = strconv.Atoi(split[1]) - if err != nil { - return - } - sel.End = value - sel.LimitEnd = true - } - if n == 2 { - return - } - if split[2] != "" { - value, err = strconv.Atoi(split[2]) - if err != nil { - return - } - sel.By = value - } - return -} - -func parseSelectorFunc(s string) (Selector, error) { - switch { - case strings.HasPrefix(s, "slice{"): - if !strings.HasSuffix(s, "}") { - return nil, fmt.Errorf( - "slice func must end with a '}'") - } - s = strings.TrimPrefix(s, "slice{") - s = strings.TrimSuffix(s, "}") - return parseSliceSelector(s) - } - return nil, fmt.Errorf("%s is an invalid function", s) -} diff --git a/tests/README.md b/tests/README.md index 7a4cd22..dd01c3c 100644 --- a/tests/README.md +++ b/tests/README.md @@ -2,18 +2,6 @@ A simple set of tests to help maintain sanity. -The tests themselves are written in Python and can be run using the [nose]( -https://nose.readthedocs.org/en/latest/) tool. +These tests don't actually test functionality -Install with: -```bash -$ pip install nose -``` - -Run the following command from either the base directory or this one to perform -the tests: - -```bash -$ nosetests -``` diff --git a/tests/cmds.txt b/tests/cmds.txt new file mode 100644 index 0000000..56c192d --- /dev/null +++ b/tests/cmds.txt @@ -0,0 +1,40 @@ +#footer +#footer li +#footer li + a +#footer li + a attr{title} +#footer li > li +table li +table li:first-child +table li:first-of-type +table li:last-child +table li:last-of-type +table a[title="The Practice of Programming"] +table a[title="The Practice of Programming"] text{} +json{} +text{} +.after-portlet +.after +:empty +td:empty +.navbox-list li:nth-child(1) +.navbox-list li:nth-child(2) +.navbox-list li:nth-child(3) +.navbox-list li:nth-last-child(1) +.navbox-list li:nth-last-child(2) +.navbox-list li:nth-last-child(3) +.navbox-list li:nth-child(n+1) +.navbox-list li:nth-child(3n+1) +.navbox-list li:nth-last-child(n+1) +.navbox-list li:nth-last-child(3n+1) +:only-child +.navbox-list li:only-child +.summary +[class=summary] +[class="summary"] +#toc +#toc li + a +#toc li + a text{} +#toc li + a json{} +#toc li + a + span +#toc li + span +#toc li > li diff --git a/tests/expected_output.txt b/tests/expected_output.txt new file mode 100644 index 0000000..db4cd5f --- /dev/null +++ b/tests/expected_output.txt @@ -0,0 +1,40 @@ +c00fef10d36c1166cb5ac886f9d25201b720e37e #footer +a7bb8dbfdd638bacad0aa9dc3674126d396b74e2 #footer li +06fb3f64027084bbe52af06951aeea7d2750dcff #footer li + a +da788c5138342d0228bfc86976bd9202419e43a6 #footer li + a attr{title} +8edb39cbf74ed66687c4eb4e0abfe36153c186a8 #footer li > li +a92e50c09cd56970625ac3b74efbddb83b2731bb table li +505c04a42e0084cd95560c233bd3a81b2c59352d table li:first-child +505c04a42e0084cd95560c233bd3a81b2c59352d table li:first-of-type +66950e746590d7f4e9cfe3d1adef42cd0addcf1d table li:last-child +66950e746590d7f4e9cfe3d1adef42cd0addcf1d table li:last-of-type +0a37d612cd4c67a42bd147b1edc5a1128456b017 table a[title="The Practice of Programming"] +0d3918d54f868f13110262ffbb88cbb0b083057d table a[title="The Practice of Programming"] text{} +48a00be8203baea11bf9672bf10b183cca98c3b2 json{} +74fa5ddf9687041ece8aed54913cc2d1a6e7101c text{} +e4f7358fbb7bb1748a296fa2a7e815fa7de0a08b .after-portlet +da39a3ee5e6b4b0d3255bfef95601890afd80709 .after +5b3020ba03fb43f7cdbcb3924546532b6ec9bd71 :empty +3406ca0f548d66a7351af5411ce945cf67a2f849 td:empty +30fff0af0b1209f216d6e9124e7396c0adfa0758 .navbox-list li:nth-child(1) +a38e26949f047faab5ea7ba2acabff899349ce03 .navbox-list li:nth-child(2) +d954831229a76b888e85149564727776e5a2b37a .navbox-list li:nth-child(3) +d314e83b059bb876b0e5ee76aa92d54987961f9a .navbox-list li:nth-last-child(1) +1f19496e239bca61a1109dbbb8b5e0ab3e302b50 .navbox-list li:nth-last-child(2) +1ec9ebf14fc28c7d2b13e81241a6d2e1608589e8 .navbox-list li:nth-last-child(3) +52e726f0993d2660f0fb3ea85156f6fbcc1cfeee .navbox-list li:nth-child(n+1) +0b20c98650efa5df39d380fea8d5b43f3a08cb66 .navbox-list li:nth-child(3n+1) +52e726f0993d2660f0fb3ea85156f6fbcc1cfeee .navbox-list li:nth-last-child(n+1) +972973fe1e8f63e4481c8641d6169c638a528a6e .navbox-list li:nth-last-child(3n+1) +a55bb21d37fbbd3f8f1b551126795fbc733451fe :only-child +44c99f6ad37b65dc0893cdcb1c60235d827ee73e .navbox-list li:only-child +641037814e358487d1938fc080e08f72a3846ef8 .summary +641037814e358487d1938fc080e08f72a3846ef8 [class=summary] +641037814e358487d1938fc080e08f72a3846ef8 [class="summary"] +613bf65ac4042b6ee0a7a47f08732fdbe1b5b06b #toc +dbc580de40eeb8448f0dbe1b98d74cf799a6868b #toc li + a +6a2c6153bce7945b88d7c818fe11aaae232725b3 #toc li + a text{} +91f36a7072f0740b170eeaac01b0be00ec528664 #toc li + a json{} +0cd687baaf08605bf6a68e3c285c5e8a41e0c9b2 #toc li + a + span +da39a3ee5e6b4b0d3255bfef95601890afd80709 #toc li + span +5d6e3ed3cfe310cde185cbfe1bba6aa7ec2a7f8d #toc li > li diff --git a/tests/index.html b/tests/index.html new file mode 100644 index 0000000..ec35afe --- /dev/null +++ b/tests/index.html @@ -0,0 +1,1494 @@ + + + + +Go (programming language) - Wikipedia, the free encyclopedia + + + + + + + + + + + + + + + + + + + + + + + +
+
+
+ + +
+
+
+

Go (programming language)

+
+
From Wikipedia, the free encyclopedia
+
+
+ Jump to: navigation, search +
+
Not to be confused with Go! (programming language), an agent-based language released in 2003.
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Go
Golang.png
Paradigm(s)compiled, concurrent, imperative, structured
Designed byRobert Griesemer
+Rob Pike
+Ken Thompson
DeveloperGoogle Inc.
Appeared in2009; 5 years ago (2009)
Stable releaseversion 1.3.3[1] / 1 October 2014; 39 days ago (2014-10-01)
Typing disciplinestrong, static, inferred, nominal
Major implementationsgc (8g, 6g, 5g), gccgo
Influenced byC, occam, Limbo, Modula, Newsqueak, Oberon, Pascal,[2] Python[citation needed]
Implementation languageC, Go, Asm
OSLinux, Mac OS X, FreeBSD, NetBSD, OpenBSD, MS Windows, Plan 9[3]
LicenseBSD-style[4] + Patent grant[5]
Filename extension(s).go
Websitegolang.org
+

Go, also commonly referred to as golang, is a programming language initially developed at Google[6] in 2007 by Robert Griesemer, Rob Pike, and Ken Thompson.[2] It is a statically-typed language with syntax loosely derived from that of C, adding garbage collection, type safety, some dynamic-typing capabilities, additional built-in types such as variable-length arrays and key-value maps, and a large standard library.

+

The language was announced in November 2009 and is now used in some of Google's production systems.[7] Go's "gc" compiler targets the Linux, Mac OS X, FreeBSD, NetBSD, OpenBSD, Plan 9, and Microsoft Windows operating systems and the i386, amd64, and ARM processor architectures.[8] A second compiler, gccgo, is a GCC frontend.[9][10]

+

+ +

+

History[edit]

+

Ken Thompson states that, initially, Go was purely an experimental project. Referring to himself along with the other original authors of Go, he states:[11]

+
+

When the three of us [Thompson, Rob Pike, and Robert Griesemer] got started, it was pure research. The three of us got together and decided that we hated C++. [laughter] ... [Returning to Go,] we started off with the idea that all three of us had to be talked into every feature in the language, so there was no extraneous garbage put into the language for any reason.

+
+

The history of the language before its first release, back to 2007, is covered in the language's FAQ.[12]

+

Language design[edit]

+

Go is recognizably in the tradition of C, but makes many changes aimed at conciseness, simplicity, and safety. The following is a brief overview of the features which define Go (for more information see the language specification):

+ +

Syntax[edit]

+

Go's syntax includes changes from C aimed at keeping code concise and readable. The programmer needn't specify the types of expressions, allowing just i := 3 or s := "some words" to replace C's int i = 3; or char* s = "some words";. Semicolons at the end of lines aren't required. Functions may return multiple, named values, and returning a result, err pair is the conventional way a function indicates an error to its caller in Go.[a] Go adds literal syntaxes for initializing struct parameters by name, and for initializing maps and slices. As an alternative to C's three-statement for loop, Go's range expressions allow concise iteration over arrays, slices, strings, and maps.

+

Types[edit]

+

Go adds some basic types not present in C for safety and convenience:

+
    +
  • Slices (written []type) point into an array of objects in memory, storing a pointer to the start of the slice, a length, and a capacity specifying when new memory needs to be allocated to expand the array. Slice contents are passed by reference, and their contents are always mutable.
  • +
  • Go's immutable string type typically holds UTF-8 text (though it can hold arbitrary bytes as well).
  • +
  • map[keytype]valtype provides a hashtable.
  • +
  • Go also adds channel types, which support concurrency and are discussed below, and interfaces, which replace virtual inheritance and are discussed in Interface system section.
  • +
+

Structurally, Go's type system has a few differences from C and most C derivatives. Unlike C typedefs, Go's named types are not aliases for each other, and rules limit when different types can be assigned to each other without explicit conversion.[19] Unlike in C, conversions between number types are explicit; to ensure that doesn't create verbose conversion-heavy code, numeric constants in Go represent abstract, untyped numbers.[20] Finally, in place of non-virtual inheritance, Go has a feature called type embedding in which one object can contain others and pick up their methods.

+

Package system[edit]

+

In Go's package system, each package has a path (e.g., "compress/bzip2" or "code.google.com/p/go.net/html") and a name (e.g., bzip2 or html). References to other packages' definitions must always be prefixed with the other package's name, and only the capitalized names from other modules are accessible: io.Reader is public but bzip2.reader is not.[21] The go get command can retrieve packages stored in a remote repository such as Github or Google Code, and package paths often look like partial URLs for compatibility.[22]

+

Concurrency: goroutines, channels, and select[edit]

+

Go provides facilities for writing concurrent programs that share state by communicating.[23][24][25] Concurrency refers not only to multithreading and CPU parallelism, which Go supports, but also to asynchrony: letting slow operations like a database or network-read run while the program does other work, as is common in event-based servers.[26]

+

Go's concurrency-related syntax and types include:

+
    +
  • The go statement, go func(), starts a function in a new light-weight process, or goroutine
  • +
  • Channel types, chan type, provide a type-safe, synchronized, optionally buffered channels between goroutines, and are useful mostly with two other facilities: +
      +
    • The send statement, ch <- x sends x over ch
    • +
    • The receive operator, <- ch receives a value from ch
    • +
    • Both operations block until the other goroutine is ready to communicate
    • +
    +
  • +
  • The select statement uses a switch-like syntax to wait for communication on any of a set of channels[27]
  • +
+

From these tools one can build concurrent constructs like worker pools, pipelines (in which, say, a file is decompressed and parsed as it downloads), background calls with timeout, "fan-out" parallel calls to a set of services, and others.[28] Channels have also found uses further from the usual notion of interprocess communication, like serving as a concurrency-safe list of recycled buffers,[29] implementing coroutines (which helped inspire the name goroutine),[30] and implementing iterators.[31]

+

While the communicating-processes model is favored in Go, it isn't the only one: memory can be shared across goroutines (see below), and the standard sync module provides locks and other primitives.[32]

+

Race condition safety[edit]

+

There are no restrictions on how goroutines access shared data, making race conditions possible. Specifically, unless a program explicitly synchronizes via channels or other means, writes from one goroutine might be partly, entirely, or not at all visible to another, often with no guarantees about ordering of writes.[33] Furthermore, Go's internal data structures like interface values, slice headers, and string headers are not immune to race conditions, so type and memory safety can be violated in multithreaded programs that modify shared instances of those types without synchronization.[34][35]

+

Idiomatic Go minimizes sharing of data (and thus potential race conditions) by communicating over channels, and a race-condition tester is included in the standard distribution to help catch unsafe behavior. Still, it is important to realize that while Go provides building blocks that can be used to write correct, comprehensible concurrent code, arbitrary code isn't guaranteed to be safe.

+

Some concurrency-related structural conventions of Go (channels and alternative channel inputs) are derived from Tony Hoare's communicating sequential processes model. Unlike previous concurrent programming languages such as occam or Limbo (a language on which Go co-designer Rob Pike worked[36]), Go does not provide any built-in notion of safe or verifiable concurrency.[33]

+

Interface system[edit]

+

In place of virtual inheritance, Go uses interfaces. An interface declaration is nothing but a list of required methods: for example, implementing io.Reader requires a Read method that takes a []byte and returns a count of bytes read and any error.[37] Code calling Read needn't know whether it's reading from an HTTP connection, a file, an in-memory buffer, or any other source.

+

Go's standard library defines interfaces for a number of concepts: input sources and output sinks, sortable collections, objects printable as strings, cryptographic hashes, and so on.

+

Go types don't declare which interfaces they implement: having the required methods is implementing the interface. In formal language, Go's interface system provides structural rather than nominal typing.

+

The example below uses the io.Reader and io.Writer interfaces to test Go's implementation of SHA-256 on a standard test input, 1,000,000 repeats of the character "a". RepeatByte implements an io.Reader yielding an infinite stream of repeats of a byte, similar to Unix /dev/zero. The main() function uses RepeatByte to stream a million repeats of "a" into the hash function, then prints the result, which matches the expected value published online.[38] Even though both reader and writer interfaces are needed to make this work, the code needn't mention either; the compiler infers what types implement what interfaces:

+
+
+
+package main
+ 
+import (
+    "fmt"
+    "io"
+    "crypto/sha256"
+)
+ 
+type RepeatByte byte
+ 
+func (r RepeatByte) Read(p []byte) (n int, err error) {
+    for i := range p {
+        p[i] = byte(r)
+    }
+    return len(p), nil
+}
+ 
+func main() {
+    testStream := RepeatByte('a')
+    hasher := sha256.New()
+    io.CopyN(hasher, testStream, 1000000)
+    fmt.Printf("%x", hasher.Sum(nil))
+}
+
+
+

(Run or edit this example online.)

+

Also note type RepeatByte is defined as a byte, not a struct. Named types in Go needn't be structs, and any named type can have methods defined, satisfy interfaces, and act, for practical purposes, as objects; the standard library, for example, stores IP addresses in byte slices.[39]

+

Besides calling methods via interfaces, Go allows converting interface values to other types with a run-time type check. The language constructs to do so are the type assertion,[40] which checks against a single potential type, and the type switch,[41] which checks against multiple types.

+

interface{}, the empty interface, is an important corner case because it can refer to an item of any concrete type, including primitive types like string. Code using the empty interface can't simply call methods (or built-in operators) on the referred-to object, but it can store the interface{} value, try to convert it to a more useful type via a type assertion or type switch, or inspect it with Go's reflect package.[42] Because interface{} can refer to any value, it's a limited way to escape the restrictions of static typing, like void* in C but with additional run-time type checks.

+

Interface values are stored in memory as a pointer to data and a second pointer to run-time type information.[43] Like other pointers in Go, interface values are nil if uninitialized.[44] Unlike in environments like Java's virtual machine, there is no object header; the run-time type information is only attached to interface values. So, the system imposes no per-object memory overhead for objects not accessed via interface, similar to C structs or C# ValueTypes.

+

Go does not have interface inheritance, but one interface type can embed another; then the embedding interface requires all of the methods required by the embedded interface.[45]

+

Omissions[edit]

+

Go deliberately omits certain features common in other languages, including generic programming, assertions, pointer arithmetic, and inheritance. After initially omitting exceptions, the language added the panic/recover mechanism, but it is only meant for rare circumstances.[46][47][48]

+

The Go authors express an openness to generic programming, explicitly argue against assertions and pointer arithmetic, while defending the choice to omit type inheritance as giving a more useful language, encouraging heavy use of interfaces instead.[2]

+

Conventions and code style[edit]

+

The Go authors and community put substantial effort into molding the style and design of Go programs:

+
    +
  • Indentation, spacing, and other surface-level details of code are automatically standardized by the go fmt tool. go vet and golint do additional checking automatically.
  • +
  • Tools and libraries distributed with Go suggest standard approaches to things like API documentation (godoc[49]), testing (go test), building (go build), package management (go get), and so on.
  • +
  • Syntax rules require things that are optional in other languages, for example by banning cyclic dependencies, unused variables or imports, and implicit type conversions.
  • +
  • The omission of certain features (for example, functional-programming shortcuts like map and C++-style try/finally blocks) tends to encourage a particular explicit, concrete, and imperative programming style.
  • +
  • Core developers write extensively about Go idioms, style, and philosophy, in the Effective Go document and code review comments reference, presentations, blog posts, and public mailing list messages.
  • +
+

When adapting to the Go ecosystem after working in other languages, differences in style and approach can be as important as low-level language and library differences.

+

Language tools[edit]

+

Go includes the same sort of debugging, testing, and code-vetting tools as many language distributions. The Go distribution includes go vet, which analyzes code searching for common stylistic problems and mistakes. A profiler, unit testing tool, gdb debugging support, and a race condition tester are also in the distribution. The Go distribution includes its own build system, which requires only information in the Go files themselves, no separate build files.

+

There is an ecosystem of third-party tools that add to the standard distribution, such as gocode, which enables code autocompletion in many text editors, goimports (by a Go team member), which automatically adds/removes package imports as needed, errcheck, which detects code that might unintentionally ignore errors, and more. Plugins exist to add language support in widely used text editors, and at least one IDE, LiteIDE, targets Go in particular.

+

Examples[edit]

+

Hello world[edit]

+

Here is a Hello world program in Go:

+
+
+
+package main
+ 
+import "fmt"
+ 
+func main() {
+    fmt.Println("Hello, World")
+}
+
+
+

(Run or edit this example online.)

+

Echo[edit]

+

This imitates the Unix echo command in Go:[50]

+
+
+
+package main
+ 
+import (
+    "flag"
+    "fmt"
+    "strings"
+)
+ 
+func main() {
+    var omitNewline bool
+    flag.BoolVar(&omitNewline, "n", false, "don't print final newline")
+    flag.Parse() // Scans the arg list and sets up flags.
+ 
+    str := strings.Join(flag.Args(), " ")
+    if omitNewline {
+        fmt.Print(str)
+    } else {
+        fmt.Println(str)
+    }
+}
+
+
+

File Read[edit]

+
+
+
+// Reading and writing files are basic tasks needed for
+// many Go programs. First we'll look at some examples of
+// reading files.
+ 
+package main
+ 
+import (
+    "bufio"
+    "fmt"
+    "io"
+    "io/ioutil"
+    "os"
+)
+ 
+// Reading files requires checking most calls for errors.
+// This helper will streamline our error checks below.
+func check(e error) {
+    if e != nil {
+        panic(e)
+    }
+}
+ 
+func main() {
+ 
+    // Perhaps the most basic file reading task is
+    // slurping a file's entire contents into memory.
+    dat, err := ioutil.ReadFile("/tmp/dat")
+    check(err)
+    fmt.Print(string(dat))
+ 
+    // You'll often want more control over how and what
+    // parts of a file are read. For these tasks, start
+    // by `Open`ing a file to obtain an `os.File` value.
+    f, err := os.Open("/tmp/dat")
+    check(err)
+ 
+    // Read some bytes from the beginning of the file.
+    // Allow up to 5 to be read but also note how many
+    // actually were read.
+    b1 := make([]byte, 5)
+    n1, err := f.Read(b1)
+    check(err)
+    fmt.Printf("%d bytes: %s\n", n1, string(b1))
+ 
+    // You can also `Seek` to a known location in the file
+    // and `Read` from there.
+    o2, err := f.Seek(6, 0)
+    check(err)
+    b2 := make([]byte, 2)
+    n2, err := f.Read(b2)
+    check(err)
+    fmt.Printf("%d bytes @ %d: %s\n", n2, o2, string(b2))
+ 
+    // The `io` package provides some functions that may
+    // be helpful for file reading. For example, reads
+    // like the ones above can be more robustly
+    // implemented with `ReadAtLeast`.
+    o3, err := f.Seek(6, 0)
+    check(err)
+    b3 := make([]byte, 2)
+    n3, err := io.ReadAtLeast(f, b3, 2)
+    check(err)
+    fmt.Printf("%d bytes @ %d: %s\n", n3, o3, string(b3))
+ 
+    // There is no built-in rewind, but `Seek(0, 0)`
+    // accomplishes this.
+    _, err = f.Seek(0, 0)
+    check(err)
+ 
+    // The `bufio` package implements a buffered
+    // reader that may be useful both for its efficiency
+    // with many small reads and because of the additional
+    // reading methods it provides.
+    r4 := bufio.NewReader(f)
+    b4, err := r4.Peek(5)
+    check(err)
+    fmt.Printf("5 bytes: %s\n", string(b4))
+ 
+    // Close the file when you're done (usually this would
+    // be scheduled immediately after `Open`ing with
+    // `defer`).
+    f.Close()
+ 
+}
+
+
+

[51][52]

+

Notable users[edit]

+

Some notable open-source applications in Go include:

+ +

Other companies and sites using Go (generally together with other languages, not exclusively) include:[53][54]

+ +

Libraries[edit]

+

Go's open-source libraries include:

+
    +
  • Go's standard library, which covers a lot of fundamental functionality: +
      +
    • Algorithms: compression, cryptography, sorting, math, indexing, and text and string manipulation.
    • +
    • External interfaces: I/O, network clients and servers, parsing and writing common formats, running system calls, and interacting with C code.
    • +
    • Development tools: reflection, runtime control, debugging, profiling, unit testing, synchronization, and parsing Go.
    • +
    +
  • +
  • Third-party libraries with more specialized tools: + +
  • +
+

Some sites help index the libraries outside the Go distribution:

+ +

Community and conferences[edit]

+
    +
  • Gopher Academy, Gopher Academy is a group of developers working to educate and promote the golang community.
  • +
  • Golangprojects.com, lists programming jobs and projects where companies are looking for people that know Go
  • +
  • GopherCon The first Go conference. Denver, Colorado, USA April 24-26 2014
  • +
  • dotGo European conference. Paris, France October 10 2014
  • +
  • GopherConIndia The first Go conference in India. Bangalore Feb. 19-21 2015
  • +
+

Reception[edit]

+

Go's initial release led to much discussion.

+

Michele Simionato wrote in an article for artima.com:[62]

+
+

Here I just wanted to point out the design choices about interfaces and inheritance. Such ideas are not new and it is a shame that no popular language has followed such particular route in the design space. I hope Go will become popular; if not, I hope such ideas will finally enter in a popular language, we are already 10 or 20 years too late :-(

+
+

Dave Astels at Engine Yard wrote:[63]

+
+

Go is extremely easy to dive into. There are a minimal number of fundamental language concepts and the syntax is clean and designed to be clear and unambiguous. Go is still experimental and still a little rough around the edges.

+
+

Ars Technica interviewed Rob Pike, one of the authors of Go, and asked why a new language was needed. He replied that:[64]

+
+

It wasn't enough to just add features to existing programming languages, because sometimes you can get more in the long run by taking things away. They wanted to start from scratch and rethink everything. ... [But they did not want] to deviate too much from what developers already knew because they wanted to avoid alienating Go's target audience.

+
+

Go was named Programming Language of the Year by the TIOBE Programming Community Index in its first year, 2009, for having a larger 12-month increase in popularity (in only 2 months, after its introduction in November) than any other language that year, and reached 13th place by January 2010,[65] surpassing established languages like Pascal. As of August 2014, its ranking had dropped to 38th in the index, placing it lower than COBOL and Fortran.[66] Go is already in commercial use by several large organizations.[67]

+

Regarding Go, Bruce Eckel has stated:[68]

+
+

The complexity of C++ (even more complexity has been added in the new C++), and the resulting impact on productivity, is no longer justified. All the hoops that the C++ programmer had to jump through in order to use a C-compatible language make no sense anymore -- they're just a waste of time and effort. Now, Go makes much more sense for the class of problems that C++ was originally intended to solve.

+
+

Mascot[edit]

+

Go's mascot is a gopher designed by Renée French, who also designed Glenda, the Plan 9 Bunny. The logo and mascot are licensed under Creative Commons Attribution 3.0 license.[69]

+

Naming dispute[edit]

+

On the day of the general release of the language, Francis McCabe, developer of the Go! programming language (note the exclamation point), requested a name change of Google's language to prevent confusion with his language.[70] The issue was closed by a Google developer on 12 October 2010 with the custom status "Unfortunate" and with the following comment: "there are many computing products and services named Go. In the 11 months since our release, there has been minimal confusion of the two languages."[71]

+

See also[edit]

+ +
+ + + + + +
Portal iconFree software portal
+
+

Notes[edit]

+
+
    +
  1. ^ Usually, exactly one of the result and error values has a value other than the type's zero value; sometimes both do, as when a read or write can only be partially completed, and sometimes neither, as when a read returns 0 bytes. See Semipredicate problem: Multivalued return.
  2. +
+
+

References[edit]

+
This article incorporates material from the official Go tutorial, which is licensed under the Creative Commons Attribution 3.0 license.
+
+
    +
  1. ^ "Go 1.3.3 is released". golang-nuts group. 1 October 2014. Retrieved 5 November 2014. 
  2. +
  3. ^ a b c "Language Design FAQ". golang.org. 16 January 2010. Retrieved 27 February 2010. 
  4. +
  5. ^ "Go Porting Efforts". Go Language Resources. cat-v. 12 January 2010. Retrieved 18 January 2010. 
  6. +
  7. ^ "Text file LICENSE". The Go Programming Language. Google. Retrieved 5 October 2012. 
  8. +
  9. ^ "Additional IP Rights Grant". The Go Programming Language. Google. Retrieved 5 October 2012. 
  10. +
  11. ^ Kincaid, Jason (10 November 2009). "Google’s Go: A New Programming Language That’s Python Meets C++". TechCrunch. Retrieved 18 January 2010. 
  12. +
  13. ^ "Go FAQ: Is Google using Go internally?". Retrieved 9 March 2013. 
  14. +
  15. ^ "Installing Go". golang.org. The Go Authors. 11 June 2010. Retrieved 11 June 2010. 
  16. +
  17. ^ "FAQ: Implementation". golang.org. 16 January 2010. Retrieved 18 January 2010. 
  18. +
  19. ^ "Installing GCC: Configuration". Retrieved 3 December 2011. "Ada, Go and Objective-C++ are not default languages" 
  20. +
  21. ^ Andrew Binstock (18 May 2011). "Dr. Dobb's: Interview with Ken Thompson". Retrieved 7 February 2014. 
  22. +
  23. ^ "Frequently Asked Questions (FAQ) - The Go Programming Language". Golang.org. Retrieved 2014-03-27. 
  24. +
  25. ^ Pike, Rob. "The Go Programming Language". YouTube. Retrieved 1 Jul 2011. 
  26. +
  27. ^ Rob Pike (10 November 2009). The Go Programming Language (flv) (Tech talk). Google. Event occurs at 8:53. 
  28. +
  29. ^ Download and install packages and dependencies - go - The Go Programming Language; see godoc.org for addresses and documentation of some packages
  30. +
  31. ^ godoc.org and, for the standard library, golang.org/pkg
  32. +
  33. ^ Rob Pike, on The Changelog podcast
  34. +
  35. ^ Rob Pike, Less is exponentially more
  36. +
  37. ^ Assignability - the Go Language Specification
  38. +
  39. ^ Constants - the Go Language Specification
  40. +
  41. ^ "A Tutorial for the Go Programming Language". The Go Programming Language. Google. Retrieved 10 March 2013. "In Go the rule about visibility of information is simple: if a name (of a top-level type, function, method, constant or variable, or of a structure field or method) is capitalized, users of the package may see it. Otherwise, the name and hence the thing being named is visible only inside the package in which it is declared." 
  42. +
  43. ^ Download and install packages and dependencies - go - The Go Programming Language
  44. +
  45. ^ Share by communicating - Effective Go
  46. +
  47. ^ Andrew Gerrand, Share memory by communicating
  48. +
  49. ^ Andrew Gerrand, Codewalk: Share memory by communicating
  50. +
  51. ^ For more discussion, see Rob Pike, Concurrency is not Parallelism
  52. +
  53. ^ The Go Programming Language Specification. This deliberately glosses over some details in the spec: close, channel range expressions, the two-argument form of the receive operator, unidrectional channel types, and so on.
  54. +
  55. ^ Concurrency patterns in Go
  56. +
  57. ^ John Graham-Cumming, Recycling Memory Buffers in Go
  58. +
  59. ^ tree.go
  60. +
  61. ^ Ewen Cheslack-Postava, Iterators in Go
  62. +
  63. ^ sync - The Go Programming Language
  64. +
  65. ^ a b "The Go Memory Model". Google. Retrieved 5 January 2011. 
  66. +
  67. ^ Russ Cox, Off to the Races
  68. +
  69. ^ Rob Pike (October 25, 2012). "Go at Google: Language Design in the Service of Software Engineering". Google, Inc.  "There is one important caveat: Go is not purely memory safe in the presence of concurrency."
  70. +
  71. ^ Brian W. Kernighan, A Descent Into Limbo
  72. +
  73. ^ Reader - io - The Go Programming Language
  74. +
  75. ^ SHA-256 test vectors, set 1, vector #8
  76. +
  77. ^ src/pkg/net/ip.go
  78. +
  79. ^ Type Assertions - The Go Language Specification
  80. +
  81. ^ Type switches - The Go Language Specification
  82. +
  83. ^ reflect.ValueOf(i interface{}) converts an interface{} to a reflect.Value that can be further inspected
  84. +
  85. ^ "Go Data Structures: Interfaces". Retrieved 15 November 2012. 
  86. +
  87. ^ Interface types - The Go Programming Language Specification
  88. +
  89. ^ "Effective Go — Interfaces and methods & Embedding". Google. Retrieved 28 November 2011. 
  90. +
  91. ^ Panic And Recover, Go wiki
  92. +
  93. ^ Release notes, 30 March 2010
  94. +
  95. ^ "Proposal for an exception-like mechanism". golang-nuts. 25 March 2010. Retrieved 25 March 2010. 
  96. +
  97. ^ Commentary - Effective Go
  98. +
  99. ^ "A Tutorial for the Go Programming Language". golang.org. 16 January 2010. Retrieved 18 January 2010. 
  100. +
  101. ^ https://gobyexample.com/reading-files.  Missing or empty |title= (help)
  102. +
  103. ^ http://golang.org/pkg/os/.  Missing or empty |title= (help)
  104. +
  105. ^ Erik Unger, The Case For Go
  106. +
  107. ^ Andrew Gerrand, Four years of Go, The Go Blog
  108. +
  109. ^ dl.google.com: Powered by Go
  110. +
  111. ^ Matt Welsh, Rewriting a Large Production System in Go
  112. +
  113. ^ David Symonds, High Performance Apps on Google App Engine
  114. +
  115. ^ Patrick Lee, Open Sourcing Our Go Libraries, 7 July 2014.
  116. +
  117. ^ John Graham-Cumming, Go at CloudFlare
  118. +
  119. ^ John Graham-Cumming, What we've been doing with Go
  120. +
  121. ^ Peter Bourgon, Go at SoundCloud
  122. +
  123. ^ Simionato, Michele (15 November 2009). "Interfaces vs Inheritance (or, watch out for Go!)". artima. Retrieved 15 November 2009. 
  124. +
  125. ^ Astels, Dave (9 November 2009). "Ready, Set, Go!". engineyard. Retrieved 9 November 2009. 
  126. +
  127. ^ Paul, Ryan (10 November 2009). "Go: new open source programming language from Google". Ars Technica. Retrieved 13 November 2009. 
  128. +
  129. ^ "Google's Go Wins Programming Language Of The Year Award". jaxenter. Retrieved 5 December 2012.  |first1= missing |last1= in Authors list (help)
  130. +
  131. ^ "TIOBE Programming Community Index for August 2014". TIOBE Software. August 2014. Retrieved 22 August 2014. 
  132. +
  133. ^ "Organizations Using Go". 
  134. +
  135. ^ Bruce Eckel (27 August 2011). "Calling Go from Python via JSON-RPC". Retrieved 29 August 2011. 
  136. +
  137. ^ "FAQ — The Go Programming Language". Golang.org. Retrieved 2013-06-25. 
  138. +
  139. ^ Claburn, Thomas (11 November 2009). "Google 'Go' Name Brings Accusations Of Evil'". InformationWeek. Retrieved 18 January 2010. 
  140. +
  141. ^ "Issue 9 - go — I have already used the name for *MY* programming language". Google Code. Google Inc. Retrieved 12 October 2010. 
  142. +
+
+

External links[edit]

+ + + + + + + + + + + + + + + + + + + + + +
+
+
+
+
+

Navigation menu

+ +
+ +
+ + +
+
+ + + +
+
+
+ + + + + + +
+
+ + + + + + + diff --git a/tests/run.py b/tests/run.py new file mode 100755 index 0000000..67a13e0 --- /dev/null +++ b/tests/run.py @@ -0,0 +1,14 @@ +#!/usr/bin/env python + +from __future__ import print_function +from hashlib import sha1 +from subprocess import Popen, PIPE, STDOUT + +data = open("index.html", "r").read() + +for line in open("cmds.txt", "r"): + line = line.strip() + p = Popen(['pup', line], stdout=PIPE, stdin=PIPE, stderr=PIPE) + h = sha1() + h.update(p.communicate(input=data)[0]) + print("%s %s" % (h.hexdigest(), line)) diff --git a/tests/test b/tests/test new file mode 100755 index 0000000..e8753eb --- /dev/null +++ b/tests/test @@ -0,0 +1,4 @@ +#!/bin/bash + +python run.py > test_results.txt +diff expected_output.txt test_results.txt diff --git a/tests/tests.py b/tests/tests.py deleted file mode 100644 index 813a945..0000000 --- a/tests/tests.py +++ /dev/null @@ -1,56 +0,0 @@ - -from subprocess import Popen, PIPE, STDOUT - -example_data = """ - - - - -
- -

Some other data

-
- - -""" - -# run a pup command as a subprocess -def run_pup(args, input_html): - cmd = ["pup"] - cmd.extend(args) - p = Popen(cmd, stdout=PIPE, stdin=PIPE, stderr=PIPE) - stdout_data = p.communicate(input=input_html)[0] - p.wait() - return stdout_data - -# simply count the number of lines returned by this pup command -def run_pup_count(args, input_html): - pup_output = run_pup(args, input_html) - lines = [l for l in pup_output.split("\n") if l] - return len(lines) - -def test_class_selector(): - assert run_pup_count([".nav"], example_data) == 3 - -def test_attr_eq(): - assert run_pup_count(["[class=nav]"], example_data) == 0 - -def test_attr_pre(): - assert run_pup_count(["[class^=nav]"], example_data) == 3 - assert run_pup_count(["[class^=clearfix]"], example_data) == 0 - -def test_attr_post(): - assert run_pup_count(["[class$=nav]"], example_data) == 0 - assert run_pup_count(["[class$=clearfix]"], example_data) == 3 - -def test_attr_func(): - result = run_pup(["div", "attr{class}"], example_data).strip() - assert result == "" - result = run_pup(["div", "div", "attr{class}"], example_data).strip() - assert result == "nav clearfix" - -def test_text_func(): - result = run_pup(["p", "text{}"], example_data).strip() - assert result == "Some other data" diff --git a/wikipedia.org b/wikipedia.org new file mode 100644 index 0000000..b6bbb52 --- /dev/null +++ b/wikipedia.org @@ -0,0 +1,591 @@ + + + + +Wikipedia, the free encyclopedia + + + + + + + + + + + + + + + + + + + + + + + + +
+
+
+ + +
+
+
+

Main Page

+
+
From Wikipedia, the free encyclopedia
+
+
+ Jump to: navigation, search +
+
+ + + + + + +
+ + + + +
+
Welcome to Wikipedia,
+ +
4,640,388 articles in English
+
+
+ + + + + +
+ + + + + + +
+ + + + + + + + + + + + + +
+

From today's featured article

+
+
+
Carl Hans Lody
+

Carl Hans Lody (1877–1914) was a reserve officer of the Imperial German Navy who spied in the United Kingdom at the start of the First World War. While working for a shipping line, he agreed to spy for German naval intelligence, and was sent to Edinburgh in late August. He spoke fluent English, and spent a month posing as an American tourist while reporting on British naval movements and coastal defences. He had not been given any espionage training and was detected almost immediately, as he sent his communications in plain English and German to a known German intelligence address in Sweden. By the end of September 1914, a rising spy panic in Britain led to foreigners coming under suspicion; he attempted to go into hiding in Ireland but was quickly caught. Tried in a public court martial in London, he made no attempt to deny his guilt, declaring that he had acted out of patriotic motives. His courage on the witness stand attracted admiration in Britain and Germany. He was sentenced to death by firing squad and on 6 November 1914 he became the first person in nearly 170 years to be executed at the Tower of London. Under the Nazi regime, he was acclaimed as a German national hero. (Full article...)

+

Recently featured: Gough Whitlam – Bonshō – Bivalvia

+ +
+
+

Did you know...

+
+
+

From Wikipedia's new and recently improved content:

+
+

Robert J. Healey

+
+ + +
+
+
+ + + + + + + + + + + + + +
+

In the news

+
+
+
One World Trade Center
+ +

Ongoing: Ebola outbreak Islamic State of Iraq and the Levant
+Recent deaths: Acker Bilk Michael Sata

+
+
+

On this day...

+
+
+

November 6: Guru Nanak Gurpurab (Sikhism, 2014); Constitution Day in the Dominican Republic (1844); Finnish Swedish Heritage Day in Finland

+
+

Portrait of George Eliot by François D'Albert Durade

+
+ +

More anniversaries: November 5 November 6 November 7

+ +
It is now November 6, 2014 (UTC) – Reload this page
+
+
+
+ + + + +
+ + + + + + + +
+

Today's featured picture

+
+
+ + + + + + + +
Cereals
+

A cereal is a grass cultivated for the edible components of its grain, composed of the endosperm, germ, and bran. In their natural form (as in whole grain), they are a rich source of vitamins, minerals, carbohydrates, fats, oils, and protein, but when refined the remaining endosperm is mostly carbohydrate.

+

Pictured here are oats and barley, together with some products made from them.

+Photograph: Agricultural Research Service, United States Department of Agriculture + +
+
+
+
+
+

Other areas of Wikipedia

+
    +
  • Community portal – Bulletin board, projects, resources and activities covering a wide range of Wikipedia areas.
  • +
  • Help desk – Ask questions about using Wikipedia.
  • +
  • Local embassy – For Wikipedia-related communication in languages other than English.
  • +
  • Reference desk – Serving as virtual librarians, Wikipedia volunteers tackle your questions on a wide range of subjects.
  • +
  • Site news – Announcements, updates, articles and press releases on Wikipedia and the Wikimedia Foundation.
  • +
  • Village pump – For discussions about Wikipedia itself, including areas for technical issues and policies.
  • +
+
+
+

Wikipedia's sister projects

+

Wikipedia is hosted by the Wikimedia Foundation, a non-profit organization that also hosts a range of other projects:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+

Wikipedia languages

+ + +
+ + + + + +
+
+
+
+
+

Navigation menu

+ +
+ +
+ + +
+
+ + + +
+
+
+ + + + + + +
+
+ + + + + + +