NAME ==== DOM::Tiny - A lightweight, self-contained DOM parser/manipulator SYNOPSIS ======== use DOM::Tiny; # Parse my $dom = DOM::Tiny.parse('<div><p id="a">Test</p><p id="b">123</p></div>'); # Find say $dom.at('#b').text; say $dom.find('p').map(*.text).join("\n"); say $dom.find('[id]').map(*.attr('id')).join("\n"); # Iterate $dom.find('p[id]').reverse.map({ .<id>.say }); # Loop for $dom.find('p[id]') -> $e { say $e<id>, ':', $e.text; } # Modify $dom.find('div p')[*-1].append('<p id="c">456</p>'); $dom.find(':not(p)').map(*.strip); # Render say "$dom"; DESCRIPTION =========== DOM::Tiny is a smallish, relaxed pure-Perl HTML/XML DOM parser. It is relatively robust owing mostly to the enormous test suite inherited from its progenitor. The HTML/XML parsing is very forgiving and the CSS parser supports a reasonable subset of CSS3 for selecting elements in the DOM tree. This module started as a port of Mojo::DOM58 from Perl 5, but maintaining compatibility with that library is not a major aim of this project. In fact, features of Perl 6 render certain aspects of Mojo::DOM58 completely redundant. For example, the collection system that provides custom features such as `map`, `each`, `reduce`, etc. are completely unnecessary in Perl 6. The built-in syntax is as simple or simpler to use and safer in every case. NODES AND ELEMENTS ================== When we parse an HTML/XML fragment, it gets turned into a tree of nodes. <!DOCTYPE html> <html> <head><title>Hello</title></head> <body>World!</body> </html> There are currently the following different kinds of nodes: Root, Text, Tag, Raw, PI, Doctype, Comment, and CDATA. These can also be grouped into the following roles: DocumentNode (anything but Root), Node (all kinds), HasChildren (Root and Tag), and TextNode (includes Text, CDATA, and Raw). Root |- Doctype (html) +- Tag (html) |- Tag (head) | +- Tag (title) | +- Text (Hello) +- Tag (body) +- Text (World!) While all node types are represented as DOM::Tiny objects, some methods like `attr` and `namespace` only apply to elements. Under normal circumstances you will probably never need to use these objects directly, but they are available in case you have some special need. These objects are all defined in the [DOM::Tiny::HTML](DOM::Tiny::HTML) namespace. If you want to import the short names, they are exported by default by that compilation unit: { use DOM::Tiny; my $t = DOM::Tiny::HTML::Text.new(:text<Hello>); } { use DOM::Tiny; use DOM::Tiny::HTML; my $t = Text.new(:text<Hello>); } CASE SENSITIVITY ================ DOM::Tiny defaults to HTML semantics. That means all tags and attribute names are automatically lowercased at parse time. Selectors will, therefore, need to be lowercase to match anything as matching is still case-sensitive. # HTML semantics my $dom = DOM::Tiny.parse('<P ID="greeting">Hi!</P>'); say $dom.at('p[id]').text; If an XML declaration is found at the start of the snippet to parse, the parser will automatically switch into XML mode and everything becomes case-sensitive. # XML semantics my $dom = DOM::Tiny.parse('<?xml version="1.0"?><P ID="greeting">Hi!</P>'); say $dom.at('P[ID]').text; XML detection can also be disabled or forced by explicitly setting the `:xml` flag as needed. # Force XML semantics my $dom = DOM::Tiny.parse('<P ID="greeting">Hi!</P>', :xml); say $dom.at('P[ID]').text; # Force HTML semantics $dom = DOM::Tiny.parse('<P ID="greeting">Hi!</P>', :!xml); say $dom.at('p[id]').text; SELECTORS ========= DOM::Tiny uses a CSS selector engine found in [DOM::Tiny::CSS](DOM::Tiny::CSS). We try to support all all CSS selectors that make sense for a standalone parser. * - Any element. my $all = $dom.find('*'); E - An element of type E. my $title = $dom.at('title'); E[foo] ------ An E element with a foo attribute. my $links = $dom.find('a[href]'); E[foo="bar"] ------------ An E element whose foo attribute value is exactly equal to bar. my $case_sensitive = $dom.find('input[type="hidden"]'); my $case_sensitive = $dom.find('input[type=hidden]'); E[foo="bar" i] -------------- An E element whose foo attribute value is exactly equal to any case-permutation of bar. my $case_insensitive = $dom.find('input[type="hidden" i]'); my $case_insensitive = $dom.find('input[type=hidden i]'); my $case_insensitive = $dom.find('input[class~="foo" i]'); This selector is part of Selectors Level 4. The "i" modifier may be added to any attribute selector to make what is normally an exact match to one that matches any case-permutation. E[foo~="bar"] ------------- An E element whose foo attribute value is a list of whitespace-separated values, one of which is exactly equal to bar. my $foo = $dom.find('input[class~="foo"]'); my $foo = $dom.find('input[class~=foo]'); E[foo^="bar"] ------------- An E element whose foo attribute value begins exactly with the string bar. my $begins_with = $dom.find('input[name^="f"]'); my $begins_with = $dom.find('input[name^=f]'); E[foo$="bar"] ------------- An E element whose foo attribute value ends exactly with the string bar. my $ends_with = $dom.find('input[name$="o"]'); my $ends_with = $dom.find('input[name$=o]'); E[foo*="bar"] ------------- An E element whose foo attribute value contains the substring bar. my $contains = $dom.find('input[name*="fo"]'); my $contains = $dom.find('input[name*=fo]'); E:root ------ An E element, root of the document. my $root = $dom.at(':root'); E:nth-child(n) -------------- An E element, the n-th child of its parent. my $third = $dom.find('div:nth-child(3)'); my $odd = $dom.find('div:nth-child(odd)'); my $even = $dom.find('div:nth-child(even)'); my $top3 = $dom.find('div:nth-child(-n+3)'); E:nth-last-child(n) ------------------- An E element, the n-th child of its parent, but counting backwards from the end. my $third = $dom.find('div:nth-last-child(3)'); my $odd = $dom.find('div:nth-last-child(odd)'); my $even = $dom.find('div:nth-last-child(even)'); my $bottom3 = $dom.find('div:nth-last-child(-n+3)'); E:nth-of-type(n) ---------------- An E element, the n-th sibling of its type. my $third = $dom.find('div:nth-of-type(3)'); my $odd = $dom.find('div:nth-of-type(odd)'); my $even = $dom.find('div:nth-of-type(even)'); my $top3 = $dom.find('div:nth-of-type(-n+3)'); E:nth-last-of-type(n) --------------------- An E element, the n-th sibling of its type, counting backwards from the end. my $third = $dom.find('div:nth-last-of-type(3)'); my $odd = $dom.find('div:nth-last-of-type(odd)'); my $even = $dom.find('div:nth-last-of-type(even)'); my $bottom3 = $dom.find('div:nth-last-of-type(-n+3)'); E:first-child ------------- An E element, first child of its parent. my $first = $dom.find('div p:first-child'); E:last-child ------------ An E element, last child of its parent. my $last = $dom.find('div p:last-child'); E:first-of-type --------------- An E element, first sibling of its type. my $first = $dom.find('div p:first-of-type'); E:last-of-type -------------- An E element, last sibling of its type. my $last = $dom.find('div p:last-of-type'); E:only-child ------------ An E element, only child of its parent. my $lonely = $dom.find('div p:only-child'); E:only-of-type -------------- An E element, only sibling of its type. my $lonely = $dom.find('div p:only-of-type'); E:empty ------- An E element that has no children (including text nodes, meaning the element does not even contain whitespace). my $empty = $dom.find(':empty'); E:checked --------- A user interface element E which is checked (for instance a radio-button or checkbox). my $input = $dom.find(':checked'); E.warning --------- An E element whose class is "warning". my $warning = $dom.find('div.warning'); E#myid ------ An E element with an "id" attribute equal to "myid". Basically, a shorthand for `E[id=foo]`. my $foo = $dom.at('div#foo'); E:not(s) -------- An E element that does not match simple selector s. my $others = $dom.find('div p:not(:first-child)'); E F --- An F element descendant of an E element. my $headlines = $dom.find('div h1'); E > F ----- An F element child of an E element. my $headlines = $dom.find('html > body > div > h1'); E + F ----- An F element immediately preceded by an E element. my $second = $dom.find('h1 + h2'); E ~ F ----- An F element preceded by an E element. my $second = $dom.find('h1 ~ h2'); E, F, G ------- Elements of type E, F and G. my $headlines = $dom.find('h1, h2, h3'); E[foo=bar][bar=baz] ------------------- An E element whose attributes match all following attribute selectors. my $links = $dom.find('a[foo^=b][foo$=ar]'); OPERATORS AND COERCIONS ======================= You can use array subscripts and hash subscripts with DOM::Tiny. Using this class as an array or hash, though, is not recommended as several of the standard methods for these do not work as expected. Array ----- You may use array subscripts as a shortcut for calling `children`: my $third-child = $dom[2]; Hash ---- You may use hash subscripts as a shortcut for calling `attr`: my $id = $dom<id>; Str --- If you convert the DOM::Tiny object to a string using `Str`, `~`, or putting it in a string, it will render the markup. my $html = "$dom"; METHODS ======= Construction, Parsing, and Rendering ------------------------------------ ### method new method new(DOM::Tiny:U: Bool :$xml) returns DOM::Tiny:D Constructs a DOM::Tiny object with an empty DOM tree. Setting the optional `$xml` flag guarantees XML mode. Setting it to a false guarantees HTML mode. If it is unset, DOM::Tiny will select a mode based upon the parsed text, defaulting to HTML. ### method deep-clone method deep-clone(DOM::Tiny:D:) returns DOM::Tiny:D Returns a deep-cloned copy of the current DOM::Tiny object and its children. Any change to the origin will not impact the copy and vice versa. ### method parse method parse(DOM::Tiny:U: Str $ml, Bool :$xml) returns DOM::Tiny:D method parse(DOM::Tiny:D: Str $ml, Bool :$xml) returns DOM::Tiny:D Parses the given string, `$ml`, as HTML or XML based upon the `$xml` flag or autodetection if the flag is not given. If called on an existing DOM::Tiny object, the newly parsed tree will replace the previous tree. ### method render method render(DOM::Tiny:D:) returns Str:D This renders the current node and all its content back to a string and returns it. The format of the markup is determined by the current `xml` setting. ### method Str method Str(DOM::Tiny:D:) returns Str:D This is a synonym for `render`. ### method xml method xml(DOM::Tiny:D:) is rw returns Bool:D This is the boolean flag determining how the node was parsed and how it will be rendered. Finding and Filtering Nodes --------------------------- ### method at method at(DOM::Tiny:D: Str:D $selector) returns DOM::Tiny Given a CSS selector, this will return the first node matching that selector or Nil. ### method find method find(DOM::Tiny:D: Str:D $selector) Returns all nodes matching the given CSS `$selector` within the current node. ### method matches method matches(DOM::Tiny:D: Str:D $selector) returns Bool:D Returns `True` if the current node matches the given `$selector` or `False` otherwise. Tag Details ----------- ### postcircumfix:<{}> method postcircumfix:<{}>(DOM::Tiny:D: Str:D $k) is rw You may use the `.{}` operator as a shortcut for calling the `attr` method and getting attributes on a tag. You may also use the `:exists` and `:delete` adverbs. ### method hash method hash(DOM::Tiny:D:) returns Hash This is a synonym for `attr`, when it is called with no arguments. ### method all-text method all-text(DOM::Tiny:D: Bool :$trim = False) returns Str Pulls the text from all nodes under the current item in the DOM tree and returns it as a string. This is identical to calling `text` with the `:recurse` flag set to `True`. The `:trim` flag may be set to true, which will cause all trimmable space to be clipped from the returned text (i.e., text not in an RCDATA tag like `title` or `textarea` and not in a `pre` tag). ### method attr multi method attr(DOM::Tiny:D:) returns Hash:D multi method attr(DOM::Tiny:D: Str:D $name) returns Str multi method attr(DOM::Tiny:D: Str:D $name, Str() $value) returns DOM::Tiny:D multi method attr(DOM::Tiny:D: Str:D $name, Nil) returns DOM::Tiny:D multi method attr(DOM::Tiny:D: *%values) returns DOM::Tiny:D The `attr` multi-method provides a getter/setter for attributes on the current tag. If the current node is not a tag, this is basically a no-op and will silently do nothing. With no arguments, the method returns the attributes of the tag as a [Hash](Hash). With a single string argument, it returns the value of the named attribute or Nil. With two string arguments, it will set the value of the named attribute and return the current node. With a string argument and a `Nil`, it will delete the attribute and return the current node. Given one or more named arguments, the named values will be set to the given values and the current node will be returned. ### method content multi method content(DOM::Tiny:D:) returns Str:D multi method content(DOM::Tiny:D: DOM::Tiny:D $tree) returns DOM::Tiny:D multi method content(DOM::Tiny:D: Str() $ml, Bool :$xml = $!xml) returns DOM::Tiny:D This multi-method works with the content of the node, something like `innerHTML` in the standard DOM. Given no arguments, it returns the markup within the element rendered to a string. If the node is empty or has no markup, it will return an empty string. Given a DOM::Tiny, the tree within that object will replace the content of the current node. If the current node cannot have children, then this is a no-op and will silently do nothing. This returns the current node. Given a string, the string will be parsed into HTML or XML (based upon the value of the `:xml` named argument, which defaults to the setting for the current node), and the generated node tree will be used to replace the content of the current node. The current node is returned. ### method namespace method namespace(DOM::Tiny:D:) returns Str Returns the namespace URI of the current tag or Str if it has no namespace. Returns Nil in all other cases (i.e., the current node is not a tag). ### method tag multi method tag(DOM::Tiny:D:) returns Str multi method tag(DOM::Tiny:D: Str:D $tag) returns DOM::Tiny:D If the current node is a tag, both versions of this multi-method are no-ops that silently do nothing. If no arguments are passed, the name of the tag is returned. If a single string is passed, the name of the tag is changed to the given string and the current node is returned. ### method text method text(DOM::Tiny:D: Bool :$trim = False, Bool :$recurse = False) returns Str This returns the text content of the current node. For a text node, this returns the text of the node itself. For a tag or the root, this will return the text of all of the immediate text node children of the current node concatenated together. If the argument named `:recurse` is passed, this method will return the text of all descendants rather than just the immediate children. This is the same as calling `all-text`. If the argument named `:trim` is passed, this method will compress all breaking space into single spaces while concatenating all the text together. ### method type method type(DOM::Tiny:D:) returns Node:U This method returns the type of node that is wrapped within the current DOM::Tiny object. This will be one of the following types: Root The root node of the tree. Tag Markup tag nodes within the tree. Text A regular text node. CDATA A CDATA text node. Comment A comment node. Doctype A DOCTYPE tag element. PI An XML processing instruction. This is also used to represent the XML declaration even though it is technically not a PI. Raw A special raw text node, used to represent the text inside of script and style tags. In addition to these types, you may also want to make use of the following roles, which help group the node types together: Node All nodes, including the root implement this role. DocumentNode All nodes that have a parent have this role, i.e., all but the root. HasChildren Only the nodes that have children have this role, so just Tag and Root. TextNode All nodes that contain text have this role. This includes Text, CDATA, and Raw. Each of these classes and roles are exported by `DOM::Tiny` by default. If you prevent these from being exported, you will need to use their full name, which are each prefixed with `DOM::Tiny::HTML::`. For example, `Tag` has the full name `DOM::Tiny::HTML::Tag` and `TextNode` as the full name `DOM::Tiny::HTML::TextNode`. ### method val method val(DOM::Tiny:D) returns Str Returns the value of the tag. Returns `Nil` if the current tag has no notion of value or if the current node is not a tag. Value is computed as follows, based on the tag name: * * **option**: If the option tag has a `value` attribute, that is the option's value. Otherwise, the option's text is used. * * **input**: The `value` attribute is used. * * **button**: The `value` attribute is used. * * **textarea**: The text content of the tag is used as the value. * * **select**: The value of the currently selected option is used. If no option is marked as selected, the select tag has no value. If the select tag has the `multiple` attribute set, then this returns all the selected values. * * Anything else will return `Nil` for the value. Tree Navigation --------------- ### method postcircumfix:<[]> method postcircumfix:<[]>(DOM::Tiny:D: Int:D $i) is rw The `.[]` can be used in place of `child-nodes` to retrieve children of the current root or tag from the DOM. The `:exists` and `:delete` adverbs also work. ### method list method list(DOM::Tiny:D:) returns List This is a synonym for `child-nodes`. ### method ancestors method ancestors(DOM::Tiny:D: Str $selector?) returns Seq Returns a sequence of ancestors to the current object as `DOM::Tiny` objects. This will return an empty sequence for the root or any node that no longer has a parent (such as may be the case for a recently removed node). ### method child-nodes method child-nodes(DOM::Tiny:D: Bool :$tags-only = False) If the current node has children (i.e., a tag or root), this method returns all of the children. If the `:tags-only` flag is set, it returns only the children that are tags. If the current node has no children or is not able to have children, an empty list will be returned. ### method children method children(DOM::Tiny:D: Str $selector?) If the current node has children, this method returns only the tags that are children of the current node. The `$selector` may be set to a CSS selector to filter the children returned. Only those matching the selector will be returned. If the current node has no children or is not able to have children, an empty list will be returned. ### method descendant-nodes method descendant-nodes(DOM::Tiny:D:) Returns all the descendants of the current node or an empty list if none or the node cannot have descendants. They are returned in depth-first order. ### method following method following(DOM::Tiny:D: Str $selector?) Returns all sibling tags of the current node that come after the current node. ### method following-nodes method following-nodes(DOM::Tiny:D:) Returns all sibling nodes of the current node that come after the current node. ### method next method next(DOM::Tiny:D:) returns DOM::Tiny Returns the next sibling tag of the current node. If there is no such sibling, it returns `Nil`. ### method next-node method next-node(DOM::Tiny:D:) returns DOM::Tiny Returns the next sibling node of the current node. If there is no such sibling, it returns `Nil`. ### method parent method parent(DOM::Tiny:D:) returns DOM::Tiny Returns the parent of the current node. If the current node is the root, this method returns `Nil` instead. ### method preceding method preceding(DOM::Tiny:D: Str $selector?) Retursn all siblings of the current node that are tags that come before the current node. A `$selector` may be given to filter the returned tags. ### method preceding-nodes method preceding-nodes(DOM::Tiny:D:) Returns all siblings nodes of the current node that precede the current node. ### method previous method previous(DOM::Tiny:D:) returns DOM::Tiny Returns the previous sibling tag of the current node. If there is no such sibling, it returns `Nil`. ### method previous-node method previous-node(DOM::Tiny:D:) returns DOM::Tiny Returns the previous sibling node of the current node. If there is no such sibling, it returns `Nil`. ### method root method root(DOM::Tiny:D:) returns DOM::Tiny:D Returns the root node of the tree. Tree Modification ----------------- ### method append method append(DOM::Tiny:D: Str() $ml, Bool :$xml = $!xml) returns DOM::Tiny:D Appends the given markup content immediately after the current node. The `:xml` flag may be set to determine whether the given markup should be parsed as XML or HTML (with the default being whatever the current document is being treated as). If the current node is the root (i.e., `$dom.type ~~ Root`), this operation is a no-op. It will silently do nothing. Returns the current node. ### method append-content method append-content(DOM::Tiny:D: Str() $ml, Bool :$xml = $!xml) returns DOM::Tiny:D If this is the root or a tag (i.e., `$dom.type ~~ Root|Tag`), the given markup will be parsed and appended to the end of the root's or tag's children. If this is a text node (i.e., `$dom.type ~~ TextNode`), then the markup will be appended to the text node parent's children. Otherwise this is a no-op and will silently do nothing. The `:xml` flag may be used to specify the format for the markup being parsed, defaulting to the setting for the current document. Returns the node whose children have been modified. ### method prepend method prepend(DOM::Tiny:D: Str() $ml, Bool :$xml = $!xml) returns DOM::Tiny:D Appends the given markup content immediately before the current node. The `:xml` flag may be set to tell the parser to parse in XML mode or not (with the default being whatever is set for the current node). If the current node is the root, this operation is a no-op and will silently do nothing. This method will return the current node. ### method prepend-content method prepend-content(DOM::Tiny:D: Str() $ml, Bool :$xml = $!xml) returns DOM::Tiny:D Appends the given markup content at the beginning of the current node's children, if it is the root or a tag. (This is a no-op that silently does nothing unless the current node is the root or a tag.) The `:xml` flag sets whether to parse the `$ml` as XML or not, with the default being the xml mode flag set on the current node. This method returns the current node. ### method remove method remove(DOM::Tiny:D:) returns DOM::Tiny:D Removes the current node from the tree and returns the parent node. If this node is the root, then the tree is emptied and the current node (i.e., the root) is returned. ### method replace method replace(DOM::Tiny:D: DOM::Tiny:D $tree) returns DOM::Tiny:D method replace(DOM::Tiny:D: Str() $ml) returns DOM::Tiny:D The current node is replaced with the tree or markup given. If the current node is the root, the current node is returned. Otherwise, the original parent of this node, which has been replaced with the new tree, is returned. ### method strip method strip(DOM::Tiny:D:) returns DOM::Tiny:D If the current node is a tag, the tag is removed from the tree and its content moved up into the current node's original parent. This will then return the original node's parent. If the current node is anything else, this is a no-op that will silently do nothing and return the current node. ### method wrap method wrap(DOM::Tiny:D: Str:D $ml, Bool :$xml = $!xml) returns DOM::Tiny:D The given markup in `$ml` is parsed according to the format given by the `:xml` flag (defaulting to whatever the `xml` setting is for the current node). The current node is put within the innermost tag of the given markup. The current node is returned. This is a no-op and will silently do nothing if the current node is the root. ### method wrap-content method wrap-content(DOM::Tiny:D: Str:D $ml, Bool :$xml = $!xml) returns DOM::Tiny:D This is a no-op and will silently do nothing unless the current node is the root or a tag. The given markup in `$ml` is parsed. The parsing proceeds as XML if the `:xml` flag is set or HTML otherwise (with the default being whatever the `xml` flag is set to on the current node). The content of the current node is then placed within the innermost tag of the parsed markup and that parsed markup replaces the content of the current node. CAVEATS ======= This software is beta quality. It has been ported from a mature code base and survived many uses, but it has still only had a small number of bugs reported and fixed. It has a large test suite, but much of that has been ported from the Perl 5 module and is not necessarily specific to the kinds of bugs this port has. There has also been very little done to optimize the code or even to check to make sure it performs well in how it utilizes CPU and memory. As of the v0.5.0, this project is committed to the following signals regarding changes to this software in the future: * The major version number ("1" in "1.2.3") will be incremented whenever a documented feature changes in a way that is not backwards compatible. * The minor version number ("2" in "1.2.3") will be incremented whenever new features or added or any backwards compatible change is made to an undocumented feature or some other significant change is made to the project. * The patch number ("3" in "1.2.3") will be incremented whenever any other change is made (e.g., documentation, testing, minor bug fixes, etc.) Semantic versioning is not a perfect system as it is not always crystal clear what distinguishes "bug fix" from "new feature" or "backwards compatible change" until after the fact, but I will try to do my best. Any change thought to break backwards compatibility will be tagged with "BREAKING CHANGE" in `Changes`. AUTHOR AND COPYRIGHT ==================== Copyright 2008-2016 Sebastian Riedel and others. Copyright 2016 Andrew Sterling Hanenkamp for the port to Perl 6. This is free software, licensed under: The Artistic License 2.0 (GPL Compatible)