The R package XML for parsing and manipulation of XML documents in R is not actively maintained anymore, but used by many:
The R package xml2 is an actively maintained, more recent alternative.
This file documents useful resources and steps for moving from XML to xml2.
- Collected at r-lib/xml2#246
- jennybc/googlesheets#102
- https://github.com/quanteda/quanteda/pull/1364/files
- https://github.com/quanteda/readtext/pull/128/files
- https://github.com/andrie/sss/commit/8787dd01dc6784bf1fc0f0301ce5d318adad7f70
- https://github.com/cpsievert/rdom/pull/14/files
- https://github.com/USGS-R/geoknife/pull/362/files and https://github.com/USGS-R/geoknife/commit/76b4b47297b7e2f2cd96fb783ec2f35db2ac2f8b
- https://github.com/mikldk/ryacas/commit/e31954259b4ee9e23b0566566b94a9cad7e32ab8 (pretty straightforward example, regex-able)
- https://github.com/cloudyr/aws.alexa/commit/5f5441f1ac8bd4f1f409e2e61ae3ab154962f6ec
The itdepends package helps with finding all usages of XML, see https://speakerdeck.com/jimhester/it-depends?slide=38
devtools::install_github("jimhester/itdepends") library("itdepends") itdepends::dep_locate("XML")
XML |
xml2 |
Comment |
|---|---|---|
XML::getNodeSet(doc = <document object>, path = "<XPath expression>") or XML::xpathApply(...) |
xml2::xml_find_all(..) and xml2::xml_find_one(..) with x = <node>, xpath = "<XPath 1.0 expression>" |
Find matching nodes value of a node's attribute |
XML::htmlTreeParse(<path>, asText = <treat file as text>) |
xml2::read_html(<path, URL, connection, or literal xml>) |
parse HTML document |
XML::isXMLString("<string>") |
No direct equivalent, can try to parse... | Heuristically determine if string is XML |
XML::toString.XMLNode(<node>) |
as.character(<document or node>) |
object to character |
XML::xmlAttrs(node = <node object>) |
xml2::xml_attrs(x = <document, node, or node set>) |
Get the attributes of a node, both return a named character vector. |
XML::xmlApply(X = <node>) and XML::xmlSApply(..) |
functions xml2::xml_attrs(..) and xml2::xml_contents(..) are vectorized |
Apply function to each child of a node |
XML::xmlChildren(x = <node object>)[["<name of the sub-node>"]] |
xml2::xml_child(x = <node>, search = <number, or name of the sub-node>) (only elements) and xml2::xml_contents(..) for all nodes |
Get sub-nodes of a node |
XML::xmlElementsByTagName(el = <node object>, name = "<name to match>") |
xml2::xml_find_all(x = <document, node, node set>, xpath = "<name to match>") |
Retrieve children matching tag name (children/sub-elements) |
XML::xmlGetAttr(node = <node object>, name = "<attribute name>", default = "<default>") |
xml2::xml_attr(x = <document, node, or node set>, attr = "<attribute name>") |
Get value of a node's attribute |
XML::xmlName(node = <node object>) |
xml2::xml_name(x = <document, node, or node set>) |
Get name of a node |
XML::xmlParse(..) |
xml2::read_xml(..) |
Unexposed method in XML ? |
XML::xmlParseDoc(file = <file name> or "<xml content>", asText = !file.exists(file)) |
xml2::read_xml(x = <string, connection, URL, or raw vector>) |
parse XML document |
XML::xmlParseString(content = "<string>") |
xml2::read_xml(x = <string, connection, URL, or raw vector>) |
convenience function XML to node/tree |
XML::xmlRoot(x = <node object>) |
xml2::xml_root(x = <document, node, or node set> |
Get top-level node |
XML::xmlSize(obj = <node or document object>) |
xml2::xml_length() |
Note that xml_length(..) does not need to go to the root first, i.e. XML::xmlSize(XML::xmlRoot(old)) == xml2::xml_length(new) |
XML::xmlToList(node = <xml node or document>) |
xml2::as_list(x = <document, node, or node set>) |
convert to R-like list; difference: as_list does not drop the root element |
XML::xmlTreeParse(file = <file name> or "<xml content>", asText = !file.exists(file)) |
parse XML document | |
if(!is.null(<node object>[["<child name>"]])) { |
(inherits(xml_child(<node object>, "<child name>"), "xml_missing") |
Checking for child node existence |
XML::xmlValue(<node object>) |
xml2::xml_text(x = <document, node, or node set>) |
Get/Set contents of a leaf node |
Common snippets
XML |
xml2 |
Comment |
|---|---|---|
if (!is.null(XML::xmlChildren(x = obj)[[<node name>]])) |
if (!inherits(xml2::xml_find_first(x = obj, xpath = <node name>), "xml_missing") |
Check if element exists. |
if(!is.null(XML::xmlAttrs(node = obj)[["href"]])) |
if(!is.na(xml2::xml_attr(x = obj, attr = "href"))) |
Checking for potentiall non-existing attribute |
XML |
xml2 |
Comment |
|---|---|---|
XML::addAttributes(node = <node object>, ..., .attrs = <character vector with attribute names>, append = <replace or add>) |
xml2::xml_set_attrs(x = <document, node, node set>, value = <named character vector>) to set multiple attributes and overwrite existing ones, or xml2::xml_set_attr(x = <node>, attr = <name>, value = <value>) to append a single attribute |
Add attributes to a node; in xml2 no re-assigning the object is needed, i.e. no doc <- XML::addAttributes(node = doc, ...) |
XML::addChildren(node = <node object>, kids = list()) |
xml2::xml_add_child(.x = <document or nodeset>, .value = <document, node or nodeset>) |
Add child nodes to a node |
XML::saveXML(doc = <xml document object>, file = "<file name>") |
xml2::write_xml(x = <document or node>, file = "<path or connection">) |
Write XML document to string or file |
XML::xmlNamespaceDefinitions(x = <node>) |
xml2::xml_ns(x = <document, node, or node set>) |
Get namespace definitions from a node |
XML::xmlNode(name = "<node name>") |
xml2::xml_new_document %>% xml2::xml_add_child("<node name>") or (preferred in docs) xml2::xml_new_root("<node name>") |
Create a new node |
XML::xmlValue() |
xml2::xml_text(x = <document, node, or node set>) |
Get/Set contents of a leaf node |
XML |
xml2 |
Comment |
|---|---|---|
XMLAbstractDocument |
xml_document |
.. |
XMLAbstractNode, XMLCommentNode, XMLTextNode, ... |
xml_node |
.. |
| ? | xml_missing |
.. |
The following steps were applied in switching from XML to xml2 for the package sos4R.
This is not a "clean" process, but hopefully provides useful input for other's doing the switch.
Ideally the lessons learned on what can be "regex-ed" and what needs manual interaction go into the above tables at a later stage.
- Make sure all functions use named parameters and package prefix with the following regular expressions
addAttributes\((?!node)replaced withXML::addAttributes(node =addChildren\(nodereplaced withXML::addChildren(nodegetNodeSet\((?!doc)replaced withXML::getNodeSet(doc =isXMLString\((?!str)replaced withXML::isXMLString(str =saveXML\((?!doc)replaced withXML::saveXML(doc =xmlAttrs\((?!node)replaced withXML::xmlAttrs(node =xmlChildren\((?!x)replaced withXML::xmlChildren(x =xmlElementsByTagNamereplaced withXML::xmlElementsByTagNamexmlGetAttr\((?!node)replaced withXML::xmlGetAttr(node =xmlName\((?!node)replaced withXML::xmlName(node =xmlNode\((?!name)andxmlNode\(name =replaced withXML::xmlNode(name =xmlParse\(replaced withXML::xmlParse(file =xmlParseDoc\((?!file)replaced withXML::xmlParseDoc(file =xmlParseString\(replaced withXML::xmlParseString(content =xmlRoot\((?!x)replaced withXML::xmlRoot(x =xmlSize\(replaced withXML::xmlSize(obj =xmlToList\(replaced withXML::xmlToList(node =xmlTreeParse\(replaced withXML::xmlTreeParse(file =xmlValue\((?!x)replaced withXML::xmlValue(x =
Imports:XML instead ofDepends:- Run tests - skip the ones unrelated to XML handling
- Commit:
- Do the switch (parsing functions first, all searches in files
*.R, files in/sandbox/ignored for manual corrections; order driven by running a basic parsing test and see where it fails next)
XML::xmlParseDoc- Replace
XML::xmlParseDoc(file =withxml2::read_xml(x =(26 occurrences) - Fix parameters
- drop
, asText = TRUEby replacing it with `` (blank, 11 occurrences) - turn
optionsinto vector with strings - replace
c(XML::NOERROR, XML::RECOVER)withSosDefaultParsingOptions() - use
xmlParseOptionseverywhere
- drop
- Replace
XML::xmlParseString- Replaced manually by simplifying the implementation of
encodeXMLfor signature"character"
- Replaced manually by simplifying the implementation of
XML::xmlParse- Replace single occurrence manually and refactored method
parseFile
- Replace single occurrence manually and refactored method
XML::xmlRoot- Replace
XML::xmlRootwithxml2::xml_root(25 occurrences)
- Replace
XML::xmlName- Replace
XML::xmlName(node =withxml2::xml_name(x =(30 occurrences) - Manually added
, ns = SosAllNamespaces()later to have names with prefix
- Replace
XML::xmlAttrs- Replace
XML::xmlAttrs(node =withxml2::xml_attrs(x =(3 occurrences) - Fix further occurrences manually by searching for
xmlAttrs(must have slipped by before) xml2::xml_attrs(x = obj)[["href"]]does not work because if attribute href does not exist there will be a "subscript out of bounds" error. Need to use
- Replace
- Search for
xml2::xml_attrs\(x = (.*)\[\[and fix manually toxml2::xml_attrs(x = obj, attr = "<attribute name>")and update subsequentis.null(..)checks to useis.na(..) XML::xmlGetAttr- Replace
XML::xmlGetAttr\(node = (.*), name =withxml2::xml_attr(x = $1, attr =(55 occurrences) - Manually fix the ones with spread across multiple lines and with missing
name =, can also fix indentation then or remove newline - Manually fix where
xmlGetAttrwas used withnlapply(..)orsapply(..)
- Replace
XML::xmlValue- Replace
XML::xmlValue\(x =withxml2::xml_text(x =(45 occurrences)
- Replace
XML::xmlChildren- Replace
XML::xmlChildren\(x =withxml2::xml_children(x =(22 occurrences) - The common pattern
XML::xmlChildren(x = obj)[[gmlTimeInstantName]]does not work becausexml2::xml_children(..)does not return a named list. Need to runxml2::xml_find_all(x = obj, xpath = gmlTimeInstant)orxml2::xml_find_first(..)then. Search forxml2::xml_children\(x = (.*)\[\[to fix those manually (10 results)..find_firstreturns missing node:is.na(xml2::xml_find_first(x, "f"))orinherits(xml2::xml_find_first(x, "f"), "xml_missing")..find_allreturns (potentially empty) nodeset:length(xml2::xml_find_all(x, "f"))
- Replace
- Replaced occurrences of class
XMLAbstractNodeandXMLInternalDocumentfor slots in S4 classes withANYand the default prototype toxml2::xml_missing(), will have to handle stuff manually around these classes- Opened issue about this in
xml2repo: r-lib/xml2#248
- Opened issue about this in
- Add
SosAllNamespaces()and add namespaces to all thexxxNameconstants inR/Constants.R test_exceptionreports.Rcompletetest_sams.Radded and parsing fixedXML::getNodeSetmanually switched toxml2::xml_find_all(..)andxml2::xml_find_one(..), because XPath-based getting of sub-nodes withxml2also requires proper namespaces and some handling can be simplified because of vectorisedxml2::xml_text(..).XML::xmlSize- Updated single occurrence manually
XML::saveXML- Replaced
XML::saveXML(doc =withxml2::write_xml(x =(6 occurrences), no parameters insaveXMLbesidesdocandfilewere used
- Replaced
- Update
NAMESPACEto importxml2and notXML - Parsing tests of
test_sensors.Rwork XML::isXMLString- Replace with own function using simple regex test:
grepl("^<(.*)>$", "...")
- Replace with own function using simple regex test:
- get rid of
.filterXmlChildrenand.filterXmlOnlyNoneTextsmanually usingxml2::xml_child(..),xml2::xml_find_first(..)orxml2::xml_find_all(..)- also remove all
".noneText"objects (and by that fix all occurrences ofxmlTagName) is.na(xml2::> fix usingis.na(..)(regex, 16 occurrences)
- also remove all
- must fix all
obj[[because subsetting with[[does not work with XML (107 occurrences at this point!)- trying to automate by replacing
obj\[\[(.*?)\]\]withxml2::xml_child(x = obj, search = $1, ns = SosAllNamespaces()) - revert the changes in summary functions where
obj[[..]]was used (filePrintShowStructureSummary-methods.R) - does not work for multiple subsets, e.g.
obj[["elementCount"]][["Count"]][["value"]]> search forSosAllNamespaces())[[and fix manually to use XPath (4 occurrences) - re-check occurrences of
.children[[ is.null\(\.with some XML object, should beis.na(..)which picks up on"xml_missing"objects- New tests added for...
parseOwsRangeparseSosFilter_CapabilitiesparseOwsServiceIdentificationparseTimeparseSosObservationOffering(also for 2.0.0)
- fix tests in
test_sensors.R
- trying to automate by replacing
- [Continue with encoding functions]
XML::addAttributes- switched manually because sometimes
.attrsis used, which is replaced withxml2::xml_set_attrs(), and sometimes not (single...), which is replaced withxml2::xml_set_attr(), the_set_attrvariants operate directly on the object (no need to re-assign), and often statements are multi-line (18 occurrences) - get rid of
.sos100_NamespaceDefinitionsForAll
- switched manually because sometimes
XML::xmlNodeandXML::addChildren- manually switched to
xml2::xml_new_root("<node name>")andxml2::xml_add_child("<node name>") attrsparameter replaced withxml2::xml_set_attrs()- r-lib/xml2#239 is a problem
XML::addChildrenwith"append = TRUE"replace with a for loop andxml2::xml_add_child(..)
- manually switched to
Limitations of regexes for the actual switch are due to multi-line statements and the result of functions not being the same.
Especially the subsetting with [[ used extensively does not work the same way anymore.
Hello,
Your blog is very useful thank you for that.
However; I wanted to let you know that XML is back again after all that period with new version https://cran.r-project.org/web/packages/XML/index.html.
Best regards,