DOM NodeList iteration performance problem

I am having a performance problem with a module of my project. The code actually does something really easy. It just iterates over a NodeList and overwrites an attribute of each element. At first, I thought this slowness was related to the size of my XML document. I found out later that it has more to do with the underlying NodeList implementation. Here is the problematic code.

NodeList nl = doc.getElementsByTagName("SumInterval");
int length = nl.getLength();
 
for(int i=0; i<length; i++){
    Element elem = (Element)nl.item(i);
    elem.setAttribute("sum-def", "MON-SUN");
}

The document size is 16 MB and my target elements sit at the bottom of the file. For quicker testing round-trip I will reduce the number of my target elements to just 6 and the time spent for this loop will be around 1.3 secs (in my real use cases there are thousands of my target elements and the execution time is almost a minute).

A reasonable guess for this slowness is that the document is quite big so it is takes some time to search for the matched elements. The strange thing is that if I comment out the line that set the new attribute and run it again now the execution time will be just 203 ms. The loop still need to search for the same element with same number of iteration so how come the execution time reduce significantly. It must be something about the method to set the attribute then.

Oracle JDK internally uses Apache Xerces as DOM implementation so I tried downloading the Apache Xerces 2 Java and did some debugging. The first interesting thing is that the getElementsByTagName() method actually doesn’t do anything much. It just creates a NodeList implementation instance with a root node that searching will perform against and the name of target elements.

public NodeList getElementsByTagName(String tagname) {
   return new DeepNodeListImpl(this,tagname);
}

The task of searching for matched elements is in DeepNodeListImpl class and it will happen when I try to access NodeListelement with method item() or getLength().  The search is just a normal DOM traversal walking up and down the tree to find elements with a specific name. The matched elements will be stored in a list in the DeepNodeListImpl instance (the variable “nodes” show below).

public DeepNodeListImpl(NodeImpl rootNode, String tagName) {
    this.rootNode = rootNode;
    this.tagName  = tagName;
    nodes = new ArrayList();
}

The search algorithm itself is quite fast enough as I can see from the execution time when I comment out the line that set the new attribute value. The algorithm is optimized in the way that it will search for elements just enough it need to be. If the 5th element is retrieved, it will search for 5 matched elements and cached it in the list. If then the 10th elements is fetched then it will picking up where it left off and search for 5 more elements so it doesn’t need to start from the root node again. The problem is the effect of modifying DOM document while iterating over NodeList.

According to DOM specification, NodeList must be “live”; the changes anywhere in the document are reflected in every NodeList. The operation of setting new attribute value will set a flag in document instance indicating that this document has changed. NodeList will check this flag every time it tries to retrieve next element.

If the document is changed then NodeList will clear its current list of matched element and start searching all over again from the root node. The root node in my case is the document itself so it will cause a lot of overhead if there are thousands of elements sit before my target and the list of matched elements is keeping destroyed in every round of the iteration.

public class DeepNodeListImpl implements NodeList {
    public Node item(int index) {
        Node thisNode;
 
        //Tree changed. Do it all from scratch!
 
        if (rootNode.changes() != changes) {
            nodes   = new ArrayList();
            changes = rootNode.changes();
        }
        // the rest of searching algorithm
    }
}

I don’t really know if this is inevitable due to the nature of DOM specification or there are other implementations that could handle this more efficiently. The even more confused thing is that although setting new attribute value is considered to be operation that change DOM document, removing the attribute is not. If I replace the problematic line with elem.removeAttribute(“sum-def”) then the flag will not be set. I am trying to get more information about this.

My workaround right now is to iterate NodeList and put matched elements in a new ArrayList then modify them outside NodeList.

NodeList nl = doc.getElementsByTagName("SumInterval");
 
//store the mactched elements in a new linkedList
int length = nl.getLength();
List elemList = new ArrayList(length);
for(int i=0; i<length; i++){
    elemList.add( (Element)nl.item(i) ) ;
}
 
//perform DOM modification
for(Element elem : elemList){
    elem.setAttribute("sum-def", "MON-SUN");
}