Parse XML using DOM in Java

XML and DOM

XML stands for eXtended Markup Language. The Document Object Model (DOM) is a standardized representation for HTML and XML. Here we show how to use Java DOM Parser to process XML documents.

The following is an example XML code. It is from Google’s news feed, which aggregates news headlines.

<rss version="2.0">
  <channel>
    <item>
      <title>Toyota in $1.1 Billion Gas-Pedal ... </title>
      <category>Top Stories</category>
      <pubDate>Thu, 27 Dec 2012 03:23:03 GMT</pubDate>
    </item>
    <item>
      ...
    </item>
  </channel>
</rss>

The following digraph shows the DOM tree structure.

+--------------+
| Root Element |
|    <rss>     |
+--------------+
       |
+--------------+
|    Element   |
|   <channel>  |
+--------------+
       |-----------------------|
+--------------+       +--------------+
|    Element   |       |    Element   |
|    <item>    |       |    <item>    |            ....
+--------------+       +--------------+
       |
+--------------+
|    Element   |
|    <title>   |       ......
+--------------+
       |
+--------------+
|     Text     |
| "Toyota ..." |
+--------------+

Updated version of this XML file can be found from this link:
http://news.google.com/?output=rss.

Using Java DOM Parser

The following Java program ParseXMLDOM.java reads the XML feed, goes through each “item”, and prints out its “title”.

import javax.xml.parsers.*;
import org.w3c.dom.*;

public class ParseXMLDOM 
{ 
    public static void main(String[] args)
    {
        String url = "http://news.google.com/?output=rss";   
        try
        {
            DocumentBuilderFactory f = 
                    DocumentBuilderFactory.newInstance();
            DocumentBuilder b = f.newDocumentBuilder();
            Document doc = b.parse(url);

            doc.getDocumentElement().normalize();
            System.out.println ("Root element: " + 
                        doc.getDocumentElement().getNodeName());
      
            // loop through each item
            NodeList items = doc.getElementsByTagName("item");
            for (int i = 0; i < items.getLength(); i++)
            {
                Node n = items.item(i);
                if (n.getNodeType() != Node.ELEMENT_NODE)
                    continue;
                Element e = (Element) n;

                // get the "title elem" in this item (only one)
                NodeList titleList = 
                                e.getElementsByTagName("title");
                Element titleElem = (Element) titleList.item(0);

                // get the "text node" in the title (only one)
                Node titleNode = titleElem.getChildNodes().item(0);
                System.out.println(titleNode.getNodeValue());
            }
        }
        catch (Exception e)
        {
            e.printStackTrace();
        }
    }
}

The following is a sample output:

Root element: rss
Toyota in $1.1 Billion Gas-Pedal Settlement - Wall Street Journal
George H.W. Bush battles fever - Boston Herald
......

Notes

The method getElementsByTagName of Element returns a NodeList of all descending elements with a given tag name.

Element is a special kind of (subinterface of) Node. In the code above, “titleElem” is an element which has a text node “titleNode” as its only child.

When parsing, the Java DOM parser loads the whole XML into memory and builds a document tree. Note that this is not a good way for large XML files. But DOM parser is quick and easy for small files.

Related Posts

References

Document Object Model (Wikipedia)
javax.xml.parsers.DocumentBuilder
org.w3c.dom.Document
org.w3c.dom.Element
org.w3c.dom.Node

Comments

comments