Parse XML using DOM in Java
XML and DOM
XML stands for eXtended Markup Language. The Document Object Model (DOM) is a standardized representation for HTML and XML. Here we show how to use Java DOM Parser to process XML documents.
The following is an example XML code. It is from Google’s news feed, which aggregates news headlines.
<rss version="2.0"> <channel> <item> <title>Toyota in $1.1 Billion Gas-Pedal ... </title> <category>Top Stories</category> <pubDate>Thu, 27 Dec 2012 03:23:03 GMT</pubDate> </item> <item> ... </item> </channel> </rss>
The following digraph shows the DOM tree structure.
+--------------+ | Root Element | | <rss> | +--------------+ | +--------------+ | Element | | <channel> | +--------------+ |-----------------------| +--------------+ +--------------+ | Element | | Element | | <item> | | <item> | .... +--------------+ +--------------+ | +--------------+ | Element | | <title> | ...... +--------------+ | +--------------+ | Text | | "Toyota ..." | +--------------+
Updated version of this XML file can be found from this link:
http://news.google.com/?output=rss.
Using Java DOM Parser
The following Java program ParseXMLDOM.java reads the XML feed, goes through each “item”, and prints out its “title”.
import javax.xml.parsers.*; import org.w3c.dom.*; public class ParseXMLDOM { public static void main(String[] args) { String url = "http://news.google.com/?output=rss"; try { DocumentBuilderFactory f = DocumentBuilderFactory.newInstance(); DocumentBuilder b = f.newDocumentBuilder(); Document doc = b.parse(url); doc.getDocumentElement().normalize(); System.out.println ("Root element: " + doc.getDocumentElement().getNodeName()); // loop through each item NodeList items = doc.getElementsByTagName("item"); for (int i = 0; i < items.getLength(); i++) { Node n = items.item(i); if (n.getNodeType() != Node.ELEMENT_NODE) continue; Element e = (Element) n; // get the "title elem" in this item (only one) NodeList titleList = e.getElementsByTagName("title"); Element titleElem = (Element) titleList.item(0); // get the "text node" in the title (only one) Node titleNode = titleElem.getChildNodes().item(0); System.out.println(titleNode.getNodeValue()); } } catch (Exception e) { e.printStackTrace(); } } }
The following is a sample output:
Root element: rss Toyota in $1.1 Billion Gas-Pedal Settlement - Wall Street Journal George H.W. Bush battles fever - Boston Herald ......
Notes
The method getElementsByTagName
of Element
returns a NodeList
of all descending elements with a given tag name.
Element
is a special kind of (subinterface of) Node
. In the code above, “titleElem
” is an element which has a text node “titleNode
” as its only child.
When parsing, the Java DOM parser loads the whole XML into memory and builds a document tree. Note that this is not a good way for large XML files. But DOM parser is quick and easy for small files.
Related Posts
References
Document Object Model (Wikipedia)
javax.xml.parsers.DocumentBuilder
org.w3c.dom.Document
org.w3c.dom.Element
org.w3c.dom.Node