Java: Read Large XML by Using StAX
The following code shows how to read XML file in Java. It uses StAX API which reads xml files sequentially. If you want to read a large xml file, and get outofmemory error, you should be able to solve the problem by using the code below. The solution below read the xml file sequentially, and can process very large xml files, such as 10G or 20G. Therefore, it is a scalable solution!
Problem
From the following xml file, get "id" and first "thetext" from each item. This is related with how to get first elements of XML file.
In the following section, I will complete the code to parse the xml file by using StAX, and explain the code a little bit.
<?xml version="1.0" encoding="UTF-8"?> <config> <item id="1"> <mode>1</mode> <long_desc isprivate="0"> <who name="Andy">[email protected]</who> <bug_when>2001-10-10 21:34:46 -0400</bug_when> <thetext> Setup a project</thetext> </long_desc> <long_desc isprivate="0"> <who name="Mike">[email protected]</who> <bug_when>2001-10-10 21:34:46 -0400</bug_when> <thetext>- Setup</thetext> </long_desc> <long_desc isprivate="0"> <who name="Gary">[email protected]</who> <bug_when>2001-10-10 21:34:46 -0400</bug_when> <thetext>project</thetext> </long_desc> </item> <item id="2"> <mode>2</mode> <long_desc isprivate="0"> <who name="John">[email protected]</who> <bug_when>2001-10-10 21:34:46 -0400</bug_when> <thetext> Setup a project</thetext> </long_desc> <long_desc isprivate="0"> <who name="Bill">[email protected]</who> <bug_when>2001-10-10 21:34:46 -0400</bug_when> <thetext>- Setup</thetext> </long_desc> <long_desc isprivate="0"> <who name="Rick">[email protected]</who> <bug_when>2001-10-10 21:34:46 -0400</bug_when> <thetext>project</thetext> </long_desc> </item> </config> |
Solution
Complete Code:
import java.io.FileInputStream; import java.io.FileNotFoundException; import java.io.InputStream; import java.util.Iterator; import javax.xml.stream.FactoryConfigurationError; import javax.xml.stream.XMLEventReader; import javax.xml.stream.XMLInputFactory; import javax.xml.stream.XMLStreamException; import javax.xml.stream.events.Attribute; import javax.xml.stream.events.EndElement; import javax.xml.stream.events.StartElement; import javax.xml.stream.events.XMLEvent; class Item{ private String firstText = null; public void setFirstText(String str){ firstText = str; } public String getFirstText(){ if(firstText == null){ return null; }else{ return firstText; } } } public class Main { public static void main(String[] args) throws FileNotFoundException, XMLStreamException, FactoryConfigurationError { // First create a new XMLInputFactory XMLInputFactory inputFactory = XMLInputFactory.newInstance(); //inputFactory.setProperty("javax.xml.stream.isCoalescing", True) // Setup a new eventReader InputStream in = new FileInputStream("/usa/xiwang/Desktop/config"); XMLEventReader eventReader = inputFactory.createXMLEventReader(in); Item item = null; while (eventReader.hasNext()) { XMLEvent event = eventReader.nextEvent(); //reach the start of an item if (event.isStartElement()) { StartElement startElement = event.asStartElement(); if (startElement.getName().getLocalPart().equals("item")) { item = new Item(); System.out.println("--start of an item"); // attribute Iterator<Attribute> attributes = startElement.getAttributes(); while (attributes.hasNext()) { Attribute attribute = attributes.next(); if (attribute.getName().toString().equals("id")) { System.out.println("id = " + attribute.getValue()); } } } // data if (event.isStartElement()) { if (event.asStartElement().getName().getLocalPart().equals("thetext")) { event = eventReader.nextEvent(); if(item.getFirstText() == null){ System.out.println("thetext: " + event.asCharacters().getData()); item.setFirstText("notnull"); continue; }else{ continue; } } } } //reach the end of an item if (event.isEndElement()) { EndElement endElement = event.asEndElement(); if (endElement.getName().getLocalPart() == "item") { System.out.println("--end of an item\n"); item = null; } } } } } |
The solution to get the first "thetext" content, is to created an object when read the start of an "item" element, assign "thetext" content to one of its member only when it's member is empty. This makes sure that only the first "thetext" data is stored.
In brief, StAX API is convenient to use, but still take some time to understand how the Event-driven works and why it costs less memory.
Here is the tutorial of Java EE from Oracle. http://download.oracle.com/javaee/5/tutorial/doc/bnbec.html#bnbeh
<pre><code> String foo = "bar"; </code></pre>
-
ryanlr
-
guest