Java: Read Large XML by Using StAX

The following code shows how to read XML file in Java. It uses StAX API which reads xml files sequentially. If you want to read a large xml file, and get outofmemory error, you should be able to solve the problem by using the code below. The solution below read the xml file sequentially, and can process very large xml files, such as 10G or 20G. Therefore, it is a scalable solution!

Problem

From the following xml file, get "id" and first "thetext" from each item. This is related with how to get first elements of XML file.

In the following section, I will complete the code to parse the xml file by using StAX, and explain the code a little bit.

<?xml version="1.0" encoding="UTF-8"?>
<config>
	<item id="1">
		<mode>1</mode>
		<long_desc isprivate="0">
			<who name="Andy">[email protected]</who>
			<bug_when>2001-10-10 21:34:46 -0400</bug_when>
			<thetext> Setup a project</thetext>
		</long_desc>
 
		<long_desc isprivate="0">
			<who name="Mike">[email protected]</who>
			<bug_when>2001-10-10 21:34:46 -0400</bug_when>
			<thetext>- Setup</thetext>
		</long_desc>
		<long_desc isprivate="0">
			<who name="Gary">[email protected]</who>
			<bug_when>2001-10-10 21:34:46 -0400</bug_when>
			<thetext>project</thetext>
		</long_desc>
 
	</item>
 
	<item id="2">
		<mode>2</mode>
		<long_desc isprivate="0">
			<who name="John">[email protected]</who>
			<bug_when>2001-10-10 21:34:46 -0400</bug_when>
			<thetext> Setup a project</thetext>
		</long_desc>
 
		<long_desc isprivate="0">
			<who name="Bill">[email protected]</who>
			<bug_when>2001-10-10 21:34:46 -0400</bug_when>
			<thetext>- Setup</thetext>
		</long_desc>
 
		<long_desc isprivate="0">
			<who name="Rick">[email protected]</who>
			<bug_when>2001-10-10 21:34:46 -0400</bug_when>
			<thetext>project</thetext>
		</long_desc>
 
	</item>
</config>

Solution

Complete Code:

import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.InputStream;
import java.util.Iterator;
import javax.xml.stream.FactoryConfigurationError;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.events.Attribute;
import javax.xml.stream.events.EndElement;
import javax.xml.stream.events.StartElement;
import javax.xml.stream.events.XMLEvent;
 
class Item{
	private String firstText = null;
 
	public void setFirstText(String str){
		firstText =  str;
	}
 
	public String getFirstText(){		
		if(firstText == null){
			return null;
		}else{
			return firstText;
		}
	}
}
 
public class Main {
	public static void main(String[] args) throws FileNotFoundException,
			XMLStreamException, FactoryConfigurationError {
		// First create a new XMLInputFactory
		XMLInputFactory inputFactory = XMLInputFactory.newInstance();
 
		//inputFactory.setProperty("javax.xml.stream.isCoalescing", True)
 
		// Setup a new eventReader
		InputStream in = new FileInputStream("/usa/xiwang/Desktop/config");
		XMLEventReader eventReader = inputFactory.createXMLEventReader(in);
 
		Item item = null;
 
		while (eventReader.hasNext()) {
			XMLEvent event = eventReader.nextEvent();
 
			//reach the start of an item
			if (event.isStartElement()) {
 
				StartElement startElement = event.asStartElement();
 
				if (startElement.getName().getLocalPart().equals("item")) {
					item = new Item();
					System.out.println("--start of an item");
					// attribute
					Iterator<Attribute> attributes = startElement.getAttributes();
					while (attributes.hasNext()) {
						Attribute attribute = attributes.next();
						if (attribute.getName().toString().equals("id")) {
							System.out.println("id = " + attribute.getValue());
						}
					}
				}
 
				// data
				if (event.isStartElement()) {
					if (event.asStartElement().getName().getLocalPart().equals("thetext")) {
						event = eventReader.nextEvent();
 
						if(item.getFirstText() == null){
							System.out.println("thetext: "
									+ event.asCharacters().getData());
							item.setFirstText("notnull");
							continue;
						}else{
							continue;
						}
 
					}
				}
			}
 
			//reach the end of an item
			if (event.isEndElement()) {
				EndElement endElement = event.asEndElement();
				if (endElement.getName().getLocalPart() == "item") {
					System.out.println("--end of an item\n");
					item = null;
				}
			}
 
		}
	}
}

The solution to get the first "thetext" content, is to created an object when read the start of an "item" element, assign "thetext" content to one of its member only when it's member is empty. This makes sure that only the first "thetext" data is stored.

In brief, StAX API is convenient to use, but still take some time to understand how the Event-driven works and why it costs less memory.

Here is the tutorial of Java EE from Oracle. http://download.oracle.com/javaee/5/tutorial/doc/bnbec.html#bnbeh

Category >> StAX  
If you want someone to read your code, please put the code inside <pre><code> and </code></pre> tags. For example:
<pre><code> 
String foo = "bar";
</code></pre>
  • ryanlr

    Corrected. Thanks.

  • guest

    Wait,
    “startElement.getName().getLocalPart() == “item” ”
    Really?