Our site, keeping you up-to-date news.

 

Java OutOfMemoryError Analysis–A Case Study

May 26, 2014 at 12:44 am | Blog | No comment

 

This topic came to me when I was doing a stress test on our javaee project. The application was supposed to support parsing an xml file containing more than 10,000 records. The total size of the xml file with 10, 000 records was about 26M in the OS. When I tested an xml file with 6, 000 records, I got the famous OutOfMemoryError. java.lang.OutOfMemoryError: Java heap space. The file size is about 15M in the OS , and there are about 15,000,000 characters.(in the OS, the character is stored as “one byte for each char”)

And Then I looked into the code, which had been there for a while before i joined the team. The exception was thrown at line 4 of the following code.

public static Document parse(String xml)throws Exception {          
                if(xml == null)
                        return null;
                xml = xml.replaceFirst("iso8859-1", "iso-8859-1");
                DocumentBuilder builder = null;
                Document doc = null;
                if (dbFactory.get() == null) {
                        dbFactory.set(DocumentBuilderFactory.newInstance());
                }
                dbFactory.get().setNamespaceAware(true);
                builder = dbFactory.get().newDocumentBuilder();
                InputStream is = new ByteArrayInputStream(xml.getBytes());
                doc = builder.parse(is);
                return doc;
}

Line 4 is to correct the wrong encoding format to the right one. It uses the String’s replaceFirst function to return another string. As string is immutable, it creates a new string object in the heap, instead of modifying the same String object.

                xml = xml.replaceFirst("iso8859-1", "iso-8859-1");

As we know, String object itself holds an array of characters.

private final char value[];

so basically, the application throws the OutOfMemoryError during allocating memories for the array of characters. In java, it needs two bytes for each character, in our case, the whole memory needed for this string is about 30M (even more that due to the overhead maintaining these characters by the string).

I was thinking that why the OutOfmemoryError happened? was it because there was not enough space, or was it caused by no contiguous spaces?

Before answering that question, we should prepare ourselves with some knowledge on garbage collection and JVM heap.

The JVM heap is divided into several divisions
Slide5

The young generation, the tenured generation (old generation), the permanent generation. Different garbage collectors are applied to different heap spaces.

The young generation space is where almost all new objects are allocated and aged. The young generation is again divided into “Eden Space”, “Suvivor Space”. The “Suvivor Space” is divided into “S0” and “S1”. More specifically, the new object are allocated in the “Eden Space”. When the eden space is filled up, a minor garbage collection is triggered. Referenced objects are moved to the first survivor space (S0). Unreferenced objects are deleted when the eden space is cleared. At the next minor GC, the same thing happens for the eden space. Unreferenced objects are deleted and referenced objects are moved to a survivor space. However, in this case, they are moved to the second survivor space (S1). In addition, objects from the last minor GC on the first survivor space (S0) have their age incremented and get moved to S1. Once all surviving objects have been moved to S1, both S0 and eden are cleared. Notice we now have differently aged object in the survivor space. S0 and S1 are swap suvivor spaces for each other.

The Old Generation is used to store long surviving objects. Typically, a threshold is set for young generation object and when that age is met, the objects in the suvivor space get moved to the old generation. however large object will also be allocated to the tenured space directly, as the younge generation space is reletively small, compared to the tenured sapce. When the old generation fills up, it triggers a major collection which involves the entire object heap.

The permanent Generation is used to store metadata required by the JVM to describe the classes and methods used in the application. The permanent generation is populated by the JVM at runtime based on classes in use by the application. In addition, Java SE library classes and methods may be stored here.

And, there are three different garbage collectors available in the Java HotSpot VM.

The Serial GC
The Parallel GC
The Concurrent Mark Sweep (CMS) Collector

The serial collector uses a single thread to perform all garbage collection work, and the parallel collector performs minor collections in parallel, it uses the serial collector to perform major collection though. By default, the server side JVM uses paralle collector for garbage collection after JDK 1.5. The CMS performs most of its work concurrently. The major difference between CMS and the former two is, the CMS does not do memory compact. For serial and parallel GC, the collector will move object to the one end of the space, and leave the other end free, so there will be no fragmentation.

You can use the following command to see what GC is used in your JVM

java -XX:+PrintCommandLineFlags -version

For my case, the JVM was using the default “parellel GC”. At the beginning, i was thinking that the OOM ocurred because there was no contiguous memory for the array (char[]) in the tenured generation, even though there is still enough memory. However, after checking the GC setup, the CMS is not configured. So my assumption is not correct. Then it should be the size of the heap, there was not enough free space to be allocated for the large string.

Get back to my code, to fix the OOM is quite obvious. Don’t allocate another copy of String with just the encoding correction. For the DocumentBuilder gives us chance to explicitly set the encoding for the byte stream.

The SAX parser will use the InputSource object to determine how to read XML input. If there is a character stream available, the parser will read that stream directly, disregarding any text encoding declaration found in that stream. If there is no character stream, but there is a byte stream, the parser will use that byte stream, using the encoding specified in the InputSource or else (if no encoding is specified) autodetecting the character encoding using an algorithm such as the one in the XML specification.

For us, it is a byte stream, we can tell the SAX parser to use “iso-8859-1” explicitly. following is the fixed code:

public static Document parse(String xml)throws Exception {          
                if(xml == null)
                        return null;
                DocumentBuilder builder = null;
                Document doc = null;
                if (dbFactory.get() == null) {
                        dbFactory.set(DocumentBuilderFactory.newInstance());
                }
                dbFactory.get().setNamespaceAware(true);
                builder = dbFactory.get().newDocumentBuilder();
                InputStream is = new ByteArrayInputStream(xml.getBytes());
                InputSource source = new InputSource(is);
                source.setEncoding("iso-8859-1");
                doc = builder.parse(source);
                return doc;
}

And after the fix, I successfully tested the parser with an XML file with 7000 records.

However, the OOM happened again when I tested 8000 records. This time, the exception was thrown from line 10:

InputStream is = new ByteArrayInputStream(xml.getBytes());

Reading from the code, it was converting the XML string to a byte array. And then construct a ByteArrayInputStream for the SAX parser. This is the same problem as the one we had at the beginning. The xml string took up more than 30M space, and to convert it into a byte array also required 30M space. And it did not make any sense if you have already had the XML in the memory, and then you made another copy of it in the memory. It is totally a waste of memory.

to resolve this issue is to provide the character stream to the SAXParser instead of the byte stream. And now this is the final version of the code.

public static Document parse(String xml)throws Exception {          
                if(xml == null)
                        return null;
                DocumentBuilder builder = null;
                Document doc = null;
                if (dbFactory.get() == null) {
                        dbFactory.set(DocumentBuilderFactory.newInstance());
                }
                dbFactory.get().setNamespaceAware(true);
                builder = dbFactory.get().newDocumentBuilder();
                //instead of get the byte inputstream, just use the string reader
                //InputStream is = new ByteArrayInputStream(xml.getBytes());
                StringReader reader = new StringReader(xml);
                InputSource source = new InputSource(reader);
                //If there is a character stream available, the parser will read that stream directly               
                //source.setEncoding("iso-8859-1");
                doc = builder.parse(source);
                return doc;
}

The solution may look plain, but the process of analyzing it, diving into it, and getting background knowledge is really enjoyable and helpful. That’s what we say learning from the failure.

As the API has already been designed this way, otherwise, we can even start with an InputStream to start parsing the XML file. It will save memory even more.

reference:
Java Garbage Collection Basics
memory management in the JVM

 

Oracle Implicit Data Type Conversion

May 6, 2014 at 4:58 pm | Blog | 1 comment

 

When i was debugging an issue in one of our java ee project, i found this topic was quite interesting.

the exception the QA got was: “ORA-01830: date format picture ends before converting entire input string.”

The stored procedure that threw this exception has a “where clause” like this:

WHERE PROCESSING_DATE = P_DATE_TIME

P_DATE_TIME is the parameter passed in to this stored procedure. And the value is something like: 30-Apr-2014 00:00. It is obvious that the parameter has a timestamp format: ‘dd-MON-yyyy hh24:mi’.

The PROCESSING_DATE is a column with a DATE type definition:

Name                         	NULL     TYPE           
----------------------------- -------- -------------- 
PROCESSING_DATE                   DATE

So in the above statement “Where PROCESSING_DATE = P_DATE_TIME”, Oracle tries to implicitly convert ’30-Apr-2014 00:00′ to a DATE with the NLS_DATE_FORMAT of the session. You can query the NLS_DATE_FORMAT by query:

SELECT VALUE FROM   nls_session_parameters WHERE  parameter = 'NLS_DATE_FORMAT'

In my case it is “DD-MON-RR”
So if the input parameter has a value “30-Apr-2014”, it has no issue. It will implicitly convert the varchar to a date type. (actually, the values 30/Apr/2014, 30-April-2014, 30-Apr-14 are all working correctly).
However, the best way is to explicitly convert the varchar to a date type.


Oracle recommends that you specify explicit conversions, rather than rely on implicit or automatic conversions, for these reasons:

•SQL statements are easier to understand when you use explicit datatype conversion functions.

•Implicit datatype conversion can have a negative impact on performance, especially if the datatype of a column value is converted to that of a constant rather than the other way around.

•Implicit conversion depends on the context in which it occurs and may not work the same way in every case. For example, implicit conversion from a datetime value to a VARCHAR2 value may return an unexpected year depending on the value of the NLS_DATE_FORMAT parameter.

•Algorithms for implicit conversion are subject to change across software releases and among Oracle products. Behavior of explicit conversions is more predictable.

To resolve this issue, the where clause should be:

WHERE PROCESSING_DATE = to_date( P_DATE_TIME, 'dd-MON-yyyy hh24:mi')

It explicitly convert the input parameter to date, and it does not need to change anything else.