Skip to main content

How do I de-duplicate strings

In this Q&A, we'll go over how Strings get duplicated in Java and how to avoid duplication.

String duplication and de-duplication

Let's first go over how String objects are duplicated.   new String() would create a new String object even if the string literal is same.  Pl see the code snippet below:

String s = new String("abc");
String s1 = new String("abc");

assertEquals("Reference inequality", false, s==s1);
assertEquals("String equality", true, s.equals(s1));

Java has a String pool.  String.intern() method would check if a String is in the pool,  If exists, reuses it and creates a new object if  String does not exist in the pool.  

String s2 = new String("abcd").intern();
String s3 = new String("abcd").intern();
assertEquals("Reference equality", true, s2==s3);

String pool performance

Pool is like a map of WeakReference objects.  Java 7 and later, String objects in the pool are garbage collected when there are no other references to the object.

Java 7 and later default String pool size is 60013.  Note that its a prime number.  It's by design.  Tests have proved that lookup and insertion performance is optimal when pool size is a prime number.

Java provides ability to set and view String pool statistics using JVM arguments. Use -XX:StringTableSize to set String pool size and -XX:+PrintStringTableStatistics to view stats.


De-duplication without String.intern()

String intern() is quite useful when the app developer has a good understanding on Strings that need to be interned.  If its is not well known, then Java provides an option to de-duplicate Strings by setting JVM command line arguments: :-XX:+UseG1GC -XX:+UseStringDeduplication.

Pre-requisite for string de-duplication is usage of G1 garbage collector(GC).  It cannot be used with parallel or concurrent mark sweep GC.   Use -XX:+PrintStringDeduplicationStatistics option to check String de-duplication stats.

This feature is available from Java 8u20.  See JEP192 on how de-duplication is done and how it might impact GC pause performance.

Comments