Sets
Set<E> interface
Sometimes, we want to work with data while we do not care about the order elements were added or how often they occur.
In such cases, we should use a Set
instead of a List
. The Set interface describes functionality related to sets,
similar to the concept of a set in mathematics. That means that a Set
either contains or does not contain an
object/element, but that these elements are not organized sequentially, nor that it is possible to have multiple copies
of the same element/object in the set. In Java Set
objects, duplicate element are filtered using the equals
method.
For example, when finding repeated values in a set of data or when keeping track of whether we have seen something before,
a Set
is very useful, as adding multiple objects that are equal according to their equals
implementation will make
sure only a single one of them is contained in the set.
Set data structures are thus different from lists. They specialize in checking efficiently whether an element is in the set, without any notion of position or indices on these elements.
The Set<E>
interface extends the Collection<E>
interface, but does not add any methods on top of it.
The semantics, i.e. meaning, of a class that implements the Set
interface is that there are no repeated values.
Since a Set
behaves differently from a List
, we can interpret some operations in the Collection<E>
interface
as operations on mathematical sets:
- The set union (𝑆 ∪ 𝑇) can be performed via the addAll method:
public boolean addAll(Collection<? extends E> arg0);
- The set intersection (𝑆 ∩ 𝑇) can be performed via the retainAll method:
public boolean retainAll(Collection<?> arg0);
- The set difference (S \ 𝑇) can be performed via the removeAll method:
public boolean removeAll(Collection<?> arg0);
- Checking for subsets (S ⊆ 𝑇) can be done via the containsAll method:
public boolean containsAll(Collection<?> arg0);
Since sets do not have a notion of position or sequence, it is not possible to iterate over the elements of the set via indices. Here's how to go through the elements of a set.
Set<String> set = new HashSet<>();
set.add("one");
set.add("one");
set.add("two");
for (String element: set) {
System.out.println(element);
}
one two
Note that Set
does not assume or guarantee anything about the order of a set of elements.
A Set
only knows which elements are contained in the Set
, and which elements are
not.
If you iterate over a Set
, the elements may show up in a completely different order
than the order in which you added them, and for some types of Set
it may even
seem different every time you run your program. This is very different from List
objects, where the order of the elements is dictated by their indices.
HashSet<E> class
As an example, the set interface is implemented by HashSet
. In order to check if an element is already in the set, you could iterate over all elements in the set.
However, this require a lot of work, as in the worst case all elements in the set have to be checked, which can become very slow for large datasets.
The HashSet<E>
class will be much more efficient by clever use of the contract between equals()
and hashCode()
by making sure only a small number of elements
need to be considered and compared.
The basic idea is that the HashSet
maintains a list of m buckets.
When a new element o is added, it is added to the bucket with index o.hashCode() % m
.
When we check whether an element is already in the Set
, we only need to compare the element to other elements in its bucket.
Remember that if two objects are equal according to their equals
method, it must be the case that their hashCode
is equal.
As a result, the bucket with the same index will be used when two objects are equal according to equals
, and we know for
sure that objects stored in the other buckets can never be equal according to their equals
methods, assuming they respect
the rules in the contract between equals
and hashCode
.
Here is a visual example from the slides that shows how a HashSet
with six objects is organized in the memory:
The HashSet
class has the following two constructors:
public HashSet()
public HashSet(Collection<? extends E> c)
If the hash codes of the objects added to a HashSet
are distributed uniformly, they will distribute nicely over the different buckets with high probability, and it will be very efficient. To work properly, HashSet
really depends on the contracts for hashCode()
and equals()
. Calling the contains() method on a HashSet
with a 10,000 elements will be a lot faster than calling it on a List
, approximately 100 to 1,000 times faster.
SortedSet<E> interface
The HashSet<E>
stores its elements in a more or less random order, depending on the current number of buckets and the hashCodes of the objects.
If there is some order available, as defined by an implementation of the Comparable
interface (natural order) or the Comparator
interface,
we may want to obtain the minimum or maximum element, a subset of all objects in the set greater than some object fromElement
or a subset smaller than some object toElement
.
The SortedSet<E>
interface is a sub-interface of Set
that defines the following additional methods:
E first();
E last();
SortedSet<E> headSet(E toElement);
SortedSet<E> tailSet(E fromElement);
SortedSet<E> subSet(E fromElement, E toElement);
Since the HashSet
interface only implements the Set
interface, but not the SortedSet
interface, these methods are not available on a HashSet
.
However, HashSet
is not the only Set
offered in the Collections framework.
TreeSet<E> class
TreeSet<E>
implements the SortedSet<E>
interface and requires some order of the elements of type E
. This can be the natural order, if E implements Comparable<E>
, or it can be provided by passing a Comparator<E>
to the constructor.
The TreeSet
stores its elements in the nodes of a binary tree, such that the following properties hold at all time:
- All nodes in the left subtree of the node hold smaller elements
- All nodes in the right subtree of the node hold greater elements
The advantage of this way to organize the data is that if the tree is balanced, i.e. for each node the subtree on its left side contains roughly as many nodes as the subtree on its right side, in every step half of the remaining elements in the tree can be ignored. The number of steps needed to detect if an element is present or not then becomes the base-2 logarithm of the number of elements stored in the tree. This results in a huge speed up: searching a balanced tree of a million element requires that we only check 20 elements. While these details are beyond the scope of this course, clever tricks exist to make sure that the tree will be kept relatively balanced when elements are inserted or removed.
Below is an example of how a TreeSet
could organize seven integer values in the computer's memory with this approach. In practice, the value variable in each node would be a reference to some object,
which is then compared using the Comparable
/Comparator
implementation.
The TreeSet<E>
class has the advantage that iterating over the elements happens according to the specified order, and that minimum/maximum elements and ranged subsets can be selected efficiently.
However, in practice, HashSet
is faster (unless you end up in the very unlikely case that all objects end up in the same bucket). However, HashSet
s have random order, while a TreeSet
has a reliable order,
as defined by the Comparable
/Comparator
used by the TreeSet
. Therefore, iterating over a TreeSet<String>
would return all the elements in an alphabetical order, regardless of the order in which the elements
were inserted. TreeSet
does require the order defined in the compareTo
/compare
methods to be consistent with equals
in order to work correctly.
Three of TreeSet
’s constructors are:
public TreeSet()
public TreeSet(Comparator<? super E> comparator)
public TreeSet(Collection<? extends E> c)
Test your knowledge
In this quiz, you can test your knowledge on the subjects covered in this chapter.
What is the difference between a Set
and List
? Is there a difference in the
methods that can be called on these two types?
Will the following code run without problems? If yes, what will it print. If not, what is the problem?
Set<String> mySet = Set.of("hello", "there", "set");
for (int i=0; i < mySet.size(); i++) {
System.out.println(mySet.get(i));
}
What is the difference between a HashSet
and a TreeSet
? What assumptions are needed to make
sure a HashSet
or a TreeSet
work correctly?
Suppose you get a set of String
objects containing different words, and you want to find all
the words that are between the words "d"
and "k"
according to an alphabetical order. Would
you choose to use a HashSet
or a TreeSet
to find these words?
What methods can you use to compute a set difference, a set union and a set intersection?