Week5

Sets

Set<E> interface

Sometimes, we want to work with data while we do not care about the order elements were added or how often they occur. In such cases, we should use a Set instead of a List. The Set interface describes functionality related to sets, similar to the concept of a set in mathematics. That means that a Set either contains or does not contain an object/element, but that these elements are not organized sequentially, nor that it is possible to have multiple copies of the same element/object in the set. In Java Set objects, duplicate element are filtered using the equals method. For example, when finding repeated values in a set of data or when keeping track of whether we have seen something before, a Set is very useful, as adding multiple objects that are equal according to their equals implementation will make sure only a single one of them is contained in the set.

Set data structures are thus different from lists. They specialize in checking efficiently whether an element is in the set, without any notion of position or indices on these elements.

The Set<E> interface extends the Collection<E> interface, but does not add any methods on top of it. The semantics, i.e. meaning, of a class that implements the Set interface is that there are no repeated values. Since a Set behaves differently from a List, we can interpret some operations in the Collection<E> interface as operations on mathematical sets:

  • The set union (𝑆 ∪ 𝑇) can be performed via the addAll method: public boolean addAll(Collection<? extends E> arg0);
  • The set intersection (𝑆 ∩ 𝑇) can be performed via the retainAll method: public boolean retainAll(Collection<?> arg0);
  • The set difference (S \ 𝑇) can be performed via the removeAll method: public boolean removeAll(Collection<?> arg0);
  • Checking for subsets (S ⊆ 𝑇) can be done via the containsAll method: public boolean containsAll(Collection<?> arg0);

Since sets do not have a notion of position or sequence, it is not possible to iterate over the elements of the set via indices. Here's how to go through the elements of a set.

Set<String> set = new HashSet<>();
set.add("one");
set.add("one");
set.add("two");

for (String element: set) {
    System.out.println(element);
}
Sample output

one two

Note that Set does not assume or guarantee anything about the order of a set of elements. A Set only knows which elements are contained in the Set, and which elements are not. If you iterate over a Set, the elements may show up in a completely different order than the order in which you added them, and for some types of Set it may even seem different every time you run your program. This is very different from List objects, where the order of the elements is dictated by their indices.

HashSet<E> class

As an example, the set interface is implemented by HashSet. In order to check if an element is already in the set, you could iterate over all elements in the set. However, this require a lot of work, as in the worst case all elements in the set have to be checked, which can become very slow for large datasets. The HashSet<E> class will be much more efficient by clever use of the contract between equals() and hashCode() by making sure only a small number of elements need to be considered and compared.

The basic idea is that the HashSetmaintains a list of m buckets. When a new element o is added, it is added to the bucket with index o.hashCode() % m. When we check whether an element is already in the Set, we only need to compare the element to other elements in its bucket. Remember that if two objects are equal according to their equals method, it must be the case that their hashCode is equal. As a result, the bucket with the same index will be used when two objects are equal according to equals, and we know for sure that objects stored in the other buckets can never be equal according to their equals methods, assuming they respect the rules in the contract between equals and hashCode.

Here is a visual example from the slides that shows how a HashSet with six objects is organized in the memory:

In the picture six buckets are depicted. Object 1 and 4 are in the first bucket, having the same Hash. Object 1 contains string aab and object 4 contains aabaab. The other buckets either contain one or zero objects, with all different hash codes and (slightly) different string elements.

The HashSet class has the following two constructors:

public HashSet()
public HashSet(Collection<? extends E> c)

If the hash codes of the objects added to a HashSet are distributed uniformly, they will distribute nicely over the different buckets with high probability, and it will be very efficient. To work properly, HashSet really depends on the contracts for hashCode() and equals(). Calling the contains() method on a HashSet with a 10,000 elements will be a lot faster than calling it on a List, approximately 100 to 1,000 times faster.

SortedSet<E> interface

The HashSet<E> stores its elements in a more or less random order, depending on the current number of buckets and the hashCodes of the objects. If there is some order available, as defined by an implementation of the Comparable interface (natural order) or the Comparator interface, we may want to obtain the minimum or maximum element, a subset of all objects in the set greater than some object fromElement or a subset smaller than some object toElement. The SortedSet<E> interface is a sub-interface of Set that defines the following additional methods:

E first();
E last();
SortedSet<E> headSet(E toElement);
SortedSet<E> tailSet(E fromElement);
SortedSet<E> subSet(E fromElement, E toElement);

Since the HashSet interface only implements the Set interface, but not the SortedSet interface, these methods are not available on a HashSet. However, HashSet is not the only Set offered in the Collections framework.

TreeSet<E> class

TreeSet<E> implements the SortedSet<E> interface and requires some order of the elements of type E. This can be the natural order, if E implements Comparable<E>, or it can be provided by passing a Comparator<E> to the constructor. The TreeSet stores its elements in the nodes of a binary tree, such that the following properties hold at all time:

  • All nodes in the left subtree of the node hold smaller elements
  • All nodes in the right subtree of the node hold greater elements

The advantage of this way to organize the data is that if the tree is balanced, i.e. for each node the subtree on its left side contains roughly as many nodes as the subtree on its right side, in every step half of the remaining elements in the tree can be ignored. The number of steps needed to detect if an element is present or not then becomes the base-2 logarithm of the number of elements stored in the tree. This results in a huge speed up: searching a balanced tree of a million element requires that we only check 20 elements. While these details are beyond the scope of this course, clever tricks exist to make sure that the tree will be kept relatively balanced when elements are inserted or removed.

Below is an example of how a TreeSet could organize seven integer values in the computer's memory with this approach. In practice, the value variable in each node would be a reference to some object, which is then compared using the Comparable/Comparator implementation.

The TreeSet<Integer%gt; object contains the reference to the root node of the TreeSet and an integer value of the size. A node has a pointer to a node object with a smaller value, a pointer to a node object with a greater value and a pointer to the integer value of the node itself. In this example, the root node has two children and they also have two children each. If you go to the left from one node, you obtain node objects with smaller values. On the right, however, are nodes with greater values.

The TreeSet<E> class has the advantage that iterating over the elements happens according to the specified order, and that minimum/maximum elements and ranged subsets can be selected efficiently. However, in practice, HashSet is faster (unless you end up in the very unlikely case that all objects end up in the same bucket). However, HashSets have random order, while a TreeSet has a reliable order, as defined by the Comparable/Comparator used by the TreeSet. Therefore, iterating over a TreeSet<String> would return all the elements in an alphabetical order, regardless of the order in which the elements were inserted. TreeSet does require the order defined in the compareTo/compare methods to be consistent with equals in order to work correctly.

Three of TreeSet’s constructors are:

public TreeSet()
public TreeSet(Comparator<? super E> comparator)
public TreeSet(Collection<? extends E> c)
Exercise

Test your knowledge

In this quiz, you can test your knowledge on the subjects covered in this chapter.

What is the difference between a Set and List? Is there a difference in the methods that can be called on these two types?


Will the following code run without problems? If yes, what will it print. If not, what is the problem?

Set<String> mySet = Set.of("hello", "there", "set");
for (int i=0; i < mySet.size(); i++) {
    System.out.println(mySet.get(i));
}

What is the difference between a HashSet and a TreeSet? What assumptions are needed to make sure a HashSet or a TreeSet work correctly?


Suppose you get a set of String objects containing different words, and you want to find all the words that are between the words "d" and "k" according to an alphabetical order. Would you choose to use a HashSet or a TreeSet to find these words?


What methods can you use to compute a set difference, a set union and a set intersection?

You have reached the end of this section! Continue to the next section: