Java作业 | COM S 228, Fall 2020 Programming Project 5

本次Java作业是实现一个完整的哈希表

COM S 228, Fall 2020Programming Project 5

Generating a Perfect Hash Table Implementation

Problem Overview

Graphs are the most general of the common data structures; consider that a scalar object is a degenerate list,which is a degenerate tree, which is itself a degenerate graph, or in other words, every data structure we’ve discussed this semester can be represented as a graph. Due to their extreme generality, it’s rare that you’ll use a generic “graph” data structure. The limited set of things you might do with a list make two generic list implementations—an array based- and a linked-structure—good enough for most implementations, and it’s only if you are doing, e.g., low-level system programming that you might have a good reason to write a custom, one-off list. But there is so much variation in graphs that it is perhaps impossible to write a good and useful generic graph implementation.

Hash tables provide a means to get constant-time look-up performance on a collection. Usually, the data comes to us at runtime, often in an on-line fashion. In this case, we take what foreknowledge we can to inform best practices in hash table construction, choosing a hash function, etc., to build a data structure with as close to optimal performance as possible; however, if the keys in the table are known beforehand, it’s always possible to construct a perfect hash table, a table in which every key maps to a unique slot.There are many algorithms to construct such a table and the associated hash function; almost all of them are implemented using graphs. For this project, you will be implementing a perfect hash table construction algorithm1

This algorithm generates a pair of tables of random numbers. Keys are hashed with the values in each these tables, giving two integers (one for each table). These integers are then taken as node indices in a graph, with an edge between them. So long as the resultant graph—after hashing all of the keys—contains no cycles, the algorithm is ready to generate the hash table as a function of the tables (the T tables) used in the graph generation and a function g() on those tables.

Algorithm Example

NOTE: This algorithm looks pretty intimidating at first glance. With some careful reading and workingthrough of examples, you should find it’s actually fairly straightforward. That said, you don’t necessarily have to understand it at all! If you find yourself getting anxious about all this “scary-looking stuff”, jump down to the Requirements sections, read that, then come back here with a more relaxed mindset. It is very common for professional programmers to be required to implement solutions to problems that they don’t actually understand, so if you find yourself doing that, here or elsewhere, please know that it is normal. This is not to advocate for ignorance, however–it’s always better to understand something than not to understand it–only to recognize that sometimes the cost-benefit of understanding is too high! The example below was generated by hand, and uses an optimization that would be very difficult to implement in program control. We note that our original keys (the number names from “one” to “ten”) 1The algorithm is presented in: Zbigniew J. Czech, George Havas, and Bohdan S. Majewski, “An optimal algorithm for generating minimal perfect has functions”, Information Processing Letters, 43(5):257-264, October 1992. are unambiguously differentiable by the first two letters. Thus, we can simplify the presentation by using only the first two letters of the keys. In your Java program, you will always use the entirety of every word; however, the method that hashes the words is already implemented for you, so you don’t need to concern yourself with it unless you are endeavoring to fully understand the implementation.Our keys and the “sub-keys” that we’re using in this example follow:

Key Minimum Unique Sub-key

one on

two tw

three th

four fo

five fi

six si

seven se

eight ei

nine ni

ten te

A modulus must be chosen, giving a maximum number in our tables, and a maximum number of nodes in the resultant graph. The paper authors choose a value that is one more than twice the number of keys, so we do too. This choice has implications on the runtime of the algorithm (smaller moduli increase the probability of cycles in the resultant graph, forcing a restart of the algorithm). If the modulus is too small, the algorithm will enter an infinite loop.

Next, we populate our tables, T1 and T2, with random numbers less than the modulus. Each of the two tables will have the same number of rows as the length l of the sub-key. The i

th row, 0 <= i <= l − 1,defines a separate mapping from letters at the i

th position (increasing from left to right) within the sub-key to random numbers.

e f h i n o s t w

8 15 10 2 10 19 13 0 13

3 5 0 5 5 6 19 5 6

e f h i n o s t w

20 2 4 19 11 15 3 8 11

16 4 2 6 19 10 18 4 18

These tables define a function that we will use to transform our keys into nodes in a graph. For each key,

we apply the function twice, once each for T1 and T2. Index the tables by the letter position and the letter

value, add the numbers found in the tables, and return a result modulo the modulus2

. We get two values per

key, one for T1, and one for T2. These values are taken as node IDs in our graph, and an edge runs between

them corresponding with the index of the key in the input vector, so “one” corresponds with index zero,

“two” with index one, etc.

If you look carefully at the hash function defined in the project source template, you will notice that we always use tables

of size 4 × 64. We’re doing things slightly differently there, to reduce code complexity. Instead of indexing with letter position,

we index with letter position mod 4, and instead of letter value, we use letter value mod 64. The example in this document uses

a simpler (and better) approach that is easy to do when hand tuning, but difficult to automate. The fundamental structure of the

algorithm is unchanged in either case.

Applying our table to the first key, “one”, and recall that we are using only the first two letters in this

example (and that we always use the entirety of every word in our program code), we have T1(0, o) ==

19 and T1(1, n) == 5. 19 + 5 ≡ 3 mod 21, so one end of our “one” edge is node 3. The other end is

given by T2(0, o) == 15 and T2(1, n) == 19, and 15 + 19 ≡ 13 mod 21. So the “one” edge runs from

node 3 to node 13.

The following table gives the calculation for the entire input set:

key T1(0,letter) T1(1,letter) sum mod 21 T2(0,letter) T2(1,letter) sum mod 21

one 19 5 3 15 19 13

two 0 6 6 8 18 5

three 0 0 0 8 2 10

four 15 6 0 2 10 12

five 15 5 20 2 6 8

six 13 5 18 3 6 9

seven 13 3 16 3 16 19

eight 8 5 13 20 6 5

nine 10 5 15 11 6 17

ten 0 3 3 8 16 3

The T1 sum column is also called u(key) and the T2 sum is v(key). We’ll be referring to them by these

names later in the algorithm.

Now our graph is built. The next step is to check if the graph is suitable to build our hash table. A

graph is suitable if it contains no cycles. Cycle detection is usually done with a depth first search, but in this

example, we’ll inspect by eye; in program control we would call a method for this check, one that you will

be implementing!

Look at the final row in the above table and note that the “ten” edge is a self loop; it both begins and

ends on node 3. Therefore, we need to discard this graph and the associated tables. We generate a new set

of random tables:

e f h i n o s t w

1 13 16 3 3 7 10 1 17

19 11 17 6 20 3 0 17 14

e f h i n o s t w

7 5 6 13 20 8 19 10 16

8 3 0 1 13 0 17 14 4

and calculate a new graph:

key T1(0,letter) T1(1,letter) sum mod 21 T2(0,letter) T2(1,letter) sum mod 21

one 7 20 6 8 13 0

two 1 14 15 10 4 14

three 1 17 18 10 0 10

four 13 3 16 5 0 5

five 13 6 19 5 1 6

six 10 6 16 19 1 20

seven 10 19 8 19 8 6

eight 1 6 7 7 1 8

nine 3 6 9 20 1 0

ten 1 19 20 10 8 18

Applying our cycle-detecting eyeballs, we see that this graph has no loops:

four (3)

nine (8)

one (0)

five (4)

seven (6)

eight (7)

two (1)

three (2)

six (5)

ten (9)

The final step is to fill the array that defines the g() function (as named in the paper; “g” does not stand

for “graph”). g() maps edges in our graph to indices in our original input array. We do this by traversing

the connected components of the graph. We always start with the lowest-numbered, unvisited vertex. When

traversing through a node where you have a choice of which direction to travel next, the choice does not

matter.

We start with node zero and assign g(0) = 0. Then we traverse the neighbors of node 0; since neighbor

order doesn’t matter, let’s do node 9 next. g(9) = 8 − g(0) mod 10 ≡ 8, where the 8 in the subtraction

comes from the edge value and 10 is the number of keys in the input vector. Going back to node 1 and

heading the other direction, we have g(6) = 0 − g(0) mod 10 ≡ 0; again, the 0 in the subtraction is the

edge value. Continuing to node 8, g(8) = 6 − g(6) mod 10 ≡ 6. Node 7, g(7) = 7 − g(8) mod 10 ≡ 1.

And node 19, g(19) = 4 − g(6) mod 10 ≡ 4.

This completes the first connected subgraph. Let’s move to the next subgraph. The lowest-numbered,

unvisited node is node 5. We assign g(5) = 0, and proceed as above. Node 16, g(16) = 3 − g(5)

mod 10 ≡ 3. Node 20, g(20) = 5 − g(16) mod 10 ≡ 2. Node 18, g(18) = 9 − g(20) mod 10 ≡ 7. And

node 10 completes this subgraph, g(10) = 2 − g(18) mod 10 ≡ 5.

There is one more subgraph, containing nodes 14 and 15. Start with the smaller, g(14) = 0, and

g(15) = 1 − g(14) mod 10 ≡ 1.

And here is g:

index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

g 0 0 0 0 0 0 0 1 6 8 5 0 0 0 0 1 3 0 7 4 2

Positions in this array that were not assigned in the last step do not matter. We choose to fill them in with

zeros.

As stated above, u(key) is the first “sum mod 21” in the final graph calculation table above, and v(key)

is the second such column. Our hash function is given by:

HASH(key) = (g (u (key)) + g (v (key))) mod 10

thus:

HASH(“seven”) = g(8) + g(6) mod 10

= 6 + 0 mod 10

= 6

which is the seventh element, since arrays are zero indexed; and

HASH(“three”) = g(18) + g(10) mod 10

= 7 + 5 mod 10

= 12 mod 10

= 2