Java作业 | COM S 228, Fall 2020 Programming Project 5

本次Java作业是实现一个完整的哈希表

COM S 228, Fall 2020Programming Project 5

Generating a Perfect Hash Table Implementation

Problem Overview

Graphs are the most general of the common data structures; consider that a scalar object is a degenerate list,which is a degenerate tree, which is itself a degenerate graph, or in other words, every data structure we’ve discussed this semester can be represented as a graph. Due to their extreme generality, it’s rare that you’ll use a generic “graph” data structure. The limited set of things you might do with a list make two generic list implementations—an array based- and a linked-structure—good enough for most implementations, and it’s only if you are doing, e.g., low-level system programming that you might have a good reason to write a custom, one-off list. But there is so much variation in graphs that it is perhaps impossible to write a good and useful generic graph implementation.

Hash tables provide a means to get constant-time look-up performance on a collection. Usually, the data comes to us at runtime, often in an on-line fashion. In this case, we take what foreknowledge we can to inform best practices in hash table construction, choosing a hash function, etc., to build a data structure with as close to optimal performance as possible; however, if the keys in the table are known beforehand, it’s always possible to construct a perfect hash table, a table in which every key maps to a unique slot.There are many algorithms to construct such a table and the associated hash function; almost all of them are implemented using graphs. For this project, you will be implementing a perfect hash table construction algorithm1 

 This algorithm generates a pair of tables of random numbers. Keys are hashed with the values in each these tables, giving two integers (one for each table). These integers are then taken as node indices in a graph, with an edge between them. So long as the resultant graph—after hashing all of the keys—contains no cycles, the algorithm is ready to generate the hash table as a function of the tables (the T tables) used in the graph generation and a function g() on those tables.

Algorithm Example

NOTE: This algorithm looks pretty intimidating at first glance. With some careful reading and workingthrough of examples, you should find it’s actually fairly straightforward. That said, you don’t necessarily have to understand it at all! If you find yourself getting anxious about all this “scary-looking stuff”, jump down to the Requirements sections, read that, then come back here with a more relaxed mindset. It is very common for professional programmers to be required to implement solutions to problems that they don’t actually understand, so if you find yourself doing that, here or elsewhere, please know that it is normal. This is not to advocate for ignorance, however–it’s always better to understand something than not to understand it–only to recognize that sometimes the cost-benefit of understanding is too high! The example below was generated by hand, and uses an optimization that would be very difficult to implement in program control. We note that our original keys (the number names from “one” to “ten”) 1The algorithm is presented in: Zbigniew J. Czech, George Havas, and Bohdan S. Majewski, “An optimal algorithm for generating minimal perfect has functions”, Information Processing Letters, 43(5):257-264, October 1992. are unambiguously differentiable by the first two letters. Thus, we can simplify the presentation by using only the first two letters of the keys. In your Java program, you will always use the entirety of every word; however, the method that hashes the words is already implemented for you, so you don’t need to concern yourself with it unless you are endeavoring to fully understand the implementation.Our keys and the “sub-keys” that we’re using in this example follow:

Key Minimum Unique Sub-key

one on

two tw

three th

four fo

five fi

six si

seven se

eight ei

nine ni

ten te

A modulus must be chosen, giving a maximum number in our tables, and a maximum number of nodes in the resultant graph. The paper authors choose a value that is one more than twice the number of keys, so we do too. This choice has implications on the runtime of the algorithm (smaller moduli increase the probability of cycles in the resultant graph, forcing a restart of the algorithm). If the modulus is too small, the algorithm will enter an infinite loop.

Next, we populate our tables, T1 and T2, with random numbers less than the modulus. Each of the two tables will have the same number of rows as the length l of the sub-key. The i

th row, 0 <= i <= l − 1,defines a separate mapping from letters at the i

th position (increasing from left to right) within the sub-key to random numbers.

T1

e f h i n o s t w

8 15 10 2 10 19 13 0 13

3 5 0 5 5 6 19 5 6

T2

e f h i n o s t w

20 2 4 19 11 15 3 8 11

16 4 2 6 19 10 18 4 18

These tables define a function that we will use to transform our keys into nodes in a graph. For each key,

we apply the function twice, once each for T1 and T2. Index the tables by the letter position and the letter

value, add the numbers found in the tables, and return a result modulo the modulus2

. We get two values per

key, one for T1, and one for T2. These values are taken as node IDs in our graph, and an edge runs between

them corresponding with the index of the key in the input vector, so “one” corresponds with index zero,

“two” with index one, etc.

2

If you look carefully at the hash function defined in the project source template, you will notice that we always use tables

of size 4 × 64. We’re doing things slightly differently there, to reduce code complexity. Instead of indexing with letter position,

we index with letter position mod 4, and instead of letter value, we use letter value mod 64. The example in this document uses

a simpler (and better) approach that is easy to do when hand tuning, but difficult to automate. The fundamental structure of the

algorithm is unchanged in either case.

Applying our table to the first key, “one”, and recall that we are using only the first two letters in this

example (and that we always use the entirety of every word in our program code), we have T1(0, o) ==

19 and T1(1, n) == 5. 19 + 5 ≡ 3 mod 21, so one end of our “one” edge is node 3. The other end is

given by T2(0, o) == 15 and T2(1, n) == 19, and 15 + 19 ≡ 13 mod 21. So the “one” edge runs from

node 3 to node 13.

The following table gives the calculation for the entire input set:

key T1(0,letter) T1(1,letter) sum mod 21 T2(0,letter) T2(1,letter) sum mod 21

one 19 5 3 15 19 13

two 0 6 6 8 18 5

three 0 0 0 8 2 10

four 15 6 0 2 10 12

five 15 5 20 2 6 8

six 13 5 18 3 6 9

seven 13 3 16 3 16 19

eight 8 5 13 20 6 5

nine 10 5 15 11 6 17

ten 0 3 3 8 16 3

The T1 sum column is also called u(key) and the T2 sum is v(key). We’ll be referring to them by these

names later in the algorithm.

Now our graph is built. The next step is to check if the graph is suitable to build our hash table. A

graph is suitable if it contains no cycles. Cycle detection is usually done with a depth first search, but in this

example, we’ll inspect by eye; in program control we would call a method for this check, one that you will

be implementing!

Look at the final row in the above table and note that the “ten” edge is a self loop; it both begins and

ends on node 3. Therefore, we need to discard this graph and the associated tables. We generate a new set

of random tables:

T1

e f h i n o s t w

1 13 16 3 3 7 10 1 17

19 11 17 6 20 3 0 17 14

T2

e f h i n o s t w

7 5 6 13 20 8 19 10 16

8 3 0 1 13 0 17 14 4

and calculate a new graph:

key T1(0,letter) T1(1,letter) sum mod 21 T2(0,letter) T2(1,letter) sum mod 21

one 7 20 6 8 13 0

two 1 14 15 10 4 14

three 1 17 18 10 0 10

four 13 3 16 5 0 5

five 13 6 19 5 1 6

six 10 6 16 19 1 20

seven 10 19 8 19 8 6

eight 1 6 7 7 1 8

nine 3 6 9 20 1 0

ten 1 19 20 10 8 18

Applying our cycle-detecting eyeballs, we see that this graph has no loops:

9

0

6

19

8

7

15

14

5

16

20

18

10

four (3)

nine (8)

one (0)

five (4)

seven (6)

eight (7)

two (1)

three (2)

six (5)

ten (9)

The final step is to fill the array that defines the g() function (as named in the paper; “g” does not stand

for “graph”). g() maps edges in our graph to indices in our original input array. We do this by traversing

the connected components of the graph. We always start with the lowest-numbered, unvisited vertex. When

traversing through a node where you have a choice of which direction to travel next, the choice does not

matter.

We start with node zero and assign g(0) = 0. Then we traverse the neighbors of node 0; since neighbor

order doesn’t matter, let’s do node 9 next. g(9) = 8 − g(0) mod 10 ≡ 8, where the 8 in the subtraction

comes from the edge value and 10 is the number of keys in the input vector. Going back to node 1 and

heading the other direction, we have g(6) = 0 − g(0) mod 10 ≡ 0; again, the 0 in the subtraction is the

edge value. Continuing to node 8, g(8) = 6 − g(6) mod 10 ≡ 6. Node 7, g(7) = 7 − g(8) mod 10 ≡ 1.

And node 19, g(19) = 4 − g(6) mod 10 ≡ 4.

This completes the first connected subgraph. Let’s move to the next subgraph. The lowest-numbered,

unvisited node is node 5. We assign g(5) = 0, and proceed as above. Node 16, g(16) = 3 − g(5)

mod 10 ≡ 3. Node 20, g(20) = 5 − g(16) mod 10 ≡ 2. Node 18, g(18) = 9 − g(20) mod 10 ≡ 7. And

node 10 completes this subgraph, g(10) = 2 − g(18) mod 10 ≡ 5.

There is one more subgraph, containing nodes 14 and 15. Start with the smaller, g(14) = 0, and

g(15) = 1 − g(14) mod 10 ≡ 1.

And here is g:

index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

g 0 0 0 0 0 0 0 1 6 8 5 0 0 0 0 1 3 0 7 4 2

Positions in this array that were not assigned in the last step do not matter. We choose to fill them in with

zeros.

As stated above, u(key) is the first “sum mod 21” in the final graph calculation table above, and v(key)

is the second such column. Our hash function is given by:

HASH(key) = (g (u (key)) + g (v (key))) mod 10

thus:

HASH(“seven”) = g(8) + g(6) mod 10

= 6 + 0 mod 10

= 6

which is the seventh element, since arrays are zero indexed; and

HASH(“three”) = g(18) + g(10) mod 10

= 7 + 5 mod 10

= 12 mod 10

= 2

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 222,000评论 6 515
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 94,745评论 3 399
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 168,561评论 0 360
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 59,782评论 1 298
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 68,798评论 6 397
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 52,394评论 1 310
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,952评论 3 421
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 39,852评论 0 276
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 46,409评论 1 318
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 38,483评论 3 341
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 40,615评论 1 352
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 36,303评论 5 350
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 41,979评论 3 334
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 32,470评论 0 24
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 33,571评论 1 272
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 49,041评论 3 377
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 45,630评论 2 359