Find the Duplicate number

File: FindDuplicate.py

Author: Keith Schwarz (htiek@cs.stanford.edu)

An algorithm for solving the following (classic) hard interview problem:

"You are given an array of integers of length n, where each element ranges

from 0 to n - 2, inclusive. Prove that at least one duplicate element must

exist, and give an O(n)-time, O(1)-space algorithm for finding some

duplicated element. You must not modify the array elements during this

process."

This problem (reportedly) took CS legend Don Knuth twenty-four hours to solve

and I have only met one person (Keith Amling) who could solve it in less time

than this.

The first part of this problem - proving that at least one duplicate element

must exist - is a straightforward application of the pigeonhole principle.

If the values range from 0 to n - 2, inclusive, then there are only n - 1

different values. If we have an array of n elements, one must necessarily be

duplicated.

The second part of this problem - finding the duplicated element subject to

the given constraints - is much harder. To solve this, we're going to need a

series of nonobvious insights that transform the problem into an instance of

something entirely different.

The main trick we need to use to solve this problem is to notice that because

we have an array of n elements ranging from 0 to n - 2, we can think of the

array as defining a function f from the set {0, 1, ..., n - 1} onto itself.

This function is defined by f(i) = A[i]. Given this setup, a duplicated

value corresponds to a pair of indices i != j such that f(i) = f(j). Our

challenge, therefore, is to find this pair (i, j). Once we have it, we can

easily find the duplicated value by just picking f(i) = A[i].

But how are we to find this repeated value? It turns out that this is a

well-studied problem in computer science called cycle detection. The general

form of the problem is as follows. We are given a function f. Define the

sequence x_i as

x_0 = k (for some k)

x_1 = f(x_0)

x_2 = f(f(x_0))

...

x_{n+1} = f(x_n)

Assuming that f maps from a domain into itself, this function will have one

of three forms. First, if the domain is infinite, then the sequence could be

infinitely long and nonrepeating. For example, the function f(n) = n + 1 on

the integers has this property - no number is ever duplicated. Second, the

sequence could be a closed loop, which means that there is some i so that

x_0 = x_i. In this case, the sequence cycles through some fixed set of

values indefinitely. Finally, the sequence could be "rho-shaped." In this

case, the sequence looks something like this:

x_0 -> x_1 -> ... x_k -> x_{k+1} ... -> x_{k+j}

^ |

| |

+-----------------------+

That is, the sequence begins with a chain of elements that enters a cycle,

then cycles around indefinitely. We'll denote the first element of the cycle

that is reached in the sequence the "entry" of the cycle.

For our particular problem of finding a duplicated element in the array,

consider the sequence formed by starting at position n - 1 and then

repeatedly applying f. That is, we start at the last position in the array,

then go to the indicated index, repeating this process. My claim is that

this sequence is rho-shaped. To see this, note that it must contains a cycle

because the array is finite and after visiting n elements, we necessarily

must visit some element twice. This is true no matter where we start off in

the array. Moreover, note that since the array elements range from 0 to

n - 2 inclusive, there is no array index that contains n - 1 as a value.

Consequently, when we leave index n - 1 after applying the function f one

time, we can never get back there. This means that n - 1 can't be part of a

cycle, but if we follow indices starting there we must eventually hit some

other node twice. The concatenation of the chain starting at n - 1 with the

cycle it hits must be rho-shaped.

Moreover, think about the node we encounter that starts at the entry of the

cycle. Since this node is at the entry of the cycle, there must be two

inputs to the function f that both result in that index being generated. For

this to be possible, it must be that there are indices i != j with

f(i) = f(j), meaning that A[i] = A[j]. Thus the index of the entry of the

cycle must be one of the values that is duplicated in the array.

There is a famous algorithm due to Robert Floyd that, given a rho-shaped

sequence, finds the entry point of the cycle in linear time and using only

constant space. This algorithm is often referred to as the "tortoise and

hare" algorithm, for reasons that will become clearer shortly.

The idea behind the algorithm is to define two quantities. First, let c be

the length of the chain that enters the cycle, and let l be the length of the

cycle. Next, let l' be the smallest multiple of l that's larger than c.

I claim that for any rho-shaped sequence l' defined above, that

x_{l'} = x_{2l'}

The proof is actually straightforward and very illustrative - it's one of my

favorite proofs in computer science. The idea is that since l' is at least

c, it must be contained in the cycle. Moreover, since l' is a multiple of

the length of the loop, we can write it as ml for some constant m. If we

start at position x_{l'}, which is inside the loop, then take l' more steps

forward to get to x_{2l'}, then we will just walk around the loop m times,

ending up right back where we started.

One key trick of Floyd's algorithm is that even if we don't explicitly know l

or c, we can still find the value l' in O(l') time. The idea is as follows.

We begin by keeping track of two values "slow" and "fast," both starting at

x_0. We then iteratively compute

slow = f(slow)

fast = f(f(fast))

We repeat this process until we find that slow and fast are equal to one

another. When this happens, we know that slow = x_j for some j, and

fast = x_{2j} for that same j. Since x_j = x_{2j}, we know that j must be at

least c, since it has to be contained in the cycle. Moreover, we know that j

must be a multiple of l, since the fact that x_j = x_{2j} means that taking j

steps while in the cycle ends up producing the same result. Finally, j must

be the smallest multiple of l greater than c, since if there were a smaller

multiple of l greater than c then we would have reached that multiple before

we reached j. Consequently, we must have that j = l', meaning that we can

find l' without knowing anything about the length or shape of the cycle!

To complete the construction, we need to show how to use our information

about l' to find the entry to the cycle (which is at position x_c). To do

this, we start off one final variable, which we call "finder," at x_0. We

then iteratively repeat the following:

finder = f(finder)

slow = f(slow)

until finder = slow. We claim that (1) the two will eventually hit each

other, and (2) they will hit each other at the entry to the cycle. To see

this, we remark that since slow is at position x_{l'}, if we take c steps

forward, then we have that slow will be at position x_{l' + c}. Since l' is

a multiple of the loop length, this is equivalent to taking c steps forward,

then walking around the loop some number of times back to where you started.

In other words, x_{l' + c} = x_c. Moreover, consider the position of the

finder variable after c steps. It starts at x_0, so after c steps it will be

at position x_c. This proves both (1) and (2), since we've shown that the

two must eventually hit each other, and when they do they hit at position x_c

at the entry to the cycle.

The beauty of this algorithm is that it uses only O(1) external memory to

keep track of two different pointers - the slow pointer, and then the fast

pointer (for the first half) and the finder pointer (for the second half).

But on top of that, it runs in O(n) time. To see this, note that the time

required for the slow pointer to hit the fast pointer is O(l'). Since l' is

the smallest multiple of l greater than c, we have two cases to consider.

First, if l > c, then this is l. Otherwise, if l < c, then we have that

there must be some multiple of l between c and 2c. To see this, note that

in the range c and 2c there are c different values, and since l < c at least

one of them must be equal to 0 mod l. Finally, the time required to find the

start of the cycle from this point is O(c). This gives a total runtime of at

most O(c + max{l, 2c}). All of these values are at most n, so this algorithm

runs in time O(n).

def findArrayDuplicate(array):
assert len(array) > 0

# The "tortoise and hare" step.  We start at the end of the array and try
# to find an intersection point in the cycle.
slow = len(array) - 1
fast = len(array) - 1

# Keep advancing 'slow' by one step and 'fast' by two steps until they
# meet inside the loop.
while True:
    slow = array[slow]
    fast = array[array[fast]]

    if slow == fast:
        break

# Start up another pointer from the end of the array and march it forward
# until it hits the pointer inside the array.
finder = len(array) - 1
while True:
    slow   = array[slow]
    finder = array[finder]

    # If the two hit, the intersection index is the duplicate element.
    if slow == finder:
        return slow
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 218,036评论 6 506
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 93,046评论 3 395
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 164,411评论 0 354
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,622评论 1 293
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,661评论 6 392
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,521评论 1 304
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,288评论 3 418
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 39,200评论 0 276
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,644评论 1 314
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,837评论 3 336
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,953评论 1 348
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,673评论 5 346
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 41,281评论 3 329
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,889评论 0 22
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 33,011评论 1 269
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 48,119评论 3 370
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,901评论 2 355

推荐阅读更多精彩内容