Content-Based Similarity Search in Large-Scale DNA Data Storage Systems

Authors

Callista Bee 1 *, Yuan-Jyue Chen 2 , David Ward 1 , Xiaomeng Liu 1 , Georg Seelig 1,3 , Karin
Strauss 2 *, Luis Ceze 1 *.
Affiliations:
1 Paul G. Allen School of Computer Science & Engineering, University of Washington.
2 Microsoft.
3 Department of Electrical & Computer Engineering, University of Washington.
*Correspondence to: kstwrt@cs.washington.edu, kstrauss@microsoft.com,
luisceze@cs.washington.edu.

Abstract

  • Here, we demonstrate content-based retrieval from a DNA database by learning a mapping from images to DNA sequences such that an encoded query image will retrieve visually similar images from the database via DNA hybridization.
  • We encoded and synthesized a database of 1.6 million images and queried it with a variety of images, showing that each query
    retrieves a sample of the database containing visually similar images are retrieved at a rate much greater than chance.
  • We compare our results with several algorithms for similarity search in electronic systems, and demonstrate that our molecular approach is competitive with state-of-the-art electronics.

One Sentence Summary:

Learned encodings enable content-based image similarity search from a database of 1.6 million images encoded in synthetic DNA.

Main Text:

P1

  • Since the dawn of online social media, the amount of text, images, and videos that the world produces has been growing rapidly in an exponential fashion, outpacing the capacity growth of traditional storage media (solid state,magnetic, optical).
  • a high density and durable storage medium (1–7), with several orders of magnitude higher density and durability than current storage technologies.
  • An important feature of these systems is the ability to retrieve single documents without having to sequence a large, dense pool of data in molecular form.
  • To accomplish this, these systems leverage the specificity of DNA hybridization to implement key-based retrieval, where a short, predetermined sequence is used to retrieve the document associated with that sequence.

P2

  • Electronic database systems separate the concerns of search and storage by maintaining an index structure that is designed to facilitate content-based retrieval.
  • We follow this design principle by creating a DNA-based index that is optimized for similarity search, where an example document is used to retrieve similar documents from a database.
  • Augmenting the density and longevity of DNA data storage with the computational power of similarity search will enable applications that are infeasible for electronic systems.

P3

  • Document similarity is typically formulated as a geometric problem (13) where each document is converted to a vector in a high-dimensional feature space, with the property that neighboring feature vectors represent subjectively similar documents.

P4

  • Searching in a high-dimentional space is challenging for electronic systems.
  • Instead, efficient similarity search algorithms perform an approximate search that allows for errors in the results. We adopt that relaxation as well.
  • Rather than finding exact nearest neighbors, our goal is to maximize the number of near neighbors retrieved while minimizing the number of irrelevant results.

P5

  • They want to encode feature vectors as DNA sequences such that single stranded molecules created from an encoded target and the reverse-complement of an encoded query are likely to form stable hybridized structures when the query and target feature vectors are neighboring, but not when they are distant (Figure 1B).

-Given such an encoding, we can construct an index where each document is associated with a single strand of DNA that contains the document’s ID, alongside with its encoded feature vector (Figure 1C-D).

  • An encoded query (Figure 1E) can then be used as a hybridization probe to filter out similar documents from the index (Figure 1F-G).

-The filtered index can be sequenced and decoded to recover the IDs of documents that are similar to the query.

  • These documents can then be retrieved from a key-based database and displayed to the user (Figure 1H).

P6

  • Designing a universal encoding from feature vectors to DNA sequences is difficult and early researchers (9) used machine learning to optimize the encoding for a particular dataset by clustering the dataset using k-means, then mapping each cluster center to a known DNA sequence which is assigned to each document in that cluster.

  • By reducing content-based retrieval to exact key-based retrieval, their approach sidesteps any issues with unwanted DNA hybridization.

  • However, once the cluster labels are chosen, they are static; any additional data added to the database must fit into the existing clusters, even if it would cause the cluster centers to change.

P7

  • Rather than using a codebook of pre-determined sequences, our technique leverages neural networks to map feature vectors directly onto fixed-length DNA sequences.
  • A key component of our technique is a differentiable model of DNA hybridization that is trained alongside the encoder. This allows us to optimize the encoder to minimize unwanted hybridization while preserving the flexibility of inexact matching.
  • To train the predictor, the reactions between pairs of sequences output by the encoder are simulated via NUPACK (17). We consider this simulation to be the “gold standard” for computing the outcome of a reaction, but it cannot be used directly to train the encoder, because its computation is not differentiable.

P8

  • We chose to focus first on encoding feature vectors derived from images.
  • Figure 2A illustrates the neural network that maps images to DNA sequences. First, we use the FC2 intermediate layer of VGG16 (14) to extract a feature vector.
  • A fully-connected neural network with a single hidden layer converts these feature vectors to “one-hot” DNA sequences, where each position is represented numerically by a four-channel vector (one channel for each possible base) whose components sum to one.


    Figure 2

P9

  • The hybridization predictor (Fig. 2B) takes a pair of one-hot sequences and predicts the yield of the hybridization reaction between the top sequence and the reverse-complement of the bottom sequence.

  • It first pre-processes the sequence pair with a sliding window operation that extracts local matches (Fig. S3B). Following local average pooling, a trainable convolutional layer enables the predictor to weigh different kinds of local matches differently.


    image.png
  • Because we are not predicting where hybridization occurs, a global average pooling operation removes spatial relevance.

  • The prediction is then computed with a logistic regression (in our experience, this is more effective than a linear regression because most hybridization yields are either close to 0 or close to 1).

P10

Fig.4
  • Figure 2C outlines the training procedure, which alternates between encoder and predictor training phases.
  • In the encoder training phase, a pair of image feature vectors are compared to determine if they are close enough to be deemed “similar”. The threshold for similarity is determined based on subjective evaluation of the feature space (see Figure S1 in the supplementary material).


    image.png
  • The feature vectors are encoded independently to produce a pair of one-hot sequences, which are passed to the hybridization predictor.

  • If the predicted yield is low and the documents are similar, or the predicted yield is high and the documents are not similar, the encoder’s parameters are modified (via gradient descent) to correct the error.

P10

  • In the predictor training phase, the one-hot sequences output by the encoder are discretized and their reaction is simulated with NUPACK.
  • If the predicted yield differs from NUPACK’s simulation result, the predictor’s parameters are modified (via gradient descent) to correct the error.
  • Eventually, the predictor learns a model of hybridization that is good enough for the encoder to learn how to map feature vectors to DNA sequences, such that neighboring feature vectors are much more likely to result in pairs of sequences that hybridize (Figure 2D-E).

P11

  • After training our encoder, we transformed each of these images into a DNA sequence using the trained encoder.
  • In addition to the encoded features, each image’s sequence contains a unique, decodable barcode that can identify the sequence without using reference-based alignment, as well as conserved regions to facilitate amplification and processing via PCR (polymerase chain reaction).
  • Each image’s sequence is short enough to fit on a single synthesizable DNA oligomer (see Figure S4 in the supplementary material).


    image.png

P12

  • To conduct similarity search with an image, we ordered a biotinylated probe oligomer that contains the reverse complement of the query’s encoded feature sequence.
  • We annealed the probe with a sample of the database, and then separated the annealed target/query pairs from the database using streptavidin-conjugated magnetic beads.
  • We then use high-throughput sequencing to reveal which database sequences persist in the separated mixture, and measure how frequently each of them occur.

P13

image.png
  • Figure 3 shows the experimental results for three different query images.

  • If we consider images with sequencing read counts above a certain threshold to be “retrieved”, we can characterize the set of retrieved images for a variety of thresholds.

  • Figure 3A shows that higher read counts are associated with sets of images that are closer to the query in Euclidean distance.

  • We can quantitatively characterize the quality of a retrieved set by its recall of the 100 nearest neighbors; that is, the number of images in the set that are among the 100 most similar images to the query in the database.

  • Figure 3B shows that, as the read threshold increases, the number of total images in the retrieved set drops very low before you begin to sacrifice nearest neighbor recall.

  • We can also visually inspect the retrieved set by sorting its contents and displaying the most similar images.

  • Figure 3C shows that, even with very aggressive filtering, the retrieved set still contains images that are relevant to the query.

  • If the read counts for each image are proportional to their concentrations in the filtered mixture, this means that the filtered mixture could be diluted about 1000x, conserving sequencing resources while still retrieving relevant images.

P14

  • The performance of a similarity search algorithm can be summarized by the curve in Figure 3B, which measures the proportion of the database that must be retrieved and sorted to achieve a particular 100-nearest neighbor recall.
  • The dashed line above the curve illustrates a “naive” algorithm that randomly samples the database.
  • To retrieve half of the hundred nearest neighbors, it must retrieve half of the database. The dashed-and-dotted line below the curve illustrates a perfect “oracle” algorithm.
  • To retrieve half of the hundred nearest neighbors, it would retrieve exactly those 50 images from the 1.6 million in the database.

P15

image.png
  • Figure 4 places the curve from Figure 3B in context alongside several state-of-the-art in silico algorithms that were benchmarked using the same query and same database for each of the queries we evaluated experimentally.

  • While there is room for improvement, our experimental performance is comparable to the state-of-the-art, indicating that DNA-based similarity search is a viable technique for searching the databases of the future.

Summary

  • This paper detailed the first demonstration of similarity search of digital information in DNA and compared its potential efficiency with electronic systems.
  • The results suggest that, as DNA data storage becomes more practical and scales to larger data sets, similarity search in DNA form is an attractive possibility compared to electronic systems.
  • Combining DNA data storage with similarity search support may offer a path to viable hybrid molecular-electronic computer systems.

Materials and Methods

Datasets

Feature Extraction

  • To extract image features, we processed each image with VGG16, a convolutional neural network designed for image classification.
  • The weights were loaded from the publicly available trained model and left unchanged during our processing.
  • We used the activations of FC2 (the second fully-connected layer) as 4096-dimensional feature vectors.
  • Figure S1 illustrates the relationship between subjective image similarity and feature vector Euclidean distance.
  • Pairs of images with Euclidean distance of 75 or less tend to be consistently similar, so during training we label these pairs as “similar” and all other pairs as “not similar”.

Sequence Encoding

  • The sequence encoder is a fully-connected neural network. Its topology is depicted in Figure S2.


    Fig.S2
  • The 4096-dimensional FC2 vectors are fed into a 2048-dimensional hidden layer with a rectified linear activation, followed by an output layer with a “one-hot” sequence representation that is 80 nucleotides in length.

  • In this representation, each sequence position has four channels, one for each base. A softmax activation function is applied that forces each position’s channels to sum to 1.

  • A DNA sequence can be read off by picking the channel with
    the maximum activity at each position.

  • The one-hot representation can produce indeterminate bases (for example, if all four channels at a position have a value of 0.25).

  • Because of this, a regularization is applied during encoder training to minimize the entropy at each position. This encourages each position to have a well-defined maximum, which improves the accuracy of the yield predictor.

  • The yield predictor takes a pair of one-hot sequence representations and produces an estimate of the yield of the hybridization reaction between the first sequence and the reverse-
    complement of the second sequence.

  • It is structured as a convolutional neural network (Figure S3A). The network makes use of a novel local match layer (Figure S3B) that produces vectors of possible matches between each window of 3-mers. This encourages the predictor to make use of any unaligned matches between the two sequences.


    image.png

Training

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 219,366评论 6 508
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 93,521评论 3 395
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 165,689评论 0 356
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,925评论 1 295
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,942评论 6 392
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,727评论 1 305
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,447评论 3 420
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 39,349评论 0 276
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,820评论 1 317
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,990评论 3 337
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 40,127评论 1 351
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,812评论 5 346
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 41,471评论 3 331
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 32,017评论 0 22
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 33,142评论 1 272
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 48,388评论 3 373
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 45,066评论 2 355