Image - based approximate DNA storage system

Abstract

  • However, prior work simply investigated the feasibility of DNA storage storing different types of data and simply store images in DNA storage, which did not fully investigate the fault-tolerant potential of images in the DNA storage.

  • In this paper, we proposed a new image-based DNA system called IMG-DNA, which can efficiently store images in DNA storage with improved DNA storage robustness.

    • First, a new DNA architecture is proposed to fit JPEG-based images and improve the image’s robustness in DNA storage.
    • Moreover, barriers inserted in DNA sequences efficiently prevent error propagation in images of DNA storage.
    • The experimental results indicate that the proposed IMG-DNA achieves much higher fault-tolerant than prior work.

Introduce

The rapidly increased pace of digital volume motivates researchers to search for a storage medium with a real high density and long-term preservation.

Therefore, it is critical for emerging storage systems to
store such a massive amount of image data with a low cost.


Two challenges of many new emerging storage devices:

  • One is that the increased storage density cannot
    catch up with the rapid rate of increased digital volume.
  • The other is that the digital data can be reliably stored in these storage devices only for several years to at most tens of years.

Synthetic Deoxyribonucleic Acid (DNA) storage is
one promising storage medium for archiving due to its high storage density and long durability.


One issue of DNA storage is its high error rate. The write (synthesis) and read (sequencing) processes are error-prone.

For each base pair (one nucleotide), it may involve around 1% error rate [ 8 ]. To handle these errors, researchers use error-correction code (ECC) to recover errors resulting in a much high overhead. The space overhead may reach 15% or even higher.

Why choose image?

According to the high error rate in DNA storage, images might be well fitted to the DNA storage due to its properties of fault tolerance and large volume in social media.

Thus, DNA storage provides great potential to store a massive amount of image data to avoid high overhead induced by errors.


There are two types of prior work related to how to store image data in DNA storage.

  • One type of studies investigated the feasibility of DNA storage [9][8][10].

    • Images as one type of applications are stored in DNA.
    • They proposed different encoding schemes (i.e., converting digital data into DNA sequences), error-correction codes, and biochemical technologies for generally improving the efficiency of sequencing (read data out from DNA) or decreasing error rates.
    • However, they did not focus on the property of image data, which cannot achieve the best performance for image data in terms of storage density and error tolerance.
  • The other focused on approximately store images based on their property in different storage devices such as SSDs or Non-volatile memory [ 2 ][ 11 ][ 12 ].

    • For examples, Kuo et al. [ 11 ] proposed a
      new JPEG encoding scheme to improve the error tolerance of SSD based on the sensitivities of JPEG image data.
    • Fan et al. [ 12 ] proposed an adaptive-length image coding scheme based on the reliability of NAND flash memory to improve the density and reduce the total cost of JPEG-based storage.
    • However, due to the special properties of DNA storage (such as error propagation discussed in Section 3), those technologies cannot be efficiently applied in image-based DNA storage.

In this paper, we proposed a new scheme to efficiently store JPEG-based images in DNA storage, called IMG-DNA, which combining the special properties of DNA storage and images.

  • First, we propose a new DNA architecture by
    adding barriers to prevent error propagation in DNA storage. The barrier design is based on the encoding scheme and biochemical constraints in DNA storage, and thus it can be feasibly applied to DNA storage.

  • Then, according to the error sensitivities of coefficients in JPEG images, we separately store those coefficients in different DNA sequences, and proper internal index is designed to associate these coefficients with their corresponding images.

  • Moreover, to further improve the error tolerance of DNA storage, separate barrier schemes are applied to the different coefficients of JPEG-images according to their characteristics.

  • Finally, the IMG-DNA scheme improves the error tolerance capability compared to prior work.


Section 2: the background of DNA storage and JPEG- based images.
Section 3: demonstrates the motivation behind this work.

Section 4: introduces the implementation of
IMG-DNA scheme.

Section 5: shows the experimental results of IMG-DNA and compare them with some prior work.

Section 6: draw a conclusions

2 Background

  • 2.1 DNA Storage
    For DNA storage, as shown in Fig. 1, there are
    four major processes in the DNA storage system: encoding, synthesis, sequencing, and decoding.
  • The encoding and synthesis are the processes to write digital data into DNA storage.
  • The sequencing and decoding are the processes to read data out from DNA storage.
Figure 1:Basic steps of DNA storage.
Figure 2: DNA format within the same primer pair.
Figure 3: A rotating encoding to nucleotides avoids homopolymers (repetitions of the same nucleotide), which are error-prone.

  • 2.2 Fundamentals of JPEG-based Image
Figure 4: The encoding process of JPEG-based image.

3 Motivation

  • There are two properties of image-based data.

    • One is the large volume since image-based data type is explosively increased and generated from social media such as Facebook, Snapchat, and Instagram. The large volume increases the demand for high-capacity storage systems.
    • The other property of image-based data is fault tolerance. Many existing studies investigated images in approximate storage such as flash memory or non-volatile memory to reduce the storage overhead or improve the robustness of images [ 11 ][ 12 ][ 25 ].
  • Therefore, these two properties of images are well fitted to the DNA storage system, which is error-prone and has an extremely high density.

  • However, none of the previous studies focused on the robustness of images in DNA storage.
    Although some existing studies [ 8 ][ 9 ][ 14 ] implemented image-based binary data in DNA storage, all of them focused on the feasibility of DNA storage for different types of data including images.

Therefore, how to efficiently store images and improve the robustness of images in DNA storage is a critical and interesting research issue.


Moreover, due to the biochemical constraints as mentioned before, we should avoid long homopolymers in DNA sequences, which cause high error rates.

Therefore, the state-of-the-art encoding scheme [9][13][8][16] uses a rotating code manner as shown in Fig. 3 to avoid long homopolymers.


However, the rotating manner induces a phenomenon called error propagation (DNA-level).

In DNA storage, there are three types of errors (i.e., insertion, substitution, and deletion).

Any nucleotide changed in the middle of one DNA
strand has a significant influence on its following nucleotides.As a result, all subsequence data cannot be correctly read out due to one nucleotide error.

Figure 5: An example of deletion error for error propagation. The deletion error causes that one nucleotide ‘C’ is missed and all subsequence are decoded wrongly.

Additionally, although prior work [ 11 ][ 12 ] has mitigated the image-level error propagation for traditional storage devices, these techniques are not efficient for the DNA-level error propagation due to the special encoding manner in DNA storage.

Therefore, there is a need to find a proper architecture preventing such error propagation in DNA storage.

4 Aligorithm Design

In this section, we introduce the proposed IMG-DNA scheme to improve the robustness of image-based DNA storage.

The IMG-DNA scheme has three major optimizations compared to previous work.

  • The first one is that the separation of DC and AC coefficients at the DNA level.
  • The second one is ‘adding barrier’ to both AC and DC coefficients.
  • The third one is that the DC and AC coefficients are applied with different densities of barriers.

4.1 AC/DC Coefficient Separation at DNA Level

According to JPEG-based images, the DC and AC coefficients have different sensitivities to errors.

  • There are two reasons. One is that the DC coefficients have higher resolution than the AC coefficients during the quantization process since the human visual system is more sensitive to DC coefficients (low-frequency part).

  • The other reason is that the DC coefficients use differential pulse code modulation coding (DPCM) to compute its consecutive DC value for the next 8*8 block
    ( which is called error propagation phenomenon [ 11 ].)

4.2 Adding ‘Barriers’

We can use some ’barriers’ to prevent such error propagation.

However, the barriers are not easy to be added since the barriers should be distinguished from the payload.


The constraint of the number of consecutive identical nucleotides is less than four nucleotides [ 16 ].

The reason of using the barrier pattern based on ‘A’ is that the ‘AA’ and ‘AAA’ patterns have higher sequencing accuracies than other three types of patterns [ 26 ].

Since we predefine the PL, an error will be prevented
within its partition and will not affect other partitions.
So, one error causes the maximum errors within a partition size (i.e., PL).


Figure 6: Definition of Partition Length (PL) and Barrier Window (BW).

Moreover, substitution, deletion, and insertion errors may generate a pattern which is the same as the barrier pattern (i.e., AA). Misclassification of the barrier patterns may mitigate the effect of error propagation prevention.

To mitigate such a scenario, we propose another mechanism called Barrier Window (BW), which means that only a pattern ’AA’ locating in a barrier window will be regarded as a true barrier if detected. The pattern ’AA’ outside the barrier window will be treated as a normal payload. If there is no barrier pattern in the BW, it means that an error happens on the barrier pattern and changes the barrier to other patterns.

Thus, we increase the number of partitions by one and use the next barrier pattern to prevent the error propagation. In this case, the error will be propagated to the next partition but we can prevent it in two partitions.

We set BW to 12 bp, which means 5 bp on both left and right sides of the barrier pattern.

Since the probability of errors happening on the barrier pattern is much lower than that of other payloads (25 times less if the fixed-length takes 50 bp), it is highly possible that the error propagation can be prevented in one or two partitions.

Therefore, with adding barriers and Barrier Window, the IMG-DNA scheme can significantly prevent the error propagation in image-based DNA storage.

4.3 Asymmetric Barriers for AC/DC Coefficients

Therefore, there is a trade-off between the robustness and encoding density of DNA storage.


Since DC coefficients are more sensitive to errors, we assign a smaller PL value to DC coefficients to increase the robustness of DNA storage. Meanwhile, since AC coefficients have much larger size than that of DC, a larger PL value for AC coefficients can reduce the overhead of the scheme of adding barriers.

Moreover, compared with the data sizes of DC and AC coefficients in JPEG images, normally DC coefficients only occupy 4% - 8% of one image size [ 12 ].
Thus, the increased overhead of adding more barriers in DC coefficients is limited. According to our experiments, the capacity overheads of barriers in DC and AC coefficients are about 0.28% and 3% of image size with PL=20 bp for DC and PL=50 bp for AC, respectively.

4.4 Image-based DNA Storage Architecture

  • After JPEG-based encoding, we first separate DC and
    AC coefficients.

  • After that, the digital sequence will be encoded into DNA sequence.

  • Then, we insert the barrier pattern ’AA’ into DC and AC coefficients separately.

  • According to the different error sensitivities of DC and AC, we use different partition lengths for AC and DC DNA strands.

  • After inserting barriers, long DNA sequences will be chunked into small segments with a fixed length (e.g., 200 bp).

  • Then, each segment with a primer pair and its corresponding
    internal index are assembled. The internal index contains the type and address offset of payload.

image.png

4.5 Feasibility Discussion

In this study, the experiments are based on simulation.


A set of design rules for synthesis and sequencing efficiency are used based on commercial design rules [ 27 , 28 ] and previous studies [ 14 , 15 , 8 ]:

  • One is that the absence of long homopolymers (less than four nucleotides) [ 15 , 16 ].
  • The second one is that GC contents in a sequence should be around 40% - 60%.
  • The third one is DNA strand length smaller than 1000 bp.

5 Experimental Results

The default DNA strand length is 250 bp including a primer pair, payload, and index (the DNA strand lengths may be changed slightly due to the length of the index).

Two previous studies are used as baselines:

  • One is the straightforward implementation of DNA storage [ 8 ][ 14 ] (denoted by Raw-DNA ).
    (【8】Random access in large-scale dna data storage.
    【14】A dna-based archival storage system)

  • The other one is the approximate image schemes with the digital-level optimization [ 11 ] (denoted by Approx-IMG ).
    (【11】 Long-term jpeg data protection and recovery for nand flash-based solid-state storage.该文章用于磁盘存储中的jpeg文件优化))

  • We use 500 JPEG-based images from the ImageNet image dataset[29] for all experiments.

  • The error model is extracted from error distributions of the wet-lab DNA storage implementation [8].

5.1 Robustness of Image-based DNA System

We investigate the robustness of image-based DNA systems for the ImageNet data set by manually injecting errors from 0.1% to 2% following the distribution of the error model [ 8 ].

Figure 8

The AC and DC coefficients use the same DNA
strand length (250 bp) and different partition lengths (50 bp and 20 bp respectively).

  • Compared with these three schemes, the proposed IMG-DNA achieves a much higher fault tolerance ability than the other two with different error rates (2.6x -19.7x SSIM improvement).
    The reason is that the IMG-DNA considers both digit- and DNA- level error propagation prevention for images.
    Moreover, a graphic view of an image with different encoding schemes are used in Fig. 9.
    We can find that the proposed IMG-DNA achieves much close vision to the original picture and the other two baselines are vague on parts of the picture.
Figure 9

In summary, without any optimization, the propagation error in DNA storage can cause image corruption even with a small error in AC or DC data.
By using the proposed scheme, the error propagation is mitigated and finally results in little image quality degradation. The proposed scheme can assist those protection schemes and reduce the overhead of those ECCs (error-correction codes). For example, originally one error can propagate to ten errors.
Thus, to recover all those ten errors, a strong ECC should be used.
With the prevention of the error propagation, the ECC only needs to recover this one error.

5.2 Effect of Different Partition Lengths

We compare SSIM values by varying the PL values for AC and DC coefficients.

The IMG-DNA scheme with no barriers is normalized to 1 (denoted by ‘No Barrier’)


Figure 10

As shown in Fig. 10, as the PL values increase for AC and DC, the SSIM is decreased.
In other words, the robustness of image-based DNA storage becomes lower with larger PL values. Moreover, when injecting error rates increase, the SSIM difference between PL values become larger.

Therefore, for the cases with high error rates, it is always better to use a smaller PL value to prevent errors while may introduce a higher capacity overhead.

5.3 Errors on Different Coefficients

In this subsection, we investigate the effect of error injection on different coefficients.

As indicated in Figure 11, for all three schemes, as injecting different errors on AC and DC coefficients, injecting errors in DC coefficients has a much larger effect on the quality of images than that of AC coefficients. (模拟实验,直接往AC、DC部分引入错误)

Figure 11

If we compare different schemes, we can find that the proposed IMG-DNA can support much higher SSIM values for both AC and DC coefficients than the other two schemes. The reason is that the proposed scheme of adding barriers prevents error propagation in DNA storage and thus improves the overall robustness of image-based DNA storage systems.

5.4 Overhead Discussion

Overhead Discussion In this subsection, we discuss the overhead of the IMG-DNA scheme.the computation overhead of both schemes is much similar. IMG-DNA can introduce some capacity overhead
due to adding extra barriers.

5.4

In this subsection, we discuss the overhead of the IMG-DNA scheme.

As shown in Figure 12, we investigate the effect of adding barriers on the encoding density with different partition lengths.

Figure 12

6 Conclusion

In this paper, we proposed a new scheme to efficiently store JPEG-based images in DNA storage, called IMG-DNA with a new DNA architecture by adding barriers to prevent error propagation in DNA storage.

The barrier design is based on the encoding scheme and biochemical constraints in DNA storage.

Then, we separately store DC and AC coefficients in different DNA sequences and use separate barrier schemes to further improve the error tolerance of DNA

Finally, the IMG-DNA scheme improves SSIM value by 2.6x - 19.7x and thus enhance the robustness of DNA
storage compared to existing work.

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 219,366评论 6 508
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 93,521评论 3 395
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 165,689评论 0 356
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,925评论 1 295
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,942评论 6 392
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,727评论 1 305
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,447评论 3 420
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 39,349评论 0 276
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,820评论 1 317
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,990评论 3 337
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 40,127评论 1 351
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,812评论 5 346
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 41,471评论 3 331
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 32,017评论 0 22
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 33,142评论 1 272
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 48,388评论 3 373
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 45,066评论 2 355