In 1862, Gregor Mendel bred pea plants to study inheritance. Fast forward 100 years to 1962, James Watson, Frances Crick, and Maurice Wilkins were awarded a Nobel Prize for discovering the structure of DNA. Today, advances in this field are spilling over into the most unlikely places.
As we enter the century of biotechnology, our ability to read, write, and edit DNA is disrupting everything from human health to manufacturing. The next disruption to take place could be in the world of data storage.
Tech giants including Facebook and Amazon and their millions of users generate petabytes of data on the Internet every second. Microsoft has been quietly working in the background to store this information in As, Ts, Cs, and Gs, and instead of 0s and 1s.
“Think of compressing all the information on the accessible Internet into a shoebox,” says Karin Strauss, a principal researcher at Microsoft. “With DNA data storage, that’s possible.”
Strauss is working with Luis Ceze, a professor of computer science and engineering at the University of Washington, to wield DNA for data storage and computing. Using synthetic DNA molecules, the team has successfully stored over one gigabyte of readable information, including various forms of media such as the top 100 books from Project Gutenberg, a high-definition OK Go music video, and the #MemoriesInDNA project.
The information density of DNA is remarkable — just one gram can store 215 petabytes, or 215 million gigabytes, of data. For context, the average hard drive in a laptop can house just one millionth of that amount.
“We encode all data at a molecular level, making it as small as possible, and store it in a medium that will last for quite a while and not become obsolete, like floppy disks, because of its eternal relevance for life,” says Strauss.
The Rise of DNA Data Storage
|1988||“Microvenus”||Joe Davis with Harvard and UC Berkeley||28 base pairs|
|2011||Encoding 70 billion copies of Regenesis: How Synthetic Biology Will Reinvent Nature and Ourselves in DNA||Church Lab @ the Wyss Institute||5.27MB|
|2016||OK Go Music Video, top 100 books in Project Gutenberg, and more||Microsoft Research/UW||Over 1GB|
|2018||Storing and retrieving information with template-free polymerase||Molecular Assemblies||150 base pairs|
|2019||Writing “hello” with fully automated end-to-end DNA data storage and computing||Microsoft Research/UW||5 bytes|
|2019||Encoding all of Wikipedia (in English)||Catalog||16GB|
|2019||Long Oligonucleotides||Twist Bioscience||300 base pairs|
Improved techniques for reading and writing DNA, including an increase in the length of strands of DNA usable for these purposes, have facilitated the rapid increase in the amount of possible data storage in DNA.
In addition to pioneering high-density data storage, Ceze and Strauss also conducted a similarity search between images using DNA, and recently created the first fully-automated, writing-to-reading DNA storage system.
“We’re trying to make computers better with a systematic approach that finds great alternatives and solutions in nature,” says Strauss. “The computational approaches facilitated by working with DNA make it an even more attractive option for data storage,” adds Ceze. “We have the freedom to choose how to map bits to DNA sequences, creating redundancy and high tolerance to error when reading and writing DNA.”
How does this technology work? It’s surprisingly simple. Data is first translated from a code of 0s and 1s to As, Ts, Cs, and Gs. This genetic code is then synthesized into an actual molecule (with the help of Twist Bioscience for the Microsoft Research-UW team), and the “encoding” process is complete.
Retrieving data is a bit more complex. Two steps — “processing” and “decoding” — must occur. Simulating random-access memory (RAM), a polymerase chain reaction (PCR, a common laboratory protocol for copying DNA) hones in on a targeted section of the sequence, which is then replicated, sequenced, decoded, and adjusted for errors to retrieve the original data. This targeted approach is efficient because it involves only the desired sequence, rather than the entire dataset.
The rise of DNA data storage, previously the stuff of science fiction, is being made possible by advances in biotechnology, particularly improvements in high-throughput DNA sequencing and synthesis. Also, because these bio-programmers control what materials enter their experiments, and their sequences do not need to be meticulously engineered to function within a living organism, there are fewer overhead costs compared to typical life science experiments. The journey has not been without roadblocks, however. Despite dramatic improvement, working with DNA can be slow and expensive. Further streamlining is still needed.
“Automation was, and is, one of our biggest challenges,” says Strauss. “It was great to have our first proof of concept converting information from bits, to DNA, and back to bits to prove that it was possible and also show what are our other challenges in automation, but some of the biotechnology aspects are quite new to some of us, so we’ve also been learning a lot there. The other significant challenges are continuing to increase throughput and decrease the cost for DNA sequencing and synthesis. There’s quite a bit of engineering left to get [us] to where we need to be.”
The interdisciplinary Microsoft and UW team sees value in its diverse background. “It’s extremely exciting that this is at the intersection of biotech and computing,” says Ceze. “These areas have been feeding off each other.”
“If the technology continues to advance the way we see it right now,” he says, “I think it’s conceivable that we will see DNA storage as a form of archival for the general public within the decade.”