The hour is late on a Friday night. You’re fast asleep, exhausted from many long hours in the laboratory. You and your colleagues at a startup biotech firm were finally able to get that automated assembly line up and running. Now, your lab is churning out thousands of microorganisms capable of producing fragrances and other consumer products. Your firm is set to make millions in revenue by the end of the year. You thank your lucky stars that you signed up for stock options like many of your colleagues when you took the job.
There’s a loud knock at the front door. Startled awake, you jump out of bed to check on who might be visiting this late at night. You open the door to find two police officers serving you with a warrant for your arrest. A pit forms in your stomach. Not having much of a choice, you allow the officers to take you into custody and travel downtown to the police station. As you’re booked on charges of first degree murder, you’re shocked to learn that the district attorney has identified you as the primary suspect in the violent death of your fellow colleague. You’d been competing with your colleague to develop a new microbe capable of producing consumer products and were just about to unveil your individual creations. The winnings for the most lucrative microbe would include a major promotion and additional stock options. Your heart sinks at the terrible news, knowing your intense dislike of him was well known to your other colleagues.
As the primary evidence in the case, the police found your DNA at the crime scene and on the body of the victim. The detective considers the presence of your DNA along with signs of a violent struggle to be the smoking gun for your role in his murder. Whilst at the police station, you suddenly realize you left your smartphone at work, and don’t have an immediate way to prove you weren’t at the lab at the time of the murder. You’d spent Friday evening at home alone watching a movie on a streaming service, oblivious to your missing smartphone. When you tell the detective, he doesn’t believe you accidentally misplaced your phone for that long. Its discovery at the crime scene will be cataloged as further evidence of your involvement.
After spending a long night in jail, the judge denies your bail. That means you’ll spend the duration of your murder trial in prison, trying to stay alive amongst real criminals. Your lawyer says the time stamps on your streaming data aren’t enough to save you from staring down a jury of peers in the courtroom. Such evidence is circumstantial at best. The prosecution thinks it has a locked case against you given your DNA at the crime scene, your access to the lab, and your clear motive for murder. If your lawyer can’t find any evidence to prove your innocence, you’re likely to spend many years in prison for a crime you didn’t commit.
Think this can’t happen to you? Think again.
Until the early 1990s, the world’s biodata existed solely in physical form — in stacks of patient files on a desk, lines of filing cabinets in a doctor’s office, rows of medical books filling library shelves, and racks of tubes in a laboratory. With the digital revolution, slowly but surely, more types and greater volumes of biodata, including human genomes, gene sequences, DNA from living organisms, and other human health-related information, are becoming digitized — i.e., converted into binary code that can be read, processed, and transmitted by computers and other electronics. The conversion of DNA from physical molecules into digital data will not only revolutionize medicine, it will alter how we interact with the world around us. It will also change how we think about our own DNA and the extent to which it offers proof for our identity or whereabouts. Before your lack of attention to the security of biodata turns your life into the stuff of a crime novel, you should probably consider the following three critical dimensions of biodata. Understanding each of them will not only be essential to protecting national security, but also the key to keeping you out of prison in the event you’re someday accused of murdering your colleague.
THE DIGITAL-TO-PHYSICAL DIMENSION
For most of the 20th century, DNA existed only in physical form — that is, as part of living organisms or as samples contained in test tubes within a laboratory environment. From 1990 to 2003, the Human Genome Project (HGP), funded by the US government, spurred rapid advancement of technologies for the reading and writing of DNA. Whereas gene sequencing technologies “read” a strand of DNA and convert the base pairs into binary code that can be processed by a computer, gene synthesis technologies use the digital code of a gene sequence to “write” or produce physical samples of DNA. Using gene synthesis, scientists can now transform two-dimensional digital data stored on a computer into three-dimensional physical material that exists in the world and forms the basis of living organisms. This digital-to-physical dimension represents a profound game-changer for science and the world we live in, and we are only beginning to see the preliminary effects.
Today, scientists can “write” or “print” DNA from digital data with relatively few restrictions. Since the reading of the first whole human genome in 2003, both the cost and the technical hurdles of gene sequencing and gene synthesis have steadily decreased over time, making both technologies more accessible and increasing the number of possible applications. Gene synthesis technologies have progressed so far that researchers no longer need to send a digitized sequence to a large, centralized biomanufacturing plant for production. Instead, scientists can now purchase a benchtop synthesizer, similar to an at-home 3D printer. Today, it is fairly common for biotech firms to have in-house options to carry out the task of gene synthesis. That said, the length of the gene sequence or genome in question continues to serve as a barrier to what scientists are capable of doing today — the longer the genome synthesized, the greater the inaccuracies in the synthetic DNA.
The profound implications of advances in synthetic biology can sound a bit like science fiction at times, but many new developments for the good of humanity could be right around the corner.
The profound implications of advances in synthetic biology can sound a bit like science fiction at times, but many new developments for the good of humanity could be right around the corner. On the positive side, gene sequencing technologies have become a huge asset in detecting and understanding the source of human diseases, many of which are caused by errors in genomes or by viral or bacterial infection. Meanwhile, scientists are using gene synthesis technologies to deepen their understanding of living organisms and are moving rapidly through the engineering design-build-test cycle, creating microorganisms that produce consumer goods that improve human life. For example, scientists can now engineer yeast to produce artemisinin, a valuable biochemical for use in anti-malarial drugs instead of harvesting it from rare plants.
On the negative side, the National Academy of Sciences published a report in 2018 in which experts determined the re-creation of pathogenic viruses to be one of the top risks of synthetic biology. Instead of having to gain access to highly secured samples of smallpox in Russia or the United States, nefarious actors need only to acquire the digitized genome of the virus. Using this digital information, they could synthesize smallpox DNA and boot it up into a cell for replication.
The potential for such a scenario might not be too far off. In 2014, scientists at the University of Alberta in Canada assembled the genome of horsepox, a close relative of smallpox. The project, which entailed ordering DNA sequences of the virus from a biomanufacturer by mail and then stitching them together, took around six months and cost about $100K — a price tag considered rather cheap by scientific standards. Unlike for viruses, major technological barriers continue to exist for the re-creation of a bacterium from scratch.
We now return to your hypothetical plight. Perhaps something can be done to get you out of prison based on what we’ve learned about the digital-to-physical dimension of biodata.
During your first consultation with your criminal lawyer, you explain how you must have been framed to take the fall for the murder by a fellow colleague. Many of your coworkers are skilled scientists and capable of printing copies of your DNA, assuming they gained access to your digital genome. They are well aware of your dislike for the victim, and everyone knows he was the frontrunner for winning the microbe competition. Each of them would equally benefit from the cancellation of your dead colleague’s non-transferable stock options and your failure to gain the promotion.
As you propose your alternate theory for the crime, your lawyer gives you a skeptical look. You attempt to explain further, but only make it worse. You mention there are several DNA synthesis machines at your laboratory that could have been used by anyone working there. It would be easy to identify other potential suspects for the murder based on the record log of people using the machine at the lab. You insist a colleague must have accessed your private health records containing your genome, and printed and planted your DNA on the crime scene. You demand to know if the police found any other DNA at the scene besides your own. When your lawyer informs you that the DNA of only one other individual was found at the crime scene, your hopes rise for a brief moment. Sadly, it was the DNA of the victim himself. Your lawyer agrees to take your off-the-wall theory back to the district attorney for consideration, but she warns you to manage your expectations. You’re likely still on the hook for this murder.
THE PRIVACY DIMENSION
Over the past decade, regular sequencing of human genomes has demonstrated the many genetic similarities across the world’s population. Despite three billion base pairs found in the human genome, every individual is 99.9% identical to every other human on planet earth. A tiny bit of the human genome is unique to each person, suggesting enormous power of variation in the expression of only a few genes. Since the cost of gene sequencing has dropped substantially, many more people are having their genomes sequenced each year. Some people are doing so for medical reasons, but many more are sending their DNA samples to consumer sequencing companies such as 23andMe or Ancestry to learn more about their family history. Most of these individuals are completely unaware of the privacy risks that lurk beneath their decision to get their DNA tested.
There are many important medical reasons for individuals to get their whole genome sequenced, and doctors will increasingly encourage patients to do so in the future. Studying the results of such tests, doctors will be able to get a complete picture of a patient’s potential medical issues, allowing for preventative action for a variety of illnesses. Certain known mutations increase the risk for cancer, such as the BRCA mutation. Women who learn in advance that they carry this mutation can opt to have a preventative mastectomy or hysterectomy, reducing their chance of getting breast or ovarian cancer. Screening for many diseases at once represents a key advantage of whole genome sequencing — it allows doctors to develop a preventative health plan for their patients. Additionally, because a person’s genome does not change over the course of their lifetime, testing centers can hold on to test results and inform individuals of new insights based on ongoing genetics research.
However, finding out more about family ancestry represents the more popular reason for individuals to have their genomes sequenced at the current time. Consumer sequencing companies such as 23andMe and Ancestry offer cheap DNA test kits that can be easily performed at home and submitted back to the company by mail. In return, these individuals receive comprehensive DNA profiles with rough estimates of their ethnic and geographic ancestry based on their genetics.
Hidden in the fine print, however, these individuals have agreed that the DNA samples submitted for testing no longer belong to them. Consumer-based DNA testing does not enjoy the same protections afforded to whole genome sequencing performed for medical reasons. In the latter case, the DNA profile becomes part of an individual’s medical record protected under US law. In contrast, under the terms of service signed by individuals who turn over their DNA to consumer sequencing companies, the company owns the sample indefinitely and can sell it to third parties. In 2018, 23andMe sold exclusive rights to mine data within its five million DNA profiles to GlaxoSmithKline for $300 million to assist the major pharmaceutical company with new drug discovery.
In the United States, there are no legislative protections to guard against the misuse of an individual’s genome unless it is part of a medical record. The Health Insurance Portability and Accountability Act of 1996 (HIPAA) protects any genetic information contained within medical records; for example, such data cannot be provided to an employer. Under the 2008 Genetic Nondiscrimination Act (GINA), health insurance companies are not allowed to charge more money to individuals with genetic predisposition for certain conditions. However, other institutions such as schools, universities, senior citizens homes, landlords, or mortgage lenders could gain access to such information to help reach decisions. Other companies such as life insurance or disability insurance companies can deny coverage based on genetic information. These implications were vividly portrayed by Hollywood in the 1997-classic film Gattaca in which a young man with genetic predisposition is held back from his dream job of becoming an astronaut.
Consumer sequencing companies may attempt to defend their decisions to allow a major pharmaceutical company to mine their aggregated data by claiming the database does not reveal identities and the data is therefore anonymized. However, scientists claim they can identify an entire population from a small base of samples. This has to do with the massive volumes of personal data that individuals share about themselves over the Internet. With just three additional pieces of basic information — i.e., zip code, date of birth, and gender — an individual’s full identity can be manually confirmed from public records. This is due to the uniqueness of seemingly innocuous pieces of information. Fast forward to an era dominated by biotechnology and machine learning algorithms, and there will be nowhere for anyone to hide. The more we understand about how our DNA controls who we are, the harder it is to anonymize that data. Genetic data that is currently made available, either in public, anonymized databases, or freely flowing between private companies without any oversight, present a major risk to privacy in the future.
The plot thickens with the rise of online genealogy databases. A small subset of more than 26 million people who have taken at-home DNA tests and received their DNA profiles have uploaded their results to a public database. Many of these individuals upload their genetic profiles to free databases such as GEDmatch without understanding the privacy implications for themselves and their distant family members. For one, hackers may gain access to information on relatives who did not wish to publicly release their data and become subject to the ultimate identity theft. In a recent data breach, hackers forced one million users to opt into open access in the genealogy database by law enforcement agencies.
Over the last few years, law enforcement agencies have been using genealogy databases to help solve criminal cases, especially those that have gone cold. Even if an individual did not directly upload their DNA profile to a publicly available database, DNA profiles from distant relatives can be used to identify potential suspects. To solve a crime, police officers can now do familial DNA searches, which allow them to search both public and police DNA records for a partial match from a crime scene. By uploading profiles for DNA found at crime scenes, detectives can match them to profiles of distant relatives that might help them identify their suspect. For example, in 2018, police finally discovered the identity of the serial killer and rapist known as the Golden State Killer. By identifying several of his distant relatives, they were able to narrow the suspect list and make an arrest. He was sentenced to life in prison.
Turning back to your criminal case, let’s see if the privacy dimension can help you out. After several weeks of racking your brain for a way to clear your name, you remember reading about police officers identifying suspects by searching online genealogy databases. You request another meeting with your lawyer. But when she arrives, there’s some more bad news about your case.
Your lawyer requested the log from your biotech firm. Based on the ID swipes, no one else at the laboratory used the machine on the day of the murder or even up to a week prior. You suggest that one of your colleagues must have either figured out a way to delete the log entry or to gain access to the gene synthesis machine over the network without physically being in the lab. Or perhaps, they’ve been planning the crime for months and printed your DNA well in advance of committing the murder. Feeling desperate, you argue that it could have just as easily been an anonymous hacker gaining access to the machine via a phishing attack or by encoding malware on DNA sequences. Once in the network, the hacker could have remotely synthesized your DNA and retrieved the sample back at the laboratory.
Shifting the topic, you ask your lawyer to run a familial search in an online genealogy database on the victim’s DNA. You know it’s a stretch, but perhaps the victim has a homicidal identical twin. That would explain why the real murder left no identifying DNA behind. Your lawyer raises her eyebrow at all of your new theories and suggests you might have watched one too many Hollywood movies. That said, she agrees to take a look into them for you.
THE BIG DATA DIMENSION
Over the past decade, the life sciences have become increasingly data driven, a trend that promises to revolutionize the medical and public health fields. Despite the vital role of biodata, gene editing tools such as CRISPR-Cas9 dominate the headlines in the field of synthetic biology. However, in the absence of massive datasets containing biodata about the living organisms, CRISPR acts like a hammer without a nail. To leverage the true potential of gene editing tools for improving human health and enabling precision medicine, scientists require “accurate and digitized knowledge about gene sequences and genomes of living organisms.” In short, they need access to reliable collections of big data and sophisticated machine learning tools to analyze massive volumes of biodata.
Whilst companies such as 23andMe and GlaxoSmithKline have clearly grasped the economic value of biodata, scientists and policymakers alike have failed to recognize both the strategic value and potential national security risks of large collections of biodata. As a result, they are squandering enormous opportunities and exacerbating the risks of biodata by failing to protect it.
Machine learning algorithms learn rules from patterns found in datasets and develop solutions to complex problems including those related to biotech and health. Most types of machine learning tools depend on massive volumes of data for their initial training. Although the concept of big data is a not new one, the volumes and types of data that can be collected and analyzed now exceed the first three decades of the digital revolution. Large collections of digital information have become increasingly powerful and valuable tools for predicting outcomes for human health and biotechnology.
In response to rapidly declining costs of sequencing and data storage, scientists have produced and stored growing volumes of biodata in online databases, accessible to anyone registered with an account. For example, the GenBank, a national repository of gene sequence data overseen by the US National Institutes of Health, currently contains more than 650 billion base pairs from more than 200 million reported sequences. In an era of artificial intelligence, some experts consider big data collections like this to be the new oil — a strategic asset that will shape the world in unanticipated ways.
One of the greatest opportunities of big data for medicine and human health is precision medicine — i.e., the ability to tailor treatment to underlying genetic causes as opposed to treating surface level symptoms. However, DNA is not destiny. Genetics offers only a piece of the puzzle for influencing human health. Environment and lifestyle play important roles in how different genes are expressed in individuals. As precision medicine matures, doctors will be able to make use of more types of biodata including those from wearables which collect both health and lifestyle data — e.g., heart rate, movement, location. Using this information, doctors can not only assess and diagnose health problems, but they will also be able to predict future health issues and prescribe preventative measures.
The more biodata available across a large population, the more insights can be gained about human genetics and disease. As scientists are able to make better predictions about the impacts of genetics, lifestyle, and environment across an entire population, they will get closer to curing diseases and extending human life. In theory, the more biodata we have to improve human health, the stronger we become, that is, unless we fail to understand the incredible value of biodata and forget to protect access to it.
Malicious state and non-state actors can leverage the same big data as scientists to target specific populations and even particular individuals for harm. China is hoping to exploit the intersection of artificial intelligence and biotechnology for the battlefield. The Chinese government recognizes the strategic value of biodata and is currently collecting biodata with the aim of developing the world’s largest repository. To achieve this goal, China has bought up interest in sequencing companies in a bid to become a DNA superpower. Already in 2016, China owned more than half the world’s capacity for gene sequencing, including some lab capacity within the United States. Beijing Genomics Inc (BGI) leads this effort and has formed partnerships with American companies, presumably giving the Chinese government access to biodata for American citizens. These early steps have positioned China to become a global leader in responding to the COVID-19 global pandemic and to leverage that role to gain access to even more biodata.
Perhaps the big data dimension can help you finally prove your innocence and get out of jail. Your lawyer visits you in prison and provides a report on the search results you requested from the genealogy database. Unfortunately, it’s not the answer you were looking for, but there is a small ray of hope. She informs you that although the victim does not have an identical twin, the database search turned up some familial matches that don’t make any sense. Your lawyer showed the results to the district attorney’s office and convinced the prosecution to investigate further. That led them to discover that the victim’s DNA closely matches people that are not at all related to him. The detective on the case is baffled and has taken new samples of DNA from the victim to run the tests again.
As you contemplate the shocking news, you remember reading an interesting article about the recipient of a bone marrow transplant. The individual worked for a police department and ran a long-term experiment on himself to test out a theory about his DNA. On a hunch, you ask your lawyer to request the district attorney compare the DNA profile of the victim to those profiles stored in the organ donor registry. You suggest that she also get the victim’s health records to see if your colleague had a bone marrow transplant. Your lawyer informs you that the district attorney will need a subpoena to gain access to that information, but she’ll ask the prosecution to consider filing one.
THE PROTECTION OF BIODATA IS CRITICAL
In theory, the more biodata scientists collect to improve human health, the stronger we become as a nation. That is, unless we fail to sufficiently protect that data from critical risks. The three dimensions of biodata — e.g., digital-to-physical, privacy, and biodata dimensions — not only highlight its strategic value, they also demonstrate the growing risks to US national security in the absence of adequate protections against discrimination, error, unauthorized access, misuse, and theft.
In contrast to China, the US government does not yet appear to view biodata as a strategic asset since it is not close to doing enough to either systematically collect or secure it. For example, there are no national standards for how and where biodata should be stored. Biodata collections are available in unencrypted online databases, local servers and computers, and even in email chains. Many companies and labs that generate and access biodata do not have the resources, technical expertise, or motivation to secure it themselves. Until the US government adequately protects biodata against misuse and theft, such collections may present more risks than benefits.
Sitting in jail for several months now, you’ve learned firsthand about both the value and potential costs of biodata. Your lawyer visits you again, this time with some good news. To the surprise of the district attorney, the search of the organ donor registry and the victim’s health records turned up evidence of a new suspect.
It turns out that your colleague suffered from leukemia and underwent a bone marrow transplant over a decade ago. When the detective on your case redid the DNA tests, he discovered two different DNA profiles for your colleague. The detective consulted with scientists and learned that over time, the victim’s DNA began to change in different parts of his body — e.g., blood, saliva, semen, and skin and became identical to that of his donor. Only his hair still contained his original DNA profile.
After further investigation, the detective also discovered that your colleague and his donor met up a few times prior to his murder. By coincidence, the donor also worked at a biotech firm at the cutting edge of consumer products produced by microbes. Apparently, your dead colleague stole an idea from his donor for submission to your biotech firm’s competition. When the organ donor found out about the theft, he planned and executed the murder of his bone marrow recipient, setting you up to take the fall. Your lawyer informs you that you’ll go free in a few hours after the paperwork is finished getting processed.
Natasha Bajema is the Founder and CEO of Nuclear Spin Cycle, LLC, a consulting firm specializing in national security, entertainment, and publishing. She has been an expert on national security issues for over 20 years, specializing in weapons of mass destruction (WMD), nuclear proliferation, terrorism, and emerging technologies. Natasha is a published fiction author and writes a science fiction mystery series called the Lara Kingsley Series.
Ronit Langer is a Scoville Fellow at the Carnegie Endowment for International Peace working on issues at the intersection of biotechnology and cybersecurity. She also currently serves as the After iGEM Global Ambassador Coordinator, leading a team of 26 ambassadors across the world. She has also represented iGEM as a delegate at the Biological Weapons Convention and the Geneva Disarmament Platform. She has a degree in computer science from MIT, where she worked on numerous projects in computational biology.