What Is Raw DNA Data? A Complete Beginner's Guide
If you have ever taken a DNA test from a service like 23andMe, AncestryDNA, or Helixline, your results go far beyond the colorful ancestry pie chart on your screen. Behind those visual reports lies something far more fundamental: your raw DNA data. This file is the actual output of your genetic test, and understanding what it is and how to use it opens up a world of possibilities for exploring your genome.
In this guide, we will break down exactly what raw DNA data is, what it looks like when you open the file, how different testing companies format it, how to download it, and what you can do with it once you have it in your hands. Whether you are a first-time DNA test customer or a genetics enthusiast looking to go deeper, this guide has everything you need.
Key Takeaway: Raw DNA data is a plain text file containing hundreds of thousands of rows, each representing a single genetic marker (SNP) in your genome. It records which two letters (alleles) you carry at each tested position on your chromosomes. This file is the foundation for all the ancestry, health, and trait reports you receive from your DNA testing company.
What Exactly Is Raw DNA Data?
When you spit into a tube or swab your cheek for a DNA test, the laboratory does not read your entire genome. Instead, it uses a technology called SNP genotyping (pronounced "snip") to read specific positions in your DNA that are known to vary between people. These variable positions are called single nucleotide polymorphisms, or SNPs.
Your genome is made up of about 3.2 billion base pairs of DNA, but any two humans share roughly 99.9% of their DNA sequence. The remaining 0.1% -- around 4 to 5 million positions -- is where the interesting variation occurs. SNP genotyping chips typically test between 600,000 and 900,000 of these variable positions, selecting the ones that are most informative for ancestry, health, and trait analysis.
The raw DNA data file is simply the complete output of this genotyping process. It is a text file where each row represents one SNP and records:
- rsID: A unique reference identifier for the SNP (e.g., rs1426654). The "rs" stands for "Reference SNP" and these IDs are assigned by the dbSNP database maintained by the National Center for Biotechnology Information (NCBI).
- Chromosome: Which of the 22 autosomes (numbered 1-22) or sex chromosomes (X or Y) the SNP is located on.
- Position: The exact numerical coordinate (base pair position) on that chromosome where the SNP occurs, based on a reference genome build (usually GRCh37/hg19 or GRCh38/hg38).
- Genotype: The two alleles (letters) you carry at this position -- one inherited from your mother and one from your father. Represented as two letters such as AA, AG, GG, CT, or TT.
What Does Raw DNA Data Look Like?
When you open a raw DNA data file in a text editor, you will see something that looks like a very long spreadsheet without the gridlines. Here is an annotated example of what a typical 23andMe-format raw data file looks like:
#
# This file contains raw genotype data, including data that is not used in
# 23andMe reports. This data has undergone a general quality review however
# only a subset of markers have been individually validated for accuracy.
#
# rsid chromosome position genotype
rs12564807 1 734462 AA
rs3131972 1 752721 AG
rs12124819 1 776546 AA
rs1110052 1 838555 GG
rs1815606 1 844113 AG
rs7537756 1 854250 AG
rs13302982 1 861808 GG
rs1799945 6 26093141 CG
rs1800562 6 26093017 GG
rs4988235 2 136608646 CT
Line-by-Line Explanation
Let us walk through several of these lines to understand exactly what the data is telling us:
- rs12564807 1 734462 AA -- This SNP (rs12564807) is located on chromosome 1 at position 734,462. The person carries two copies of the A allele (one from each parent). This is a homozygous genotype.
- rs3131972 1 752721 AG -- This SNP is also on chromosome 1, at position 752,721. The person carries one A allele and one G allele. This is a heterozygous genotype, meaning the person inherited different versions from each parent.
- rs1799945 6 26093141 CG -- This well-studied SNP is in the HFE gene on chromosome 6 and is associated with hereditary hemochromatosis (iron overload). The CG genotype means the person is a carrier of the H63D variant.
- rs4988235 2 136608646 CT -- This famous SNP near the LCT gene on chromosome 2 is associated with lactase persistence (the ability to digest lactose in adulthood). The CT genotype typically means the person can digest milk.
Understanding Genotypes: At every SNP position, you have two alleles. If both are the same (e.g., AA or GG), you are homozygous at that position. If they differ (e.g., AG or CT), you are heterozygous. The specific combination determines everything from your eye color to your disease risk to your ancestry composition.
Raw Data Formats Across DNA Testing Companies
Not all DNA testing companies produce raw data in the same format. While the fundamental information is the same (rsID, chromosome, position, genotype), the file structure, column order, number of SNPs tested, and reference genome build can differ significantly. Understanding these differences is important when you want to upload your data to third-party analysis tools.
| Company | File Format | SNPs Tested | File Size | Genome Build | Compatibility |
|---|---|---|---|---|---|
| 23andMe (v5) | .txt (tab-separated) | ~640,000 | ~15 MB (unzipped) | GRCh37 (hg19) | Widely accepted by most third-party tools |
| AncestryDNA | .txt (tab-separated) | ~700,000 | ~18 MB (unzipped) | GRCh37 (hg19) | Accepted by most tools; some require format conversion |
| MyHeritage | .csv (comma-separated) | ~720,000 | ~19 MB (unzipped) | GRCh37 (hg19) | Good compatibility; CSV format differs slightly |
| FamilyTreeDNA | .csv (comma-separated) | ~700,000 | ~17 MB (unzipped) | GRCh37 (hg19) / GRCh38 | Widely compatible; some older files use build 36 |
| Helixline | .txt (tab-separated) | ~850,000 | ~22 MB (unzipped) | GRCh38 (hg38) | Compatible with all major tools; includes South Asian-specific markers |
| Whole Genome (VCF) | .vcf (VCF format) | 4,000,000+ | 1-5 GB | GRCh38 (hg38) | Research-grade; requires bioinformatics tools to process |
Key Differences Between Formats
While all raw data files contain essentially the same type of information, there are important differences to be aware of:
- Column Separators: 23andMe and AncestryDNA use tab-separated values (.txt files), while MyHeritage and FamilyTreeDNA use comma-separated values (.csv files). This affects how you open and process the files.
- Header Lines: 23andMe files begin with comment lines starting with "#". AncestryDNA files start with a header row followed by data. Helixline includes a metadata section with your kit ID, test date, and chip version before the data begins.
- Allele Representation: Most companies report both alleles as a pair (e.g., "AG"), but some older formats report them in separate columns. Some companies use the forward strand, while others use the dbSNP strand, which can cause confusion when comparing data across platforms.
- Reference Genome Build: The "position" column is relative to a specific version of the human reference genome. Most consumer tests use GRCh37/hg19, but newer tests (including Helixline) are moving to GRCh38/hg38. The same SNP may have different position numbers in different builds.
- SNP Selection: Each company uses a different genotyping chip with a different selection of SNPs. While there is significant overlap (especially for common ancestry and health markers), some SNPs may be present in one company's data but absent from another's.
Understanding VCF Format
If you have had whole genome sequencing (WGS) rather than SNP genotyping, your raw data will be in VCF (Variant Call Format) rather than a simple text file. VCF is the standard file format used in bioinformatics for storing gene sequence variations. It is significantly more complex than SNP genotyping output:
- Much larger: A VCF file from whole genome sequencing can be 1 to 5 gigabytes, compared to 15 to 22 megabytes for SNP genotyping data.
- More variants: WGS detects 4 to 5 million variants, compared to 600,000 to 900,000 from genotyping chips.
- Additional data: VCF includes quality scores, read depth, filter status, and other information about the reliability of each variant call.
- Structural variants: Unlike SNP data, VCF can also record insertions, deletions, and other structural changes in the genome.
For most consumer genetic testing purposes, SNP genotyping raw data is sufficient. Whole genome sequencing provides more complete data but requires specialized tools and expertise to analyze effectively.
How to Download Your Raw DNA Data
Every major DNA testing company allows you to download your raw data. Here are step-by-step instructions for each platform:
Downloading from 23andMe
- Log in to your 23andMe account at 23andme.com
- Click on your name in the top-right corner and select Settings
- Scroll down to the 23andMe Data section
- Click Download Raw Data
- Re-enter your password for security verification
- Complete the two-step verification if enabled
- Select "Submit Request" -- 23andMe will email you when the file is ready
- Return to the same page and click the download link (available for 30 days)
- The file downloads as a .zip archive containing a single .txt file
Downloading from AncestryDNA
- Log in to your Ancestry account at ancestry.com
- Click the DNA tab in the top navigation
- Click Settings (gear icon) on your DNA home page
- Scroll to Download DNA Data under the Actions section
- Click Get Started
- Confirm your identity by re-entering your password
- Ancestry will send a confirmation email -- click the link in the email
- Return to the settings page and click Download DNA Raw Data
- The file downloads as a .zip archive containing a .txt file
Downloading from Helixline
- Log in to your Helixline account at helixline.in
- Navigate to your Dashboard
- Click My DNA in the sidebar menu
- Select Download Raw Data
- Verify your identity through password re-entry
- Choose your preferred format: Helixline native format or 23andMe-compatible format
- Click Download -- the file is generated immediately
- The file downloads as a .zip archive containing your raw data file
Important Note: Always store your downloaded raw DNA data file in a secure location on your computer or in an encrypted cloud storage service. This file contains sensitive genetic information. Treat it with the same care you would give to your medical records or financial documents.
What Can You Do With Raw DNA Data?
Once you have downloaded your raw data file, you can use it for a wide range of analyses that go beyond what your original testing company provides. Here are the most popular use cases:
1. Health and Trait Analysis
Several third-party services accept raw DNA data uploads and provide detailed health and trait reports:
- Promethease: One of the most comprehensive health report tools. It cross-references your SNPs against the SNPedia database and generates a detailed report covering health risks, drug responses, and traits. Reports cost approximately $12 USD and are generated within minutes. Promethease reports are dense and technical, but they provide one of the most thorough analyses available from raw data.
- SelfDecode: Offers AI-powered health reports based on your raw DNA data. It provides actionable recommendations for diet, supplements, and lifestyle changes based on your genetic profile. SelfDecode is subscription-based and includes ongoing updates as new research is published.
- Codegen.eu: A free tool that checks your raw data for specific medically relevant SNPs, including BRCA1/2 variants, MTHFR status, Factor V Leiden, and other clinically significant genetic markers. It is a good starting point for a quick health overview.
- Genetic Genie: A free service that analyzes methylation and detoxification-related genes from your raw data. It provides reports on MTHFR, COMT, VDR, and other genes involved in methylation pathways.
2. Genetic Genealogy and Finding Relatives
Uploading your raw data to genetic genealogy databases can help you find relatives who tested with different companies:
- GEDmatch: The most popular free genetic genealogy platform. By uploading your raw data, you can find DNA matches from any testing company. GEDmatch also offers admixture calculators (like Dodecad, Eurogenes, and HarappaWorld) that provide alternative ancestry breakdowns. The HarappaWorld calculator is particularly valuable for Indian users as it was designed specifically for South Asian genetic diversity.
- FamilyTreeDNA: Accepts raw data uploads from other companies and adds you to their matching database. FamilyTreeDNA has a large database of users, particularly strong in Y-DNA and mtDNA analysis.
- DNA.Land: A free research-oriented platform run by Columbia University and the New York Genome Center. It provides ancestry analysis and allows you to contribute to genetic research. DNA.Land's ancestry analysis uses its own reference populations and can provide a different perspective from your original test results.
- MyHeritage: Allows free raw data uploads and provides DNA matching and ethnicity estimates. MyHeritage has a large international database, which can be helpful for finding relatives abroad.
3. Ancestry and Population Analysis
Beyond the standard ancestry report from your testing company, raw data enables deeper population-level analysis:
- Admixture Calculators: Tools like GEDmatch's HarappaWorld or Dodecad K12b break down your ancestry into ancient genetic components (such as South Indian, Baloch, Caucasian, Northeast European, Southeast Asian). These calculators use different reference populations and algorithms from the major testing companies, often providing more granular detail for specific regions.
- PCA Plots: Principal Component Analysis plots allow you to visualize where your DNA falls relative to global reference populations. Tools like the Vahaduo G25 calculator enable you to model your ancestry as a mixture of ancient and modern reference populations.
- Haplogroup Deep Analysis: While your testing company provides basic haplogroup assignments, specialized tools can analyze your raw data for more detailed subclade information within your Y-DNA or mtDNA haplogroup.
4. Upload to Helixline for India-Specific Analysis
If you tested with another company but want Indian-specific insights, Helixline accepts raw data uploads from 23andMe, AncestryDNA, MyHeritage, and FamilyTreeDNA. When you upload your data to Helixline, you receive:
- India-Specific Ancestry Breakdown: Instead of a generic "South Asian" label, Helixline provides granular regional ancestry estimates across India's diverse populations.
- Haplogroup Analysis: Detailed Y-DNA and mtDNA haplogroup assignments with context about their frequency and history in the Indian subcontinent.
- Ancestry Composition in Indian Context: Your ANI (Ancestral North Indian), ASI (Ancestral South Indian), and other ancestral components mapped to India's genetic landscape.
- Wellness Insights: Health and wellness markers interpreted specifically for South Asian populations, which can differ significantly from European-centric analyses.
Get More From Your DNA Data
Already tested with another company? Upload your raw DNA data to Helixline for India-specific ancestry, haplogroup, and wellness insights.
Upload Your DataPrivacy Considerations When Sharing Raw DNA Data
Your raw DNA data is among the most sensitive personal information you possess. Unlike a password or credit card number, you cannot change your DNA if it is compromised. Before sharing or uploading your raw data anywhere, carefully consider these privacy factors:
What Your DNA Data Can Reveal
- Health Predispositions: Your raw data may reveal genetic risk factors for conditions like Alzheimer's disease, certain cancers (BRCA variants), heart disease, and hundreds of other conditions. This information could theoretically be used by insurers or employers in jurisdictions without genetic discrimination protections.
- Family Secrets: DNA data can reveal non-paternity events (your biological father is not who you think), unknown half-siblings, secret adoptions, and other family surprises. These revelations can be emotionally significant.
- Ancestry and Ethnicity: Your genetic ancestry can reveal ethnic and geographic origins that may be personally sensitive, particularly in contexts where ethnic identity carries social or political implications.
- Relative Identification: Even if you keep your own data private, relatives who share their data can indirectly expose your genetic information. Law enforcement has used genetic genealogy databases to identify suspects through their relatives' DNA uploads.
Best Practices for DNA Data Privacy
- Read privacy policies carefully before uploading to any third-party service. Look for clear statements about data storage, sharing with third parties, and whether your data is used for research.
- Check deletion options: Only use services that allow you to delete your uploaded data and account at any time. Verify that deletion is genuine and not just hiding your data from view.
- Use strong passwords and enable two-factor authentication on any account that stores your genetic data.
- Be cautious with public databases: Some genealogy databases are publicly searchable. Understand the privacy settings available and choose the level of visibility you are comfortable with.
- Consider pseudonymous uploads: Some services allow you to upload under a pseudonym or alias. This provides a layer of privacy while still enabling analysis.
- Never post raw data on public forums or social media, even partial snippets. A small subset of SNPs can be enough to identify an individual or infer sensitive health information.
- Encrypt your stored files: If you store your raw data file on your computer or cloud storage, consider encrypting it. Most operating systems offer built-in encryption options.
How Helixline Handles Raw DNA Data
At Helixline, we take the security and privacy of your genetic data extremely seriously. Here is how we handle raw DNA data:
- Encryption at Rest and in Transit: All raw DNA data is encrypted using AES-256 encryption while stored on our servers and transmitted over TLS 1.3 encrypted connections.
- Data Sovereignty: Your genetic data is stored on servers located in India, ensuring compliance with Indian data protection laws and keeping your data within the country's jurisdiction.
- No Third-Party Sharing: Helixline does not sell, share, or provide your raw DNA data to any third party without your explicit written consent. This includes insurance companies, employers, and law enforcement (except when legally compelled by court order).
- Right to Delete: You can request complete deletion of your raw DNA data and all derived results from our servers at any time through your account settings. Deletion is permanent and irreversible.
- Download Anytime: You can download your raw DNA data at any time from your Helixline dashboard, in your choice of Helixline native format or 23andMe-compatible format for use with third-party tools.
- Research Opt-In Only: If Helixline conducts genetic research, participation is always opt-in. Your data is never included in research datasets without your explicit consent, and research data is always de-identified and aggregated.
Helixline Privacy Promise: Your DNA belongs to you. We believe you should have full control over your genetic data, including the ability to download it, upload it elsewhere, or delete it permanently. We are custodians of your data, not owners.
Common Questions About Raw DNA Data File Structure
What Do the Chromosome Numbers Mean?
Humans have 23 pairs of chromosomes. In your raw data file, these are numbered as follows:
- Chromosomes 1-22: The autosomes (non-sex chromosomes). These are numbered from largest (chromosome 1, with about 249 million base pairs) to smallest (chromosome 22, with about 51 million base pairs). Every person has two copies of each autosome.
- Chromosome X: One of the sex chromosomes. Women have two copies (XX), while men have one copy (XY). In raw data, males will show only one allele for most X chromosome SNPs (hemizygous), while females show two.
- Chromosome Y: The male sex chromosome. Only males have a Y chromosome, so only males will have Y chromosome data in their raw file. If you are female, your raw data will not contain any Y chromosome entries.
- MT (Mitochondrial): Some raw data files also include mitochondrial DNA (mtDNA) SNPs, labeled as chromosome "MT" or "26". Mitochondrial DNA is inherited exclusively from your mother.
What Do "No Call" or "--" Entries Mean?
You may notice some entries in your raw data that show "--" or "00" or "NC" instead of a normal genotype like "AA" or "AG". These are called no-calls and mean that the genotyping chip was unable to determine your genotype at that particular position. This can happen due to:
- Insufficient DNA sample quality at that particular probe
- Technical noise or ambiguous signal on the microarray
- The genotyping algorithm's quality threshold not being met
No-calls are normal and typically affect only 1-3% of the SNPs in your raw data. They do not indicate a problem with your DNA; they simply mean the measurement was not reliable enough to report. Most analyses will simply skip these positions.
What Is the "i" Prefix in Some SNP Names?
In 23andMe raw data files, you may notice SNP identifiers that start with "i" instead of "rs" (for example, i3000027 instead of rs3000027). These are internal identifiers used by 23andMe for SNPs that either:
- Do not have an assigned rsID in the dbSNP database
- Are custom markers designed by 23andMe for their genotyping chip
- Are probes targeting specific regions of interest that are not standard dbSNP entries
These "i-number" SNPs may not be recognized by all third-party analysis tools, as they are specific to 23andMe's platform. However, many popular tools like Promethease and GEDmatch can interpret them.
Understanding Genotype Notation
The genotype column in your raw data uses standard IUPAC nucleotide codes. Here is what each letter means and the possible genotypes you can encounter:
| Letter | Nucleotide | Full Name | Common Pairings |
|---|---|---|---|
| A | Adenine | A purine base | AA, AG, AC, AT |
| G | Guanine | A purine base | GG, GA, GC, GT |
| C | Cytosine | A pyrimidine base | CC, CA, CG, CT |
| T | Thymine | A pyrimidine base | TT, TA, TG, TC |
| D | Deletion | A base was deleted | DD, DI |
| I | Insertion | A base was inserted | II, ID |
Most SNPs in your raw data will involve only two possible alleles (for example, A or G at a particular position). The three possible genotypes for such a SNP would be AA (homozygous for the A allele), AG (heterozygous), and GG (homozygous for the G allele). Which allele is considered "reference" and which is "alternate" depends on the reference genome.
How Raw DNA Data Powers Your Reports
Understanding how testing companies transform your raw data into meaningful reports helps you appreciate both the power and the limitations of consumer genetic testing:
Ancestry Reports
Ancestry analysis works by comparing your genotypes at thousands of ancestry-informative markers (AIMs) against reference panels of populations with known geographic origins. The algorithm calculates the statistical likelihood that your DNA pattern at each marker came from various reference populations, then combines these probabilities across all markers to produce your ancestry composition percentages.
Different companies use different reference panels and algorithms, which is why your ancestry results may vary between services. Helixline uses a reference panel that includes detailed representation of Indian subpopulations, providing more granular South Asian ancestry results than services that group all of India into a single category.
Health Reports
Health-related insights are derived by looking up specific SNPs in your raw data that have been associated with health conditions in published scientific research (genome-wide association studies, or GWAS). For each health-related SNP, the report checks which genotype you carry and references the scientific literature to determine what that genotype is associated with.
It is important to understand that most health-related SNPs identified through GWAS contribute only a small amount of risk for any given condition. Having a "risk" genotype does not mean you will develop a condition; it means your statistical probability may be slightly higher or lower than average.
Relative Matching
When two people upload their raw data to the same platform, the system compares their genotypes across all shared SNPs. If two people share long stretches of identical genotypes (called identical-by-descent segments, or IBD), it indicates they share a recent common ancestor. The total amount of shared DNA (measured in centimorgans, or cM) determines the likely relationship.
Frequently Asked Questions
What does raw DNA data look like?
Raw DNA data is a plain text file that you can open in any text editor or spreadsheet application. Each row contains four pieces of information: an rsID (a unique identifier like rs1234567), a chromosome number (1-22, X, Y, or MT), a position (numerical coordinate on the chromosome), and your genotype (two letters like AA, AG, or GG). A typical file contains 600,000 to 900,000 such rows, with comment lines at the top preceded by a "#" symbol. The file is usually 15 to 25 megabytes when unzipped and can be opened in Notepad, TextEdit, Excel, or any text editor.
Can I download my raw DNA data?
Yes, all major DNA testing companies provide the option to download your raw DNA data. With 23andMe, go to Settings and then 23andMe Data and then Download Raw Data. With AncestryDNA, visit Settings, then DNA Settings, then Download DNA Data. With Helixline, navigate to your Dashboard, click My DNA, and select Download Raw Data. The process typically requires re-entering your password for security, and the file is usually ready within a few minutes. The download comes as a compressed .zip file containing your raw data as a text file. You own this data and have the right to download it at any time.
Is it safe to share raw DNA data?
Sharing raw DNA data carries meaningful privacy risks and should be done with careful consideration. Your DNA data can reveal sensitive health information (like BRCA gene variants), family secrets (non-paternity, unknown siblings), ethnic ancestry, and can potentially be used to identify you or your relatives. Only upload your data to reputable services with clear privacy policies, data encryption, and deletion options. Never post raw data on public forums or social media. That said, sharing data with trusted third-party tools like Promethease, GEDmatch, or Helixline can provide valuable insights when done thoughtfully. Always read the privacy policy and terms of service before uploading.
What can I do with raw DNA data?
Raw DNA data unlocks a wide range of analyses beyond your original testing company's reports. You can upload it to health analysis tools like Promethease for detailed health and trait reports, or to genetic genealogy platforms like GEDmatch to find DNA matches from people who tested with different companies. You can explore alternative ancestry calculators like HarappaWorld for more detailed South Asian ancestry breakdowns. You can check for carrier status on specific genetic conditions using tools like Codegen.eu. And if you tested with another company, you can upload your raw data to Helixline for India-specific ancestry, haplogroup, and wellness analysis tailored to South Asian genetics.
Conclusion
Your raw DNA data file is the most fundamental output of any genetic test. While the ancestry pie charts and health reports are easier to understand at a glance, the raw data file is where the real power lies. It is a portable record of hundreds of thousands of your genetic variants that you can take with you to any analysis platform, now or in the future.
Understanding what this file contains -- the rsIDs, chromosomes, positions, and genotypes that make up your unique genetic fingerprint -- empowers you to make informed decisions about how to use your genetic information. Whether you want to explore your deep ancestry through admixture calculators, find long-lost relatives through genetic genealogy, investigate health-related traits, or simply keep a copy of your genetic data for future use, the raw data file is your starting point.
As genetic science advances and new analysis tools are developed, having your raw DNA data on hand means you can always take advantage of the latest discoveries without needing to take another test. Your DNA does not change, but our ability to interpret it improves every year.
Ready to explore your DNA? Order your Helixline DNA kit or upload your existing raw data to get India-specific ancestry and wellness insights that go beyond the generic.