So, this one’s likely pretty niche, but I’m hoping someone here might know the answer.
So, I’ve gotten genotype data for myself from 23AndMe (don’t worry, I made them delete it before the acquisition) and AncestryDNA years ago and I’ve been looking into things like SNPs and such more recently. I write code for a living, so I can do some cool things with a little code and the raw data that I’ve gotten to check into what interesting SNPs I might have.
Something I’ve noticed recently is that for some SNPs, I’ve got alleles that aren’t listed as a possibility anywhere on the internet that I can find.
Just to take a random example, rs3746544, part of the SNAP25 gene. According to SNPedia, the available alleles are A and C with A being the major allele and C being the minor. So what is my genotype for that SNP?
[tootsweet@computer genome_raw_data]$ grep rs3746544 23andme_raw_data.txt ancestrydna_raw_data.txt
23andme_raw_data.txt:rs3746544 20 10287084 TT
ancestrydna_raw_data.txt:rs3746544 20 10287084 T T
[tootsweet@computer genome_raw_data]$
TT? There’s zero mention of “T” being an allele that you can have for rs3746544.
rs3746544 is very much not the only example. Just a few more among many:
- SNPedia says rs807701 has alleles C and T, but I have AA.
- SNPedia says rs25532 has alleles C and T, but I have AG.
- SNPedia says rs6265 has alleles A and G, but I have TT.
I’m hoping some of you folks know enough about genes to know what might be up with these examples. I’m sure it’s just simply something I don’t yet understand about genetics. Thanks in advance!
Edit: So I had a bit of a brain fart after writing this in a comment:
(Side note: oddly of the 23 “mismatch” examples I mentioned, my genotype doesn’t have a single allele in common with the documented possible alleles for the SNP. For example, I don’t have any AT’s where the documented alleles are AA, AC, and CC. My genes either match the documented alleles or have no alleles in common with the documented genotypes. Which seems even stranger.)
A’s match with T’s and C’s with G’s. I’m guessing when I get a “mismatch” like what I’m talking about, what 23andme or AncestryDNA is giving me is the complementary base pairs. So if I see a CT where the documented options are AA, AG, and GG, I should just consider my CT to be equivalent to an AG. (Because the T matches up with an A and the C matches up with a G.)
So I guess that means that sometimes the equiment that 23andme and AncestryDNA use reads the other side of the DNA strand from the one that’s documented in the literature. (This only seems to happen in about 16.5% of cases or therebouts – at least that’s what my napkin math indicates. In most cases, what 23andme and AncestryDNA report in the raw data matches and thus must be measuring/reading/reporting the “same side” of the double helix as the literature talks about.)
At least that theory seems consistent with what I’m seeing. If anybody knows better, I definitely would appreciate any further input!
That said, it does seem kindof odd that any time 23andme reads the “other side” of the DNA molecule, so does AncestryDNA and vice versa. That is, there don’t seem to be any cases where they disagree on my genotype for a given SNP. At least I haven’t seen any examples of that so far. I might have to do some searching now.
Edit 2: I’ve done a little more googling based on the first edit above and found this page. It seems 23andme always goes off of the so-called “+ strand” of the “Genome Reference Consortium Human Build 37” human reference genome. So maybe the 23 examples I’ve found so far are cases where at least some of the literature (or at least SNPedia and EUPedia, if not “the literature”) is based more off of what the “Genome Reference Consortium Human Build 37” considers the “- strand”. So maybe “the literature” (and/or SNPedia/EUPedia) uses a different reference genome? All this is still just a theory, but I definitely know more than I did a few minutes ago.
Edit 3: Some folks are suggesting that 23AndMe and AncestryDNA may just not be accurate. As in, 23AndMe and AncestryDNA may have a very high error rate when reading my genetic data. If that was the case, I wouldn’t expect the inaccuracies to “match” between the two raw data files. So, to test that hypothesis out, I wrote a script to check my 23AndMe raw data against my AncestryDNA data to see how often they disagree. The script is quite slow, but at the moment it’s checked over 35,000 SNPs that are measured by both services and found 12 that disagree for an error rate of roughly 0.0343%. From another comment, I mentioned the instances I’ve found make up about 16.5% of the ones I’ve checked. So it doesn’t seem like that accounts for a very large percentage of these. I’m still leaning pretty heavily toward it just being the “other strand” theory. Thanks again for everyone’s input!


All I know is that RS232 has options for Parity and various bitstream options.