The data file had to read into the database and then the information from the database was used to determine inheritance codes.
We had 5000 samples of data associated with one “SNP ID” and we had over 1000 SNP ID’s, making our data file over 5 million lines long. It was actually much messier looking than this and I ended up processing each line and storing the results in a database. After talking with my boss about this, he provided me the same data in a different format.
We had 5000 samples of data associated with one “SNP ID” and we had over 1000 SNP ID’s, making our data file over 5 million lines long. It was actually much messier looking than this and I ended up processing each line and storing the results in a database. After talking with my boss about this, he provided me the same data in a different format.
This format really condensed the data file. From 800MB to less than 15MB, in fact. However, now each “data point” isn’t “tagged”, so some additional preprocessing needed to be done.
This format really condensed the data file. From 800MB to less than 15MB, in fact. However, now each “data point” isn’t “tagged”, so some additional preprocessing needed to be done.
The sample ID’s I showed you earlier, each represented a different individual corn plant. Knowing the relationships among the different plants was required for processing the data. Here, since I’m a human familiar the genetic system, I know that IBM stands for an Intermated B73 x Mo17 population. This is a simplified example of a manifest file. Z1, M100, and “Bob” are just made up names and any similarity to known names is purely coincidental. When you start looking at these, you see that the way the Relationships were defined in multiple ways. There isn’t anything here that directly tells that IBM and Mo17 and B73 are related. To take advantage of this information I wrote a long series of rules. Well, the break through came with the realization that I couldn’t keep this up forever. Instead of telling the computer how to understand these relationships, I decided to just tell the computer what the relationships are (next slide).
The sample ID’s I showed you earlier, each represented a different individual corn plant. Knowing the relationships among the different plants was required for processing the data. Here, since I’m a human familiar the genetic system, I know that IBM stands for an Intermated B73 x Mo17 population. This is a simplified example of a manifest file. Z1, M100, and “Bob” are just made up names and any similarity to known names is purely coincidental. When you start looking at these, you see that the way the Relationships were defined in multiple ways. There isn’t anything here that directly tells that IBM and Mo17 and B73 are related. To take advantage of this information I wrote a long series of rules. Well, the break through came with the realization that I couldn’t keep this up forever. Instead of telling the computer how to understand these relationships, I decided to just tell the computer what the relationships are (next slide).
This is organized in a way that is simple to both humans and computer programs to understand.
Configuration files are great for some tasks that are easy for humans but more difficult to program. They are also great for things that are variable Setting up the configuration file only takes minutes. If we don’t know what these relationships are to start with, then we’re in trouble anyway. Simple for humans ≠ simple for computers Something else I didn’t put up here is that reducing your dependencies sure makes it easier to install.