[course homepage]

Problem Set 2 - Ancestry

[Problem Set 2 PDF]

Overview

Data files and code templates for this problem set are available on comet at:

/oasis/projects/nsf/csd524/mgymrek/data/ps2/
/oasis/projects/nsf/csd524/mgymrek/templates/ps2/

You should make a directory for problem set 2 in your working directory with subdirectories for code and results:

mkdir /oasis/projects/nsf/csd524/$USER/ps2
mkdir /oasis/projects/nsf/csd524/$USER/ps2/code
mkdir /oasis/projects/nsf/csd524/$USER/ps2/results

Installing python packages

Use the following commands to install useful python packages:

pip install --user sklearn pandas pyvcf

PS2 data

The data directory contains the following files you will use in the problem set:

To see how these files were created from the original 1000 Genomes file, see:

PS2 templates

The templates directory contains:

Using 23andMe data

To include your own 23andme results in the data used for the PCA problem, see:

/oasis/projects/nsf/csd524/mgymrek/templates/ps2/preprocess_23andme.sh

and edit the paths appropriately.