# DTC Introduction to Programming

Practice Problems for Multidimensional Array Structures

1. The purpose of this exercise is to see how you might go about doing the preprocessing necessary to align two protein chains in three-dimensional space. There are many algorithms that can superimpose protein structures to find an alignment. Theseus is one such example.
2. Download the 4D12 Protein Data Bank file and save it to your working directory. We're going to do some text processing using multidimensional arrays in Python.
3. For some background, check out the PDB entry for 4D12 to learn a little bit about what this protein does. Also check out the PDB guide for coordinate files to learn more about what information is contained in PDB files. Note that the PDB format is a bit unusual in that spaces are important; columns are not delimited by tab or comma characters. See here for the spaces that correspond to each piece of information in the PDB file.
4. As you've seen from the PDB documentation, the lines in the PDB file that have protein coordinates start with 'ATOM' and the chain is in the fifth column. Read the 4d12.pdb file into a Python array, take the lines (in order) that start with 'ATOM' and are in the A chain, and copy them into a new file called 5gxia.pdb. (Solution)
5. Proteins consist of amino acid residues (see Figure 1 below). Each residue has an amino acid (unique to each residue) and a backbone that is common to all residues. The backbone consists of the repeating series N-Cα-C, where N is in an amino group, Cα is bonded to the amino acid, and C is in a carboxyl group. Residues are linked by a peptide bond between the amino group and the carboxyl group, so that the backbond has the repeating structure N-Cα-C-N-Cα-C- etc. Modify the code you wrote in the previous step so that it only writes the backbone atoms to a new file called 5gxia_backbone.pdb. Open 5gxia_backbone.pdb in a text editor - can you see the N-Cα-C repeating structure? (Solution)
6. 5FRC has the same amino acid sequence as 4D12, but it was expressed in a different organism and the X-ray diffraction was done under different conditions. Download the 5FRC Protein Data Bank file and use your code from the previous to steps to make 5frca.pdb and 5frca_backbone.pdb.
7. We've now created 4d12a_backbone.pdb and 5frca_backbone.pdb files that we can use in our structural alignment!
8. Perform the structural alignment on the 5frca_backbone.pdb adn 4d12a_backbone.pdb inputs that you created in the exercises. You will need to download Theseus from the Theseus webpage, where there are binary executables available for MacOS, Windows, and Linux. Note that Theseus requires Muscle to do the multiple sequence alignment, so you'll need that too. You'll also need to edit the theseus_align shell script to tell it where to find the Theseus binary and the Muscle binary. Once you've done that, run theseus_align by running 'bash theseus_align -a1 -f 4d12a_backbone.pdb 5frca_backbone.pdb'. What is the root-mean-square deviation (RMSD) between the two structures?