hec logo white smallHECBioSim makes it easy for any UK-based bio-simulation scientist to get significant amounts of time on ARCHER, the UK National High Performance Computing Service. Used appropriately, this can make an enormous impact on the quantity and quality of the research that can be done. But learning how to use HPC resources, and analysing the huge amounts of data they can produce, can be a significant challenge.

Longbow smallIn this course you will be introduced to Longbow, a Python tool created by the HECBioSim consortium, that allows you to use primary molecular dynamics packages (AMBER, GROMACS, LAMMPS, NAMD) with ease from the comfort of your own desktop. Longbow eliminates the day to day tasks of setting up job submission files and having to deal with Linux terminals or FTP transfer programs to upload and download your files. Longbow is designed to be as simple as possible to install and get going so that researchers can spend more time doing research.

 

Some useful links for future reference:

Longbow software page here

Longbow documentation here

Longbow support here


Setup Exercise 1 - Retrieve your ARCHER username and password from SAFE

Before we can start setting up Longbow we first need to retrieve and test that we can SSH into the ARCHER system. Follow the below steps to retrieve your machine account details for ARCHER.

 

Step 1 - First go to the ARCHER SAFE login webpage here, and log in with your SAFE account credentials (you will have done this as asked in your registration email).

 

Step 2 - Once logged in, go to the "login accounts" option on the navigation menu at the top of the page. A drop down menu should appear listing various options including a list of user accounts you have setup. As part of the preparation for this workshop you were asked to request an account under e280 in which you will have created a username. Select this username from the menu.

 

Step 3 - You will be presented with a page listing various system resource metrics, towards the bottom right there are some buttons. We need the button labelled "View Login Account Password" when you click this button a box will appear below asking for your SAFE password. Enter your SAFE password and click the button labelled "view". You will be shown your system generated password for gaining entry to ARCHER login nodes (make a note of this for the next step).

 

That's it, we can now move onto the next step and setup the password-less SSH.

 

Setup Exercise 2 - Setting up password-less SSH

Before proceeding to install Longbow, it is important to ensure that you can get access to your ARCHER account, and to then setup password-less entry via SSH. This is because Longbow makes many calls to your remote machine in each Longbow session, and without this, you would have to keep inputting your password. This would be cumbersome and annoying! To get started, please work through the following steps and make sure everything is working as it should before moving on.

 

Step 1 - The first thing we should do before setting up password-less SSH, is to verify that SSH works, accept the key signature and change the default password. So get the username and password handy from the previous exercise and use them with SSH, open a terminal and type (replace "username" with the username from SAFE, you will be prompted for your password, this is same one as you looked up previously):

 

$ ssh This email address is being protected from spambots. You need JavaScript enabled to view it.

 

You may also be asked to accept the key signature, you should answer yes to this. You will then be asked to change your password, you should do this and then make a note of it for the next step. You may then log out and return to your local machine ternimal (type exit and press enter).

 

Step 2 - We will now create an SSH key pair by typing the following into a terminal (this needs to be created locally, so if you are still logged in to ARCHER then logout):

 

$ ssh-keygen

 

You will see something like the following, you will be prompted to enter a file and passphrase, you can just press enter 3 times to leave the default file with blank password since we want password-less SSH:

 

$ ssh-keygen 
Generating public/private rsa key pair.
Enter file in which to save the key (/home/juan/.ssh/id_rsa):
Created directory '/home/juan/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/juan/.ssh/id_rsa.
Your public key has been saved in /home/juan/.ssh/id_rsa.pub.
The key fingerprint is:1a:13:f5:77:7c:32:ea:00:42:4c:99:a8:59:1d:6e:17 juan@trique-ponee
The key's randomart image is: +--[ RSA 2048]----+ | =++E | | oo=. o . | | + = o . . = .| | o . + . . o + | | o S . . | | + o | | . . | | | | | +-----------------+

 

Step 3 - We then need to copy the public part of the key pair to ARCHER, there is a utility that will do this automatically for you (replace "username" with your ARCHER username you retrieved in exercise 1, it will also ask for a password, this will now be the password that you changed in step 1):

 

$ ssh-copy-id This email address is being protected from spambots. You need JavaScript enabled to view it.

 

Step 4 - Test the SSH connection can be established without asking for a password (replace "username" with your ARCHER username):

 

$ ssh This email address is being protected from spambots. You need JavaScript enabled to view it.

 

You should be now logged into ARCHER without being asked for a passcode, if this is not the case, seek support from the tutors or see this guide for more information.

 

Setup Exercise 3 - Installation of Longbow

To install Longbow on the cluster machines here at Bristol we will need to do the alternate install using setup.py as pip is not available. You could in theory install pip, however it is just as fast to install Longbow using the setup script, just bear in mind that if you install Longbow on your own system later that you can likely use the much easier pip install method (this is all covered in the documentation). Follow the instructions below to get Longbow installed and tested.

 

Step 1 - Install Longbow using pip.

 

$ pip install longbow --user

 

In this case, we need to use the --user flag to do a local install of Longbow as you will not have root permissions to do system-wide install.

 

You should see some output relating to the installation, this is normal. Once complete we should do some tests to make sure everything works.

 

Step 2 - Testing. Before we go on we need to make sure everything is working. The following check should be performed:

 

First check if Longbow can be launched, during testing the paths on the cluster machines were not setup correctly (this may have been resolved)

 

$ longbow --version

 

If the above command works you should see the Longbow version number output to the terminal console, if this is the case then you can skip to the next exercise. If however you got a message saying something along the lines of "executable not found" then there is likely a problem with paths. This is because the local bin path was not added to the system path by the system administrators, we can resolve this now by:

 

Open up your local bashrc file with your favourite text editor

 

$ nano ~/.bashrc

 

Somewhere near the bottom of the file put the following line:

 

export PATH=$PATH:~/.local/bin

 

Then save and exit (for nano this is done by pressing ctrl+x and then saying 'y' followed by 'enter' to save), then you will need to resource your bashrc by:

 

$ source ~/.bashrc 

 

You should then try to do the Longbow version command again and make sure you can get the version number output.

 

Setup Exercise 4 - Editing the hosts configuration file

Now Longbow is setup and working we will need to tell it about the computing resource and the user account that we are going to use. To do this Longbow uses a configuration file called "hosts.conf", this file can be used to store details of many different compute resources or different configurations for the same compute resource (for example same machine different accounts). There is a hierarchy to parameters used in files and on the command-line and this can be used to create complex setups, but this is beyond the scope of this workshop and is covered in detail in the documentation.

For this workshop we will be creating just one machine profile, to do this open up hosts.conf in your favourite editor, this file is created during installation in the ~/.Longbow directory:

 

$ nano ~/.Longbow/hosts.conf

 

You will see that this file already contains some information, this is simply a template to show new users what the format of the file looks like. You should delete the information that was already in this file and replace it with the following (note that Longbow will default to the compute resource that appears first in this file if one is not specified at runtime), you should then replace all instances of "username" with the username that you created in safe prior to attending the workshop:

 

[Archer]
host = login.archer.ac.uk
user = username
remoteworkdir = /work/e280/e280/username/Longbow
account = e280-workshop
queue = R3704872
frequency = 120
maxtime = 01:00

 

That's it you have created your first machine profile in Longbow!

You will notice that this file has a very specific structure, this structure is the ini format. The name in the square brackets denotes the name of the compute resource (you can use this in Longbow to select different compute resources to run on otherwise Longbow defaults to the first one in the file), under this is listed parameters that describe the compute resource. This is quite a powerful way to describe compute resources, it is possible to setup multiple entries for the same compute resource that has different settings (such as different accounts or other parameters) or to setup different compute resources at different facilities if you have lots of accounts at different places. This file can be used to setup machine specific job defaults which can then be overridden by job files or command-line parameters (this however is beyond the scope of this workshop, however it is well documented in the documentation).


Simulation Exercise 1: Running a Single Job 

For this example we are going to run a simple job, we are going to launch a job straight from the command-line so that you can see just how simple running simulations via Longbow can be. To get started we will make use of the 'Zanamivir Bound to H7N9 Neuraminidase' (part 2) example that you used in the previous session only we are going to be running it on ARCHER. To launch Longbow via the command-line the format is always of the following form:

 

$ longbow [longbow arguments] executable [executable arguments]

 

So launching Longbow in this way can almost be as simple as writing "longbow" in front of your normal command-line that you would use on your desktop machine. To get started with this example:

 

1. First copy over the input files that you used for the "part 2" example in the previous session (they were in the directory called "complex") into a new directory so you have a clean working space, this will be the files called "h7n9_zan.prmtop", "h7n9_zan.rst" and "mdconfig". If you deleted these files then you can re-download them from here, but you will need to re-modify them as guided here.

 

2. Once you have done this then in a terminal change to this new directory path (using cd) and then we can launch Longbow:

 

$ longbow --log exercise1.log --verbose namd2 mdconfig

 

You will see Longbow go through lots of configuration and setup and eventually pause shortly after submitting your job, this is normal and is designed to be light on logging (simulations that take months would make massive log files otherwise). Longbow will periodically check upon the status of your job and stage back your results so far, this is an excellent way to keep an eye on your simulation without having to log in to your compute machine and download a file to check progress.

When we launch Longbow like this the --log argument allows us to rename the log file that Longbow logs to (some programs default to a file called log and so does Longbow, so it is best to rename it with something meaningful). The --verbose flag simply instructs Longbow to give basic log output to the terminal you launch from, Longbow is silent by default so that it can be launched from small scale institutional cluster machines in a "headless" fashion. Another point to note is in real world simulations with Longbow, parameters given to Longbow on the command-line override those in any configuration file.

Launching Longbow like this without any configuration outside of the hosts.conf will pick up and use parameters either from the hosts.conf or use internal defaults (for example we didn't set the number of cores in hosts.conf so Longbow defaults to 24, already a lot bigger than the 4 you used earlier). You will notice that the ++ppn has been left out, and there is a good reason for this! 

So all being well, this simulation should complete within 4-5 minutes or so (file transfer could take time if the computer cluster you are working on uses some network file systems, but this will not happen on your own machine) upon completion you will be notified.

 

Simulation Exercise 2: Lets make it bigger

OK, so you have seen how to quickly fire off a job to ARCHER using Longbow. Now we would like to make this simulation much longer so that you can get a lot more information from your simulations, this would have taken a prohibitively long time to run on your 4 core processor in the previous session. In this example we will modify the simulation input files and also use a job.conf file with Longbow to override the number of cores we will use for this simulation. To get going with this example:

 

1. Make a new copy of the input files as you did in example 1 in a new directory so you have a nice clean work space.

 

2. To make the simulation run for a much longer time we need to edit (using nano) the mdconfig file just as you did in the previous workshop, change numsteps to 125000 and change the frequency of the output file generation (DCDfreq) to 1000 so you should now get 250 ps of simulation with 125 frames of output.

 

3. Now we need to create a job configuration file so that we can make some changes without modifying our default settings in hosts.conf. So with your favourite text editor create a job.conf in the same directory as your input files from step 1:

 

$ nano job.conf

 

Now enter the following structure:

 

[jobname]
executable = namd2
resource = Archer
cores = 48
modules = namd
executableargs = mdconfig

 

So in the above file, you will notice that we have provided the executable and it's arguments here, so we will not need to do this on the command-line later. Also we have explicitly specified ARCHER as the machine to run on, although in this workshop we have only configured one so we could easily leave this out, but you will likely have access to many machines. We are also specifying how many cores we wish to run on so we are going to use 48 processor cores (so we can do much more simulation in the time available). The modules parameter is useful for loading Linux module on the compute machine, system administrators often install many different configurations or versions of software and provide modules as a way to load them, this is the way to use them in Longbow.

 

4. Now all we need to do is run it!

 

$ longbow --log exercise2.log --verbose --job job.conf 

 

You will again see Longbow do lots of configuration and setup followed by submission to ARCHER. Longbow will then enter a monitoring stage where it will wait a few minutes between performing checks upon your simulation and staging back the results so far, it will appear to be frozen between these checks, this is normal. You will be notified when your job is complete.

 

Before moving on:

up next are some examples that showcase some of the more powerful use cases for Longbow, the jobs used in this section are simply to highlight the functionality of Longbow. Please find the files for the next two examples here.

Once downloaded extract the zip file to a place of your choice.

 

Simulation Exercise 3: Running Replica Jobs

Running replicates is very simple using Longbow. Replicates are useful when you have lots of very similar jobs that would/could be launched from an identical command-line if all input files were named in a fashion to allow this. Lets have a look how such a job would be configured and then run.

 

1. Change into the 'replicate' directory in the directory you just extracted. You will notice we have a number of files plus 5 subdirectories, rep1 – rep5. The three files placed in the top level directory ending in .prm .pdb and psf are common to all replicate jobs and to prevent lots of disk wastage Longbow can detect files used in this fashion and treat them accordingly. Now look at one of the subdirectories:

 

$ ls ./rep1

 

You will see that in here we have more input files, for this example we should pretend that the files in each directory have different starting coordinates, in reality they don't but in real world simulations this is a common setup.

 

2. Submit the job:

 

$ longbow --log exercise3.log --verbose --replicates 5 namd2 example.in

 

For each replica, Longbow will first look in the current rep* subdirectory for an input file and if it isn’t found, Longbow will then look for it in the parent directory of the subdirectory. Due to this all of the input files will be found and used in the simulations in the correct manner. You can see this by inspecting the generated submit.pbs in another terminal. 

Note: that it is also possible to submit simulations with all input files in the parent folder and have Longbow generate the repx directories, this is particularly useful when doing seeded simulations.

 

Simulation Exercise 4: Running Multijobs

One of the most powerful features of Longbow is it’s ability to rapidly and simultaneously submit multiple jobs and monitor them to completion. To run multiple jobs under one Longbow instance, jobs should be specified in a job configuration file. Let’s try this ourselves; 

 

1. Change into the directory 'multi' in the examples directory you previously downloaded. You will see there are three subdirectories in this directory, called “gromacs”, "lammps" and “namd”. So in this example we are going to submit three completely different jobs to ARCHER using three different MD codes. 

 

2. Create a job configuration file and submit the job.

 

$ nano job.conf

 

Job files take the form of ini files like the hosts.conf you did earlier, so we need to set up three different sections with different parameters in each one:

 

[gromacs]
executable = mdrun_mpi
resource = Archer
cores = 48
modules = gromacs
executableargs = -deffnm example
[namd]
executable = namd2
resource = Archer
cores = 48
modules = namd
executableargs = example.in > example.out
[lammps]
executable = lmp_xc30
resource = Archer
cores = 48
modules = lammps/lammps-9Dec14
executableargs = -i example.in -sf opt

 

In this file we are specifying some information for each job separately, this idea could be used for example to run different jobs with different numbers of cores or on different machines (resource). But here we are using each jobs to use a different program (executable) and thus different command-line parameters and format (executableargs).

Parameters specified under a job in a job configuration file, take precedence over the same parameter that might be declared under the hosts.conf file, so this is a good way to provide job specific overrides for one offs.

 

3. Now launch the multijob and watch the fun happen!

$ longbow --log exercise4.log --verbose --job job.conf

HECBioSim Workshop - Going Large Course Material

hec logo white smallHECBioSim makes it easy for any UK-based biosimulation scientist to get significant amounts of time on ARCHER, the UK National High Performance Computing Service. Used appropriately, this can make an enormous impact on the quantity and quality of the research that can be done. But learning how to use HPC resources, and analysing the huge amounts of data they can produce, can be a significant challenge.

In this course you will be introduced to Longbow, a Python tool created by the HECBioSim consortium, that allows you to use primary molecular dynamics packages (AMBER, GROMACS, LAMMPS, NAMD) with ease from the comfort of your own desktop, and pypcazip, a flexible Python-based package for the analysis of large molecular dynamics trajectory data sets.

Before starting this workshop you will need working copies of the software, get them at the below links:

For Longbow download here

For pypcazip download here

 


 

Longbow Exercises

Exercise 1: Setting up a passwordless connection

Normally, setting up a password-less connection with a computer would involve the following two simple commands:

 

% ssh-keygen
% ssh-copy-id This email address is being protected from spambots. You need JavaScript enabled to view it.

 

See here for more details.

However, for simplicity today we are going to create password-less connections with ARCHER using a script we have prepared.

1. Do:

 

% cd ~

 

2. Then run the script by doing:

 

% ../shared/environment-setup.sh

 

When prompted enter your Archer username and password.

3. Then test the passwordless connection by doing:

 

% ssh This email address is being protected from spambots. You need JavaScript enabled to view it.

 

You should connect with Archer without being prompted for a password.

4. Finally check the ~/Workshop directory now exists.

 

Exercise 2: Editing the hosts configuration file

Longbow uses a configuration file called hosts.conf to establish the details of the HPC’s to be used, what the users credentials are on these machines and other details such as which directory the job should be ran in on the remote HPC. Let’s get this file from the shared directory and edit it.

1. Execute the following command:

 

% cp –r ~/Workshop/.Longbow ~

 

Please make sure it is the home directory (~) you copy to. If you now do:

 

% ls ~

 

You will not see the .Longbow directory we have just copied because directories beginning with a “.” are hidden. However, if you execute:

 

% ls -a ~

 

You will see the .Longbow directory.

2. Now do:

 

% cd ~/.Longbow

 

And open the hosts.conf file with a text editor of your choice e.g.:

 

% vi hosts.conf

 

You will see the file is split into 2 sections each headed by a string enclosed in brackets []. These string are names of computers. By default Longbow will use the computer at the top of the file unless told otherwise (more on this later).

3. Then edit the Archer section to reflect your own Archer account e.g:

 

[Archer]
host = login.archer.ac.uk
user = DBeckham
remoteworkdir = /work/e280/e280/DBeckham/Longbow
account = e280-workshop	

 

Then save the file.

 

Exercise 3: Run a single Amber job

Now we have edited the hosts configuration file, we are ready to submit our first job to an HPC using Longbow.

1. First let’s change directory

 

% cd ~/Workshop/

 

2. Now do:

 

% unzip LongbowExamples.zip
% cd ./LongbowExamples/Workshop/AmberSingle/
% ls

 

You will see that this directory contains three Amber input files.

3. Submit the job:

 

% longbow -verbose pmemd.MPI -O -i example.in -c example.min -p example.top -o example.out

 

Notice that to submit the job using Longbow, we simply had to place the word “longbow” in front of the ordinary Amber command. We also chose to run in Longbow verbose mode by specifying the -verbose flag.

4. Open another terminal and change directory to where we launched the job:

 

% cd ~/Workshop/LongbowExamples/Workshop/AmberSingle/

 

5. Do:

 

% ls

 

Notice that a file called submit.pbs has been created. Open the file in your favourite text editor and see that Longbow has created a PBS job submission script for you. Keep executing “ls” and eventually you will notice that results files have started to be brought back to the desktop by Longbow.

 

Exercise 4: Run a single Amber job with a job configuration file

Whilst the hosts configuration file (hosts.conf) should be used primarily to provide information about the HPC we wish to use, the job configuration file should be used primarily to provide information about the job itself such as number of cores to use and the walltime. Lets create a job configuration file and submit the same job as we did in Exercise 3.

1. Change directory:

 

% cd ~/Workshop/LongbowExamples/Workshop/AmberJobConf/
% ls

 

You will notice we have the same input files in here as in Exercise 3.

2. Lets change the number of cores we wish to use for the job, and let’s set the walltime too by creating a job configuration file. Create a new file called job.conf:

 

% vi job.conf

 

In this file write the following text:

 

[myjob]
resource = Archer
cores = 48
maxtime = 00:05

 

The text in the brackets gives the job a name, the next line tells Longbow to use an HPC in hosts.conf called Archer, the next line specifies that we would like to use 48 cores and the final line indicates the walltime should be 5 minutes (HH:MM format). Save the file.

3. Submit the job:

 

% longbow -verbose –job job.conf pmemd.MPI -O -i example.in -c example.min -p example.top -o example.out

 

4. Open another terminal and change directory to where we launched the job:

 

% cd ~/Workshop/LongbowExamples/Workshop/AmberJobConf/

 

5. Look at the submit.pbs file and notice the number of cores and walltime have been set by Longbow as per the instructions in the job configuration file.

 

Exercise 5: Running replicate Amber jobs

Running replicates is very simple using Longbow.

1. First let’s change directory:

 

% cd ~/Workshop/LongbowExamples/Workshop/AmberReplicates/
% ls

 

You will notice we have the example.top and example.in files in here that was used in Exercises 3 and 4. We also have 5 subdirectories, rep1 – rep5.

2. Look at one of the subdirectories:

 

% ls ./rep1

 

You will see that in here we have example.min that were used in previous exercises. In each of the rep* subdirectories, the example.min file contains different starting coordinates.

3. Submit the job:

 

% longbow -verbose –replicates 5 pmemd.MPI -O -i example.in -c example.min -p example.top -o example.out

 

For each replica, Longbow will first look in the current rep* subdirectory for an input file and if it isn’t found, Longbow will then look for it in the parent directory of the subdirectory. Due to this example.top and example.in will be found and used in the simulations.

4. Open another terminal and change directory to where we launched the job:

 

% cd ~/Workshop/LongbowExamples/Workshop/AmberReplicates/

 

5. Look at the submit.pbs file and notice that “example.top” and “example.min” in the pmemd.MPI command are preceeded by “../”. This feeds this global topology file and the global input file into all the simulations.

Running replicates when all the input files are identical is even easier because Longbow will create the rep* subdirectories on behalf of the user:

6. First let’s change directory:

 

% cd ~/Workshop/LongbowExamples/Workshop/AmberReplicatesIdenticalInputs/
% ls

 

You will notice that we have the usual 3 input files in this directory. In example.in we have a variable ig set to be -1, this will seed the MD thermostat using the system time and so each replica will begin with a different set of random velocities.

7. Submit the job:

 

% longbow -verbose –replicates 5 pmemd.MPI -O -i example.in -c example.min -p example.top -o example.out

 

8. Open another terminal and change directory to where we launched the job:

 

% ~/Workshop/LongbowExamples/Workshop/AmberReplicatesIdenticalInputs
% ls

 

You will notice that the rep* subdirectories have been created on behalf of the user. The trajectory files will be brought back to these rep* subdirectories by Longbow.

 

Exercise 6: Running Multijobs

One of the most powerful features of Longbow is it’s ability to simultaneously submit multiple jobs. To do this, more than one job should be specified in the job configuration file. Let’s try this ourselves.

1. First let’s change directory.

 

% cd ~/Workshop/LongbowExamples/Workshop/AmberMultiJob
% ls

 

You will see there are two subdirectories in this directory, one called “single” and another called “replicate”.

2. Create a job configuration file and submit the job.

 

% vi job.conf

 

In this file we need to enter 2 jobs – one called [single] the other called [replicate]. Using what you have learnt today, add longbow parameters such that one job is a single job and the other is a replicate job. Then submit the job.

 


 

pyPcazip Exercises

Hands on session on analysing MD data with Principal Component Analysis techniques

 

1. Introduction

In this session we will see how PCA can be used to assess convergence and sampling in MD and compare dynamics between different data sets.

You will be using a production version of the new pyPCAZIP toolkit, that is a re-implementation by the ExTASY project www.extasy-project.org of the PCAZIP toolkit developed and distributed by the Laughton Group at the University of Nottingham, UK, and the Orozco group at the University of Barcelona, Spain.

The pyPcazip toolkit is compatible will all major MD codes formats: AMBER, CHARMM, GROMACS, NAMD, etc.

The data you will be using comes from MD simulations on the mouse Major Urinary Protein (MUP), which is a popular test-bed for the investigation of protein-ligand recognition (see for example Roy and Laughton, Biophys J, 2010, 99(1): 218-226).

This tutorial can also be found online at the official pyPcazip repository:https://bitbucket.org/ramonbsc/pypcazip/wiki/Tutorial

 

% cd ~/Workshop
% tar -xvf data.tar.gz
% ls data
>>> apo+holo.ndx apoProteinTraj1.dcd holoProteinTraj1.dcd  Protein.gro

 

In the ./data folder you will find:

  • A topology file for the protein: Protein.gro.
  • Trajectory files for the protein, generated in the absence and presence of the pyrazine ligand: apoProteinTraj1.dcd and holoProteinTraj1.dcd respectively.
  • An “index” file apo+holo.ndx – we will come back to this later.

 

2. Familiarisation.

If you want to have a quick look at the systems, load the .gro file into VMD. If you want to see more, load the trajectory files too!

 

3. Principal Component Analysis of the free protein using pyPcazip.

a) Using the pyPcazip package.

To get a quick guide to pyPcazip usage type:

 

% pyPcazip --help

 

b) Performing Principal Component Analysis.

The trajectory files have already been stripped of waters, ions, and the ligand; we could do PCA on the whole protein but analysis of just the dynamics of the C-alpha atoms may be a good place to start. So begin by running the PCA job itself.

Type (all on one line):

 

% pyPcazip –i data/apoProteinTraj1.dcd –t data/Protein.gro –s “name CA” –p CA.pdb –o apo_CA.pcz –vv

 

Two files will be produced: CA.pdb is a PDB format file containing just the atoms selected for analysis (good for checking that your selection string gave you what you intended), and apo_CA.pcz is the PCA output file. This is a compressed binary file, readable using other tools from the pyPcazip package, so let’s do this next, using pyPczdump:

 

% pyPczdump –i apo_CA.pcz –n

 

gives basic information about the contents of the file. pyPczdump also allows many other sorts of analysis – for a quick guide type:

 

% pyPczdump --help”

 

c) Looking at eigenvalues.

Now lets output a list of the eigenvalues:

 

% pyPczdump –i apo_CA.pcz –l –o apo_CA_evals.dat

 

You can just scan through this in a terminal window, or input to your favourite graph-drawing package, e.g. gnuplot:

 

% gnuplot
gnuplot> plot 'apo_CA_evals.dat' w boxes
gnuplot> quit

 

d) Analysis of modes.

Clearly the majority of the dynamics of the free protein can be represented using a small number of collective modes (the eigenvectors). To get a feel for the sorts of motions these are, produce animations of them in the form of multi-model PDB format files, using the PDB file of the selection (CA.pdb, made in step b) above) as a ‘template’. Thus:

 

% pyPczdump –i apo_CA.pcz --anim 0 --pdb CA.pdb –o apo_CA_ev0.pdb

 

Will produce an animation of the first (=0, python numbers from zero!) collective mode, which can be visualised using, e.g., VMD.

(HINT: Graphics>Representations…>Drawing Method>Trace)

 

e) Analysis of convergence and sampling.

Having got a feel for the types of motion represented by each collective mode, we can see how these are sampled over the trajectory by outputting the time series of their projections, e.g.:

 

%pyPczdump –i apo_CA.pcz --proj 0 –o apo_CA_proj0.dat
%pyPczdump –i apo_CA.pcz --proj 1 –o apo_CA_proj1.dat

 

for modes 0 and 1. Again these can be visualised through, e.g. gnuplot.

 

f) Introducing the pczplot tool.

The process of running pyPczdump to produce an output file, and then turning this into some form of graph in a second step, is flexible but rather tedious. You have a chance to evaluate a trial version of an analysis tool that does both.

Type:

 

% pyPczplot apo_CA.pcz

 

The start-up screen shows a plot of the eigenvalues, and summary data about the .pcz file. Press ‘1’ to show the time-series for the projections along the first PC. Pressing the ‘i’ and ‘k’ keys takes you to higher/lower PCs.

Press ‘2’ to produce histograms of the data. A PC that is sampled in a perfectly harmonic way will give a Gaussian distribution with a variance equal to the eigenvalue (shown by the dotted line); you can see how well (or not) your data fits this model. Use the ‘j’ and ‘l’ keys to change the selected PC.

Press ‘3’ to produce a ‘map’ showing how the trajectory moves over a 2D-subspace defined by two principal modes. Which two can be selected using the j/l and i/k keys.

Press ‘4’ to convert the map into a 2D histogram. This can present in the form of a heat map or contour plot – press ‘c’ to toggle between them.

You will have a more complete flavour of what can be done graphically with results produced by pyPcazip in the pyPcazip GUI session shortly.

 

g) Further work.

Experiment with choosing different selections of atoms for the analysis (‘-s’ argument in step b)).

Repeat the whole process for the dataset from MUP bound to the pyrazine ligand (holo).

 

4. Comparative analysis of the dynamics of MUP in the presence and absence of the ligand.

a) Calculation of the dot product matrix using pyPczcomp.

If you visualise the first collective mode of MUP in the free state, and compare it with the same mode for MUP in the bound state, you should be able to see, even by eye, that they are not exactly the same. We can put this on a quantitative footing by calculating the dot product matrix between the eigenvectors found from PCA on the apo protein, with those found from PCA on the ligand-bound form.

 

% pyPczcomp –i apo_CA.pcz holo_CA.pcz --quick

 

b) Combined PCA on both trajectories.

To analyse the relative sampling and convergence of both simulations within a common subspace, we need to perform PCA on them both together:

 

% pyPcazip –i data/apoProteinTraj1.dcd data/holoProteinTraj1.dcd –t data/Protein.gro –s ‘name CA’ –o apo+holo_CA.pcz –vv

 

(note how we give a list of trajectory files to be analysed in the ‘-i’ argument).

 

c) Visualisation of the data.

Use pczplot to visualize the results. In this case, also specify an index file, which provides information about which snapshots in the combined trajectory used for the PCA belong to which simulation:

 

% pyPczplot apo+holo_CA.pcz data/apo+holo.ndx

 

Using the 2D map option (key ‘4’) see how the apo simulation samples an area of PC1/PC2 space that is not sampled in the holo simulation.

 


pyPcazipgui Exercises

Example 1: Analysing an ensemble of replicas for sampling and convergence

pyPcazipGUI allows MD researchers to perform principal component analysis (PCA) and plot the results making it simple to analyse the sampling and convergence of your replicate simulations.

We can demonstrate this with 10 replicates of pentaalanine:

1. First let’s unzip the pentaalanine data:

 

% unzip ~/Workshop/Ala5.zip

 

2. Now change directory into the Ala5 directory and launch pyPcazipgui:

 

% cd ~/Workshop/Ala5
% pyPcazipgui

 

3. When the GUI has loaded, click “Browse” to load a global topology file. Select “penta.top”.

4. Now click “Add row” 10 times to load each of the “pentalong.dcd” files in the rep* subdirectories. YOU DON’T NEED TO FILL IN THE OTHER FIELDS IN THE TABLE.

5. Now press “Save ensemble and go to PCA”. This will take you to the “PCA” tab where we will perform principal component analysis.

We are now ready to plot the principal components and investigate the sampling and convergence of the ensemble...

 

Initial investigations of the ensemble

 

1. Enter a PCZ output file and press “Run pyPcazip”. When this has finished it will automatically fill the “Select PCZ file” field. Now press “Plot data”.

2. You will see the below graph:

 

 

The fact the Eigenvalues are not dramatically different in magnitude is a good sign. Drastically different Eigenvalues can indicate the analysis is suffering from artefacts due to period boundary conditions.

3. Now go the the “PC time series” tab and look at the inividual replicas. Below are two examples:

 

23

 

Notice that the time series are quite different from each other. This shows the replicates are independent from each other.

4. Next go the “PC histograms” tab and compare different projections for different replicas. Below is the first projection (projection 0, left) for replica 1 and the fourth projection (projection 3, right) for replica 1.

 

45

 

The first projection seems to be far from Gaussian whereas the higher projections are more Gaussian-like in distribution. This might be what you expect: the higher frequency projections are often better sampled. To investigate if the deviation from Gaussian is simply sue to poor sampling let’s look at the distribution for projection 0 for all the replicas combined by selecting ‘All’ from the Replica drop down list:

 

6

 

The pooled distribution also isn’t Gaussian. This may mean the distribution is not an artefact of poor sampling: the underlying free energy surface contains basins that are causing the distribution to deviate from Gaussian.

5. Next go to the “2D timeseries” tab. Looking through projection 0 vs projection 1 for the different replicas, it seems the free energy surface contains at least 2 basins that are being sampled in different ways by the different replicas. Below are two examples:

 

78

 

Note that pressing “Animate timeseries” allows you to watch the simulation basin-hopping.

Note also that if you click on a point you are interested in the frame number and filename are printed to the dialogue:

 

9

 

6. Next go to the “2D histograms” tab. Flicking through the different replicas for projection 0 vs projection 1 we can see there is a consensus as all replicas exhibit three primary minima, however some of the replicates have sampled the different basins in drastically different measure:

 

1011

 

The differences between the replicates means that although we have converged on the three macrostates, the different replicates have clearly not converged on the same microstates.

We can also select replica ‘All’ and view the plot we currently have that is closest the true underlying free energy landscape:

 

12

 

7. Next go to the 2D clustering tab. Continuing to look at all the replicates, we can see that three clusters have been found, reinforcing our earlier observation that there seems to be three primary macrostates.

 

13

 

Note that it is possible to smoothen the boundaries between the clusters by increasing the resolution. We can also cluster over more projections if we wish by increasing the number of projections.

8. Now go to the “Cluster sampling” tab. As we scroll through the different replicates we can see that in all cases the plots level out. This indicates that the replicates are sampling the three macrostates. However the different replicates clearly spend differing amounts of time in each of the three clusters:

 

1415

 

This indicates that the replicates have not converged. This is investigated further in the following section.

How well converged is the ensemble?

We can investigate how well conserved our ensemble is by splitting the ensemble in two and comparing the data. If the ensemble has converged, the 2D histograms of the pooled two halves should be very similar.

1. Go back to the “Define ensemble” tab. Before we do anything let’s save the current ensemble to an album file. Once you have done that rename the system of the first 5 rows of the table to “system1” and the second 5 rows to “system2”. If you wish you can also renumber the replicates of the second 5 rows although this isn’t actually necessary.

2. Check the “Using pre-prepared PCZ file?” option. We can use the PCZ file we created last time because the content hasn’t changed, just our labelling. Press “Save ensemble and go to PCA”

3. This will take you to the PCA tab. Select the PCZ file we created previously and press “Plot data”

4. Go to the “2D histograms” tab and select ‘All’ from the Replicas drop down list for “system1”. Then change the system to “system2”. Below are the plots:

 

1617

 

As you can see the distributions are quite different. This means we should run more or longer simulations in order to sample the system more fully.

But is it better to run more replicates or just longer simulations? We will investigate this in the following section...

Should more replicates or alternatively longer simulations be ran to sample the system more substantially?

To investigate whether it is better to run longer simulations or to run more short replicates to sample the system more thoroughly, we can manipulate the systems table on the “Define ensemble” tab.

1. Go back to the “Define ensemble” tab and save the current system table to album file if you wish.

2. Delete the final 5 rows of the system table. For rows 2-5, rename the system to “system2” but leave row 1 as “system1”. Then for rows 2-5 set “Frames to” to be 2499, un-check the “Using pre-prepared PCZ file” option and press “Save ensemble and go to PCA”.

3. This will take you to the “PCA” tab. We will need to create a new pcz file so enter a different filename to that entered previously and press “Run pyPcazip”.

4. When the calculation has finished, press “Plot data”.

5. Go to the “2D histograms” tab and select ‘All’ from the Replica drop-down list. Now compare the two systems:

 

18

 

The two plots are both very similar so perhaps there is no approach has an advantage over the other. It is just as good to run more short replicates as it is to restart the current simulations and run further dynamics. However, It may be more convenient to run more replicates given access to a large supercomputer.

Example 2: Is RMSD a reliable indicator of equilibration and convergence?

Researchers typically plot RMSD to determine whether a system has equlibrated. But can RMSD be trusted as an reliable indicator of equilibration and convergence?

We can investigate this with 10 replicates of a dynein light-chain LC8 from Drosophila:

1. First let’s unzip the data we wi;; be looking at:

 

% unzip ~/Workshop/1rhw.zip

 

2. Now change directory into the 1rhw directory and launch pyPcazipgui:

 

% cd ~/pyPcazipGUIWorkshop/1rhw
% pyPcazipgui

 

3. When the GUI has loaded, click “Browse” to load a global topology file. Select “1rhw_prot.pdb”.

4. Press “Add row” and select ~/Workshop/1rhw/rep1/1rhw.md1.nc

5. Also load the equivalent files from the rep2 and rep3 subdirectories.

6. Give each of the 3 loaded files a different system name.

7. Press “Save ensemble and plot RMSD”

8. Scrolling through the different systems we can see that the RMSD of these replicates is very different and in one case doesn’t seem to be plateauing:

 

202122

 

9. Lets’s take a look at what happens when we add them to the same ensemble. Go back to the “Define ensemble” tab and give each of them the same system name and press “Save ensemble and plot RMSD”

 

23

 

A thought the three RMSD plots individually looked different, the average RMSD plot (red dots) plateaus suggesting the replicates are converging. Furthermore the average structure (green dots) of the replicates is lower which means the average structure is closer to the reference point than the individual replicates. This means the more repicates we include the better so let’s now include all 10 replicates.

10. Go back to the “Define ensemble” tab and ad the other 7 replicates to the system and press “Save ensemble and plot RMSD”

 

24

 

The plot seems to plateau after around 500 timeframes so let’s remove the first 500 frames from the whole ensemble.

11. Go back to the “Define ensemble” tab and enter 500 in the the “Frames from” column for all rows of the systems table and press “Save ensemble and plot RMSD”

 

25

 

The graph now looks very stable and many scientists would conclude the ensemble has equilibrated.

12. Now progress to the “PCA” tab and lets run principal component analysis by entering a filename and pressing “Run pyPcazip”. You will see the following plot:

 

26

 

The fact the Eigenvalues are not dramatically different in magnitude is a good sign. Drastically different Eigenvalues can indicate the analysis is suffering from artefacts due to period boundary conditions.

13. Now explore the “PC time series”, “PC histograms”, “2D histograms” and “2D clustering tabs” and compare different replicates. You will notice the plots are drastically different from each other indicating the the replicates have not converged. For example, below are two plots from the “2D histograms” tab for different replicates:

 

2728

 

14. Now go to the “Cluster sampling” tab and select Replica ‘All’ from the drop down list. The percentage occupancy of each cluster vs timeframe index has not plateaued showing the RMSD graph plotted earlier was highly misleading as the system clearly hasn't converged:

 

29