==========================================
Assembling reads with Velvet
==========================================
In this exercise we will learn how to perform an assembly with Velvet. Velvet
takes your reads as input and turns them into contigs. It consists of two
steps. In the first step, ``velveth``, the de Bruijn graph is created.
Afterwards the graph is traversed and contigs are created with ``velvetg``.
When constructing the de Bruijn graph, a *kmer* has to be specified. Reads are
cut up into pieces of length *k*, each representing a node in the graph, edges
represent an overlap (some de Bruijn graph assemblers do this differently, but
the idea is the same). The advantage of using kmer overlap instead of read
overlap is that the computational requirements grow with the number of unique
kmers instead of unique reads. A more detailled explanation can be found at
http://www.nature.com/nbt/journal/v29/n11/full/nbt.2023.html.


Pick a kmer
===========
Please work in pairs for this assignment. Every group can select a kmer of
their likings - pick a random one if you haven't developed a preference yet.
Write you and your partner's name down at a kmer on the
Google doc_ for this workshop.

.. _doc: https://docs.google.com/spreadsheet/ccc?key=0AvduvUOYAB-_dDdDSVhqUi1KQmJkTlZJcHVfMGI3a2c#gid=3 

velveth
=======
Create the graph data structure with ``velveth``. Again like we did with
``sickle``, first create a directory with symbolic links to the pairs that you
want to use::

    mkdir -p ~/asm-workshop/velvet
    cd ~/asm-workshop/velvet
    ln -s ../sickle/qtrim1.fastq pair1.fastq
    ln -s ../sickle/qtrim2.fastq pair2.fastq

The reads need to be interleaved for ``velveth``::

    shuffleSequences_fastq.pl pair1.fastq pair2.fastq pair.fastq

Run velveth over the kmer you picked (21 in this example)::

    velveth out_21 21 -fastq -shortPaired pair.fastq

Check what directories have been created::

    ls

velvetg
=======
To get the actual contigs you will have to run ``velvetg`` on the created
graph. You can vary options such expected coverage and the coverage cut-off if
you want, but we do not do that in this tutorial. We only choose not to do
scaffolding::

    velvetg out_21 -scaffolding no


assemstats
==========
After the assembly one wants to look at the length distributions of the
resulting assemblies. You can use the ``assemstats`` script for that::

    assemstats 100 out_*/contigs.fa

Try to find-out what each of the stats represent by varying the cut-off. One of
the most often used statistics in assembly length distribution comparisons is
the *N50 length*, a weighted median, where you weight each contig by it's
length. This way you assign more weight to larger contigs. Fifty procent of all
the bases in the assembly are contained in contigs shorter or equal to N50
length. Once you have gotten an idea of what it all the stats mean, it is time
to compare your results with the other attendees of this workshop. Generate the results and copy them to the doc_::

    assemstats 100 out_*/contigs.fa

Do the same for the cut-off at 1000 and add it to the doc_. Compare your kmer
against the others. If there are very little available yet, this would be an
ideal time to help out some other attendees or do the same exercise for a kmer
that has not been picked by somebody else yet. Please write down you and your
partners name again at the doc_ in that case.


**Question: What are the important length statistics? Do we prefer sum over
length? Should it be a combination?**

Think of a formula that could indicate the best preferred
length distribution where you express the optimization function in terms of the
column names from the doc_. For instance only ``n50_len`` or ``sum *
n50_len``.


Ray
=======================
Try to create an assembly with Ray over the same kmer. Ray is an assembler that
uses MPI to distribute the assembly over multiple cores and nodes. The latest
version of Ray was made to work well with metagenomics data as well::

    mkdir -p ~/asm-workshop/ray
    cd ~/asm-workshop/ray
    ln -s ../sickle/qtrim1.fastq pair1.fastq
    ln -s ../sickle/qtrim2.fastq pair2.fastq
    mpiexec -n 1 Ray -k 21 -p pair1.fastq pair2.fastq -o out_21

Add the ``assemstats`` results to the doc_ as you did for Velvet. There is a
separate tab for the Ray assemblies, compare the results with Velvet.

(Optional exercise) VelvetOptimiser
===================================
VelvetOptimiser_ is a script that runs Velvet multiple times and follows the
optimization function you give it. Use VelvetOptimiser_ to find the assembly
that gets the best score for the optimization function you designed in
`assemstats`_. It requires ``BioPerl``, which you can get on uppmax with
``module load BioPerl``.

.. _VelvetOptimiser: https://github.com/Victorian-Bioinformatics-Consortium/VelvetOptimiser