Commit e332e233 authored by Marc's avatar Marc
Browse files

default_repo again in project directory; error handling in database script

parent a1aae5c2
*.swp
*.pyc
docs/_build/
[0.2]
* Implemented cluster execution with SGE
* Aligned output format for homology tools
* Changed module manifest format to allow parameters for converters
* Ignore hidden files in modules directory
* Implement resolving of dbxrefs
* Introduce support for own repositories
[0.1]
* Implemented psot runner script
* Implemented psot configuration
* Implemented nextflow execution
* Added blastp vs swissprot module
* Added ghostx vs swissprot module
* Added signalp module
* Added exemplary profiles
The MIT License (MIT)
Copyright (c) 2017 SOaAS
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
Protein Sequence Observation Tool
=================================
A tool that executes several bioinformatic tools for a set of protein
sequences. The results of all tools are converted into json documents, enabling
for simple post processing.
Supported bioinformatic tools:
* hmmer
* blastp
* signalp
* ghostx
* targetp
Getting started for development (Setup)
---------------------------------------
Prerequisites:
* git
* python3
* nextflow
Install nextflow::
curl -fsSL get.nextflow.io | bash
Checkout the repository::
git clone git@git.computational.bio.uni-giessen.de:SOaAS/psot.git
Setup a virtualenv for development and install it in editable mode::
# install in development environment
virtualenv --python=python3 venv; source venv/bin/activate;
pip3 install -e .
# run tests
python3 setup.py test
# compile documentation
python3 setup.py build_sphinx
At the moment the required `dbxref` module is not available via a public pip
repository. You have to check it out via git and install it to the same virtual
environment.
Use the application::
psot analyze -f example/proteins.fas -o results -p fast
Run the tests::
# all tests
python3 setup.py test
# single tests module, e.g. test_repository
python3 setup.py test -s tests.test_repository
FROM ubuntu:18.04
RUN apt-get update && \
apt-get install -y wget && \
apt-get install -y python3 &&\
apt-get install -y ncbi-blast+
ENV INSIDE_CONTAINER=TRUE
\ No newline at end of file
FROM ubuntu:18.04
COPY . /
RUN apt-get update
RUN apt-get -y install perl
RUN apt-get -y install gawk
CMD echo "Container is running"
\ No newline at end of file
FROM ubuntu:18.04
COPY helpers /
RUN apt-get update && \
apt-get install -y wget && \
apt-get install -y python3
RUN echo 'export PATH=/opt/conda/bin:$PATH' > /etc/profile.d/conda.sh && \
wget --quiet https://repo.continuum.io/miniconda/Miniconda2-4.0.5-Linux-x86_64.sh -O ~/miniconda.sh && \
/bin/bash ~/miniconda.sh -b -p /opt/conda && \
rm ~/miniconda.sh
ENV PATH=$PATH:/opt/conda/bin
RUN conda config --add channels r
RUN conda config --add channels bioconda
RUN conda upgrade conda
RUN conda install ghostx && conda update ghostx
ENV PATH=$PATH:/helpers
ENV INSIDE_CONTAINER=TRUE
\ No newline at end of file
#!/usr/bin/python3
import argparse
import fileinput
parser = argparse.ArgumentParser(description='Replaces fasta headers with unique numbers and saves a dictionary of both in tsv format. Caution: The original fasta file gets replaced in the process.')
parser.add_argument('--fasta', '-f', required=True, help='The fasta file')
parser.add_argument('--enum-headers', '-e', required=True, help='File to store enumerated headers in tsv format')
args = parser.parse_args()
fasta = args.fasta
headers_dict = {}
num = 1
with fileinput.FileInput(fasta, inplace=True) as f:
for line in f:
if line.startswith(">"):
header = line.strip().lstrip('>')
headers_dict[num] = header
print(">{}".format(num))
num += 1
else:
print(line, end='')
enum_headers_file = args.enum_headers
with open(enum_headers_file, 'w') as o:
for key in headers_dict:
o.write("{}\t{}\n".format(key, headers_dict[key]))
FROM ubuntu:18.04
RUN apt-get update && \
apt-get install -y python3 && \
apt-get install -y hmmer
ENV INSIDE_CONTAINER=TRUE
\ No newline at end of file
FROM ubuntu:18.04
ADD tools.tar.gz /
RUN apt-get update
RUN apt-get -y install perl
RUN apt-get -y install gawk
RUN apt-get -y install python3
ENV PATH=$PATH:/chlorop-1.1:/signalp-4.1:/targetp-1.1:/tmhmm-2.0c/bin:/helpers
ENV INSIDE_CONTAINER=TRUE
#!/usr/bin/python3
import argparse
import fileinput
parser = argparse.ArgumentParser(description='Replaces fasta headers with unique numbers and saves a dictionary of both in tsv format. Caution: The original fasta file gets replaced in the process.')
parser.add_argument('--fasta', '-f', required=True, help='The fasta file')
parser.add_argument('--enum-headers', '-e', required=True, help='File to store enumerated headers in tsv format')
args = parser.parse_args()
fasta = args.fasta
headers_dict = {}
num = 1
with fileinput.FileInput(fasta, inplace=True) as f:
for line in f:
if line.startswith(">"):
header = line.strip().lstrip('>')
headers_dict[num] = header
print(">{}".format(num))
num += 1
else:
print(line, end='')
enum_headers_file = args.enum_headers
with open(enum_headers_file, 'w') as o:
for key in headers_dict:
o.write("{}\t{}\n".format(key, headers_dict[key]))
FROM ubuntu:18.04
COPY . /
RUN apt-get update && apt-get -y install perl
CMD echo "Container is running"
\ No newline at end of file
FROM ubuntu:18.04
COPY . /
RUN apt-get update
RUN apt-get -y install perl
RUN apt-get -y install gawk
CMD echo "Container is running"
\ No newline at end of file
FROM ubuntu:18.04
COPY . /
RUN apt-get update && apt-get -y install perl
CMD echo "Container is running"
\ No newline at end of file
# Minimal makefile for Sphinx documentation
#
# You can set these variables from the command line.
SPHINXOPTS =
SPHINXBUILD = sphinx-build
SPHINXPROJ = psot
SOURCEDIR = .
BUILDDIR = _build
# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
.PHONY: help Makefile
# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
Concepts
========
PSOT is a system that executes bioinformatic tools on a file with protein
sequences and converts the results into easy to process json documents. It
contains a live mode that writes the results of already finished tools into
a directory, which can be polled and further processed, e.g. by a website
that displays results as they become ready.
Vocabulary
----------
Module
A module implements a bioinformatic tool and the corresponding json
converter. It is defined in a module manifest.
Profile
A profile is a set of modules that are executed during an execution of
PSOT. Profiles can override default parameters of modules.
Repository
A collection of profiles, modules, scripts and configurations.
Workflow
--------
1. Load all module manifests and profiles from all available repositories
2. Create an execution directory
3. Generate a nextflow script for the choosen profile in the execution directory
4. Run the nextflow script
5. Remove the execution directory
Structure of the Nextflow Script
--------------------------------
1. Run all analyses in parallel
2. Convert all analyses in parallel
3. In live mode: generate a json document for each module and each sequence within the live directory
4. In retrieve mode: retrieve all information from the referenced databases
5. Join all json files into a single one containing all information
6. Split the large json file into separate files for each sequence
Loading of configuration artifacts
----------------------------------
Profiles, modules and configurations are organized in repositories. A
repository can contain the following elements:
* config.yaml (file)
* modules/ (directory with yaml files)
* profiles/ (directory with yaml files)
* scripts/ (directory with scripts for modules)
PSOT uses a repository search path to find bundled and own repositories. It can
be set by defining the environment variable `PSOT_REPOSITORIES` with a ':'
separated list of paths.
The repositories are loaded in the following order. Later respositories
overwrite values from previous repositories.
* default repository
* PSOT_REPOSITORIES
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
#
# PSOT - protein sequence observation tool documentation build configuration file, created by
# sphinx-quickstart on Wed Jul 26 11:24:20 2017.
#
# This file is execfile()d with the current directory set to its
# containing dir.
#
# Note that not all possible configuration values are present in this
# autogenerated file.
#
# All configuration values have a default; values that are commented out
# serve to show the default.
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#
# import os
# import sys
# sys.path.insert(0, os.path.abspath('.'))
# -- General configuration ------------------------------------------------
# If your documentation needs a minimal Sphinx version, state it here.
#
# needs_sphinx = '1.0'
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = ['sphinx.ext.autodoc']
# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']
# The suffix(es) of source filenames.
# You can specify multiple suffix as a list of string:
#
# source_suffix = ['.rst', '.md']
source_suffix = '.rst'
# The master toctree document.
master_doc = 'index'
# General information about the project.
project = 'PSOT - protein sequence observation tool'
copyright = '2017, Lukas Jelonek'
author = 'Lukas Jelonek'
# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
# built documents.
#
# The short X.Y version.
version = '0.2'
# The full version, including alpha/beta/rc tags.
release = '0.2'
# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
#
# This is also used if you do content translation via gettext catalogs.
# Usually you set "language" from the command line for these cases.
language = None
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This patterns also effect to html_static_path and html_extra_path
exclude_patterns = []
# The name of the Pygments (syntax highlighting) style to use.
pygments_style = 'sphinx'
# If true, `todo` and `todoList` produce output, else they produce nothing.
todo_include_todos = False
# -- Options for HTML output ----------------------------------------------
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#
html_theme = 'sphinx_rtd_theme'
# Theme options are theme-specific and customize the look and feel of a theme
# further. For a list of options available for each theme, see the
# documentation.
#
# html_theme_options = {}
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']
# -- Options for HTMLHelp output ------------------------------------------
# Output file base name for HTML help builder.
htmlhelp_basename = 'PSOT-proteinsequenceobservationtooldoc'
# -- Options for LaTeX output ---------------------------------------------
latex_elements = {
# The paper size ('letterpaper' or 'a4paper').
#
# 'papersize': 'letterpaper',
# The font size ('10pt', '11pt' or '12pt').
#
# 'pointsize': '10pt',
# Additional stuff for the LaTeX preamble.
#
# 'preamble': '',
# Latex figure (float) alignment
#
# 'figure_align': 'htbp',
}
# Grouping the document tree into LaTeX files. List of tuples
# (source start file, target name, title,
# author, documentclass [howto, manual, or own class]).
latex_documents = [
(master_doc, 'PSOT-proteinsequenceobservationtool.tex', 'PSOT - protein sequence observation tool Documentation',
'Lukas Jelonek', 'manual'),
]
# -- Options for manual page output ---------------------------------------
# One entry per manual page. List of tuples
# (source start file, name, description, authors, manual section).
man_pages = [
(master_doc, 'psot-proteinsequenceobservationtool', 'PSOT - protein sequence observation tool Documentation',
[author], 1)
]
# -- Options for Texinfo output -------------------------------------------
# Grouping the document tree into Texinfo files. List of tuples
# (source start file, target name, title, author,
# dir menu entry, description, category)
texinfo_documents = [
(master_doc, 'PSOT-proteinsequenceobservationtool', 'PSOT - protein sequence observation tool Documentation',
author, 'PSOT-proteinsequenceobservationtool', 'One line description of project.',
'Miscellaneous'),
]
.. PSOT - protein sequence observation tool documentation master file, created by
sphinx-quickstart on Wed Jul 26 11:24:20 2017.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Welcome to PSOT - protein sequence observation tool's documentation!
====================================================================
.. toctree::
:maxdepth: 2
:caption: Contents:
concepts
installation
usage.rst
modules
profiles
repositories
results_format
Installation
============
Requirements
------------
In order to run PSOT on your machine you need:
* git
* python >= 3
* nextflow
* dbxref library
* the bioinformatic tools you want to use
* blastp
* hmmscan
* signalp
* ghostx
* tmhmm
* targetp
* the bioinformatic databases you want to use
* PFAM
* Swiss-Prot
Installation
------------
Install nextflow::
curl -fsSL get.nextflow.io | bash
(Optional) create a virtual env::
virtualenv --python=python3 venv
source venv/bin/activate
Install dbxref and psot::
pip3 install git+https://git.computational.bio.uni-giessen.de/SOaAS/dbxref.git
pip3 install git+https://git.computational.bio.uni-giessen.de/SOaAS/psot.git
Install a default repository for your profiles and modules::
TODO
Developer setup
---------------
Clone the git repository of PSOT::
git clone git@git.computational.bio.uni-giessen.de:SOaAS/psot.git
Install it into a virtualenv::
virtualenv --python=python3 venv; source venv/bin/activate;
pip3 install git+https://git.computational.bio.uni-giessen.de/SOaAS/dbxref.git
pip3 install -e .
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment