.. WPRDC ETL Pipeline documentation master file, created by
   sphinx-quickstart on Wed Jan 13 22:53:01 2016.
   You can adapt this file completely to your liking, but it should at least
   contain the root `toctree` directive.

WPRDC Pipeline
==============

The WPRDC pipeline is a python library that allows users to quickly build pipelines. Schema and data validation are handled by the a custom implementation of a Marshmallow :py:class:`~marshmallow.Schema`.

.. note::
    This project is in a **pre-alpha** stage, meaning that its API can and will likely change fairly dramatically in pre-release and release versions.

Example:

.. code-block:: python

    from marshmallow import fields
    import pipeline as pl

    class MySchema(BaseSchema):
        some_field = fields.Integer()
        some_date = fields.DateTime(format='%Y-%m-%d')

    my_pipeline = pl.Pipeline('my_pipeline', 'An Example Pipeline') \
        .connect(pl.FileConnector, 'path/to/my.csv')
        .extract(pl.CSVExtractor, firstline_headers=True) \
        .schema(MySchema) \
        .load(pl.Loader)

This pipeline connects to a file located a 'path/to/my.csv', extracts data from it, validates it according to the rules of ``MySchema``, and loads it into a ``LoadTarget``. The pipeline can be kicked off by calling ``my_pipeline.run()``, or scheduled via command-line.

As the job runs, its status is automatically recorded in a local sqlite database.

To schedule a job via command-line, a built-in ``run_job`` is included. Let's say that ``my_pipeline`` was stored in a file called ``jobs.py``. It could be kicked off from the command line using the following:

.. code-block:: bash

    run_job jobs:my_pipeline

This command can be scheduled via cron to run at specific intervals.

Guide:
------

.. toctree::
   :maxdepth: 1

   getting_started
   writing_pipelines
   monitoring
   api
   changelog