WPRDC Pipeline

The WPRDC pipeline is a python library that allows users to quickly build pipelines. Schema and data validation are handled by the a custom implementation of a Marshmallow Schema.

Note

This project is in a pre-alpha stage, meaning that its API can and will likely change fairly dramatically in pre-release and release versions.

Example:

from marshmallow import fields
import pipeline as pl

class MySchema(BaseSchema):
    some_field = fields.Integer()
    some_date = fields.DateTime(format='%Y-%m-%d')

my_pipeline = pl.Pipeline('my_pipeline', 'An Example Pipeline') \
    .connect(pl.FileConnector, 'path/to/my.csv')
    .extract(pl.CSVExtractor, firstline_headers=True) \
    .schema(MySchema) \
    .load(pl.Loader)

This pipeline connects to a file located a ‘path/to/my.csv’, extracts data from it, validates it according to the rules of MySchema, and loads it into a LoadTarget. The pipeline can be kicked off by calling my_pipeline.run(), or scheduled via command-line.

As the job runs, its status is automatically recorded in a local sqlite database.

To schedule a job via command-line, a built-in run_job is included. Let’s say that my_pipeline was stored in a file called jobs.py. It could be kicked off from the command line using the following:

run_job jobs:my_pipeline

This command can be scheduled via cron to run at specific intervals.