WPRDC Pipeline¶
The WPRDC pipeline is a python library that allows users to quickly build pipelines. Schema and data validation are handled by the a custom implementation of a Marshmallow Schema
.
Note
This project is in a pre-alpha stage, meaning that its API can and will likely change fairly dramatically in pre-release and release versions.
Example:
from marshmallow import fields
import pipeline as pl
class MySchema(BaseSchema):
some_field = fields.Integer()
some_date = fields.DateTime(format='%Y-%m-%d')
my_pipeline = pl.Pipeline('my_pipeline', 'An Example Pipeline') \
.connect(pl.FileConnector, 'path/to/my.csv')
.extract(pl.CSVExtractor, firstline_headers=True) \
.schema(MySchema) \
.load(pl.Loader)
This pipeline connects to a file located a ‘path/to/my.csv’, extracts data from it, validates it according to the rules of MySchema
, and loads it into a LoadTarget
. The pipeline can be kicked off by calling my_pipeline.run()
, or scheduled via command-line.
As the job runs, its status is automatically recorded in a local sqlite database.
To schedule a job via command-line, a built-in run_job
is included. Let’s say that my_pipeline
was stored in a file called jobs.py
. It could be kicked off from the command line using the following:
run_job jobs:my_pipeline
This command can be scheduled via cron to run at specific intervals.