Monitoring and Reporting

As pipelines run, they generate status information and store it. The Status object contains the specification of the exact information stored, but there are a few key fields that are important to consider for other parts of the pipeline.

Automatic monitoring requires the creation of a Sqlite database as mentioned in the Getting Started section. However, for testing and possibly other purposes, it is possible to either disable logging (by setting the log_status flag to False), or to send logging information to a different Sqlite database than the one specified in configuration (by setting conn to some other sqlite3.Connection object).

Note:
If you pass your own Sqlite connection, you are responsible for making sure that it is closed after the pipeline runs and has the status table configured with the proper schema. Using the connection created via the configuration will ensure everything is correct.

status

Status is a string that can be one of three things: new for pipelines that have just started to run, success for pipelines that have successfully completed, and error for pipelines that have unsuccessfully run. In the event that a pipeline has an error status, the Exception that was raised to break the pipeline will be attached to the string.

num_lines

This field holds the length of the data object that is eventually loaded into the final Loader destination. This can be helpful for tracking dataset growth over time, and possibly forecasting if different ETL pre-processing would be needed in order to keep up with the growth of the dataset.

input_checksum

One of the goals of the Pipeline is to avoid re-processing the same input data twice. In order to do this, a checksum of a file’s contents is created when the pipeline loads. This checksum is an md5 hash of the file’s contents, read in 8192-byte chunks (see checksum_contents() for an example).

When a given pipeline is run again, it checks against the status table to see if a pipeline with the same name has an identical checksum. If it does, it raises a custom DuplicateFileException and halts.

Note:
DuplicateFileException is thrown before a new Status object is created to represent a new pipeline. If you are seeing long gaps where you think new pipelines should be running, make sure that your source data is being updated properly.