Tuesday, April 24, 2012

Batch File System - ideas.

The typical processing of data is to find a data source, parse the data, convert the data to another data schema, then output the data into a data sink.

The sources of a file can be ftp, file, tcp, rss feed, sql, or any other method.

The input data needs to be parsed into a data structure, at least separating out the data records from each other.

The data schema can be word, line, fixed field, delimited, or need a parser to read it.  It could also be mixed with a delimited field needing to be further parsed as a set of fixed field.

Then there needs to be an output schema for the data format.  If the input and output data format is not the same format then there needs to be a schema mapping to map from one format to another.

The data sink can be the same set of places to put files as the data source was, ftp, file, tcp, rss, or any other supported method.

Over this layer needs to be a control and reporting layer that executes the process to open the data source, read the data file, convert the data file to the input schema, convert to the output data schema, then write the data file out to it's destination.

If the data files are going to be picked up or delivered from remote systems, then they need to be scheduled for pick up and delivery.  This can be used as is to just pick up and deliver data between systems without any processing.

The control layer will execute the process and report on it's status as it progresses.

As part of the system the log file will be parsed and files generated for daily weekly and monthly reporting, and exception reporting will happen as directed.


Batch processing of files in a production environment is a large part of the work that many information system specialists are required to do.  Every place I have worked at has written their own custom batch processing system with different capabilities, error handling and reporting capabilities.

There has to be a scheduling system to run everything.  This will be a flexible system to execute processes on given schedules with exceptions to the times as well.

There will be a logging system that will log each action of the system.  Exceptions will be processed immediately and actions taken for each entry.  Reporting is also scheduled by the system itself using it's own scheduling system. This can be daily or weekly or monthly or quarterly, whatever you want.  The cleanup of the logs is also handled by the scheduling system.

This reporting system can be used by any other processes and systems, from a variety of methods.


I am thinking of combining scheduling, curl, xml and xslt tools, and sqlite into one program to allow me to pick up files at set times, transform the data, save the data to a database or file, transfer the file to another system, while archiving and logging everything it did.  A separate companion program could read the logs and perform actions (paging, emailing) based on events and daily log reports.

Once I have examples of my programming working then look at making it work with my generic framework object as its first program.

No comments:

Post a Comment