Politics, Power, and Science: Data Processing Pattern.

Sunday, April 29, 2012

Data Processing Pattern.

I've done a lot of data processing over the years and have come to the following understanding of how data processing words at a general level. This concept is what I am planning on using for several batch and message processing projects I wish to create.

Data level

Data can come from many sources. The program has to open a file, a database connection, a serial port, a network port, or other device and begin reading in a stream of data. At this level the data is an almost meaningless stream of single bytes.

Format level

These bytes are organized in a specific pattern known as a format. There are many different formats that the data can be organized around.

Fixed length. Each field can be in a strict order, each with a fixed length, so that each record you read will be the sum of those fixed fields. Typically there will be a special byte with an end of line significance, typically a newline or a carriage return, but with this format the record separator is optional. This is how the IP and TCP headers come in a data packet at layers 4 and 5. Each byte and even each bit can have a specific positional meaning. If you look at a set of data and you can set it to be 80 columns wide in a text editor and suddenly you see beginning of last names all line up right down the page at column 20, each name followed by spaces until another column starts all lined up at column 32.

Delimited. Each field is followed by a delimiter, typically a comma or a tab and there is an end of record marker that separates each record from each other, typically a newline or a carriage return as with some of the fixed length format above. Typically the fields do still have a maximum length, or range of values, but this is not visible from the format itself. Typically you can spot this format by seeing the commas or the tabs in the data, typically every record will have a fixed count of commas or tabs in every record.

Mixed. A message, or record, can be a combination of the above. The fields can mostly be delimited with commas or tabs, but have a few fields whose contents have a fixed. HL7 is an example of a mixed format.

Grammar. This used to be much more difficult than it is now. Typically this means they used XML now. In the past people would create many different formats for data that was contextual in nature. If you are trying to parse text that comes from a command line, or a language like English, or a program file written in C or Java, then your parser will have to understand that combination of positional text whose meaning is determined by the initial state and the order of the commands.

Conceptual Layer

At this point you have read in the stream of bytes, given groups of those bytes meaning and stored the data into a record or other data object in your program. A reference to this data can be passed around to represent that stored set of meaning.

Translation and Routing

Often the data you received has fields in the wrong order, or you have a set of numbers from 1-5 that actually represents a user name. This layer will take the incoming data and create a new record in the new format, transferring and transforming the data from one data object to the other. Or an xml file you parsed has a 100 records that need to be pulled out of the object and 100 individual records sent to the next layer. So this layer would have a loop that lets you get one data value and create as many objects as you need. A single message might be split into multiple outbound messages.

Data Store

The data coming out of the translation layer will need to be mapped to a set of outbound data objects. One stream might go one direction, while another set of messages goes to another table. This operation has to be tied to a database transaction, so that either all the data is applied to the database, or none of it is. Or you can have an exception log that others have to check and correct later.

Data View

In order to see this data you can map a view onto one or more data objects and see records in the data view. The data view can represent the underlying data objects in many ways. It also only has to retrieve what it needs to fill the current set of records in the view, so a dataview for a million element database might only have to load in the first 10 elements. This data view could even be aliased across to another computer and still only has to cache a little data to represent many records.

Politics, Power, and Science