ARMR Data Base Structure

A data base is defined to be the basic modular component of your data. An ARMR data base is analogous to a single manual file. In practice an application typically requires several files; e.g. in an elementary personnel tracking system you might maintain a personnel file, a work record file, and a project file. A corresponding automated system would contain the following three ARMR data bases: a work record base, a project base, and an employee base. ARMR does not attempt to tie the individual data bases into a monolithic structure; instead, to satisfy the requirements of a given application, you will combine any number of individual bases, via the "Data Algebra" provided by the ARMR modules.

To us, "Data Algebra" implies a formal set of fundamental operations which can be applied to data bases. ARMR's Data Algebra contains among other operations, data base transposition and data base substitutions. More about all this later. Before we introduce the various data base operations, we must return the data base structure.

Again, an ARMR data base is simply a computerized representation of a single file. Beyond this, we do not maintain any other structures; i.e. file linkages, index arrays, key fields, and the like are not maintained on a permanent basis. Instead, structures like these are created "on the fly", in RAM only, when you process the data via an ARMR module. This approach gives you the flexibility to specify any conceivable processing view at run time. The overhead which would be incurred by storing such structures on a permanent basis is therefore eliminated.

A data base has any number of data fields. Currently, ARMR allows up to 120 fields per data base segment, and up to 65,535 records. This does not imply a data base limit of 65,535 records. In practice, one will typically divide up a large file into smaller files of more manageable sizes.

Each field of a data base carries a specific data type. ARMR recognizes the following data types:


             Type Code      Corresponding Data Type       Samples
           --------------   ------------------------   -------------
                B           signed integer               -248, 500
           --------------   ------------------------   -------------
                R           real number                 3.279, 100.
           --------------   ------------------------   -------------
                D           integer calendar date        86/09/23
           --------------   ------------------------   -------------
                Z           zero-filled integer           000249
           --------------   ------------------------   -------------
                T           tabular data               Alpha string
           --------------   ------------------------   -------------
                E           external data              Alpha string
           --------------   ------------------------   -------------

A data base usually contains two classes of data, numeric data and alpha strings. Data types B, R, D, and Z define numeric data fields; these data are always stored in binary. T and E types are used for alpha numeric strings. Alpha strings are stored in traditional character form, as one byte per character in an E type field. However, the T type field is a preferred data field type to be used for alpha strings. The cost benefits and utility of the tabular data field type are described later on.

The following is a sample employee file which you might store in an ARMR data base.


           ssn      last           first         rate   month dep  tally
        ----------- -------------- ------------ -----  ------ ---  -----
        327-12-3492 WONDERLAND     ALICE         30.0  01_JAN  y     1
        327-12-3492 WONDERLAND     ALICE         30.0  02_FEB  y     1
        327-12-3492 WONDERLAND     ALICE         30.0  03_MAR  y     1
        327-12-3492 WONDERLAND     ALICE         30.0  04_APR  y     1
        327-12-3492 WONDERLAND     ALICE         40.0  05_MAY  y     1
        390-32-2331 BLOW           JOE           20.5  01_JAN  x     1
        390-32-2331 BLOW           JOE           20.5  02_FEB  x     1
        390-32-2331 BLOW           JOE           20.5  03_MAR  y     1
        390-32-2331 BLOW           JOE           25.0  04_APR  y     1
        390-32-2331 BLOW           JOE           25.0  05_MAY  y     1
        729-49-9876 HALL           ANNIE         32.5  01_JAN  x     1
        729-49-9876 HALL           ANNIE         38.0  02_FEB  x     1
        729-49-9876 HALL           ANNIE         38.0  03_MAR  x     1
        729-49-9876 HALL           ANNIE         38.0  04_APR  x     1
        729-49-9876 HALL           ANNIE         38.0  05_MAY  x     1
        773-21-9321 DOE            JOHN          17.0  02_FEB  z     1
        773-21-9321 DOE            JOHN          18.0  03_MAR  z     1
        773-21-9321 DOE            JOHN          18.0  04_APR  z     1

                         Fig. 2. A Sample Employee Data Base

This file tracks employee social security number, first name, last name, department code, and hourly labor rates as a function of calendar month. "ssn", "last", "first", "rate", "month", "dep" and "tally" are the field names of our sample data base. The tally field is made to contain the number 1 at all times; its utility is explained latter.

The employees tracked by our data base may migrate to any of the departments x, y, z; moreover, their rates may change in time. Rates and departments are tracked at monthly resolution. Therefore we keep one data base record per employee for each month in which the employee was with the company.

The n-tuplet structure

To us, an n-tuplet is a collection of attribute values. The attributes are nothing more than the data fields of a data base. In the interest of brevity, permit us to refer to the attribute collection simply as a vector; then, our sample data base has the following attribute vector:

              (ssn,last,first,rate,month,dep,tally)

In the abstract sense, a data base is a collection of data points in an n-dimensional space; each data base record represents the coordinates of a single data point! For example, for the sake of simplicity, let us project our data base to the (ssn,month) plane; then the 18 records of our data base are represented by the 18 points marked "*" in Figure 3.


          773-21-9321 |         *       *       *
          729-49-9876 | *       *       *       *       *
          390-32-2331 | *       *       *       *       *
          327-12-3492 | *       *       *       *       *
                      +-------------------------------------
                      01_JAN  02_FEB  03_MAR  04_APR  05_MAY

                 Fig. 3. Spatial Representation Of Employees Data

If we added another axis, say "dep" to this, then our 18 points would appear in three dimensional space. The "dep" axis would contain distinct positions representing department code values x, y, and z. In general we treat each field as a distinct axis. The unique field values are distinct points on the respective field's axis.

The notion of a spatial representation is by no means new to computer science. If you have ever programmed in a higher level language, you are familiar with the multi dimensional array structure. Such structures are directly related to the spatial representation of data. For example, we could store our employee rate values in a five dimensional array as follows:

                      array: rate(4,4,4,5,3)

The array's five subscripts represent ssn, last, first, rate, month, and dep, respectively. Note that in our example, as shown in figure 2. we have 4 unique values for ssn, first, and last, 5 unique values for month; and 3 unique values for dep. The rate array as defined above can thus accommodate the records shown in figure 2. Unfortunately, the array's entries would by primarily empty; i.e. only 18 out of the 960 array slots would contain non zero data; i.e. for our example, 98% of the allocated memory space goes to waste.

Dimensioned arrays waste considerable amounts of memory space. Moreover, they are practical in a static environment only. In an active data base, the space defining coordinates are constantly changing; e.g. the introduction of a new ssn value means that the ssn coordinate must be modified. Because of this, we say that a multi-dimensional array structure is not practical for storing a data base. Nevertheless, we can not therefore ignore the spatial nature of data.

ARMR treats data records as attribute vectors whose elements, in the abstract sense, are subscripts to a multi dimensional array. In this way we store only the non zero entries of a multi dimensional array. Moreover, we do not loose sight of the spatial nature of data.

Aside from the obvious savings in memory space, ARMR's n-tuplet structure provides us considerable flexibility. With ARMR your are never wired into a fixed data view. Instead, you will find that you are able to manipulate and view data from "all angles" with the greatest of ease.

Canonical Data Form

The ARMR modules adhere to certain rules whenever a data base is processed. Most important among these is the notion of a "canonical" data form. Conceptually we can represent an attribute vector of n fields as follows:

    (f1,f2,f3,....fn)

Here f1 through fn are associated with the data base's field names. Let's extend this notation to include the individual data records. To do this we associate a subscript to the attributes. The attribute vector would then be written as follows:

    (f1[i],f2[i],f3[i],...,fn[i])

The subscript "i" denotes a record number. For a particular data base, "i" ranges from 1 to "nline"; where "nline" equals the number of data records stored in the base.

Two conditions must hold in order to realize the canonical form. These are as follows:

1) general data uniqueness

this implies that

    (f1[i],f2[i],f3[i],...,fn[i])

is not equal to

    (f1[j],f2[j],f3[j],...,fn[j])

for all possible pairs of i and j for which i does not equal j.

This simply means that we always discard exact duplicates; i.e. consistent with the spatial nature of data, two points can not occupy the identical spot in space.

2) inherent sort

this implies that

    (f1[k],f2[k],f3[k],...,fn[k])

is placed ahead of

    (f1[k+1],f2[k+1],f3[k+1],...,fn[k+1])

for k ranging from 1 to nline-1.

The ARMR modules are programmed to insure that your data base retains, at all times, the canonical form as defined above; e.g. when filing a data base, ARMR will discard all duplicate records; moreover, the data records will be arranged in ascending ASCII sort on all data fields, before the data is written to your disk.

The importance of the canonical form can not be over emphasized. Because of it, you will be able to define report formats and summaries in a matter of seconds; e.g. the the canonical form is applied directly to build a summary file. To build such a file you simply specify a subset of the data base fields, designate one or more fields as counters to be summarized. The left-to-right output field sequence defines the output sort. ARMR will retain all unique combinations of the selected fields and summarize the counters accordingly as shown below:

                     input           output
                  -----------       ---------
                  f1 f2 f3 f4       f4 f2 f1+
                  -- -- -- --       -- -- ---
                   3  A Q1 Z        Y  F  30
                  15  A Q2 Z        Z  A  18
                  30  F Q3 Y

In the example above we drop "f3", summarize "f1" and rearrange the field sequence. Both input and output have ARMR's canonical form.

The Tabular Data Type

Earlier we have mentioned the available data field types to be signed integer, real number, calendar date, zero filled integer, external data, and tabular data.

In our sample data base we use three of the seven possible data types; "rate" was defined to be a real number; "tally" was defined as an integer. The remaining fields were defined to be "tabular".

The "tabular" data type plays an import role in our data base structure. The tabular type is typically assigned to non arithmetic data fields; "rate" and "tally" contain data which at some time or other will be accessed as operands in arithmetic calculations. For this reason we, defined them to be of a numeric type. The remaining fields of the employee data base contain reference labels. Reference labels are prime candidates for the tabular data type.

Typically, a "reference label" can be a name, color, size, shape even a ssn or any word that might be used to further describe and identify the particular object with which we are concerned. For example, in our data base we may have many objects that are repeated, such as the example above, where the name of the employee is duplicated for each monthly record. Rather than duplicate the reference label each time, we conserve space in the machine by assigning an index number to the name, place the name in a single "table" entry and then use the number to refer (this is where we get "reference label") to the single "tabular" entry in the "name table".

An ARMR table is a rectangular array of unique reference labels. The table entries are generated and are maintained automatically by the ARMR system.

The table entries are gathered and/or are modified automatically whenever you input, edit, or import data to your data base.

ARMR retains the tables' canonical form. This means that all entries of a given table are kept unique; moreover, the table entries are kept in ascending sort.

The current version of ARMR allows up to 20 distinct tables per data base. Tables are numbered 1 through 20. The tables are associated to one or more of your data base's fields; i.e. you can declare any field to be tabular. In doing so, you will explicitly associate the field with one of 20 possible tables.

The limit of 20 tables is not in conflict with the restriction of 80 data fields; i.e. a single table can be shared by several fields. For this reason you can have more than 20 tabular fields in a given data base.

Our employee data base has 5 distinct tables. These are associated with ssn, first, last, month, and dep respectively. Figure 4 shows the current contents of the tables. See also figure 2. The table entries are unique occurrences of the various data strings associated with the five tabular fields.


         Table #1         Table #2        Table #3     Table #4     Table #5
      ---------------   --------------   ---------   ------------   --------
      seq     ssn       seq    last      seq first   seq  month     seq dep
      --- -----------   --- ----------   --- -----   ---  ------    --- ---
       1  327-12-3492    1  BLOW          1  ALICE    1   01_JAN     1   x
       2  390-32-2331    2  DOE           2  ANNIE    2   02_FEB     2   y
       3  729-49-9876    3  HALL          3  JOE      3   03_MAR     3   z
       4  773-21-9321    4  WONDERLAND    4  JOHN     4   04_APR
                                                      5   05_MAY

                       Fig. 4. Tables Of Employee Data Base

Physical Data Base Structure

The data of an ARMR data base is stored as 3 files. One file contains the data base tables. The other contains the rectangular array of attribute combinations.

The third file is an ASCII file which you must set up to define the data base layout. The third file is called the "directory" file. More about the directory file in the next sub section.

We like to refer to the first file as the "tables" file, and the second as the "base" file.

The tables file of our employee data base contains the ASCII entries of the five tables as shown in figure 4.

The physical contents of the base file is shown in figure 5.

                   ssn last first rate  month  dep tally
                   --- ---- ----- ----- -----  --- -----
                    1    4    1   30.00   1     2    1
                    1    4    1   30.00   2     2    1
                    1    4    1   30.00   3     2    1
                    1    4    1   30.00   4     2    1
                    1    4    1   40.00   5     2    1
                    2    1    3   20.50   1     1    1
                    2    1    3   20.50   2     1    1
                    2    1    3   20.50   3     2    1
                    2    1    3   25.00   4     2    1
                    2    1    3   25.00   5     2    1
                    3    3    2   32.50   1     1    1
                    3    3    2   38.00   2     1    1
                    3    3    2   38.00   3     1    1
                    3    3    2   38.00   4     1    1
                    3    3    2   38.00   5     1    1
                    4    2    4   17.00   2     3    1
                    4    2    4   18.00   3     3    1
                    4    2    4   18.00   4     3    1

                       Fig. 5. Employee Base File

The base file contains a matrix of numbers.

Figure 5 demonstrates a typical "matrix" with rows and columns.

Each matrix row represents one data base record. The matrix columns are assigned to the various data base fields.

Numeric data are stored explicitly. For the tabular items we store tabular indexes.

Compare figure 2. with figure 5. The first figure shows the data as it would be displayed to you on the screen. The current figure shows you how the equivalent information is stored inside the computer.

Keep in mind that ARMR has access to the tables file as well as to the base file. Let's illustrate how ARMR translates the internal data to create an equivalent external screen display.

The first base record is as follows:

                 ssn last first rate  month  dep tally
                 --- ---- ----- ----- -----  --- -----
                  1    4    1   30.00   1     2    1

The corresponding external form of this data is as follows:

           ssn      last           first         rate   month dep  tally
        ----------- -------------- ------------ -----  ------ ---  -----
        327-12-3492 WONDERLAND     ALICE         30.0  01_JAN  y     1

Rate and tally are numeric types; their numerical values are therefore stored explicitly in the base file. The remaining fields are tabular. Their external display values reside in the tables file. The base file merely contains indexes of the tabular entry sequence as shown below:

                   field  table indexes           display
                   ----- ----------------        -----------
                   ssn    entry 1 of ssn     =   327-12-3492
                   last   entry 4 of last    =   WONDERLAND
                   first  entry 1 of first   =   ALICE
                   month  entry 1 of month   =   01_JAN
                   dep    entry 2 of dep     =   y

Data Compression

ARMR is ultra conservative when it comes to saving your disk space. The storage techniques described above allow us to make optimum use of your disk storage.

"How much storage is required for a given data base?" There is no accurate answer to this question. A lot depends on the nature of the data stored in the data base.

Experience has taught us that ARMR data bases require from 1/3 to 1/2 the storage of their ASCII counter parts. To obtain the ASCII counterparts we simply add all of the external display width. The width or number of characters per record thus obtained is then multiplied by the number of data records to obtain the ASCII space requirement. This space is then compared to the actual space required for the equivalent ARMR data base.

It is not too difficult to see why there should be a savings in storage space. The key to this is the tabular data type. If you look at our sample data base shown in Figure 2 you will note that there exists a fair amount of data redundancy. By moving the ASCII data strings to a table of unique entries we avoid having to store redundant strings. Lengthy strings are thus replaced by binary integers which are never longer than 2 bytes.

We must emphasize that data compression is not the primary justification for the tabular data field concept. As you will see later on, tabular field types are important data structures which enable us to implement certain operations in the Data Algebra provided by ARMR. Nevertheless, the cost savings realized through ARMR's data compression is significant enough to warrant special attention.

A Map Of The Data Base

ARMR's basemap module provides you with a report which shows the dimensions of all components of a given data base.

Figure 6 shows a data base map of our sample employee data base. Think of the data base and its associated tables as a collection of rectangles composed of horizontal rows and vertical columns as shown in figures 4 and 5 above. We define the dimensions of these rectangles by reporting their horizontal and vertical length. The vertical dimension is labeled "nline"; this is simply the record count. The horizontal dimension is measured in terms of bytes or character positions. We coined the abbreviation "cpl" to denote characters per line.

                               data structure map
                               ------------------

      base's filespec: employee.BAS            base's dimensions:  cpl=   13
     table's filespec: employee                                  nline=   18


                         display   no. of  table   table table   field field
     field  name   type   width   decimals number   cpl  nline   start width
     -----  ------ ----- -------  -------- ------  ----- -----   ----- -----
        1   ssn    TABLE    11                 1     11      4      1    2
        2   last   TABLE    14                 2     14      4      3    2
        3   first  TABLE    12                 3     12      4      5    2
        4   rate   RNUM      5        2                             7    4
        5   month  TABLE     6                 4      6      5     11    1
        6   dep    TABLE     1                 5      1      3     12    1
        7   tally  BNUM      1                                     13    1

                     Fig. 6. Data Structure Map Of Employee Base

The employee's data base has a cpl of 13; Its ASCII equivalent is 50 character per lines. The space required to store the data base is the area of the base plus the areas of its associated tables. For the employee data base the total space required is calculated as follows:

           base  T01  T02  T03  T04  T05     total
           ----- ---- ---- ---- ---- ----    -----
           18*13+11*4+14*4+12*4+ 6*5+ 1*3  =  415

    The equivalent ASCII rectangle's area is

    50*18 = 900.

    The compression ratio is 415/900 = 0.46

For the 18 record data base the ratio is near 1/2. The ratio is expected to decrease significantly as the number of records increases. This will happen because the growth rate of the tables typically decreases as the number of data base records increases.