To us, "Data Algebra" implies a formal set of fundamental operations which can be applied to data bases. ARMR's Data Algebra contains among other operations, data base transposition and data base substitutions. More about all this later. Before we introduce the various data base operations, we must return the data base structure.

Again, an ARMR data base is simply a computerized representation of a single file. Beyond this, we do not maintain any other structures; i.e. file linkages, index arrays, key fields, and the like are not maintained on a permanent basis. Instead, structures like these are created "on the fly", in RAM only, when you process the data via an ARMR module. This approach gives you the flexibility to specify any conceivable processing view at run time. The overhead which would be incurred by storing such structures on a permanent basis is therefore eliminated.

A data base has any number of data fields. Currently, ARMR allows up to 120 fields per data base segment, and up to 65,535 records. This does not imply a data base limit of 65,535 records. In practice, one will typically divide up a large file into smaller files of more manageable sizes.

Each field of a data base carries a specific data type. ARMR recognizes the following data types:

Type Code Corresponding Data Type Samples -------------- ------------------------ ------------- B signed integer -248, 500 -------------- ------------------------ ------------- R real number 3.279, 100. -------------- ------------------------ ------------- D integer calendar date 86/09/23 -------------- ------------------------ ------------- Z zero-filled integer 000249 -------------- ------------------------ ------------- T tabular data Alpha string -------------- ------------------------ ------------- E external data Alpha string -------------- ------------------------ -------------A data base usually contains two classes of data, numeric data and alpha strings. Data types B, R, D, and Z define numeric data fields; these data are always stored in binary. T and E types are used for alpha numeric strings. Alpha strings are stored in traditional character form, as one byte per character in an E type field. However, the T type field is a preferred data field type to be used for alpha strings. The cost benefits and utility of the tabular data field type are described later on.

The following is a sample employee file which you might store in an ARMR data base.

ssn last first rate month dep tally ----------- -------------- ------------ ----- ------ --- ----- 327-12-3492 WONDERLAND ALICE 30.0 01_JAN y 1 327-12-3492 WONDERLAND ALICE 30.0 02_FEB y 1 327-12-3492 WONDERLAND ALICE 30.0 03_MAR y 1 327-12-3492 WONDERLAND ALICE 30.0 04_APR y 1 327-12-3492 WONDERLAND ALICE 40.0 05_MAY y 1 390-32-2331 BLOW JOE 20.5 01_JAN x 1 390-32-2331 BLOW JOE 20.5 02_FEB x 1 390-32-2331 BLOW JOE 20.5 03_MAR y 1 390-32-2331 BLOW JOE 25.0 04_APR y 1 390-32-2331 BLOW JOE 25.0 05_MAY y 1 729-49-9876 HALL ANNIE 32.5 01_JAN x 1 729-49-9876 HALL ANNIE 38.0 02_FEB x 1 729-49-9876 HALL ANNIE 38.0 03_MAR x 1 729-49-9876 HALL ANNIE 38.0 04_APR x 1 729-49-9876 HALL ANNIE 38.0 05_MAY x 1 773-21-9321 DOE JOHN 17.0 02_FEB z 1 773-21-9321 DOE JOHN 18.0 03_MAR z 1 773-21-9321 DOE JOHN 18.0 04_APR z 1 Fig. 2. A Sample Employee Data BaseThis file tracks employee social security number, first name, last name, department code, and hourly labor rates as a function of calendar month. "ssn", "last", "first", "rate", "month", "dep" and "tally" are the field names of our sample data base. The tally field is made to contain the number 1 at all times; its utility is explained latter.

The employees tracked by our data base may migrate to any of the departments x, y, z; moreover, their rates may change in time. Rates and departments are tracked at monthly resolution. Therefore we keep one data base record per employee for each month in which the employee was with the company.

(ssn,last,first,rate,month,dep,tally)In the abstract sense, a data base is a collection of data points in an n-dimensional space; each data base record represents the coordinates of a single data point! For example, for the sake of simplicity, let us project our data base to the (ssn,month) plane; then the 18 records of our data base are represented by the 18 points marked "*" in Figure 3.

773-21-9321 | * * * 729-49-9876 | * * * * * 390-32-2331 | * * * * * 327-12-3492 | * * * * * +------------------------------------- 01_JAN 02_FEB 03_MAR 04_APR 05_MAY Fig. 3. Spatial Representation Of Employees DataIf we added another axis, say "dep" to this, then our 18 points would appear in three dimensional space. The "dep" axis would contain distinct positions representing department code values x, y, and z. In general we treat each field as a distinct axis. The unique field values are distinct points on the respective field's axis.

The notion of a spatial representation is by no means new to computer science. If you have ever programmed in a higher level language, you are familiar with the multi dimensional array structure. Such structures are directly related to the spatial representation of data. For example, we could store our employee rate values in a five dimensional array as follows:

array: rate(4,4,4,5,3)The array's five subscripts represent ssn, last, first, rate, month, and dep, respectively. Note that in our example, as shown in figure 2. we have 4 unique values for ssn, first, and last, 5 unique values for month; and 3 unique values for dep. The rate array as defined above can thus accommodate the records shown in figure 2. Unfortunately, the array's entries would by primarily empty; i.e. only 18 out of the 960 array slots would contain non zero data; i.e. for our example, 98% of the allocated memory space goes to waste.

Dimensioned arrays waste considerable amounts of memory space. Moreover, they are practical in a static environment only. In an active data base, the space defining coordinates are constantly changing; e.g. the introduction of a new ssn value means that the ssn coordinate must be modified. Because of this, we say that a multi-dimensional array structure is not practical for storing a data base. Nevertheless, we can not therefore ignore the spatial nature of data.

ARMR treats data records as attribute vectors whose elements, in the abstract sense, are subscripts to a multi dimensional array. In this way we store only the non zero entries of a multi dimensional array. Moreover, we do not loose sight of the spatial nature of data.

Aside from the obvious savings in memory space, ARMR's n-tuplet structure provides us considerable flexibility. With ARMR your are never wired into a fixed data view. Instead, you will find that you are able to manipulate and view data from "all angles" with the greatest of ease.

(f1,f2,f3,....fn)Here f1 through fn are associated with the data base's field names. Let's extend this notation to include the individual data records. To do this we associate a subscript to the attributes. The attribute vector would then be written as follows:

(f1[i],f2[i],f3[i],...,fn[i])The subscript "i" denotes a record number. For a particular data base, "i" ranges from 1 to "nline"; where "nline" equals the number of data records stored in the base.

Two conditions must hold in order to realize the canonical form. These are as follows:

1) general data uniqueness

this implies that

(f1[i],f2[i],f3[i],...,fn[i])is not equal to

(f1[j],f2[j],f3[j],...,fn[j])for all possible pairs of i and j for which i does not equal j.

This simply means that we always discard exact duplicates; i.e. consistent with the spatial nature of data, two points can not occupy the identical spot in space.

2) inherent sort

this implies that

(f1[k],f2[k],f3[k],...,fn[k])is placed ahead of

(f1[k+1],f2[k+1],f3[k+1],...,fn[k+1])for k ranging from 1 to nline-1.

The ARMR modules are programmed to insure that your data base retains, at all times, the canonical form as defined above; e.g. when filing a data base, ARMR will discard all duplicate records; moreover, the data records will be arranged in ascending ASCII sort on all data fields, before the data is written to your disk.

The importance of the canonical form can not be over emphasized. Because of it, you will be able to define report formats and summaries in a matter of seconds; e.g. the the canonical form is applied directly to build a summary file. To build such a file you simply specify a subset of the data base fields, designate one or more fields as counters to be summarized. The left-to-right output field sequence defines the output sort. ARMR will retain all unique combinations of the selected fields and summarize the counters accordingly as shown below:

input output ----------- --------- f1 f2 f3 f4 f4 f2 f1+ -- -- -- -- -- -- --- 3 A Q1 Z Y F 30 15 A Q2 Z Z A 18 30 F Q3 YIn the example above we drop "f3", summarize "f1" and rearrange the field sequence. Both input and output have ARMR's canonical form.

In our sample data base we use three of the seven possible data types; "rate" was defined to be a real number; "tally" was defined as an integer. The remaining fields were defined to be "tabular".

The "tabular" data type plays an import role in our data base structure. The tabular type is typically assigned to non arithmetic data fields; "rate" and "tally" contain data which at some time or other will be accessed as operands in arithmetic calculations. For this reason we, defined them to be of a numeric type. The remaining fields of the employee data base contain reference labels. Reference labels are prime candidates for the tabular data type.

Typically, a "reference label" can be a name, color, size, shape even a ssn or any word that might be used to further describe and identify the particular object with which we are concerned. For example, in our data base we may have many objects that are repeated, such as the example above, where the name of the employee is duplicated for each monthly record. Rather than duplicate the reference label each time, we conserve space in the machine by assigning an index number to the name, place the name in a single "table" entry and then use the number to refer (this is where we get "reference label") to the single "tabular" entry in the "name table".

An ARMR table is a rectangular array of unique reference labels. The table entries are generated and are maintained automatically by the ARMR system.

The table entries are gathered and/or are modified automatically whenever you input, edit, or import data to your data base.

ARMR retains the tables' canonical form. This means that all entries of a given table are kept unique; moreover, the table entries are kept in ascending sort.

The current version of ARMR allows up to 20 distinct tables per data base. Tables are numbered 1 through 20. The tables are associated to one or more of your data base's fields; i.e. you can declare any field to be tabular. In doing so, you will explicitly associate the field with one of 20 possible tables.

The limit of 20 tables is not in conflict with the restriction of 80 data fields; i.e. a single table can be shared by several fields. For this reason you can have more than 20 tabular fields in a given data base.

Our employee data base has 5 distinct tables. These are associated with ssn, first, last, month, and dep respectively. Figure 4 shows the current contents of the tables. See also figure 2. The table entries are unique occurrences of the various data strings associated with the five tabular fields.

Table #1 Table #2 Table #3 Table #4 Table #5 --------------- -------------- --------- ------------ -------- seq ssn seq last seq first seq month seq dep --- ----------- --- ---------- --- ----- --- ------ --- --- 1 327-12-3492 1 BLOW 1 ALICE 1 01_JAN 1 x 2 390-32-2331 2 DOE 2 ANNIE 2 02_FEB 2 y 3 729-49-9876 3 HALL 3 JOE 3 03_MAR 3 z 4 773-21-9321 4 WONDERLAND 4 JOHN 4 04_APR 5 05_MAY Fig. 4. Tables Of Employee Data Base

The third file is an ASCII file which you must set up to define the data base layout. The third file is called the "directory" file. More about the directory file in the next sub section.

We like to refer to the first file as the "tables" file, and the second as the "base" file.

The tables file of our employee data base contains the ASCII entries of the five tables as shown in figure 4.

The physical contents of the base file is shown in figure 5.

ssn last first rate month dep tally --- ---- ----- ----- ----- --- ----- 1 4 1 30.00 1 2 1 1 4 1 30.00 2 2 1 1 4 1 30.00 3 2 1 1 4 1 30.00 4 2 1 1 4 1 40.00 5 2 1 2 1 3 20.50 1 1 1 2 1 3 20.50 2 1 1 2 1 3 20.50 3 2 1 2 1 3 25.00 4 2 1 2 1 3 25.00 5 2 1 3 3 2 32.50 1 1 1 3 3 2 38.00 2 1 1 3 3 2 38.00 3 1 1 3 3 2 38.00 4 1 1 3 3 2 38.00 5 1 1 4 2 4 17.00 2 3 1 4 2 4 18.00 3 3 1 4 2 4 18.00 4 3 1 Fig. 5. Employee Base FileThe base file contains a matrix of numbers.

Figure 5 demonstrates a typical "matrix" with rows and columns.

Each matrix row represents one data base record. The matrix columns are assigned to the various data base fields.

Numeric data are stored explicitly. For the tabular items we store tabular indexes.

Compare figure 2. with figure 5. The first figure shows the data as it would be displayed to you on the screen. The current figure shows you how the equivalent information is stored inside the computer.

Keep in mind that ARMR has access to the tables file as well as to the base file. Let's illustrate how ARMR translates the internal data to create an equivalent external screen display.

The first base record is as follows:

ssn last first rate month dep tally --- ---- ----- ----- ----- --- ----- 1 4 1 30.00 1 2 1The corresponding external form of this data is as follows:

ssn last first rate month dep tally ----------- -------------- ------------ ----- ------ --- ----- 327-12-3492 WONDERLAND ALICE 30.0 01_JAN y 1Rate and tally are numeric types; their numerical values are therefore stored explicitly in the base file. The remaining fields are tabular. Their external display values reside in the tables file. The base file merely contains indexes of the tabular entry sequence as shown below:

field table indexes display ----- ---------------- ----------- ssn entry 1 of ssn = 327-12-3492 last entry 4 of last = WONDERLAND first entry 1 of first = ALICE month entry 1 of month = 01_JAN dep entry 2 of dep = y

"How much storage is required for a given data base?" There is no accurate answer to this question. A lot depends on the nature of the data stored in the data base.

Experience has taught us that ARMR data bases require from 1/3 to 1/2 the storage of their ASCII counter parts. To obtain the ASCII counterparts we simply add all of the external display width. The width or number of characters per record thus obtained is then multiplied by the number of data records to obtain the ASCII space requirement. This space is then compared to the actual space required for the equivalent ARMR data base.

It is not too difficult to see why there should be a savings in storage space. The key to this is the tabular data type. If you look at our sample data base shown in Figure 2 you will note that there exists a fair amount of data redundancy. By moving the ASCII data strings to a table of unique entries we avoid having to store redundant strings. Lengthy strings are thus replaced by binary integers which are never longer than 2 bytes.

We must emphasize that data compression is not the primary justification for the tabular data field concept. As you will see later on, tabular field types are important data structures which enable us to implement certain operations in the Data Algebra provided by ARMR. Nevertheless, the cost savings realized through ARMR's data compression is significant enough to warrant special attention.

Figure 6 shows a data base map of our sample employee data base. Think of the data base and its associated tables as a collection of rectangles composed of horizontal rows and vertical columns as shown in figures 4 and 5 above. We define the dimensions of these rectangles by reporting their horizontal and vertical length. The vertical dimension is labeled "nline"; this is simply the record count. The horizontal dimension is measured in terms of bytes or character positions. We coined the abbreviation "cpl" to denote characters per line.

data structure map ------------------ base's filespec: employee.BAS base's dimensions: cpl= 13 table's filespec: employee nline= 18 display no. of table table table field field field name type width decimals number cpl nline start width ----- ------ ----- ------- -------- ------ ----- ----- ----- ----- 1 ssn TABLE 11 1 11 4 1 2 2 last TABLE 14 2 14 4 3 2 3 first TABLE 12 3 12 4 5 2 4 rate RNUM 5 2 7 4 5 month TABLE 6 4 6 5 11 1 6 dep TABLE 1 5 1 3 12 1 7 tally BNUM 1 13 1 Fig. 6. Data Structure Map Of Employee BaseThe employee's data base has a cpl of 13; Its ASCII equivalent is 50 character per lines. The space required to store the data base is the area of the base plus the areas of its associated tables. For the employee data base the total space required is calculated as follows:

base T01 T02 T03 T04 T05 total ----- ---- ---- ---- ---- ---- ----- 18*13+11*4+14*4+12*4+ 6*5+ 1*3 = 415 The equivalent ASCII rectangle's area is 50*18 = 900. The compression ratio is 415/900 = 0.46For the 18 record data base the ratio is near 1/2. The ratio is expected to decrease significantly as the number of records increases. This will happen because the growth rate of the tables typically decreases as the number of data base records increases.