More and more, a company's operational and strategic work depends on data processing. It is now apparent that data is itself an important company resource. Since a company's information on its products, customers, suppliers, employees, competitors, and so on can provide a significant competitive advantage, we must start to manage data as a company resource, while at the same time improve the way it is handled relative to programming. It becomes clear that some sort of uniform system is needed. A relational database environment, with its supporting software and accompanying management techniques, provides the framework for treating data as a standardized, manageable, shareable resource.
The Good Old Days
We still tend to concentrate on the how (code), as opposed to the what (data and the structure of that data). Programming (coding) is the fun part for most of us; designing an application, especially the database, seems too much like work. The majority of us began our programming lives on a flat file system, the S/34 or S/36. Little, if any, thought was given to the design of our files -- the code was what mattered. With the advent of the S/38, and more recently the AS/400, we were given a built-in relational database. But all those years of thinking about the code and not the data have taken their toll. Even today, when we are all supposed to know better, most of our time and emphasis (when it comes to applications development) is still spent on the code.
MIS departments are an ever-expanding entity. Rules change as data processing needs grow. It becomes apparent that in order to keep up with large backlogs of requests for new applications and enhancements to existing applications, application development has to take on a more standard, structured form. The amount of time spent maintaining existing applications becomes unacceptable as we attempt to keep ahead of these requests. It's clear that the way we handle -- or mishandle -- data is a major factor in the application maintenance mess in which we find ourselves.
There are numerous ways the design of our data files impacts new development and maintenance to existing applications. Most shops have one or more of the following problems somewhere in their system:
o Different files store the same data in different formats.
o Different applications store the same data is different files. This stores data redundantly in several files.
o Some programs are written in such a manner that if the way that the data is stored changes, the programs that access that data have to be modified in order to continue working correctly.
o Data is all too often inaccurate, inconsistent, or not up to date. Changes in everything from access methods to tax tables require programming changes.
Most data processing departments find it difficult to maintain standards in both programming and data storage. Programmers tend to go their separate ways when other programs are not directly dependent on their work. Unfortunately, a lack of standardization makes it difficult for anyone -- often even the original programmer -- to understand a program well enough to make required maintenance changes to it later on.
Setting Up Standards
A careful and standardized look at the entire application development process is required when different applications share data files. We need to use consistent data descriptions and access methods in our programs that affect the database and keep in mind that whatever one programmer does affects others, too. Standardization in programming techniques and storage schemes only serves to make for a less error-prone environment. This won't happen overnight, but rather takes place as a gradual transition.
In the independent flat file data processing environment, choice of file organization and access method usually decides the way that the application program is written. The program logic is dependent on the way in which the data is stored. If for any reason the storage characteristics of the data change, the program must be modified, often extensively. This should be avoided, due to the time and expense involved in such efforts. There are many reasons file structures should change to reflect improved storage techniques: new hardware, new software and performance tuning, to name just a few. One reason for the massive amounts of time spent on program maintenance is the typical level of dependency between programs and data.
What we would like to be able to do is separate storage and access considerations from programming as much as possible. The AS/400's Database Management System (DBMS) is a major step in the right direction in solving certain aspects of this problem.
The use of independent flat files should be looked at as both a starting point and a stopgap measure. It is clear that the independent flat file method of data storage, combined with individual applications implemented in a vacuum, carries with it a host of undesirable side effects. The solution, as we shall see more and more, is the database approach.
A Relational Database
Dr. Edgar F. Codd of IBM published a paper in 1970 entitled, "A Relational Model of Data for Large Shared Data Banks." This paper was the beginning of the relational database.
A company needs to keep track of many different pieces of information. Among these are employees, inventory, financial statistics, etc. All of this data must be organized in a manner that allows for fast easy retrieval, while at the same time avoids redundant storage. At a conceptual level, a database is a collection of this information. It's a collection of the data concerning a real world facet of the company.
The Parts of a Relational Database
A relational database is a DBMS implementation that is made up of relations (files), also known as tables. These tables are implemented as flat files, but through the use of keys are combined together to give very complex and sophisticated views of the data. A graphic representation of the parts of a relational database is shown in 1. A relation is made up of tuples (records, rows). A tuple is the description of a particular set of facts in a relational database. For example, an employee master record, a magazine subscriber record. While a relation is a description of a group of entities, a tuple is the description of a particular entity. Tuples are made up of attributes (fields, columns). An attribute is a description of one particular fact concerning a tuple. Just as a tuple is a description of a particular entity, an attribute is a description of one fact about the entity. For example, an employee's social security number, a magazine subscriber's address. Each attribute has a domain. The domain of an attribute is the allowable values of that attribute, e.g., an employee's salary must be numeric and be in the range of $1 to $100. From this point on, when referring to a relation (file) I will use the term table, when referring to a tuple (record) I will use the term row and when referring to an attribute (field) I will use the term column.
A relational database is a DBMS implementation that is made up of relations (files), also known as tables. These tables are implemented as flat files, but through the use of keys are combined together to give very complex and sophisticated views of the data. A graphic representation of the parts of a relational database is shown in Figure 1. A relation is made up of tuples (records, rows). A tuple is the description of a particular set of facts in a relational database. For example, an employee master record, a magazine subscriber record. While a relation is a description of a group of entities, a tuple is the description of a particular entity. Tuples are made up of attributes (fields, columns). An attribute is a description of one particular fact concerning a tuple. Just as a tuple is a description of a particular entity, an attribute is a description of one fact about the entity. For example, an employee's social security number, a magazine subscriber's address. Each attribute has a domain. The domain of an attribute is the allowable values of that attribute, e.g., an employee's salary must be numeric and be in the range of $1 to $100. From this point on, when referring to a relation (file) I will use the term table, when referring to a tuple (record) I will use the term row and when referring to an attribute (field) I will use the term column.
Uniquely Qualified
One requirement of relational databases is that each row in a table must be unique. Each row must be identified by a unique set of columns, known as a key.
There may be many combinations of columns that can uniquely identify a row, these are known as candidate keys. The primary key is the one candidate key chosen to function as the main identifier for the rows of the table. The key of one table can appear in another table, this is know as a foreign key. A foreign key is a column or columns of one table in a database that is the primary key of another table in the database. This foreign key is what allows one table to relate to another table in a database, hence the term relational database. One particular issue that foreign keys are closely tied to is referential integrity. Referential integrity is the issue of whether a data value that serves as the foreign key of a row in one table has a matching primary key value in another table.
Putting it all Together
A relational database must be capable of certain operations in order to qualify for the name relational. Among these operations are three very important ones: Select, Project and Join.
The Select operation allows us to retrieve a table that is the subset of another table. When we select a table, we take a horizontal piece of the table. In other words, we take selected rows from a table. For example, select all employees that have a salary greater that $1,000 an hour. This would retrieve a table that contain only the employees that meet the selection criteria: salary greater than $1,000. On the AS/400 there are many ways to achieve this: logical files, OPNQRYF, SQL/400, etc.
To Project a table, we take only selected columns from the table. When we project a table, we take a vertical piece of the table. In other words, we take selected columns from a row. For instance, retrieve the salary of all employees. This would produce a table that contains only the salary column. When I say that these operations produce a table, it doesn't mean that this table is a permanent table. It means that to an application that result appears as a table, but this table may only exist for that particular run of the program.
The Join operation is the most important and complex of the operations in a relational database. A join allows for the implementation of the purpose of a relational database, the relating of two or more tables. A join is the merging of related records from two or more files using keys as the link. For example, joining the employee master table with the employee insurance table using the column employee number as the join column. In the employee master table, the column employee number is the primary key and in the employee insurance file, the column employee number is a foreign key.
These operations can be combined in many different ways to produce very complicated and sophisticated views of your data. For instance, join the employee master file with the employee insurance file, selecting only those employees that have been employed more than two years, giving their name, insurance type and years of service.
Relationships
The ways in which different entities relate to each other are known as associations or relationships. These relationships are show in 2. Unary associations are when one entity relates to only one other entity, such as an employee and the employee's social security number. There is a one for one (unary) association in each direction: one employee, one social security number. This is also known as a one-to-one relationship.
The ways in which different entities relate to each other are known as associations or relationships. These relationships are show in Figure 2. Unary associations are when one entity relates to only one other entity, such as an employee and the employee's social security number. There is a one for one (unary) association in each direction: one employee, one social security number. This is also known as a one-to-one relationship.
A relationship can also exist that has unary association in one direction and a multiple association in the other direction. This is known as a one-to-many relationship. An example would be the association between an employee and the company. An employee works for one company but the company has many employees working for it. The final type of relationship has multiple associations in both directions. This is known as a many-to-many relationship. An example would be users and AS/400 libraries. There are many users authorized to a library and many libraries authorized to a user.
Data Redundancy
Data stored in a data processing system mirrors the realities of a business. In a data processing system, data is redundant if a particular fact about the real world is stored more than once. Let's take a look at the associated problems of redundant data and data integrity. First we will consider them in the context of one file, then in the context of many files. A example of nonredundant and redundant tables is shown in 3.
Data stored in a data processing system mirrors the realities of a business. In a data processing system, data is redundant if a particular fact about the real world is stored more than once. Let's take a look at the associated problems of redundant data and data integrity. First we will consider them in the context of one file, then in the context of many files. A example of nonredundant and redundant tables is shown in Figure 3.
Suppose that we are a health insurance company which must keep track of our accounts and agents. The Agent file is keyed on the unique column, Agent Number, and includes, as one of its other fields, the column, Agent Name. The Account file is keyed on the unique column, Account Number, and has, as one of its other fields, the Agent Number, (the number of the agent who services this account). There is one record per agent in the Agent file and each record contains information about a particular agent, such as agent number, name, location, telephone extension, and so on. There's a similar situation for the data on each account in the Account file.
This kind of data storage is said to be nonredundant, because the detail data on each entity is stored only once. If someone comes along and wants to know the name of the agent who is identified by a particular agent number, they merely look up that agent's record in the Agent file, using Agent Number as the key, and read the name. Similarly, if someone wants to know the number of the agent who is responsible for a particular account, that person may access the Account file and look up the account's record, using the Account Number as the key. So far, so good.
Now, what if someone comes along and wants to know the name of the agent who services a particular account, identified by account number. Can that information be obtained from only one of the files? Clearly it cannot, since the account number information exists only in the Account file, and the agent name information exists only in the Agent file. The only way to answer this question is to first look up the account's record in the Account file, find the number of the agent on that account, use that number to find the person's record in the Agent file, and then finally find his or her name in the record. That kind of custom-made, multi-command, multifile access is error-prone in terms of programming, and expensive in terms of execution performance.
Ah, but if we knew that we were going to make such queries, why did we bother breaking the data up into two files in the first place? If it were all contained in one file, in a meaningful way, then there would be no costly multi-command, multifile accesses. Notice that we show the agent assigned to a particular account, just as we did in the two files, but now the agent's other information is also carried along. We can still answer the two original simple questions that we put to the files, but now, in addition, we can answer the tougher, previously multifile question, "What is the name of the agent on a particular account?" with just the one file and one command. Fine! Or is it?
In this combined file, where all of the fundamentally different kinds of information are intermingled, the particulars of a specific agent must be repeated for every account serviced.
This situation of data redundancy causes a number of problems. For one thing, the data takes up a lot more storage space. For another thing, it causes extra work to update. If an agent's name changes, the change must be recorded many times. The time consumed in doing this can be very costly, particularly if the records involved are scattered all over a disk. Also, there is the nagging doubt that each of the occurrences was located and updated, due to anything from programming bugs to execution time system failures. When that happens, it's called a data integrity problem.
There appears to be a trade-off, but trade-offs are hardly rare in the computer field. In the two files, we have no redundancy, but a query of the type that we've been looking at requires a multi-command, multifile access. In the combined file, we have eliminated the need for a multi-command, multifile access for that type of query, but we have introduced data redundancy. Neither of these situations is tolerable, for the reasons indicated previously.
Can we have our cake and eat it, too? Yes, and this question brings us to one of the key features of the database environment. A true integrated data management system; that is, a true database system, is one in which data can be held nonredundantly. It also allows for a multifile query that requires a mixture of different kinds of data to be specified in a single command from the highest level programming interface. Any system that does not have this property really should not be called a database system -- a "file management" system would be a more accurate term.
Another benefit of the integrated approach is the encouragement that it gives users and programmers to find new uses for the data in the database. The richness of having so many different portions of the organization's data interconnected leads to many more usage possibilities than when it was stored separately.
Departmental Data
There is another kind of redundancy that, while perhaps a bit more obvious than the one previously discussed, is not necessarily easier to combat, but is difficult for different reasons.
Certain data is needed by various departments in a company during the course of normal, everyday work. For example, account name address may be needed by the sales department, the accounts receivable department and the credit department. There will also generally be other pieces of information involving accounts that are needed by only one or two departments. Typically the solution to this multiple need is redundancy. The sales department (or accounts receivable and credit departments) has its own stored file that, among other things, contains the account name and address.
While it's true that this keeps various departments from competing for the data, it opens up a can of worms reminiscent of the problems that we encountered in single file redundancy. First of all, storing a piece of data three times takes three times the amount of storage as storing it once. In addition, every time the data must be updated, it must be changed in each file in which it resides. This wastes processing time, causes control nightmares and can potentially lead to integrity problems if all of the data is not updated correctly and at the same time. Furthermore, the question arises as to whose responsibility it is to update data in all of the different files -- files that are probably owned by widely separated departments.
Wouldn't it make more sense for such data to be stored just once, in a way that makes it accessible to all of the different groups that need it? This way, there would be one group in charge of updating it, little wasted space, few integrity problems, and so on.
The database approach includes the concept of a common shared database with controlled management and update responsibility.
Summing It All Up
A relational database offers many advantages over the traditional methods of data storage. It allows for the storage of complex pieces of data concerning the real world activities of a company in a flexible, nonredundant manner. It also provides a set of operations that allows that data to be accessed in a myriad of ways. This approach enables a company to look at its data in ways that were never before possible. With a relational approach to data storage, a company can begin to manage its data the same way it manages its inventory - - as an asset and not a liability.
Relational Database Design -- Part 1
Figure 1 Database components
Figure 1: Database Components DATABASE RELATION, TABLE, FILE TUPLE, ROW, RECORD ATTRIBUTE, COLUMN, FIELD DOMAIN
Relational Database Design -- Part 1
Figure 2 Entity relationships
Figure 2: Entity Relationships ONE <--------> ONE ONE <-------->> MANY MANY<<-------->> MANY
Relational Database Design -- Part 1
Figure 3 Redundant vs Nonredundant tables
Figure 3: Redundant vs. Nonredundant Tables NONREDUNDANT: Account Table: Account Account Account Agent Number Name Type Number Agent Table: Agent Agent Agent Agent Number Name Extension Location REDUNDANT: Account/Agent Table: Account Account Account Agent Agent Agent Number Name Type Name Extension Location
LATEST COMMENTS
MC Press Online