CS245   Summer 1994                                              Handout # 4
 
Assignment 2
due in class on Thursday July 7

A relational database system holds two relations, C (customers) and
A (accounts), with the following characteristics:

Relation C
  Tuples stored as fixed length, fixed format records, length = 1024 bytes
  Tuples contain key attribute C.N (customer number), length 20 bytes;
                 other fields and record header make up rest
  There are 5,000 C tuples
  There is an index on attribute C.N.

Relation A
  Tuples stored as fixed length, fixed format records, length = 250 bytes
  Tuples contain attribute A.N (number of customer who owns account),
                        length 20 bytes; other fields (+header) make up rest
  There are 17,000 A tuples
  There is an index on attribute A.N.

While the number of accounts associated with each customer vary,
we have been told that for evaluation purposes we may assume that:
      80% of the customers have exactly 3 accounts
      20% of the customers have exactly 5 accounts

The records are to be placed in a collection of 4 KB disk blocks
that have been reserved to exclusively hold C, or A, or both types of records.
(That is, there are no other types of records in the blocks we are
discussing in this problem.)
Each block uses up 20 bytes for its header; records are not spanned.)
Three disk organization strategies are being considered:

  [sequential]  All the C records are placed sequentially (ordered
                by customer name) in one subset of blocks;
                Records of type A are separate in another set of blocks;
                accounts are not ordered by customer name.

  [clustered]   For each customer record, the accounts for that customer
                (C.N = A.N) must reside in the same block.
                The customer records are not sequenced in any way.
                Assume you have 1000 blocks with 2 customer records each
                and the corresponding 3 + 5 = 8 account records.
                The rest of the blocks have the remaining 3000 customers
                (two per block) that only have 3 accounts.

  [random]      The records are placed as follows, without regard to C.N
                or A.N values, as follows:
                    2,125 blocks with two C and eight A records
                    the remaining C records in blocks with three records each.

We are also told there are four types of queries that constitute the vast
majority of the workload:

[probe]         Given a customer name, get his customer record
[ordered scan]  For all customers, in increasing customer name order,
                    get each customer record
[plain scan]    For all customers, in any order, get all customer records.
[join]          For all customers (any order) do:
                     Get customer record followed by all its accounts
                (This is exactly how join queries are processed.)

Problem 1:
For each of the storage strategies, compute how many total disk blocks
are needed for hold relations C and A. Briefly explain your answers.
Display your final results as explained below.

Problem 2:
For each query type and for each storage strategy,
estimate the number of disk blocks
that must be transferred from disk to execute the query.
Briefly explain each answer.  Display your final results as explained below.

Assume you only have one buffer page (4 KB) in memory;
thus, each time you need a block it counts as one IO
(unless the requests are consecutive).

Also, ignore any IOs due to index accesses.  For example,
say I am examining a customer with C.N = "Smith".
The index will point directly to the blocks that hold Smith's
accounts. In my computations, I only need to consider the IO's
for getting the actual C and A records, not for accessing the index.
(Another way to put this is that you may assume the index resides in memory.)

DISPLAYING FINAL ANSWERS:
In Problems 1 and 2 you are to compute a total of 15 values.
Please display them in a table like this:
(You also need to explain the derivation of the numbers, this is just a
summary.)

              | storage   ||  IOs for  |  IOs for  |  IOs for  |  IOs for  |
              | cost(Blks)||  [probe]  | [ord scan]| [pln scan]|  [join]   |
----------------------------------------------------------------------------
[sequential]  |           ||           |           |           |           |
----------------------------------------------------------------------------
[clustered]   |           ||           |           |           |           |
----------------------------------------------------------------------------
[random]      |           ||           |           |           |           |
----------------------------------------------------------------------------


HINT:  Do not concern yourself with events that are very unlikely.
For example, say you want to get the three account records for a particular
customer, and that these records are randomly placed on a large
number of disk blocks. There is some chance that the account
records you want happen to be in the same block (just by luck).
However, for this problem you may assume this probability is negligible
and it will take you exactly three IOs to get those three records.

END OF ASSIGNMENT 2