CS311 HW4, Fri Dec 5, 2003

**Problem 1**

*Questions below refer to disk with an actual (formatted) capacity of 8 gigabytes (2^33 bytes). The disk has 16 surfaces and each surface has 1024 tracks. The disk rotates at 7200 rpm. The average seek time is 9ms. The block size 8KB.*

(a) What is the capacity of a single track?

(a) What is the capacity of a single track?

**524288 bytes**

*(b) Suppose we are reading a file F that occupies exactly one entire track. How long does it take to read the entire file sequentially?*

**2.3 microseconds**

*(c) Suppose that you are designing a file system for a movie database. Each movie record has 10 fields that always occur and 30 optional fields that may or may not be relevant or known for a movie. Each field has a fixed size of 20 bytes. Assume that each optional field is relevant for a particular movie with probability p. You are considering two options:*

- a fixed-format record

- a variable-format record where all fields are tagged using a 2-byte tag

For what range of p values is the fixed format option better than variable format option according to the expected storage requirement?

- a fixed-format record

- a variable-format record where all fields are tagged using a 2-byte tag

For what range of p values is the fixed format option better than variable format option according to the expected storage requirement?

Where F is the number of optional fields:

If each field is tagged with a 2 byte tag:

- (20 bytes + 2(f)) + 20*(10+f)

- 20*40=800 bytes

20+2f+200+20f <= 800

220 + 22f <= 800

22f <= 580

f <= 26.36

26/30 is 86% probability. If p <= .86, it is better to use the variable field format, otherwise, it is better to use the fixed-field format, due to the extra space used by the tags.

**Problem 2 (Indexing, 20 points)**

*Suppose blocks hold either thirty records or 200 key-pointer pairs, but neither data nor index blocks are allowed to be more than 80% full. As a function of n, the number fo records, how many blocks do we need to hold a data file and:*

- A dense index?
- A sparse index?

**Answers:**

- Data blocks take n/30 blocks to store, and the index takes n/200. However, since blocks can only be 80% full, it takes 1.25 as many blocks. So, 23n/600 * 1.25 = 23n/400 blocks to store n blocks of data.
- Same as above, however, the index is sparse and therefore only takes n/600. 7n/200 * 1.25 = 7n/160.

Suppose that a block an hold either three records, 10 key-pointer pairs, or fifty pointers. Let there be secondary indexes on studioName and Year of the relation Movie, as in example 13.16. (Example 13.16: Movie(title, year, length, inColor, studioName, producerC#)). Suppose there are 51 Disney movies, and 101 movies made in 1995. Only one of these movies was a disney movie. Compute the number of disk I/O's needed to answer the query (SELECT title FROM Movie WHERE studioName = 'Disney' AND year =1995;) if we:

Suppose that a block an hold either three records, 10 key-pointer pairs, or fifty pointers. Let there be secondary indexes on studioName and Year of the relation Movie, as in example 13.16. (Example 13.16: Movie(title, year, length, inColor, studioName, producerC#)). Suppose there are 51 Disney movies, and 101 movies made in 1995. Only one of these movies was a disney movie. Compute the number of disk I/O's needed to answer the query (SELECT title FROM Movie WHERE studioName = 'Disney' AND year =1995;) if we:

- Do not use buckets, use the index on studioName to get the poitners to Disney movies, retrieve them, and select those that were made in 1995. Assume no two Disney movie records are on the same block.
- Proceed as in b, but starting with the index on year. Assume no two movies of 1995 are on the same block.

**Answers:**

- 1 disk I/O to get the index (on studioName). 51 disk I/Os: need to retrieve each disney movie, dump into memory. Check each movie in memory and output ones that were made in 1995. Total 52.
- 1 disk I/O to get the index (on year). 101 disk I/Os: need to retrieve each disney movie, dump into memory. Check each movie in memory and output ones that were made by Disney. total 102.

**Problem 3 (Indexing; B+ tree, 10 points)**

**Problem 4 (Execution, 35 points)**

*Give one pass algorithms for each of the following join like operators:*

- R antisemijoin S, assuming R fits in memory
- R antisemijoin S, assuming S fits in memory

**Answer:**

- Read R into memory. For each tuple t in S, compare to tuples in memory (R). If t does not match any tuple in memory, output the tuple.
- Read S into memory. For each tuple t in R, compare to tuples in memory (S). If it does not match any tuple in memory, output the tuple.

*Suppse B(r) = B(s) =10.000. What value of M would we need to compute R join S using the nested loop algorithm with no more than a) 100,000, b) 25,000, ) 15,000 I/Os.*

**Answer:**

- Disk I/Os for nested loop algorithm are:

B(S) + (B(S)B(R))/(M-1)

10,000 + (100000000/(m-1)) = a, b, c

a: 1112.11 rounds up to 1113

b: 6667.666 rounds up to 6668

c: 20001

*How much memory do we need to use a two-pass, sort-based algorithm for relations of 10,000 blocsk each if the operation is a binary operation such as join?*

**Answer:**

- sqrt(B(R) + B(S))

sqrt (20,000)

141.42, rounded up is 142 blocks of memory.

*Suppse B(r) = B(s) =10000, and M = 1000. What is the number of disk I/O's required for a hybrid hash join?*

**Answer:**

- Disk I/O for a hybrid Hash Join:

3-(2M/B(S)) * (B(R)+B(S))

(3 - .2) * 20000 = 56000 I/Os.

**Problem 5 (Optimization, 35 points)**

Question 1, part c:

- Oc=20(Y)

T(Y)/V(Y,c)

300/50

6

Question 1, Part d:

- If A is an attribute of R but not of S, then V(R join S, A) = V(R,A). This is the case in part d: Relation Y contains C but relation Z does not, resulting in V(Y join Z, c=20) is the same as V(Y,c), or 50.

Question 1, Part e:

- W x Y = T(W) * T(Y) = 30,000

Question 1, part f:

- If S = Oa<c(R), our estimate for T(S) is T(R)/3

T(Z)/3 = 400/3, approxmiately 134 tuples.

Question 2:

W | X | Y | Z | |

size: | 100 | 200 | 300 | 400 |

cost: | 0 | 0 | 0 | 0 |

Best Plan: | W | X | Y | Z |

W,X | W,Y | W,Z | X,Y | X,Z | Y,Z | |

Size | 334 | 30000 | 40000 | 600 | 80000 | 2400 |

cost: | 0 | 0 | 0 | 0 | 0 | 0 |

Best Plan: | W,X | W,Y | W,Z | X,Y | Z,X | Z,Y |

W,X,Y | W,X,Z | W,Y,Z | X,Y,Z | |

size: | 1670 | 2227 | 4800 | 2400 |

cost: | 334 | 334 | 2400 | 600 |

Best Plan: | (W,X), Y | (W,X),Z | (Y,Z),W | (X,Y),Z |

Relation | Cost |

((W,X),Y),Z | 2004 |

((W,X),Z),Y | 2561 |

((Y,Z),W),Z | 7200 |

((X,Y),Z),W | 3000 |

Best plan is W join X join Y join Z.

**Problem 6 (Transaction Management, 30 points)**

*The following is a sequence of undo/redo log records written by two transactions T and U:*

<START T>

<T,A,10,11>

<Start U>

<U,B,20,21>

<T,C,30,31>

<U,D,40,41>

<Commit U>

<T,E,50,51>

<Commit T>

Describe the action of the crovery manager including changes to both disk and the log, if there is a crash and the last log record to appear on disk is <Commit U>.

<START T>

<T,A,10,11>

<Start U>

<U,B,20,21>

<T,C,30,31>

<U,D,40,41>

<Commit U>

<T,E,50,51>

<Commit T>

Describe the action of the crovery manager including changes to both disk and the log, if there is a crash and the last log record to appear on disk is <Commit U>.

**Answer:**

- First redo all commited transactions.

<U,B,20,21>

<U,D,40,41>

Output B,D

Then undo all non-commited transactions.

<T,E,51,50>

<T,C,31,30>

<T,A,11,10>

Output E,C,A