Skip to content

Commit f91a90f

Browse files
committed
readme finished
1 parent 93e2e05 commit f91a90f

File tree

4 files changed

+216
-1
lines changed

4 files changed

+216
-1
lines changed

README.md

Lines changed: 216 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,216 @@
1-
# CSC443_A2
1+
# Relational Data layout on disk
2+
We implemented a library containing functions to store and maintain relational data on disk using _heap file_ data structure. We implemented and experimented two different file formats corresponding to a _row store_ and a _column store_
3+
4+
collaborated with [Playjasb2](https://github.com/Playjasb2)
5+
6+
----
7+
### Table of Contents
8+
1. [Record Serialization](#1-record-serialization)
9+
2. [Page layout](#2-page-layout)
10+
3. [Heap file](#3-heap-file)
11+
4. [Column Store](#4-column-store)
12+
13+
### 1 Record Serialization
14+
Before we start record serialization, we assume that records are maps mapping attribute names to values. The attribute names are stored as part of the schema, which will not be stored as part of the record serialization.
15+
16+
This means we abstract record as a tuple of values.
17+
18+
```C++
19+
#include <vector>
20+
typedef const char* V;
21+
typedef std::vector<V> Record;
22+
```
23+
24+
We implemented a serialization of fixed length records in the following functions:
25+
```C++
26+
/**
27+
* Compute the number of bytes required to serialize record
28+
*/
29+
int fixed_len_sizeof(Record *record);
30+
31+
/**
32+
* Serialize the record to a byte array to be stored in buf.
33+
*/
34+
void fixed_len_write(Record *record, void *buf);
35+
```
36+
37+
and a Desrialization function as follows:
38+
```c++
39+
/**
40+
* Deserializes `size` bytes from the buffer, `buf`, and
41+
* stores the record in `record`.
42+
*/
43+
void fixed_len_read(void *buf, int size, Record *record);
44+
```
45+
46+
_We assumed there was only one table in schema and there are 100 attributes, and each attribute is 10 bytes each. So, records in the table are fixed length._
47+
48+
### 2 Page layout
49+
50+
As we know, it is critical for all disk I/O to be done in units of blocks, known as _pages_. In this section, we experiment with storing serialized records in pages.
51+
52+
#### Storing fixed length records in pages
53+
We used a slotted directory based page layout to store fixed length records.
54+
```c++
55+
typedef struct {
56+
void *data;
57+
int page_size;
58+
int slot_size;
59+
} Page;
60+
```
61+
62+
Functions implemented:
63+
```c++
64+
/**
65+
* Initializes a page using the given slot size
66+
*/
67+
void init_fixed_len_page(Page *page, int page_size, int slot_size);
68+
69+
/**
70+
* Calculates the maximal number of records that fit in a page
71+
*/
72+
int fixed_len_page_capacity(Page *page);
73+
74+
/**
75+
* Calculate the free space (number of free slots) in the page
76+
*/
77+
int fixed_len_page_freeslots(Page *page);
78+
79+
/**
80+
* Add a record to the page
81+
* Returns:
82+
* record slot offset if successful,
83+
* -1 if unsuccessful (page full)
84+
*/
85+
int add_fixed_len_page(Page *page, Record *r);
86+
87+
/**
88+
* Write a record into a given slot.
89+
*/
90+
void write_fixed_len_page(Page *page, int slot, Record *r);
91+
92+
/**
93+
* Read a record from the page from a given slot.
94+
*/
95+
void read_fixed_len_page(Page *page, int slot, Record *r);
96+
```
97+
98+
### 3 Heap file
99+
100+
After completing the above designed and coded abstract functions, we are ready to generate and maintain _heap files_.
101+
A heap file is just paginated file. Each page is to store a series of records.
102+
```c++
103+
typedef struct {
104+
FILE *file_ptr;
105+
int page_size;
106+
} Heapfile
107+
```
108+
109+
We assume the following way to assign unique identifiers to records in the heap file:
110+
- Every page `p` has an entry in the heap directory of `(page_offset, freespace)`. The page ID of `p` can be the index of its entry in the directory. We call this: `ID(p)`.
111+
- Every record `r` is stored at some slot in some page `p`. The record ID, `ID(r)` is the contenation of ID(p) and the slot index in `p`.
112+
113+
#### Functions
114+
115+
We implemented a directory based heap file in which we have directory pages (organized as a linked list), and data pages that store records.
116+
```c++
117+
/**
118+
* Initalize a heapfile to use the file and page size given.
119+
*/
120+
void init_heapfile(Heapfile *heapfile, int page_size, FILE *file);
121+
122+
/**
123+
* Allocate another page in the heapfile. This grows the file by a page.
124+
*/
125+
PageID alloc_page(Heapfile *heapfile);
126+
127+
/**
128+
* Read a page into memory
129+
*/
130+
void read_page(Heapfile *heapfile, PageID pid, Page *page);
131+
132+
/**
133+
* Write a page from memory to disk
134+
*/
135+
void write_page(Page *page, Heapfile *heapfile, PageID pid);
136+
137+
```
138+
The central functionality of a heap file is enumeration of records. Implement the record iterator class:
139+
```c++
140+
class RecordIterator {
141+
public:
142+
RecordIterator(Heapfile *heapfile);
143+
Record next();
144+
bool hasNext();
145+
}
146+
```
147+
148+
#### File Operations
149+
Operations available are:
150+
151+
```bash
152+
# Build heap file from CSV file
153+
$ csv2heapfile <csv_file> <heapfile> <page_size>
154+
155+
# Print out all records in a heap file
156+
$ scan <heapfile> <page_size>
157+
158+
# Insert all records in the CSV file to a heap file
159+
$ insert <heapfile> <csv_file> <page_size>
160+
161+
# Update one attribute of a single record in the heap file given its record ID
162+
# <attribute_id> is the index of the attribute to be updated (e.g. 0 for the first attribute, 1 for the second attribute, etc.)
163+
# <new_value> will have the same fixed length (10 bytes)
164+
$ update <heapfile> <record_id> <attribute_id> <new_value> <page_size>
165+
166+
# Delete a single record in the heap file given its record ID
167+
$ delete <heapfile> <record_id> <page_size>
168+
```
169+
170+
### 4 Column Store
171+
A column-oriented DBMS (or Column Store), stores all values of a single attribute (column) together, rather than storing all values of a single record together. Hence, the main abstract they use is of columns-of-data rather than rows-of-data. Most DBMS are row-oriented, but depending on the workload, column-oriented DBMS may provide better performance. Below is an illustration of column store (left) and row-store (right).
172+
![](img/col.png)
173+
174+
Consider a query that returns all values of the attribute Name. In a column store you only need to retrieve Name values from the file. In constract, in when using row-oriented storage, to find the Name value of a record, you are necessarily retrieving all attribute values in the record. Hence, to return all values of the Name attribute, you need to read the entire table. The drawback of column store is you that you need to do extra work to reconstruct the tuple. Suppose you wish to return the Name and Salary values from a table. To be able to reassemble records, a column-oriented DBMS will store the tuple id (record-id) with each value in a column.
175+
176+
Column-oriented storage has advantages for queries that only access some of the attributes of a table. However, insertion and deletion now may require multiple page accesses as each tuple is no longer stored on a single page
177+
178+
We implemented a simplified version of columnn store using exisitng heap file implementation. Our implementation will have a separate heap file for each column of a table.
179+
![](img/heap_Col.png)
180+
181+
We will use the same fixed table schema (100 attributes, 10 bytes each). For each attribute, we created a separate heap file. We named the heap file with the same name as the attribute id. We placed all attribute heap files in a single file directory. This is a simplification to make the bookkeeping on what files are in a relation simpler. Think about the limitations of the simplification.
182+
183+
Tuple reconstruction: Different attributes of a tuple will be in different heap file. So, we need to reconstruct the tuple (part of the tuple) to get the result of a query. We can store the tuple-id with each field. Two attributes will have the same tuple-id if they belong to the same tuple.
184+
![](img/ex.png)
185+
186+
#### Column-Oriented file operations
187+
188+
We implemented thhe following operations as follows:
189+
```sh
190+
# Build a column store from CSV file
191+
# <colstore_name> should be a file directory to store the heap files
192+
$ csv2colstore <csv_file> <colstore_name> <page_size>
193+
194+
# Select a single attribute from the column store where parameter
195+
# <attribute_id> is the index of the attribute to be project and returned (e.g. 0 for the first attribute, 1 for the second attribute, etc.)
196+
# the value of the attribute must be between <start> and <end>
197+
$ select2 <colstore_name> <attribute_id> <start> <end> <page_size>
198+
```
199+
This performs the following parameterized SQL query. Note that the selection predicate is on the same attribute that is returned by the query:
200+
```sql
201+
SELECT SUBSTRING(A, 1, 5) FROM T
202+
WHERE A >= start AND A <= end
203+
```
204+
205+
Another function
206+
```sh
207+
# Select only a single attribute from the column store where parameter
208+
# <return_attribute_id> (B) is the index of an attribute to be projected and returned
209+
# <attribute_id> (A) is the index of (a possibly different) attribute whose value must be between <start> and <end>
210+
$ select3 <colstore_name> <attribute_id> <return_attribute_id> <start> <end> <page_size>
211+
```
212+
performs thhe following parameterized SQL Query:
213+
```sql
214+
SELECT SUBSTRING(B, 1, 5) FROM T
215+
WHERE A >= start AND A <= end
216+
```

img/col.png

56.3 KB
Loading

img/ex.png

83.4 KB
Loading

img/heap_Col.png

85.8 KB
Loading

0 commit comments

Comments
 (0)