Version 4, 2021-01-27
https://github.com/LinuxAtDuke/intro-to-databases
Instructor
Andy Ingham (andy.ingham AT duke.edu)
Table of Contents
- Unit 1: Definitions, uses, and kinds
- Unit 2: Schema, data dictionary
- Unit 3: Architectures
- Unit 4: Data integrity, security, confidentiality
- Unit 5: Primer for the Intro to MySQL class (or for practical playing)
-
What are data?
- from least to most abstract: data > information > knowledge > wisdom
- https://en.wikipedia.org/wiki/Data
- "Research data is defined as the recorded factual material commonly accepted in the scientific community as necessary to validate research findings..." (OMB Circular 110 ; https://www.whitehouse.gov/sites/whitehouse.gov/files/omb/circulars/A110/2cfr215-0.pdf)
-
What is a "database"? What is a DBMS (database management system)?
-
Databases are primarily needed to structure and organize data so that:
- associations can be revealed (e.g., X is related to Y via shared attribute Z),
- patterns can be exposed (e.g., X is more prevalent than Y by magnitude Z),
- questions can be answered (e.g., X is the type of Y that exhibits Z),
- a repository of information can be stored (by retaining associated data elements)
-
Why aren't databases spreadsheets? Because, GENERALLY SPEAKING, databases ...
- are more powerful (can organize data in more complex ways) and, therefore, can provide better searching/discovery options
- provide better normalization (if designed well) and, therefore, are more performant as size increases ("scalable")
- provide more complex integrity checks (rules and constraints about data elements and associations)
- are more flexible (in how they can be configured)
- provide more granular options for things like access control and, therefore, better suited to shared usage by disparate groups
- provide more and better options for external integrations
- are more focused on searching/storing than tabulating/displaying
-
Kinds of databases ... there are quite a few (https://en.wikipedia.org/wiki/Database#Models ), but relational is probably still king
-
Relational (https://medium.com/@pocztarski/what-if-i-told-you-there-are-no-tables-in-relational-databases-13d31a2f9677 )
- PROS: highly normalizable (avoids data duplication); stable; mature; consistency after each transaction
- CONS: rigid
-
Object / document / NoSQL
- PROS: faster for "big" or distributed data; flexible
- CONS: less standardization for how to interact with data; requires high developer discipline; reasonable use cases are limited
-
GraphDBs
- PROS: powerful for specialized use cases (highly interconnected data)
- CONS: highly complex for both users and developers
-
Key-value stores:
- PROS: simple
- CONS: simple
-
Unit 2: Schema, data dictionary
-
Schema development is best done via an ER diagram and/or a whiteboard - consider these:
- what are the entities? (the "things" or "concepts" that form the basis of our data)
- what relationships do they have with one another?
- what are the important attributes of the entities?
- what are the data types and metadata (is NULL allowed? are there default values?) for those attributes?
- what will determine uniqueness in each table? (will the primary key be simple or compound?)
- what queries are users likely to run? (this will inform index creation)
- what indexes are needed? (to supplement the primary key)
-
Some (albeit simple and somewhat silly) examples:
- https://www.edrawsoft.com/templates/pdf/pet-store-er-diagram.pdf
- https://github.com/LinuxAtDuke/Intro-to-MySQL/blob/master/pet-store-example-schemaOwner.pdf
- https://github.com/LinuxAtDuke/Intro-to-MySQL/blob/master/pet-store-example-schemaPet.pdf
- https://github.com/LinuxAtDuke/Intro-to-MySQL/blob/master/pet-store-example-schemaPetClinic.pdf
- ESPECIALLY PROBLEMATIC: https://github.com/LinuxAtDuke/Intro-to-MySQL/blob/master/pet-store-example-schemaTreatments.pdf
- IN LIGHT OF THE ABOVE: https://dzone.com/articles/how-to-handle-a-many-to-many-relationship-in-datab
- https://github.com/LinuxAtDuke/Intro-to-MySQL/blob/master/pet-store-example-schemaPetStore.pdf
- https://www.safaribooksonline.com/library/view/learning-mysql/0596008643/ch04s04.html
- https://www.edrawsoft.com/templates/pdf/pet-store-er-diagram.pdf
-
A tutorial to help with schema development:
-
Fine-tuning of schema...
- referential integrity - data types consistent across linking fields (foreign keys) (https://en.wikipedia.org/wiki/Data_integrity )
- data types (https://dev.mysql.com/doc/refman/5.7/en/data-types.html) should be as prescriptive and compact as possible
- index creation should be done where needed, but not elsewhere
- index creation is always faster BEFORE data are loaded into the table
- verify that data are "reasonably" normalized (e.g., data generally de-duplicated)
-
Some examples
mysql>> describe LCL_genotypes;
Field Type Null Key Default Extra IID varchar(16) NO PRI NULL SNPpos varchar(512) NO PRI NULL rsID varchar(256) NO MUL NULL genotype varchar(512) NO NULL mysql>> describe phenotypes;
Field Type Null Key Default Extra LCL_ID varchar(16) NO PRI NULL phenotype varchar(128) NO PRI NULL phenotypic_value1 decimal(20,10) YES NULL phenotypic_value2 decimal(20,10) YES NULL phenotypic_value3 decimal(20,10) YES NULL phenotypic_mean decimal(20,10) YES NULL mysql>> describe snp;
Field Type Null Key Default Extra rsID varchar(256) NO PRI NULL Chromosome tinyint(3) unsigned NO NULL Position int(10) unsigned NO NULL Allele1 varchar(128) NO NULL Allele2 varchar(128) NO NULL DistanceToNearGene varchar(32) NO NULL Gene varchar(32) NO NULL SNPtype varchar(64) NO NULL
- suitability (or, So which database to choose? (SQLite vs Postgres vs MySQL vs ORACLE vs MSSQL ...))
- functionality
- store and retrieve; CRUD (Create, Read, Update, Delete)
- partitioning, replication, MVCC (Multi-Version Concurrency Control)
- stored procedures, triggers, views
- advanced data types + searching (e.g., XPath)
- "Hot" backups, point-in-time recovery
- security
- how secure are the defaults? (accounts, passwords, settings, allowed connections)
- how straightforward is (or how comfortable are you with) the DB administration?
- are patches regularly released? (remembering that updates are only useful if ACTUALLY APPLIED in a timely fashion)
- usability
- CLI or GUI
- documentation / community-support versus paid-support
- integration
- base OS may dictate options for add-ons or helper apps
- middleware / ODBC (Open Database Connectivity)
- reporting tools
- cost
- up-front (hardware, licensing, life-cycle capital costs)
- pay-as-you-go (cloud options)
- hosting / future development (bug fixes+enhancements+security)
- scalability (DBMS as well as underlying environments)
- storage needs
- transaction rates
- functionality
- integrity: ACID (Atomicity, Consistency, Isolation, and Durability) (https://en.wikipedia.org/wiki/ACID)
- Important for some obvious reasons:
- scientific validity
- potential basis for future work or discoveries
- provide confidence in information
- Important for some obvious reasons:
- (information) security: CIA (Confidentiality, Integrity, Availability) (https://www.tylercybersecurity.com/blog/fundamental-objectives-of-information-security-the-cia-triad)
- Important for some obvious reasons:
- protect intellectual property
- maintain confidentiality of data (which can help protect privacy)
- guard against unauthorized updates or deletions (to protect data integrity)
- provide on-going access for authorized people
- Important for some obvious reasons:
AKA: Creating a personal Linux VM (Lab 0 from the Intro to MySQL class)
- A brief tangent to discuss architecture == https://github.com/LinuxAtDuke/Intro-to-MySQL/blob/master/client-server-architecture.pdf
- Using a web browser, go to https://vcm.duke.edu/
- Login using your Duke NetId.
- Select "Reserve a VM" (near the middle of the page)
- On the next screen, select the "Lamp Stack" link from the list under "Linux Apps"
- If you encounter a pop-up window about SSH keys (which displays if you do not have them set up for your netID), you may need to select the less secure option until you've done that step (which is outside the scope of this class).
- After agreeing to the Terms of Use, the VCM web page will display the name of your VM along with available usernames. You must first connect to the University VPN (if not "on campus"), THEN initiate an ssh session as the Admin User (vcm) -- do this via the "Terminal" app on your Mac or via "PuTTY" (available at https://www.chiark.greenend.org.uk/~sgtatham/putty/ ) on your Windows machine.
Example (after establishing a University VPN session, if off campus): ssh vcm@vcm-1473.vm.duke.edu [Answering "yes" to "Are you sure you want to continue connecting (yes/no)?" and then entering the password behind "View Password" when prompted]