Introduction to databases

Version 4, 2021-01-27

https://github.com/LinuxAtDuke/intro-to-databases

Instructor

Andy Ingham (andy.ingham AT duke.edu)

Table of Contents

Unit 1: Definitions, uses, and kinds
Unit 2: Schema, data dictionary
Unit 3: Architectures
Unit 4: Data integrity, security, confidentiality
Unit 5: Primer for the Intro to MySQL class (or for practical playing)

Unit 1: Definitions, uses, and kinds

What are data?
- from least to most abstract: data > information > knowledge > wisdom
- https://en.wikipedia.org/wiki/Data
- "Research data is defined as the recorded factual material commonly accepted in the scientific community as necessary to validate research findings..." (OMB Circular 110 ; https://www.whitehouse.gov/sites/whitehouse.gov/files/omb/circulars/A110/2cfr215-0.pdf)
What is a "database"? What is a DBMS (database management system)?
- https://en.wikipedia.org/wiki/Database
Databases are primarily needed to structure and organize data so that:
- associations can be revealed (e.g., X is related to Y via shared attribute Z),
- patterns can be exposed (e.g., X is more prevalent than Y by magnitude Z),
- questions can be answered (e.g., X is the type of Y that exhibits Z),
- a repository of information can be stored (by retaining associated data elements)
Why aren't databases spreadsheets? Because, GENERALLY SPEAKING, databases ...
- are more powerful (can organize data in more complex ways) and, therefore, can provide better searching/discovery options
- provide better normalization (if designed well) and, therefore, are more performant as size increases ("scalable")
- provide more complex integrity checks (rules and constraints about data elements and associations)
- are more flexible (in how they can be configured)
- provide more granular options for things like access control and, therefore, better suited to shared usage by disparate groups
- provide more and better options for external integrations
- are more focused on searching/storing than tabulating/displaying
Kinds of databases ... there are quite a few (https://en.wikipedia.org/wiki/Database#Models ), but relational is probably still king
- Relational (https://medium.com/@pocztarski/what-if-i-told-you-there-are-no-tables-in-relational-databases-13d31a2f9677 )
  - PROS: highly normalizable (avoids data duplication); stable; mature; consistency after each transaction
  - CONS: rigid
- Object / document / NoSQL
  - PROS: faster for "big" or distributed data; flexible
  - CONS: less standardization for how to interact with data; requires high developer discipline; reasonable use cases are limited
- GraphDBs
  - PROS: powerful for specialized use cases (highly interconnected data)
  - CONS: highly complex for both users and developers
- Key-value stores:
  - PROS: simple
  - CONS: simple

Unit 2: Schema, data dictionary

Schema development is best done via an ER diagram and/or a whiteboard - consider these:
- what are the entities? (the "things" or "concepts" that form the basis of our data)
- what relationships do they have with one another?
- what are the important attributes of the entities?
- what are the data types and metadata (is NULL allowed? are there default values?) for those attributes?
- what will determine uniqueness in each table? (will the primary key be simple or compound?)
- what queries are users likely to run? (this will inform index creation)
- what indexes are needed? (to supplement the primary key)
Some (albeit simple and somewhat silly) examples:
- https://www.edrawsoft.com/templates/pdf/pet-store-er-diagram.pdf
- https://www.safaribooksonline.com/library/view/learning-mysql/0596008643/ch04s04.html
A tutorial to help with schema development:
- http://www.anchor.com.au/hosting/support/CreatingAQuickMySQLRelationalDatabase
Fine-tuning of schema...
- referential integrity - data types consistent across linking fields (foreign keys) (https://en.wikipedia.org/wiki/Data_integrity )
- data types (https://dev.mysql.com/doc/refman/5.7/en/data-types.html) should be as prescriptive and compact as possible
- index creation should be done where needed, but not elsewhere
- index creation is always faster BEFORE data are loaded into the table
- verify that data are "reasonably" normalized (e.g., data generally de-duplicated)

Some examples

mysql>> describe LCL_genotypes;

Field	Type	Null	Key	Default
IID	varchar(16)	NO	PRI	NULL
SNPpos	varchar(512)	NO	PRI	NULL
rsID	varchar(256)	NO	MUL	NULL
genotype	varchar(512)	NO		NULL

mysql>> describe phenotypes;

Field	Type	Null	Key	Default
LCL_ID	varchar(16)	NO	PRI	NULL
phenotype	varchar(128)	NO	PRI	NULL
phenotypic_value1	decimal(20,10)	YES		NULL
phenotypic_value2	decimal(20,10)	YES		NULL
phenotypic_value3	decimal(20,10)	YES		NULL
phenotypic_mean	decimal(20,10)	YES		NULL

mysql>> describe snp;

Field	Type	Null	Key	Default
rsID	varchar(256)	NO	PRI	NULL
Chromosome	tinyint(3) unsigned	NO		NULL
Position	int(10) unsigned	NO		NULL
Allele1	varchar(128)	NO		NULL
Allele2	varchar(128)	NO		NULL
DistanceToNearGene	varchar(32)	NO		NULL
Gene	varchar(32)	NO		NULL
SNPtype	varchar(64)	NO		NULL

Unit 3: Architectures

suitability (or, So which database to choose? (SQLite vs Postgres vs MySQL vs ORACLE vs MSSQL ...))
- functionality
  - store and retrieve; CRUD (Create, Read, Update, Delete)
  - partitioning, replication, MVCC (Multi-Version Concurrency Control)
  - stored procedures, triggers, views
  - advanced data types + searching (e.g., XPath)
  - "Hot" backups, point-in-time recovery
- security
  - how secure are the defaults? (accounts, passwords, settings, allowed connections)
  - how straightforward is (or how comfortable are you with) the DB administration?
  - are patches regularly released? (remembering that updates are only useful if ACTUALLY APPLIED in a timely fashion)
- usability
  - CLI or GUI
  - documentation / community-support versus paid-support
- integration
  - base OS may dictate options for add-ons or helper apps
  - middleware / ODBC (Open Database Connectivity)
  - reporting tools
- cost
  - up-front (hardware, licensing, life-cycle capital costs)
  - pay-as-you-go (cloud options)
- hosting / future development (bug fixes+enhancements+security)
- scalability (DBMS as well as underlying environments)
  - storage needs
  - transaction rates

Unit 4: Data integrity, security, confidentiality

integrity: ACID (Atomicity, Consistency, Isolation, and Durability) (https://en.wikipedia.org/wiki/ACID)
- Important for some obvious reasons:
  - scientific validity
  - potential basis for future work or discoveries
  - provide confidence in information
(information) security: CIA (Confidentiality, Integrity, Availability) (https://www.tylercybersecurity.com/blog/fundamental-objectives-of-information-security-the-cia-triad)
- Important for some obvious reasons:
  - protect intellectual property
  - maintain confidentiality of data (which can help protect privacy)
  - guard against unauthorized updates or deletions (to protect data integrity)
  - provide on-going access for authorized people

Unit 5: Primer for the Intro to MySQL class (or for practical playing)

AKA: Creating a personal Linux VM (Lab 0 from the Intro to MySQL class)

A brief tangent to discuss architecture == https://github.com/LinuxAtDuke/Intro-to-MySQL/blob/master/client-server-architecture.pdf

Using a web browser, go to https://vcm.duke.edu/
Login using your Duke NetId.
Select "Reserve a VM" (near the middle of the page)
On the next screen, select the "Lamp Stack" link from the list under "Linux Apps"
If you encounter a pop-up window about SSH keys (which displays if you do not have them set up for your netID), you may need to select the less secure option until you've done that step (which is outside the scope of this class).
After agreeing to the Terms of Use, the VCM web page will display the name of your VM along with available usernames. You must first connect to the University VPN (if not "on campus"), THEN initiate an ssh session as the Admin User (vcm) -- do this via the "Terminal" app on your Mac or via "PuTTY" (available at https://www.chiark.greenend.org.uk/~sgtatham/putty/ ) on your Windows machine.

Example (after establishing a University VPN session, if off campus): ssh vcm@vcm-1473.vm.duke.edu [Answering "yes" to "Are you sure you want to continue connecting (yes/no)?" and then entering the password behind "View Password" when prompted]

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Introduction to databases

Unit 1: Definitions, uses, and kinds

Unit 2: Schema, data dictionary

Unit 3: Architectures

Unit 4: Data integrity, security, confidentiality

Unit 5: Primer for the Intro to MySQL class (or for practical playing)

AKA: Creating a personal Linux VM (Lab 0 from the Intro to MySQL class)

About

Uh oh!

Releases

Packages

Uh oh!

LinuxAtDuke/intro-to-databases

Folders and files

Latest commit

History

Repository files navigation

Introduction to databases

Unit 1: Definitions, uses, and kinds

Unit 2: Schema, data dictionary

Unit 3: Architectures

Unit 4: Data integrity, security, confidentiality

Unit 5: Primer for the Intro to MySQL class (or for practical playing)

AKA: Creating a personal Linux VM (Lab 0 from the Intro to MySQL class)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Packages