Introduction to Relational Databases in SQL

Datacamp course notes on relational db in SQL.

Basic Concepts

Advantages of relational database:

  1. real-life entities become tables. Each table will only contain information/attributes that are directed related to that entity.
  2. reduced redundancy. No need to repeat the same person’s name multiple times if he shows up in multiple entities.
  3. data integrity by relationships. one - one, one - many, many - many

Dealing with Redundancy

Sometimes we may detect redundancy in one table, since it contains multiple entities and repetitive information. We can seperate those information into multiple tables by creating new tables, and then migrate the data into them.

  • Create Tables

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    CREATE TABLE table_name (
    column_a data_type,
    column_b data_type,
    column_c data_type
    );

    /*EXAMPLE*/
    CREATE TABLE weather (
    clouds text,
    temperature numeric,
    weather_station char(5)
    )
  • Add Columns
    We can add columns into table.

    1
    2
    3
    4
    5
    6
    ALTER TABLE table_name
    ADD COLUMN column_name data_type;

    /*EXAMPLE*/
    ALTER TABLE professors
    ADD COLUMN university_shortname text;
  • Change Column Name

    1
    2
    ALTER TABLE table_name
    RENAME COLUMN old_name TO new_name;
  • Drop a Column
    Sometimes we would find that the rest of the columns can already uniquely identify a row, so the column becomes redundant. We can drop that by the following operation.

    1
    2
    ALTER TABLE table_name
    DROP COLUMN column_name;
  • Migrate Data
    We can migrate data from one table to another.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    /*university_professors --> organizations*/
    INSERT INTO organizations
    SELECT DISTINCT organization, organization_sector
    /*The above DISTINCT applies to all listed columns*/
    FROM university_professors;

    /*normal case below*/
    INSERT INTO table_name (column_a, column_b)
    VALUES ("value_a", "value_b");

    /*EXAMPLE*/
    INSERT INTO transactions (transaction_date, amount, fee)
    VALUES ('2018-24-09', 5454, '30');
  • Drop a Table
    After migrating the data, we can delete the redundant table by the following operation

    1
    DROP TABLE table_name;

Constraints

With the specified constraints, we can achieve better data quality.

  • Types

    1. Integrity constraints, e.g. data types on columns
      Implementing such a type on a column would disallow anything that doesn’t fit the structure specified.
    2. Key constraints, e.g. primary keys
    3. Referential integrity constraints, enforced through foreign keys
  • Why constraints?

    • Constraints give the data structure
    • Constraints help with consistency, and thus data quality
    • Data quality is a business advantage/ data science prerequisite
    • Enforcing is difficult, but PostgreSQL helps

Dealing with Data Types

  • Basics

    1. Enforced on columns (i.e. attributes)
    2. Define the so-called “domain” of a column
    3. Define what operations are possible
    4. Enfore consistent storage of values, e.g. the number of characters
  • Most common types

    • text: character strings of any length
    • varchar [(x)]: a maximum of n characters
    • char [(x)]: a fixed-length string of n characters
    • boolean: can only take three states, e.g. TRUE, FALSE and NULL (unknown)
  • Alter types after table creation

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    ALTER TABLE students
    ALTER COLUMN name
    TYPE varchar(128);

    ALTER TABLE students
    ALTER COLUMN average_grade
    TYPE integer
    -- Round to the nearest integer,
    -- so that it turns 5.54 into 6, not 5, before altering the type
    USING ROUND(average_grade);

    -- Truncate the string before converting the varchar to shorter length
    ALTER TABLE table_name
    ALTER COLUMN column_name
    TYPE varchar(x)
    USING SUBSTRING(column_name FROM 1 FOR x)
  • Using CAST() to convert data type on-the-fly

    1
    2
    SELECT temperature * CAST(wind_speed AS integer) AS wind_chill
    FROM weather;

Not NULL Constraint

  • Basics:
    • Disallow NULL values in a certain column
    • Must hold true for the current state
    • Must hold true for any future state

Two null values must not have the same meaning.

When creating a table:

1
2
3
4
5
6
CREATE TABLE students (
ssn integer not null,
lastname varchar(64) not null,
home_phone integer,
office_phone integer
);

When changing the constraint after the table has been created:

1
2
3
4
5
6
7
8
9
-- add constraint
ALTER TABLE students
ALTER COLUMN home_phone
SET NOT NULL;

-- remove constraint
ALTER TABLE students
ALTER COLUMN ssn
DROP NOT NULL;

Unique Constraint

  • Basics
    • Disallow duplicate values in a column
    • Must hold true for the current state
    • Must hold true for any future state

When creating a table:

1
2
3
CREATE TABLE table_name (
column_name UNIQUE
);

When adding the constraint after the table has been created:

1
2
3
ALTER TABLE table_name
ADD CONSTRAINT some_name UNIQUE(column_name);
-- some_name is the name of the constraint, which is arbitrary by you, but has to be set

Keys

What is a Key?

  • Attribute(s) that identify a record uniquely
  • As long as attributes can be removed: superkey (the combination of attributes can still uniquely identify each row after removing one or more attributes)
  • If no more attributes can be removed: minimal superkey or key (the combination of attributes can no longer uniquely identify each row after removing any attributes)
  • There can be multiple keys that satisfy the above minimal superkey requirement, and thus are all called candidate keys.

Primary Key

What is a Primary Key?

  • One primary key per database table, chosen from candidate keys
  • Uniquely identifies records, e.g. for referencing in other tables
  • Unique and not-null constraints both apply
  • Primary keys are time-invariant: choose columns wisely! They must hold for both the current data and future data.


  • Specifying Primary Key
    When creating a table:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    -- set one attribute as primary key
    CREATE TABLE products (
    product_no integer PRIMARY KEY,
    name text,
    price numeric
    );

    -- set more than one attributes as primary key
    CREATE TABLE example (
    a integer,
    b integer,
    c integer,
    PRIMARY KEY (a, c)
    );

    When specifying primary key after creating the table

    1
    2
    3
    ALTER TABLE table_name
    ADD CONSTRAINT some_name PRIMARY KEY (column_name)
    -- have to give the constraint "some_name"

What is a Surrogate Key?
It is an artificial primary key, and we create it for the following reasons:

  1. Primary keys should be built from as few columns as possible
  2. Primary keys should never change over time
  • Ways to create a surrogate key

    1. Adding a surrogate key with serial data type
      Add a column called ‘id’, which is a series on incrementing numbers so that it can uniquely identify each row. (like row number)

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      ALTER TABLE cars
      ADD COLUMN id serial PRIMARY KEY;

      -- Whenever you add a new record to the table,
      -- it will automatically get a number that does not exist yet
      INSERT INTO cars
      VALUES ('Volkswagen', 'Blitz', 'black'); -- insert another row and generates a new id with it

      -- If you try to specify an ID that already exists,
      -- the primary key constraint will prevent you from doing so
      INSERT INTO cars
      VALUES ('Opel', 'Astra', 'green', 1); -- throws error since 1 already exists
    2. Combining several columns into a new one

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      -- create a new column
      ALTER TABLE table_name
      ADD COLUMN column_c varchar(256);

      -- concat two existing columns together and set the result into the new column
      UPDATE table_name
      SET column_c = CONCAT(column_a, column_b);

      -- set the new column as surrogate primary key
      ALTER TABLE table_name
      ADD CONSTRAINT pk PRIMARY KEY (column_c);

Foreign Key

What is a Foregin Key?

  • A foreign key (FK) points to the primary key (PK) of another table
  • Domain of FK must be equal to domain of PK (same data type)
  • Each value of FK must exist in PK of the other table (FK constraint or “referential integrity“)
  • FKs are not actual keys (can have duplicates and NULL)

When creating table with foreign key:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- Create manufacturers table
CREATE TABLE manufacturers (
name varchar(255) PRIMARY KEY
);

INSERT INTO manufacturers
VALUES ('Ford'), ('VW'), ('GM');

-- create car table, and reference the columne manufacturer_name
-- to manufacturers table's name column,
-- which is manufacturer table's primary key
CREATE TABLE cars (
model varchar(255) PRIMARY KEY,
manufacturer_name integer REFERENCES manufacturers (name)
);

-- From now on, only cars with valid and existing manufacturers
-- may be entered into the car table,
-- due to the foreign key constraint.
INSERT INTO cars
VALUES ('Ranger', 'Ford'), ('Beetle', 'VW')

When specifying foreign keys to existing tables:

1
2
ALTER TABLE a 
ADD CONSTRAINT a_fkey FOREIGN KEY (b_id) REFERENCES b (id);

N:M relationship

How to implement this kind of relationship?

  • Create a table
  • Add foreign keys for every connected table
  • Add additional attributes
  • No primary key! Can possibly combine several attributes together to have some form of unique constraint in that table.
1
2
3
4
5
CREATE TABLE affiliations (
professor_id integer REFERENCES professors (id),
organization_id varchar(256) REFERENCES organization (id),
function varchar(256)
);

Referential Integrity

What is Referential Integrity?

  • A record referencing another table must refer to an existing record in that table.
  • Specified between two tables
  • Enforced through foreign keys

Referential Integrity Violations
Referential integrity from table A to table B is violated… (A has a foreign key that references from B)

  • …if a record in table B that is referenced from a record in table A is deleted.
  • …if a record in table A referencing a non-existing record from table B is inserted.
  • Foreign keys prevent violations!

Dealing with violations
ON DELETE…

  • …NO ACTION: Throw an error

    1
    2
    3
    4
    5
    6
    7
    CREATE TABLE a (
    id integer PRIMARY KEY,
    column_a varchar(64),
    ...,
    b_id integer REFERENCES b (id) ON DELETE NO ACTION
    );
    -- will throw an error if violation happens
  • …CASCADE: Delete all referencing records

    1
    2
    3
    4
    5
    6
    7
    CREATE TABLE a (
    id integer PRIMARY KEY,
    column_a varchar(64),
    ...,
    b_id integer REFERENCES b (id) ON DELETE CASCADE
    );
    -- cascade the deletion
  • …RESTRICT: Throw an error

  • …SET NULL: Set the referencing column to NULL
  • …SET DEFAULT: Set the referencing column to its default value

To change the default behavior ON DELETE NO ACTION, we have to delete the key constraint and then add a new one with a different ON DELETE behavior.

For deleting constraints, though, we need to know their name. This information is also stored in information_schema.

1
2
3
4
5
6
7
8
9
10
11
12
-- Identify the correct constraint name
SELECT constraint_name, table_name, constraint_type
FROM information_schema.table_constraints
WHERE constraint_type = 'FOREIGN KEY';

-- Drop the right foreign key constraint
ALTER TABLE affiliations
DROP CONSTRAINT affiliations_organization_id_fkey;

-- Add a new foreign key constraint from affiliations to organizations which cascades deletion
ALTER TABLE affiliations
ADD CONSTRAINT affiliations_organization_id_fkey FOREIGN KEY (organization_id) REFERENCES organizations (id) ON DELETE CASCADE;