Skip to content

SQL Basics overview

Praveen Kumar Anwla edited this page Jan 2, 2025 · 17 revisions

Q1: What is the difference between Row_number, Rank, Dense_rank in SQL?

Ans:In SQL, ROW_NUMBER(), RANK(), and DENSE_RANK() are window functions that assign a unique number to each row within a result set based on a specified ordering. However, there are differences in how they handle ties (cases where multiple rows have the same values). Here's an explanation of each function:

  1. ROW_NUMBER():

    • Assigns a unique number to each row based on the specified order.
    • Does not handle ties; it will always assign a unique number to each row, even if values are the same.
    • The result is a monotonically increasing sequence without gaps.
    SELECT
        ROW_NUMBER() OVER (ORDER BY column_name) AS row_num,
        other_columns
    FROM your_table;
  2. RANK():

    • Assigns a unique rank to each distinct row based on the specified order.
    • Handles ties by assigning the same rank to rows with equal values, and the next rank is skipped.
    • If multiple rows have the same values, they receive the same rank, and the next rank is adjusted accordingly.
    SELECT
        RANK() OVER (ORDER BY column_name) AS rank_num,
        other_columns
    FROM your_table;
  3. DENSE_RANK():

    • Similar to RANK(), assigns a unique rank to each distinct row based on the specified order.
    • Handles ties by assigning the same rank to rows with equal values, but the next rank is not skipped.
    • If multiple rows have the same values, they receive the same rank, and the next rank is not adjusted.
    SELECT
        DENSE_RANK() OVER (ORDER BY column_name) AS dense_rank_num,
        other_columns
    FROM your_table;

Here's a simple example to illustrate the difference:

SELECT
    column_name,
    ROW_NUMBER() OVER (ORDER BY column_name) AS row_num,
    RANK() OVER (ORDER BY column_name) AS rank_num,
    DENSE_RANK() OVER (ORDER BY column_name) AS dense_rank_num
FROM your_table;

In this example, ROW_NUMBER() will always assign a unique number, RANK() will handle ties by skipping ranks, and DENSE_RANK() will handle ties without skipping ranks.

Q1.1: Please share real-time examples of scenarios where each of these could be valuable.

Ans: Let's consider real-time examples for each of these scenarios:

  1. ROW_NUMBER():

    • Example Scenario: Imagine you have a dataset of employee salaries, and you want to assign a unique employee ID to each record based on their salary in ascending order. You don't want any gaps in the employee IDs, and each employee should have a distinct identifier.

      SELECT
          ROW_NUMBER() OVER (ORDER BY salary) AS employee_id,
          employee_name,
          salary
      FROM employees;

    In this case, ROW_NUMBER() ensures that each employee gets a unique identifier based on their salary.

  2. RANK():

    • Example Scenario: Consider a sports competition where participants receive scores, and you want to rank them based on their scores. If multiple participants have the same score, you want to assign the same rank to them but skip the next rank.

      SELECT
          RANK() OVER (ORDER BY score DESC) AS participant_rank,
          participant_name,
          score
      FROM participants;

    RANK() is useful here because it handles ties by skipping the next rank.

  3. DENSE_RANK():

    • Example Scenario: Suppose you are working with a sales dataset, and you want to rank products based on their sales revenue. If multiple products have the same revenue, you want to assign the same rank to them without skipping any ranks.

      SELECT
          DENSE_RANK() OVER (ORDER BY sales_revenue DESC) AS product_rank,
          product_name,
          sales_revenue
      FROM products;

    DENSE_RANK() is appropriate in this scenario, as it handles ties without skipping ranks.

These examples illustrate how each function can be valuable in different real-world scenarios, depending on how you want to handle ties and whether you need gapless or sequential numbering.

Q2: Where do we use SELF JOIN?

Ans: A self-join is a SQL query in which a table is joined with itself. This can be useful in various scenarios when you need to compare rows within the same table. Here are some common use cases for self-joins:

  1. Hierarchical Data:

    • When you have hierarchical data stored in a table, such as an organizational chart or a category hierarchy, and you need to retrieve relationships between different levels.
    -- Example: Retrieving employees and their managers in an organizational hierarchy
    SELECT 
        E.EmployeeID,
        E.EmployeeName,
        E.ManagerID,
        M.EmployeeName AS ManagerName
    FROM 
        Employee E
    JOIN 
        Employee M ON E.ManagerID = M.EmployeeID;
  2. Tracking Relationships:

    • When you have relationships between entities within the same table, like tracking relationships between users in a social network.
    -- Example: Retrieving friends in a social network
    SELECT 
        A.UserID,
        A.UserName,
        B.FriendID,
        B.FriendName
    FROM 
        Friends A
    JOIN 
        Friends B ON A.FriendID = B.UserID;
  3. Comparing Rows:

    • When you need to compare values across different rows within the same table based on certain conditions.
    -- Example: Comparing sales amounts for each product and finding products with higher sales than others
    SELECT 
        A.ProductID,
        A.ProductName,
        A.SalesAmount,
        B.ProductID AS CompareProductID,
        B.ProductName AS CompareProductName,
        B.SalesAmount AS CompareSalesAmount
    FROM 
        Sales A
    JOIN 
        Sales B ON A.SalesAmount > B.SalesAmount
    WHERE
        A.ProductID <> B.ProductID;

These are just a few examples, and the use of self-joins can vary based on specific requirements and the structure of the database. Self-joins are a powerful tool in SQL, providing flexibility when dealing with relationships and comparisons within the same dataset.

Example:

With the given data:

| ProductID | ProductName   | SalesAmount |
|-----------|---------------|-------------|
| 1         | Widget A      | 150         |
| 2         | Widget B      | 100         |
| 3         | Widget C      | 200         |
| 4         | Widget D      | 50          |

The expected output after executing the query would include comparisons where each product's sales are higher than another's, such as:

| ProductID | ProductName | SalesAmount | CompareProductID | CompareProductName | CompareSalesAmount |
|-----------|-------------|-------------|------------------|--------------------|-------------------|
| 1         | Widget A    | 150         | 2                | Widget B           | 100               |
| 1         | Widget A    | 150         | 4                | Widget D           | 50                |
| 3         | Widget C    | 200         | 1                | Widget A           | 150               |
| 3         | Widget C    | 200         | 2                | Widget B           | 100               |
| 3         | Widget C    | 200         | 4                | Widget D           | 50                |
| 2         | Widget B    | 100         | 4                | Widget D           | 50                |

Q3: You've a below table, and if you perform below query on below table then what will the output would look like?

Method 1::

select *,round((a2.timestamp-a1.timestamp), 3) as processing_time 
from Activity a1
join Activity a2 
on a1.machine_id=a2.machine_id and a1.process_id=a2.process_id
and a1.activity_type='start' and a2.activity_type='end'

:

Activity table:
+------------+------------+---------------+-----------+
| machine_id | process_id | activity_type | timestamp |
+------------+------------+---------------+-----------+
| 0          | 0          | start         | 0.712     |
| 0          | 0          | end           | 1.520     |
| 0          | 1          | start         | 3.140     |
| 0          | 1          | end           | 4.120     |
| 1          | 0          | start         | 0.550     |
| 1          | 0          | end           | 1.550     |
| 1          | 1          | start         | 0.430     |
| 1          | 1          | end           | 1.420     |
| 2          | 0          | start         | 4.100     |
| 2          | 0          | end           | 4.512     |
| 2          | 1          | start         | 2.500     |
| 2          | 1          | end           | 5.000     |
+------------+------------+---------------+-----------+

Ans: The provided SQL query calculates the processing time for each 'start' and 'end' activity pair by joining the Activity table with itself based on machine_id, process_id, and opposite activity_type ('start' and 'end'). The processing time is then rounded to three decimal places using the round function.

Here's what the output might look like:

+------------+------------+---------------+-----------+------------+------------+---------------+-----------+-----------------+
| machine_id | process_id | activity_type | timestamp | machine_id | process_id | activity_type | timestamp | processing_time |
+------------+------------+---------------+-----------+------------+------------+---------------+-----------+-----------------+
| 0          | 0          | start         | 0.712     | 0          | 0          | end           | 1.520     | 0.808           |
| 0          | 1          | start         | 3.140     | 0          | 1          | end           | 4.120     | 0.98            |
| 1          | 0          | start         | 0.550     | 1          | 0          | end           | 1.550     | 1.0             |
| 1          | 1          | start         | 0.430     | 1          | 1          | end           | 1.420     | 0.99            |
| 2          | 0          | start         | 4.100     | 2          | 0          | end           | 4.512     | 0.412           |
| 2          | 1          | start         | 2.500     | 2          | 1          | end           | 5.000     | 2.5             |
+------------+------------+---------------+-----------+------------+------------+---------------+-----------+-----------------+

The output includes all columns from both instances of the Activity table (aliased as a1 and a2), along with the calculated processing_time for each pair of 'start' and 'end' activities. Each row represents a combination of 'start' and 'end' activities with the corresponding processing time. The processing time is not rounded in the output because the round function is applied in the SELECT clause of the query.

Q4: You've got 2 tables called Table T1, and Table T2. Please perform the following joins and tell me the number of records in each of these joins in SQL :: Left Join, Right Join, Inner Join.

T1 ::
Col_1 
1 
2 
3 
4 
4
NULL

T2 ::
Col_1 
1 
1 
1 
2 
3 
3 
4 
5 
NULL
NULL

Ans:

Here's the result for all three joins in one table:

Row_Num| LeftJoinResult | RightJoinResult | InnerJoinResult |
1     | -------------- | --------------- | --------------- |
2     | 1              | 1               | 1               |
3     | 1              | 1               | 1               |
4     | 1              | 1               | 1               |
5     | 2              | 2               | 2               |
6     | 3              | 3               | 3               |
7     | 3              | 3               | 3               |
8     | 4              | 4               | 4               |
9     | 4              | 4               | 4               |
10    | NULL           | 5               | -               |
11    | -              | NULL            | -               |
12    | -              | NULL            | -               |

Note:

  • - indicates no corresponding value in that particular join.
  • The InnerJoinResult column includes only the matching values, and NULL values are excluded.

Q5: What is Cross join? Please share a real-time example of it.

Ans: A CROSS JOIN, also known as a Cartesian join, is a type of join in SQL where each row from one table is combined with every row from another table. This results in a Cartesian product of the two tables, meaning every possible combination of rows is included in the result set. While CROSS JOINs can be useful in some scenarios, they can also lead to large result sets, so they should be used with caution.

Here's a simple example using a hypothetical scenario where you have two tables: employees and departments. The goal is to retrieve all possible combinations of employees and departments:

### Input Data:

**Table: employees**

| employee_id | employee_name | department_id |
|-------------|---------------|----------------|
| 1           | John Doe       | 101            |
| 2           | Jane Smith     | 102            |
| 3           | Bob Johnson    | 101            |

**Table: departments**

| department_id | department_name |
|---------------|------------------|
| 101           | HR               |
| 102           | IT               |

### Expected Output Data:

**Result of CROSS JOIN:**

| employee_id | employee_name | department_id | department_name |
|-------------|---------------|----------------|------------------|
| 1           | John Doe       | 101            | HR               |
| 1           | John Doe       | 102            | IT               |
| 2           | Jane Smith     | 101            | HR               |
| 2           | Jane Smith     | 102            | IT               |
| 3           | Bob Johnson    | 101            | HR               |
| 3           | Bob Johnson    | 102            | IT               |

In the result set, each row represents a combination of an employee from the `employees` table and a department from the `departments` table, resulting in a Cartesian product.
-- Perform a CROSS JOIN to get all combinations of employees and departments
SELECT
    e.employee_id,
    e.employee_name,
    d.department_id,
    d.department_name
FROM
    employees e
CROSS JOIN
    departments d;

In this example:

- The `employees` table contains information about employees, including their `employee_id`, `employee_name`, and `department_id`.
- The `departments` table contains information about departments, including `department_id` and `department_name`.

The `CROSS JOIN` in the query combines every row from the `employees` table with every row from the `departments` table, resulting in a Cartesian product. The result includes all possible combinations of employees and departments.

Keep in mind that while CROSS JOINs have their use cases, they can lead to large result sets and are often avoided in practice. It's important to use them judiciously and consider their implications on performance.

Q 6: can you explain the difference between Cross JOIN and Full Outer Join?

Ans:

Cross Join

  • Definition: A Cross Join, also known as a Cartesian Join, produces the Cartesian product of the two tables involved. This means every row from the first table is paired with every row from the second table.
  • Result: If table A has m rows and table B has n rows, the result of a Cross Join will be a table with m * n rows. This type of join does not require any condition to join the two tables.
  • Use Case: Cross Joins are used when you need to combine each item of one set with all items of another set. It's less commonly used in practical scenarios but is helpful in creating comprehensive pairing combinations, like generating test data.

Full Outer Join

  • Definition: A Full Outer Join returns all records when there is a match in either the left (table A) or right (table B) table records. It also returns records from the left table and right table where there is no match with a fill of NULL values on either side where the match is absent.
  • Result: This join will return the union of a Left Outer Join and a Right Outer Join. Rows from both tables that do not meet the join condition are still included in the output, but with NULL values in the columns where data is missing.
  • Use Case: Full Outer Joins are beneficial when you need to maintain all records from both joined tables to see where data overlaps and where it does not (i.e., performing data reconciliations or completeness checks in datasets).

Key Differences

  • Output Complexity: Cross Joins produce a large set of combinations with no inherent relationship unless further filtered, whereas Full Outer Joins are typically used to display all potential matched and unmatched records between two datasets based on a specific condition or set of conditions.
  • Join Condition: Cross Joins do not use a join condition, leading to potentially massive result sets. Full Outer Joins generally use a condition to match rows, similar to other outer joins but ensure no data is excluded from either table.
  • Practical Usage: Full Outer Joins are more common in data analysis scenarios where understanding the presence or absence of matches is important. Cross Joins are less frequently used unless explicitly needed for comprehensive combinations.
Let's use a simple example with two small tables to illustrate the differences between a **Cross Join** and a **Full Outer Join**.

### Example Tables

#### `Products` Table:

| ProductID | ProductName   |
|-----------|---------------|
| 1         | Apple         |
| 2         | Banana        |

#### `Customers` Table:

| CustomerID | CustomerName  |
|------------|---------------|
| 101        | Alice         |
| 102        | Bob           |

### Cross Join
A **Cross Join** produces a Cartesian product, pairing every row from the first table with every row from the second table.

SQL Query:

SELECT *
FROM Products
CROSS JOIN Customers;
#### Result:
| ProductID | ProductName | CustomerID | CustomerName |
|-----------|-------------|------------|--------------|
| 1         | Apple       | 101        | Alice        |
| 1         | Apple       | 102        | Bob          |
| 2         | Banana      | 101        | Alice        |
| 2         | Banana      | 102        | Bob          |

**Explanation**: Each product is paired with each customer, resulting in 4 combinations (2 products x 2 customers = 4 combinations).

### Full Outer Join
A **Full Outer Join** combines the results of both left and right outer joins. The records from both tables will match where possible. Where there is no match, the result is still displayed, but with `NULL` values for the missing side.

SQL Query:

SELECT *
FROM Products
FULL OUTER JOIN Customers ON Products.ProductID = Customers.CustomerID;

(For demonstration, let’s pretend that ProductID could somehow match CustomerID to make this example feasible.)

#### Result:
| ProductID | ProductName | CustomerID | CustomerName |
|-----------|-------------|------------|--------------|
| 1         | Apple       | NULL       | NULL         |
| 2         | Banana      | NULL       | NULL         |
| NULL      | NULL        | 101        | Alice        |
| NULL      | NULL        | 102        | Bob          |

**Explanation**: Since `ProductID` does not match `CustomerID`, all rows from both tables appear, but they do not connect any products with customers. Products and customers are shown independently, with `NULL` values filling in where no matches occur.

### Key Differences Illustrated
- **Result Content**: The Cross Join creates every possible combination between the two tables, useful for scenarios requiring comprehensive pair analysis. The Full Outer Join shows all potential data from both tables with specific matches where possible and fills in gaps with `NULL` where no data exists to match on the joining condition.
- **Usage Context**: Cross Joins are not typically used without additional conditions unless a full combination set is specifically required. Full Outer Joins are useful when you want to ensure that no data from either table is omitted, providing a full picture of both matched and unmatched data.
Clone this wiki locally