sqlflow_public/dbobjects_relationship.md

279 lines
8.9 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

## The basic concept of the dataflow
### The main relation type in dataflow
1. column to column dataflow, the data of target column is coming from the source column(fdd).
2. column to resultset(mainly select list), the number of row in the resultset is impacted by the source column(fdr).
3. resultset to resultset, the number of row in a source table in the from clause impact the number of row in the target select list.(fdr)
### Analyze dataflow in the different SQL elements - part 1
1. select list
2. where clause
3. function (case expression)
4. group by(aggregate function)
5. from clause
6. handle of select * (Not finished yet)
#### 1. select list
```sql
SELECT a.empName "eName"
FROM scott.emp a
```
the data of target column `"eName"` comes from `scott.emp.empName`, so we have a dataflow relation like this:
```
scott.emp.empName -> fdd -> "eName"
```
the result generated by the select list called: `resultset` likes a virtual table includes columns and rows.
#### 2. where clause
```sql
SELECT a.empName "eName"
FROM scott.emp a
Where sal > 1000
```
The total number of row in the select list is impacted by the value of column `sal` in the where clause. So we have a dataflow relation like this:
```
sal -> fdr -> resultset.PseudoRows
```
As you can see, we introduced a new pseudo column: `PseudoRows` to represents the number of rows in the resultset.
#### 3. function
During the dataflow analyzing, `function` plays a key role. It accepts arguments which usually is column and generate resultset which maybe a scalar value or a set value.
```sql
select round(salary) as sal from scott.emp
```
The relation of the `round` function in the above SQL :
```
scott.emp.salary -> fdd -> round(salary) -> fdd -> sal
```
#### 4. group by and aggregate function
```sql
SELECT deptno, COUNT() num_emp, SUM(SAL) sal_sum
FROM scott.emp
Where city = 'NYC'
GROUP BY deptno
```
##### 4.1
since `SUM()` is an aggregate function, so `deptno` column in the group by clause will be treated as an implict argument of the `SUM()` function.
However, `deptno` column doesn't directly contribute the value to the `SUM()` function as column `SAL` does, So, the relation type is `fdr`:
```
scott.emp.deptno -> fdr -> SUM(SAL) -> fdd -> sal_sum
```
the columns in the having clause have the same relation as the columns in the group by clause as mentioned above.
##### 4.2
The value of `SUM()` function also effected by the total rows of the table `scott.emp`, so, there is a relation like this:
```
scott.emp.pseudoRows -> fdr -> SUM(SAL) -> fdd -> sal_sum
```
The above rules apply to all aggregation functions, such as the `count()` function in the SQL.
#### 5. From clause
If the resultset of a subquery or CTE is used in the from clause of the upper-level statement, then the impact of the lower level resultset will be transferred to the upper-level.
```sql
WITH
cteReports (EmpID, FirstName, LastName, MgrID, EmpLevel)
AS
(
SELECT EmployeeID, FirstName, LastName, ManagerID, 1 -- resultset1
FROM Employees
WHERE ManagerID IS NULL
)
SELECT
FirstName + ' ' + LastName AS FullName, EmpLevel -- resultset2
FROM cteReports
```
In the CTE, there is an impact relation:
```
Employees.ManagerID -> fdr -> resultset1.pseudoRows
```
Since `cteReports` is used in the from clause of the upper-level statement, then the impact will carry on like this:
```
Employees.ManagerID -> fdr -> resultset1.pseudoRows -> fdd -> resultset2.pseudoRows
```
If we choose to ignore the intermediate resultset, the end to end dataflow is :
```
Employees.ManagerID -> fdr -> resultset2.pseudoRows
```
### Handle the dataflow chain
Every relation in the SQL is picked up by the tool, and connected together to show the whole dataflow chain.
Sometimes, we only need to see the end to end relation and ignore all the intermediate relations.
If we need to convert a fully chained dataflow to an `end to end` dataflow, we may consider the following rules:
1. A single dataflow chain with the mixed relation types: fdd and fdr.
```
A -> fdd -> B -> fdr -> C -> fdd -> D
```
the rule is: if any `fdr` relation appears in the chain, the relation from `A -> D` will be consider as type of `fdr`, otherwise, the final relation is `fdd` for the end to end relation of `A -> D`.
2. If there are multiple chains from `A -> D`
```
A -> fdd -> B1 -> fdr -> C1 -> fdd -> D
A -> fdd -> B2 -> fdr -> C1 -> fdd -> D
A -> fdd -> B3 -> fdd -> C3 -> fdd -> D
```
The final relation should choose the `fdd` chain if any.
### analyze dataflow in the different SQL elements - part 2
#### 1. case expression
```sql
select
case when a.kamut=1 and b.teur IS null
then 'no locks'
when a.kamut=1
then b.teur
else 'locks'
end teur
from tbl a left join TT b on (a.key=b.key)
```
During the analyzing of dataflow, case expression is treated as a function. The column used inside the case expression will be treated like the arguments of a function.
So for the above SQL, the following relation is discovered:
```
tbl.kamut -> fdd -> teur
TT.teur -> fdd -> teur
```
#### 2. join condition
Columns in the join condition also effect the number of row in the resultset of the select list just like column in the where clause do.
So, the following relation will be discoverd in the above SQL.
```
tbl.key -> fdr -> resultset.PseudoRows
TT.key -> fdr -> resultset.PseudoRows
```
#### 3. create view
```sql
create view vEmp(eName) as
SELECT a.empName "eName"
FROM scott.emp a
Where sal > 1000
```
From this query, you will see how the column `sal` in where clause impact the number of rows in the top level view `vEmp`.
```
scott.emp.sal -> fdr -> resultset1.PseudoRows -> fdr -> vEmp.PseudoRows
```
So, from an end to end point of view, there will be a `fdr` relation between column `sal` and view `vEmp` like this:
```
scott.emp.sal -> fdr -> vEmp.PseudoRows
```
#### 4. rename/swap table
```sql
alter table t2 rename to t3;
```
We also use `PseudoRows` to represent the relation when rename a table, the relation type is `fdd`.
```
t2.PseudoRows -> fdd -> t3.PseudoRows
```
#### 5. create external table (snowflake)
```sql
create or replace stage exttable_part_stage
url='s3://load/encrypted_files/'
credentials=(aws_key_id='1a2b3c' aws_secret_key='4x5y6z')
encryption=(type='AWS_SSE_KMS' kms_key_id = 'aws/key');
create external table exttable_part(
date_part date as to_date(split_part(metadata$filename, '/', 3)
|| '/' || split_part(metadata$filename, '/', 4)
|| '/' || split_part(metadata$filename, '/', 5), 'YYYY/MM/DD'),
timestamp bigint as (value:timestamp::bigint),
col2 varchar as (value:col2::varchar))
partition by (date_part)
location=@exttable_part_stage/logs/
auto_refresh = true
file_format = (type = parquet);
```
The data of the external table `exttable_part` comes from the stage: `exttable_part_stage`
```
exttable_part_stage (url='s3://load/encrypted_files/') -> fdd -> exttable_part(date_part,timestamp,col2)
```
#### 6. create external table (bigquery)
```sql
CREATE EXTERNAL TABLE dataset.CsvTable OPTIONS (
format = 'CSV',
uris = ['gs://bucket/path1.csv', 'gs://bucket/path2.csv']
);
```
The data of the external table `dataset.CsvTable` comes from the csv file: `gs://bucket/path1.csv, gs://bucket/path2.csv`
```
file (uri='gs://bucket/path1.csv') -> fdd -> dataset.CsvTable
file (uri='gs://bucket/path2.csv') -> fdd -> dataset.CsvTable
```
![bigquery create external table](/images/bigquery-create-external-table.png)
#### 7. build data lineage for the foreign key in the create table statement.
```sql
CREATE TABLE masteTable
(
masterColumn varchar(3) Primary Key,
);
CREATE TABLE foreignTable
(
foreignColumn1 varchar(3) NOT NULL ,
foreignColumn2 varchar(3) NOT NULL
FOREIGN KEY (foreignColumn1) REFERENCES masteTable(masterColumn),
FOREIGN KEY (foreignColumn2) REFERENCES masteTable(masterColumn)
)
```
The data flow is:
```
masteTable.masterColumn -> fdd -> foreignTable.foreignColumn1
masteTable.masterColumn -> fdd -> foreignTable.foreignColumn2
```
#### 8, Hive load data
```sql
LOAD DATA LOCAL INPATH /tmp/pv_2008-06-08_us.txt INTO TABLE page_view PARTITION(date='2008-06-08', country='US')
```
The data flow is:
```
file (/tmp/pv_2008-06-08_us.txt) -> fdd -> page_view(date,country)
```
![hive load data](/images/hive-load-data.png)
#### 9, Hive INSERT OVERWRITE [LOCAL] DIRECTORY
```sql
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/pv_gender_sum'
SELECT pv_gender_sum.*
FROM pv_gender_sum;
```
The data flow is:
```
pv_gender_sum(*) -> fdd -> file ('/tmp/pv_gender_sum')
```
### The meaning of the letter in fdd, fdr
The meaning of the letter in fdd, fdr. f: dataflow, d: data value, r: record set.
The first letter is always fthe second letter represents the source columnthe third letter represents the target column, the fourth is reserved.
* fdd: data of the source column will used in the target column
* fdr: data of the source column will impact the number of the resultset in the select list, or will impact the result value of an anggreate function.