279 lines
8.9 KiB
Markdown
279 lines
8.9 KiB
Markdown
## The basic concept of the dataflow
|
||
|
||
### The main relation type in dataflow
|
||
1. column to column dataflow, the data of target column is coming from the source column(fdd).
|
||
2. column to resultset(mainly select list), the number of row in the resultset is impacted by the source column(fdr).
|
||
3. resultset to resultset, the number of row in a source table in the from clause impact the number of row in the target select list.(fdr)
|
||
|
||
### Analyze dataflow in the different SQL elements - part 1
|
||
1. select list
|
||
2. where clause
|
||
3. function (case expression)
|
||
4. group by(aggregate function)
|
||
5. from clause
|
||
6. handle of select * (Not finished yet)
|
||
|
||
#### 1. select list
|
||
```sql
|
||
SELECT a.empName "eName"
|
||
FROM scott.emp a
|
||
```
|
||
the data of target column `"eName"` comes from `scott.emp.empName`, so we have a dataflow relation like this:
|
||
```
|
||
scott.emp.empName -> fdd -> "eName"
|
||
```
|
||
the result generated by the select list called: `resultset` likes a virtual table includes columns and rows.
|
||
|
||
#### 2. where clause
|
||
```sql
|
||
SELECT a.empName "eName"
|
||
FROM scott.emp a
|
||
Where sal > 1000
|
||
```
|
||
The total number of row in the select list is impacted by the value of column `sal` in the where clause. So we have a dataflow relation like this:
|
||
```
|
||
sal -> fdr -> resultset.PseudoRows
|
||
```
|
||
|
||
As you can see, we introduced a new pseudo column: `PseudoRows` to represents the number of rows in the resultset.
|
||
|
||
#### 3. function
|
||
During the dataflow analyzing, `function` plays a key role. It accepts arguments which usually is column and generate resultset which maybe a scalar value or a set value.
|
||
```sql
|
||
select round(salary) as sal from scott.emp
|
||
```
|
||
|
||
The relation of the `round` function in the above SQL :
|
||
```
|
||
scott.emp.salary -> fdd -> round(salary) -> fdd -> sal
|
||
```
|
||
|
||
#### 4. group by and aggregate function
|
||
```sql
|
||
SELECT deptno, COUNT() num_emp, SUM(SAL) sal_sum
|
||
FROM scott.emp
|
||
Where city = 'NYC'
|
||
GROUP BY deptno
|
||
```
|
||
|
||
##### 4.1
|
||
|
||
since `SUM()` is an aggregate function, so `deptno` column in the group by clause will be treated as an implict argument of the `SUM()` function.
|
||
However, `deptno` column doesn't directly contribute the value to the `SUM()` function as column `SAL` does, So, the relation type is `fdr`:
|
||
```
|
||
scott.emp.deptno -> fdr -> SUM(SAL) -> fdd -> sal_sum
|
||
```
|
||
|
||
the columns in the having clause have the same relation as the columns in the group by clause as mentioned above.
|
||
|
||
##### 4.2
|
||
The value of `SUM()` function also effected by the total rows of the table `scott.emp`, so, there is a relation like this:
|
||
```
|
||
scott.emp.pseudoRows -> fdr -> SUM(SAL) -> fdd -> sal_sum
|
||
```
|
||
|
||
The above rules apply to all aggregation functions, such as the `count()` function in the SQL.
|
||
|
||
#### 5. From clause
|
||
If the resultset of a subquery or CTE is used in the from clause of the upper-level statement, then the impact of the lower level resultset will be transferred to the upper-level.
|
||
```sql
|
||
WITH
|
||
cteReports (EmpID, FirstName, LastName, MgrID, EmpLevel)
|
||
AS
|
||
(
|
||
SELECT EmployeeID, FirstName, LastName, ManagerID, 1 -- resultset1
|
||
FROM Employees
|
||
WHERE ManagerID IS NULL
|
||
)
|
||
SELECT
|
||
FirstName + ' ' + LastName AS FullName, EmpLevel -- resultset2
|
||
FROM cteReports
|
||
```
|
||
|
||
In the CTE, there is an impact relation:
|
||
```
|
||
Employees.ManagerID -> fdr -> resultset1.pseudoRows
|
||
```
|
||
Since `cteReports` is used in the from clause of the upper-level statement, then the impact will carry on like this:
|
||
```
|
||
Employees.ManagerID -> fdr -> resultset1.pseudoRows -> fdd -> resultset2.pseudoRows
|
||
```
|
||
|
||
If we choose to ignore the intermediate resultset, the end to end dataflow is :
|
||
```
|
||
Employees.ManagerID -> fdr -> resultset2.pseudoRows
|
||
```
|
||
|
||
|
||
### Handle the dataflow chain
|
||
Every relation in the SQL is picked up by the tool, and connected together to show the whole dataflow chain.
|
||
Sometimes, we only need to see the end to end relation and ignore all the intermediate relations.
|
||
|
||
If we need to convert a fully chained dataflow to an `end to end` dataflow, we may consider the following rules:
|
||
|
||
1. A single dataflow chain with the mixed relation types: fdd and fdr.
|
||
```
|
||
A -> fdd -> B -> fdr -> C -> fdd -> D
|
||
```
|
||
the rule is: if any `fdr` relation appears in the chain, the relation from `A -> D` will be consider as type of `fdr`, otherwise, the final relation is `fdd` for the end to end relation of `A -> D`.
|
||
|
||
2. If there are multiple chains from `A -> D`
|
||
```
|
||
A -> fdd -> B1 -> fdr -> C1 -> fdd -> D
|
||
A -> fdd -> B2 -> fdr -> C1 -> fdd -> D
|
||
A -> fdd -> B3 -> fdd -> C3 -> fdd -> D
|
||
```
|
||
The final relation should choose the `fdd` chain if any.
|
||
|
||
### analyze dataflow in the different SQL elements - part 2
|
||
#### 1. case expression
|
||
|
||
```sql
|
||
select
|
||
case when a.kamut=1 and b.teur IS null
|
||
then 'no locks'
|
||
when a.kamut=1
|
||
then b.teur
|
||
else 'locks'
|
||
end teur
|
||
from tbl a left join TT b on (a.key=b.key)
|
||
```
|
||
During the analyzing of dataflow, case expression is treated as a function. The column used inside the case expression will be treated like the arguments of a function.
|
||
So for the above SQL, the following relation is discovered:
|
||
```
|
||
tbl.kamut -> fdd -> teur
|
||
TT.teur -> fdd -> teur
|
||
```
|
||
|
||
#### 2. join condition
|
||
|
||
Columns in the join condition also effect the number of row in the resultset of the select list just like column in the where clause do.
|
||
So, the following relation will be discoverd in the above SQL.
|
||
```
|
||
tbl.key -> fdr -> resultset.PseudoRows
|
||
TT.key -> fdr -> resultset.PseudoRows
|
||
```
|
||
|
||
#### 3. create view
|
||
```sql
|
||
create view vEmp(eName) as
|
||
SELECT a.empName "eName"
|
||
FROM scott.emp a
|
||
Where sal > 1000
|
||
```
|
||
From this query, you will see how the column `sal` in where clause impact the number of rows in the top level view `vEmp`.
|
||
```
|
||
scott.emp.sal -> fdr -> resultset1.PseudoRows -> fdr -> vEmp.PseudoRows
|
||
```
|
||
|
||
So, from an end to end point of view, there will be a `fdr` relation between column `sal` and view `vEmp` like this:
|
||
```
|
||
scott.emp.sal -> fdr -> vEmp.PseudoRows
|
||
```
|
||
|
||
#### 4. rename/swap table
|
||
```sql
|
||
alter table t2 rename to t3;
|
||
```
|
||
We also use `PseudoRows` to represent the relation when rename a table, the relation type is `fdd`.
|
||
```
|
||
t2.PseudoRows -> fdd -> t3.PseudoRows
|
||
```
|
||
|
||
#### 5. create external table (snowflake)
|
||
```sql
|
||
create or replace stage exttable_part_stage
|
||
url='s3://load/encrypted_files/'
|
||
credentials=(aws_key_id='1a2b3c' aws_secret_key='4x5y6z')
|
||
encryption=(type='AWS_SSE_KMS' kms_key_id = 'aws/key');
|
||
|
||
create external table exttable_part(
|
||
date_part date as to_date(split_part(metadata$filename, '/', 3)
|
||
|| '/' || split_part(metadata$filename, '/', 4)
|
||
|| '/' || split_part(metadata$filename, '/', 5), 'YYYY/MM/DD'),
|
||
timestamp bigint as (value:timestamp::bigint),
|
||
col2 varchar as (value:col2::varchar))
|
||
partition by (date_part)
|
||
location=@exttable_part_stage/logs/
|
||
auto_refresh = true
|
||
file_format = (type = parquet);
|
||
```
|
||
|
||
The data of the external table `exttable_part` comes from the stage: `exttable_part_stage`
|
||
```
|
||
exttable_part_stage (url='s3://load/encrypted_files/') -> fdd -> exttable_part(date_part,timestamp,col2)
|
||
```
|
||
|
||
#### 6. create external table (bigquery)
|
||
```sql
|
||
CREATE EXTERNAL TABLE dataset.CsvTable OPTIONS (
|
||
format = 'CSV',
|
||
uris = ['gs://bucket/path1.csv', 'gs://bucket/path2.csv']
|
||
);
|
||
```
|
||
|
||
The data of the external table `dataset.CsvTable` comes from the csv file: `gs://bucket/path1.csv, gs://bucket/path2.csv`
|
||
```
|
||
file (uri='gs://bucket/path1.csv') -> fdd -> dataset.CsvTable
|
||
file (uri='gs://bucket/path2.csv') -> fdd -> dataset.CsvTable
|
||
```
|
||
|
||

|
||
|
||
#### 7. build data lineage for the foreign key in the create table statement.
|
||
```sql
|
||
CREATE TABLE masteTable
|
||
(
|
||
masterColumn varchar(3) Primary Key,
|
||
);
|
||
|
||
|
||
CREATE TABLE foreignTable
|
||
(
|
||
foreignColumn1 varchar(3) NOT NULL ,
|
||
foreignColumn2 varchar(3) NOT NULL
|
||
FOREIGN KEY (foreignColumn1) REFERENCES masteTable(masterColumn),
|
||
FOREIGN KEY (foreignColumn2) REFERENCES masteTable(masterColumn)
|
||
)
|
||
```
|
||
|
||
The data flow is:
|
||
```
|
||
masteTable.masterColumn -> fdd -> foreignTable.foreignColumn1
|
||
masteTable.masterColumn -> fdd -> foreignTable.foreignColumn2
|
||
```
|
||
|
||
#### 8, Hive load data
|
||
```sql
|
||
LOAD DATA LOCAL INPATH /tmp/pv_2008-06-08_us.txt INTO TABLE page_view PARTITION(date='2008-06-08', country='US')
|
||
```
|
||
|
||
The data flow is:
|
||
```
|
||
file (/tmp/pv_2008-06-08_us.txt) -> fdd -> page_view(date,country)
|
||
```
|
||

|
||
|
||
#### 9, Hive INSERT OVERWRITE [LOCAL] DIRECTORY
|
||
```sql
|
||
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/pv_gender_sum'
|
||
SELECT pv_gender_sum.*
|
||
FROM pv_gender_sum;
|
||
```
|
||
|
||
The data flow is:
|
||
```
|
||
pv_gender_sum(*) -> fdd -> file ('/tmp/pv_gender_sum')
|
||
```
|
||
|
||
|
||
### The meaning of the letter in fdd, fdr
|
||
|
||
The meaning of the letter in fdd, fdr. f: dataflow, d: data value, r: record set.
|
||
|
||
The first letter is always f,the second letter represents the source column,the third letter represents the target column, the fourth is reserved.
|
||
|
||
* fdd: data of the source column will used in the target column
|
||
* fdr: data of the source column will impact the number of the resultset in the select list, or will impact the result value of an anggreate function.
|
||
|