sqlflow_public/doc/the complete data lineage.md

4.0 KiB

When analyzing data lineage, the complete model is always generated since all other higher level models are based on this model.

1. Types of entity

Table, view, column, process(SQL statement), resultset, function, variale, path.

2. Types of relation

fdd

The fdd relation means the data of the target entity comes from the source entity. Take this SQL query for example:

SELECT a.empName "eName"
FROM scott.emp a

the data of target column "eName" comes from scott.emp.empName, so we have a dataflow relation like this:

scott.emp.empName -> fdd -> "eName"

image.png

the result generated by the select list is called: resultset which ikes a virtual table includes columns and rows.

fdr

The fdr relation means the data of the source column will impact the row number of the resultset in the select list, or will impact the result value of an anggreate function.

SELECT a.empName "eName"
FROM scott.emp a
Where sal > 1000

The total number of rows in the select list is impacted by the value of column sal in the where clause. So we have a dataflow relation like this:

sal -> fdr -> resultset.PseudoRows

image.png

PseudoRows column

As you can see, we introduced a new pseudo column: PseudoRows to represents the number of rows in the resultset.

The fdr type dataflow is represented by a dash line. You can hide the fdr type dataflow by turn off the impact option in the SQLFlow.

image.png

You may find more examples about the fdr relation in the lineage in real SQL section.

join

The join relation build a link between 2 or more columns in the join condition. Striclty speaking, the relation is not a dataflow relation.

 select b.teur
 from tbl a left join TT b on (a.key=b.key)

A join relation will be created after analzye the above SQL. It indicates a join relation betwee tbl.key and TT.key.

image.png

3. Connect the entity using relation

When build relation between 2 entities, the source and target entity can be column to column, or, table to table.

column to column

This is the most often case that both entity in a relation are columns.

table to table

Sometimes, there will be a relation between 2 tables. For example, in an alter table rename SQL statement, a table to table relation will be created. Acutally, a table to table relation is represented by a column to column relation using the PseudoRows column.

Table to table relation is included in the complete lineage model by using PseudoRows for 2 reasons:

  1. This pseudo column to column relation will be used to generate a table-level lineage later if user need a table-level lineage model.
  2. If a column in this table is used in a column to column relation while the table itself is in a table to table relation, then, this pseudo column will make it possible for a single table to includes both the column to column relation and table to table relation.

take this SQL for example

create view v1 as select f1 from t2;
alter table t2 rename to t3;

column to column relation

image.png

As you can see, Table t2 involved in the column to column relation generated by the create view statement, It also involved in a table to table relation generated by the alter table statement. A single table t2 in the above digram show that it includes both the column to column relation and the table to table reation.

table-level lineage

With the table to table relation generated in the complete data lineage model, we can later use it to generate a table-level lineage like this:

image.png