The gaps in Snowflake internal lineage

  • Anton Revyako
    Anton Revyako

    Founder of dwh.dev

The gaps in Snowflake internal lineage

Recently, Snowflake Data Superhero Cristian Scutaru published an article where he argues that lineage tools are unnecessary and that Snowflake provides enough out-of-the-box features.

Main Points:

  • In the modern world, living without column-level lineage is impossible (agree)
  • Open-source is complicated [and useless] (agree)
  • Commercial solutions cost $1000 a month, and that's too much (disagree. $1000 is too cheap 🙂)
  • You can use built-in features and visualize everything in Graphviz by generating DOT annotations (disagree)

As an alternative to third-party solutions, the article suggests using the built-in ACCESS_HISTORY function. More details can be found here. Let's delve into this a bit more.

0) Not just "where," but also "how."

Internal Lineage doesn't work as you might expect. The examples in the article and documentation only discuss connections between two objects at most: inserting from one table to another, selecting from a table. But what happens when you use multiple dependent views?

CREATE OR REPLACE TABLE SOURCETABLE (id int)
    AS SELECT * FROM VALUES (1), (2), (3)
;
CREATE OR REPLACE VIEW VIEW1 AS SELECT id FROM SOURCETABLE;
CREATE OR REPLACE VIEW VIEW2 AS SELECT id FROM VIEW1;
SELECT * FROM VIEW2;

What do we see in ACCESS_HISTORY?

[
  {
    "columns": [
      {
        "columnName": "ID"
      }
    ],
    "objectDomain": "Table",
    "objectName": "TESTDB.PUBLIC.SOURCETABLE"
  }
]


[
  {
    "columns": [
      {
        "columnName": "ID"
      }
    ],
    "objectDomain": "View",
    "objectName": "TESTDB.PUBLIC.VIEW2"
  }
]

Oops… VIEW1 disappeared somewhere! I can list many reasons why such Lineage doesn't answer the questions teams usually have when they start thinking about it. Just imagine that you have dozens of such intermediate objects.

1) A lot of manual work

Okay, how about this:

CREATE OR REPLACE TABLE TARGETTABLE1 (id int);
CREATE OR REPLACE TABLE TARGETTABLE2 (id int);
INSERT INTO TARGETTABLE1 SELECT * FROM SOURCETABLE;
INSERT INTO TARGETTABLE2 SELECT * FROM TARGETTABLE1;

ACCESS_HISTORY will display 2 segments:

SOURCETABLE -> TARGETTABLE1
TARGETTABLE1 -> TARGETTABLE2

Great. Now, we just have to recursively go through all upstream objects to build the complete dependency graph…

2) This functionality is available starting from the ENTERPRISE edition

If you haven't yet started paying twice as much for the ENTERPRISE edition and are on the STANDARD edition, think hard whether it's worth it…

3) Mismatching with external systems

In addition to the function not showing anything happening with the data between the endpoints, you'll also not learn how these queries are connected with external systems. Which DBT model was used here? Which BI dashboard triggered this query?

4) No additional insights

Because we only see the beginning and the end, we can't see any inherited tags or comments either. We won't know if aggregation or any transformations were applied.

5) Visualization is a separate challenge that you might not want to tackle.

Graphviz is a great tool; we use it as a source for object placement in dwh.dev, but it's not what the average person wants to see, especially when we talk about at least a hundred objects on one graph. Especially if it's column-level lineage. Everything turns into a mash of thousands of columns, without search, interactive highlighting, or navigation. Conclusion

A final summary.

Snowflake claims that the total engineering resources spent on data quality issues are consistently between 30 to 50% according to most industry surveys.

If you are already paying for a cloud database and $100k+ annually for each data engineer on your team (and data engineers don't come alone :)), spending $1000-$3000 a month on data lineage (for example, on dwh.dev) would be the wisest investment 🙂