Sat Nov 25 2023

Startup Killers

Anton Revyako
Founder of dwh.dev

Amid the soap opera starring Sam Altman, everyone forgot that just a week prior, Open AI held a developer conference that became a tombstone for numerous startups.

Of course, this isn't a unique occurrence. All major vendors do it. How many applications has Apple buried? Too many to count.

For instance, just a few days before the Open AI conference, another major vendor dug some graves. On November 2nd, an online Snowflake conference took place. Here it is: Snowday agenda

A few presentations have ruined a significant portion of the services working with Snowflake.

Everything took a hit: data quality, data security, data catalog, cost management, text2sql generators, and, of course, data lineage.

Data Quality

In my view, data quality-first services suffered the most. Now, there's no need to pay $50k/y for an interface to run SQL queries via cron. Now, in Snowflake, you can do this:

CREATE DATA METRIC FUNCTION INVALID_EMAIL_COUNT (ARG_T TABLE(ARG_C1 STRING))
  RETURNS NUMBER AS
  ...
;

ALTER TABLE t SET DATA_METRIC_SCHEDULE = 'USING CRON */5 * * * * UTC';
ALTER TABLE t ADD DATA METRIC FUNCTION INVALID_EMAIL_COUNT ON (EMAIL);

And obtain a table with results. Then, create an ALERT that fetches results from this table and sends you emails if something goes wrong.

With Snowflake now having Cortex Anomaly Detection and Time-Series Forecasting, complex metric-related tasks can be handled within Snowflake itself.

Data Security

The new Trust Center section contains information about account issues (account access settings), network policies, and new DATABASE ROLE objects. Also there's a new object: PRIVACY POLICY. A cool feature adding noise to aggregation functions, adapting to the dataset's size, and guarding against differencing attacks.

Data Catalog

Universal Search: AI-based metadata search in your database. The demo looks impressive: natural language search yields relevant results even without column's comments.

Cost Management

A new section in the admin panel lets you view expense dynamics, the most expensive queries (grouped by hash, irrespective of query parameters' values), lists of rarely used materialized views, tables with clustering keys, and more.

Text2sql Generators

Snowflake Copilot :)
Code generation happens almost instantly.

Data Lineage

Now, tables (for other objects, nothing is known) have a lineage tab. You can view object-to-object dependencies as a graph, expanding it one object at a time.

Details

Altogether, it's called Snowflake Horizon. The vendor's logic is clear: give clients maximum convenience without requiring additional purchases. Is it bad? Good!

But not entirely. While all this looks great, it's not enough to make clients' pain disappear…

DATA METRIC FUNCTION lacks an interface, and setting up Slack notifications will require coding.
Universal Search lacks essential features present in data catalogs, like built-in chats for database objects :)
Cost management/optimization for startups involves rewriting queries (getespresso.ai) or tuning the warehouse (select.dev).
Snowflake Copilot generates poor queries, needing requests for rewrite. It's unclear if these queries are validated against reality.

And more on lineage:

Bird's-eye view only
Takes several seconds to open the next level of dependencies and can only be done for one object at a time
Object view. Columnar Lineage looks like a table in a separate modal window and takes 5 seconds to load for 5 dependent objects in 3 downstream levels.
It's unclear where the relationships shown in the demo come from. CTAS? INSERT/UPDATE/MERGE? Over what time period? DYNAMIC TABLE only? Does it work for COPY INTO in PIPES? What about VIEWS?
The demo demonstrates a rather specific case: applying PII tags (and dependent masking policy) to columns from downstream objects by clicking in the interface. However, in Snowflake, tags don't automatically spread to dependent objects. If a tag is assigned to a column, anything downstream won't be affected.

Is such functionality useful in the user interface? At first glance, yes. But could it happen that data used to flow one way, then changed? Could PII stop flowing from upstream? Why not? Anything can happen. Would tags set on downstream columns still make sense in a month, a quarter, a year? Nobody knows.

Suddenly, a new problem arises: regular review of tags and masking policy. Reviewing becomes more complicated as no comments can be added to tag and masking policy set process.

How will all this look and perform when there are hundreds of downstream objects?

So, How Do Startups Survive Now?

You might think I'm nitpicking?

Many independent data lineage products work the same features as Snowflake now, and that not bothered them (it's in the past now). Ultimately, many users are satisfied with the information provided by DBT :)

It's also clear that Snowflake simply bolted an interface onto the lineage data they already had. It's crucial to understand: lineage in Snowflake is the result of dynamic analysis. All independent data lineage services don't have Snowflake's data and build their work based on static analysis. Dynamic and static analyses are different approaches, not replacements.

For data startups, it seems things that lie beyond Snowflake's capabilities or demand far more effort than extracting metrics from query engines should now take the forefront.

For instance, as soon as the need arises to view lineage from BI dashboards up to sources or to obtain columnar lineage for DBT, Snowflake won't be of much help.

Or static analysis of SQL queries. Developing this is time-consuming and expensive (as noticed across all dwh.dev competitors). It seems vendors' efforts will now focus on AI rather than static analysis.

Will it become more challenging for modern data stack startups to survive?
Yes, certainly.

Will it kill them?
The ones that thrived on basic functionality - undoubtedly.

And How About the Clients?

The vendor provided good enough functionality.
What's next is up to them :)