State of Data Engineering 2024 Q2

Data Engineering and AI

Chip Huyen, who came out of Stanford and is active in the AI space recently wrote an article on what she learned by looking at the 900 most popular open source AI tools.

https://huyenchip.com/2024/03/14/ai-oss.html

Image Credit: huyenchip’s blog

In data engineering, one of our primary usages of AI is really just prompt engineering.

Use Case 1: Data Migration

Before LLMs, when we did data migrations, we would use Amazon Schema Conversion Tool (SCT) first to help convert source schemas to a new target schema.  Let us say we are going from SQLServer to Postgres, which is a major language change.

From there, the hard part begins where you need to manually convert the SQL Server SQL business logic code to Postgres.  Some converters do exist out there, and I assume they work on a basis of mapping a language grammar from one to another (fun fact – I almost pursued a PhD in compiler optimization, but bailed from the program).

Now what we can do is use LLMs to convert a huge set of code from one source to a target using prompt engineering.  Despite a lot of the new open source models out there, Chat GPT 4 still seems to be outperforming the competitors for the time being in doing this type of code conversions.

The crazy thing is with the LLMs, we can convert really one source system to any source system.  If you try it out Java to C#, SQL to Spark SQL, all work somewhat reasonably well.  In terms of predictions of our field I see a couple things progressing

Phase 1 Now:

  1. Productivity gains of code conversions using LLMs
  2. Productivity gains of coding itself of tools like Amazon Code Whisperer or Amazon Q or LLM of your choice for faster coding
  3. Productivity gains of learning a new language with LLMS
  4. Debugging stack traces by having LLMs analyze it

Phase 2: Near Future

  1. Tweaks of LLMs to make them more deterministic for prompt engineering.  We already have the ability to control creativity with the ‘temperature’ parameter, but we generally have to give really tight prompt conditions to get some of the code conversions to work.  In some of our experimentations with SQL to SparkSQL, doing things like passing in the DDLs have forced the LLMs to generate more accurate information.
  1. An interesting paper about using chain of thought with prompting (a series of intermediate reasoning steps), might help us move towards this
    Arxiv paper here – https://arxiv.org/abs/2201.11903
  2. In latent.space’s latest newsletter, they mentioned a citation of a paper adding “Let’s think step by step” improved zero shot reasoning from 17 to 79%.  If you happen to DM me and say that in an introduction I will raise an eyebrow.
    latent.space citation link
  1. Being able to use LLMs to create data quality tests based on schemas or create unit tests based off existing ETL code.

Phase 3: Future

  1. The far scary future is where we tell LLMs how to create our data engineering systems.  Imagine telling it to ingest data from S3 into an Open Table Format (OTF) and to write business code on top of this.  I kind of don’t see this for at least 10ish years though.

Open Table Format Wars – Continued


The OTF wars continue to rage with no end in site.  As a refresher, there are 3 players

  1. Apache Hudi – which came out of the Uber project
  2. Apache Iceberg – which came out of the Netflix project
  3. Databricks Deltalake.

As a reminder, OTFs provide features some as time travel features, incremental ETL, deletion capability, and schema evolution-ish capability depending on which one you use.

Perhaps one of the biggest subtle changes which has recently happened is that the OneTable project is now Apache X Table.

https://xtable.apache.org

Apache X Table is a framework to seamlessly do cross-table work between any of the OTFs.  I still think this is ahead of its time because I haven’t seen any project that have needs to combine multiple OTFs in an organization.  My prediction though is in 5-10 years this format will become a standard to allow vendor interoperability, but it will take a while.

Apache Hudi Updates

Apache Iceberg Updates

Lake Formation

Lake Formation, which still is a bit weird to me as one part of it is blue prints which we really don’t use, and the other part which deals with access control, rolled out some new changes with OTF integration and ACL

Summary about state of the OTF Market

It is still kind of mess, and there still really aren’t any clear winners.  There are also multiple options where you can choose to go the open source branch or with a hosted provide with One House or Tabular.

The false promises of AWS announcements – S3 Express Zones

Around Re:invent, there are always a huge set of announcements, and one stood out, S3 Express Zones.  This feature would allow retrieval of data in S3 in the single digit milliseconds with the tradeoffs of storage being in one zone (so no HA).  You can imagine if this actually works, datalakes can hypothetically start competing with databases as we wouldn’t need to worry about the SLA time penalty you usually get with S3.

Looking at the restrictions there are some pretty significant drawbacks.

https://docs.aws.amazon.com/athena/latest/ug/querying-express-one-zone.html

As you can see here Hudi isn’t supported  (not sure why Iceberg tables aren’t there), and Deltalake has partial support.  The other consideration is this is in one zone, so you have to  make sure there is a replicated bucket in a standard zone.

I kind of feel that Amazon seems to test the waters by launching not fully formed products, to get feedback from us.  Unfortunately that makes us the guinea pigs

TLDR – This service works for Glue jobs, but for OTFs, it is dead in the water for the time being.

Amazon Q 

I remember being in an AWS roundtable representing data consulting companies at Re:invent and a complaint from other firms was that Amazon had too many confusing products.  As we are all guinea pigs in their early services, Amazon Q is no exception. 

 Use CaseFeatures
Amazon Q for BusinessChatbot for internal enterprise data that is managed by Amazon. No dev work requiredChatbot
Amazon Q For DevelopersBest for doing basic coding and coding with AWS specific services.

Broader coding is probably better with a foundational model
Code completion – Code whisperer

Chat – Amazon Q

TLDR

  1. Amazon Q for business is a managed product where you click and add data sources and a chatbot is used
  2. Amazon Q for developers contains Code completion (Code Whisperer) AND a chat in Visual Studio IDE with, yes, Amazon Q again as the chat.  Confused yet?

Quicksight Q

I’d like to confuse you one more time with the history of Quicksight Q.  Pre ChatGPT and LLM craze, Quicksight Q in 2021 went Generally Available (GA) being powered by Machine Learning

https://aws.amazon.com/about-aws/whats-new/2021/09/amazon-quicksight-q-generally-available

After Chat GPT came out, Quicksight Q went back into Preview

With LLM integration, but they kept the same name.

One of the things to really keep in mind is as you do your solutions architecture, you need to keep in mind of a service is in preview or GA.  Things in preview typically only support a couple regions and don’t have production support.  If you are interested in a service in preview (like Amazon Q), it is advisable to wait a bit.

A Framework for Processing Uber Uber Large Sets of Data – Theseus


I show this diagram very often, and as a refresher, a lot of the work we do in data engineering is yellow and in red, and often involves OTFS.

Voltron Data, who created a GPU Query Engine called Theseus, put out these benchmarks comparing their framework Theseus vs Spark

https://voltrondata.com/benchmarks/theseus

Image Credit: Voltran’s Blog1
Their guidance also quite interesting

For less than 2TBs: We believe DuckDB and Arrow backed projects, DataFusion, and Polars make a lot of sense. This is probably the majority of datasets in the world and can be run most efficiently leveraging these state-of-the-art query systems.

For up to 30TBs: Well-known data warehouses like Snowflake, Google BigQuery, Databricks, and distributed processing frameworks like Spark and Trino work wonders at this scale.

For anything over 30TBs: This is where Theseus makes sense. Our minimum threshold to move forward requires 10TB queries (not datasets), but we prefer to operate when queries exceed 100TBs. This is an incredibly rare class of problem, but if you are feeling it, you know how quickly costs balloon, SLAs are missed, and tenuously the data pipeline is held together.

I mostly work in the AWS space, but it is interesting to peek on what innovations are going on outside of the space.

The author of Apache Arrow also made this observation

  • </= 1TB — DuckDB, Snowflake, DataFusion, Athena, Trino, Presto, etc.
  • 1–10TB — Spark, Dask, Ray, etc.
  • 10TB — hardware-accelerated processing (e.g., Theseus).
    (citation credit link)


You might ask, what my guidance might be for the Amazon space?

  • < 100 gigabytes – your run of the mill RDS or Aurora
  • >= 100 gigabytes – 30 TB – Redshift, or OTF
  • >30 TB – We haven’t really played in this space but things like Apache Iceberg are probably better candidates

TLDR – you probably will never use Theseus, so this is just a fun article.

American Privacy Rights Act (APRA)

There was a bit of surprising news coming out of the US Congress that there is now draft legislation for a national data privacy rights for Americans.  In the United States, data privacy has consisted of an odd patchwork of legislation state to state (like CCPA in California or the Colorado Privacy Act).  The US really is quite behind in legislation as the rest of the world has some type of privacy legislation.

https://energycommerce.house.gov/posts/committee-chairs-rodgers-cantwell-unveil-historic-draft-comprehensive-data-privacy-legislation


Here are some draft highlights

  1. Deletion Requests: Companies are required to delete personal data upon an individual’s request and must notify any third parties who have received this data to do the same.
  2. Third-Party Notifications: Companies must inform third parties of any deletion requests, ensuring that these third parties also delete the relevant data.
  3. Verification of Requests: Companies need to verify the identity of individuals who request data deletion or correction to ensure the legitimacy of these requests.
  4. Exceptions to Deletion: There are specific conditions under which a company may refuse a deletion request, such as legal restrictions, implications for data security, or if it would affect the rights of others.
  5. Technological and Cost Constraints: If it is technologically impossible or prohibitively expensive to comply with a deletion request, companies may decline the request but must provide a detailed explanation to the individual.
  6. Frequency and Cost of Requests: Companies can allow individuals to exercise their deletion rights free of charge up to three times per year; additional requests may incur a reasonable fee.
  7. Timely Response: Companies must respond to deletion requests within specified time frames, generally within 15 to 30 days, depending on whether they qualify as large data holders or not.


Who is this applicable for?

  1. Large Data Holders: The Act defines a “large data holder” as a covered entity that, in the most recent calendar year, had annual gross revenue of not less than $250 million and, depending on the context, meets certain thresholds related to the volume of covered data processed. These thresholds include handling the covered data of more than 5 million individuals, 15 million portable connected devices identifying individuals, or 35 million connected devices that can be linked to individuals. Additionally, for handling sensitive covered data, the thresholds are more than 200,000 individuals, 300,000 portable connected devices, or 700,000 connected devices​​.
  2. Small Business Exemptions: The Act specifies exemptions for small businesses. A small business is defined based on its average annual gross revenues over the past three years not exceeding $40 million and not collecting, processing, retaining, or transferring the covered data of more than 200,000 individuals annually for purposes other than payment collection. Furthermore, all covered data for such purposes must be deleted or de-identified within 90 days unless retention is necessary for fraud investigations or consistent with a return or warranty policy. A small business also must not transfer covered data to a third party in exchange for revenue or other considerations​​.

A while back I worked on a data engineering project which was exposed to the European GDPR.  It was interesting because we had meetings with in-house counsel lawyers to discuss what kind of data policies they had in place.  One of the facets of GDPR which is similar here is the ‘right to remove data.’

We entered some gray areas as when talking with lawyers the debate was occurring which data would be removed?  Removing data from a database or data lake is clear if it contained customer data, but what if it was deeply nestled in Amazon Glacier?

I don’t really have any great answers, but if this legislation actually does pan out, it makes a strong case for large companies to use OTFs for their data lakes otherwise it would be extremely difficult to delete the data.

TLDR – if you are a solution architect, do ask what kind of data policy exposure they have.  If this legislation does pass, please pay attention when you start your projects based in the USA whether this legislation is applicable to them based of the final legislation.

Citation Link and Credit For Talking About This – Hard Fork Podcast

 

Everything Else

Glue: Observability

The AWS Team recently put out a blog series on monitoring and debugging AWS Jobs using observability metrics.

https://aws.amazon.com/blogs/big-data/enhance-monitoring-and-debugging-for-aws-glue-jobs-using-new-job-observability-metrics-part-3-visualization-and-trend-analysis-using-amazon-quicksight/ (part 3)

DBT

  • The DBT team also released their 2024 state of analytics engineering (PDF here) –
  • TLDR, data quality is still of big concern
  • I’m surprised data mesh is still a thing, although it seems like it is only for big orgs according to the survey

AWS Exams:

AWS released a free training course on the Data Engineer Associate Exam

https://explore.skillbuilder.aws/learn/course/external/view/elearning/18546/exam-prep-standard-course-aws-certified-data-engineer-associate-dea-c01?trk=e6934e10-170d-4c94-bf7b-b88f95ed0f47&sc_channel=el

Also note the AWS Specialty Analytics and Database Specialty exams are being retired this month.

YADT (Yet Another Data Tool)
As if there weren’t enough tools on the market..

Devon:

Fortunately uhh I don’t think anyone in our team is named Devon, but this video has been making its rounds the Internet as the first ‘AI software engineer’

https://www.youtube.com/watch?v=fjHtjT7GO1c

Just remember, Devon hasn’t taken our jobs…. yet.

State of Data Engineering 2024 Q1

The current state of data engineering offers a plethora of options in the market, which can be challenging when selecting the right tool We are approaching a period where the traditional boundaries between between databases, datalakes, and data warehouses are overlapping. As always, it is important to think about what is the business case, then do a technology selection afterwards.

This diagram is simple, but merits some discussion.

Most companies in the small and medium data fields can get away with simpler architectures with a standard database powering their business applications. However it is when you get into big data and extremely large data do you want to start looking at more advanced platforms.

The Open Source Table Format Wars Revisited

A growing agreement is forming around the terminology used for Open Table Formats (OTF), also known as Open Source Table Formats (OSTF). These formats are particularly beneficial in scenarios involving big data or extremely large datasets, similar to those managed by companies like Uber and Netflix. Currently, there are three major contenders in the OTF space.

PlatformLinkPaid Provider
Apache Hudihttps://hudi.apachehttps://onehouse.ai/
Apache Iceberghttps://iceberg.apache.org/ https://tabular.io/
Databrickshttps://docs.databricks.com/en/delta/index.html Via hyperscaler

Several announcements from AWS recently, lead me to believe of some more support of Apache Iceberg into the AWS ecosystem

AWS Glue Data Catalog now supports automatic compaction of Apache Iceberg tables

Every datalake eventually suffers from a small file problem. What this means is if you have too many files in a given S3 partition (aka file path), performance degrades substantially. To alleviate this, compaction jobs are run to merge files to bigger files to improve performance. In managed paid platforms, this is done automatically for you, but in the open source platforms, developers are on the hook in needing to do this.

I was surprised to read that now if you use Apache Iceberg tables, developers no long have to deal with compaction jobs. Now to the second announcement:

Amazon Redshift announces general availability of support for Apache Iceberg 

If you are using Amazon Redshift, you can do federated queries without needing to go through the hassle of manually mounting data catalogs.

In this video, you can watch Amazon talk about Iceberg explicitly in their AWS storage:session from re:Invent.

This generally leads me to believe that Apache Iceberg probably will be more integrated into the Amazon ecosystem in the near future.

Apache Hudi

Apache Hudi recently released version 0.14.0 which has some major changes such as Record Level Indexing

https://hudi.apache.org/releases/release-0.14.0/

https://aws.amazon.com/blogs/big-data/simplify-operational-data-processing-in-data-lakes-using-aws-glue-and-apache-hudi/

One Table 
 
Another kind of weird development which was announced right before Re:invent was the announcement of OneTable,  

https://onetable.dev/

Microsoft, the Hudi team, and the Databricks team got together to create a new standard that serves as an abstraction layer on top of an OTF.  This is odd to me, because not too many organizations have these data stacks concurrently deployed.

However probably in the next couple years as abstraction layers get created on top of OTFs, this will be something to watch.

Amazon S3 Express One Zone Storage Class

Probably one of the most important but probably buried news from re:Invent was the announcement of Amazon S3 Express One Zone

https://aws.amazon.com/s3/storage-classes/express-one-zone/

With this, we can now have single digit millisecond access to data information to S3, which leads to a weird question of datalakes encroaching onto database territory if they now can meet higher SLAs. However there are some caveats with this as there is limited region availability, and it is in one zone so think about your disaster recovery requirements. This is one feature I would definitely watch.

Zero ETL Trends

Zero ETL is the ability for behind the scenes replication for Aurora, RDS, and Dynamo to replicate to Redshift.  If you have a use case where Slowly Changing Dimensions (SCD) Type 1 is acceptable, these are all worth taking a look at. From my understanding, when replication occurs, there is no connection penalty to your Redshift cluster.

https://aws.amazon.com/about-aws/whats-new/2023/11/amazon-aurora-postgresql-zero-etl-integration-redshift-public-preview/

https://aws.amazon.com/about-aws/whats-new/2023/11/amazon-rds-mysql-zero-etl-integration-amazon-redshift-public-preview/

https://aws.amazon.com/about-aws/whats-new/2023/11/amazon-dynamodb-zero-etl-integration-redshift/ 
 
Amazon OpenSearch Service zero-ETL integration with Amazon S3 preview now available 

https://aws.amazon.com/about-aws/whats-new/2023/11/amazon-opensearch-zero-etl-integration-s3-preview/ 
 
AWS announces Amazon DynamoDB zero-ETL integration with Amazon OpenSearch Service 

https://aws.amazon.com/about-aws/whats-new/2023/11/amazon-dynamodb-zero-etl-integration-amazon-opensearch-service/ 
 
AWS CloudTrail Lake data now available for zero-ETL analysis in Amazon Athena 

https://aws.amazon.com/about-aws/whats-new/2023/11/aws-cloudtrail-lake-zero-etl-anlysis-athena/

Spark/Glue/EMR Announcements

Glue Serverless Spark UI 

Now it is way easier to debug glue jobs as the Spark UI doesn’t have to manually be provisioned.

https://aws.amazon.com/blogs/big-data/introducing-aws-glue-studio-serverless-spark-ui-for-better-monitoring-and-troubleshooting/

Glue native connectors: Teradata, SAP HANA, Azure SQL, Azure Cosmos DB, Vertica, and MongoDB 

https://aws.amazon.com/about-aws/whats-new/2023/11/aws-glue-launches-native-connectivity-6-databases/

 
AWS Glue announces entity-level actions to manage sensitive data 
https://aws.amazon.com/about-aws/whats-new/2023/11/aws-glue-entity-level-actions-sensitive-data/ 

 
 
Glue now supports Gitlab and Bitbucket 

https://aws.amazon.com/about-aws/whats-new/2023/10/aws-glue-gitlab-bitbucket-git-integration-feature/

 
Trusted identity propagation 

Propagate oauth 2.0 credentials to EMR 
 

https://docs.aws.amazon.com/singlesignon/latest/userguide/trustedidentitypropagation-overview.html

Databases

Announcing Amazon Aurora Limitless Database 

https://aws.amazon.com/about-aws/whats-new/2023/11/amazon-aurora-limitless-database/

Conclusion

It is exciting to see the OTF ecosystem evolve. Apache Hudi is still a great and mature option, with Apache Iceberg now being more integrated with the AWS ecosystem.

Zero ETL has the potential to save your organization a ton of time if your data sources are supported by it.

Something to consider is that major shifts in data engineering occur every couple of months, so keep an eye on new developments, as they can have profound impacts on enterprise data strategies and operations.

State of Data Engineering 2023 Q3

As we roll towards the end of the year data engineering as expected does have some changes, but now everyone wants to see how Generative AI intersects with everything. The fits are not completely natural, as Generative AI like Chat GPT is more NLP type systems, but there are a few interesting cases to keep an eye on. Also Apache Iceberg is one to watch now there is more first class Amazon integration.

Retrieval Augmented Generation (RAG) Pattern

One of the major use cases for data engineers to understand for Generative AI is the retrieval augmented generation (rag) pattern.

There are quite a few articles on the web articulating this such as

What is important to realize is that Generative AI is only providing the light weight wrapper interface to your system. The RAG paradigm was created to help address context limitations by vectorizing your document repository and using some type of nearest neighbors algorithm to find the relevant data and passing it back to a foundation model. Perhaps LLMS with newer and larger context windows (like 100k context) may address these problems.

At the end of the data engineers will be tasked more with chunking, and vectorizing back end systems, and debates probably will emerge in your organization whether you want to roll out your own solution or just use a SAAS to do it quickly.

Generative AI for Data Engineering?

One of the core problems with generative AI is eventually it will start hallucinating. I played around with asking ChatGPT to convert CSV to JSON, and it worked for about the first 5 prompts, but by the 6th prompt, it started to make up JSON fields which never existed. 
 
Things I kind of envision in the future is the ability to use LLMs to stitch parts of data pipelines concerning data mapping and processing. But at the moment, it is not possible because of this. 
 
There is some interesting research occurring where a team has put a finite state machine (FSM) with LLMs to create deterministic JSON output. I know that might not seem like a big deal, but if we can address deterministic outcomes of data generation, it might be interesting to look at 

https://github.com/normal-computing/outlines

So far use cases we see day to day are 

1.      Engineers using LLMs to help create SQL or Spark code scaffolds 

2.      Creation of synthetic data – basically pass in a schema and ask an LLM to generate a data set for you to test 

3.      Conversion of one schema to another schema-ish. This kind of works, but buyer beware 

Apache Iceberg

Last year our organization did a proof of concept with Apache Iceberg, but one of the core problems, is that Athena and Glue didn’t have any native support, so it was difficult to do anything.

However on July 19, 2023 AWS quietly released an integration with Apache Iceberg & Athena into production 

Since then, AWS has finally started to treat Iceberg as a first class product with their documentation and resources 

Something to keep track of is that the team which founded Apache Iceberg, founded a company called tabular.io which provides hosted compute for Apache Iceberg workloads. Their model is pretty interesting because what you do is give Tabular access to your S3 buckets and they will deal with ingestion, processing, and file compaction for you. They even can point to DMS CDC logs, and create SCD Type 1, and query SCD Type 2 via time travel via a couple clicks which is pretty fancy to me.

However if you choose to roll things out yourself, expect to handle engineering efforts similar to this

https://tabular.io/blog/cdc-merge-pattern/

The Open Source Table Format Wars Continue

One of the core criticisms of traditional datalakes the difficulty to perform updates or deletes against them. With that, we have 3 major players in the market for transactional datalakes. 

PlatformLinkPaid Provider
Databrickshttps://docs.databricks.com/en/delta/index.html Via hyperscaler
Apache Hudihttps://hudi.apachehttps://onehouse.ai/
Apache Iceberghttps://iceberg.apache.org/ https://tabular.io/

What’s the difference between these 3 you say? Well, 70% of the major features are similar, but there are some divergent features 

https://www.onehouse.ai/blog/apache-hudi-vs-delta-lake-vs-apache-iceberg-lakehouse-feature-comparison

Also don’t even consider AWS Governed Tables and focus on the top 3 if you have these use cases.

Redshift Serverless Updates 

There has been another major silent update that now Redshift Serverless only requires 8 RPUs to provision a cluster. Before it was 32 RPUs which was ridiculously high number

8 RPUs x 12 hours x 0.36 USD x 30.5 days in a month = 1,054.08 USD 

Redshift Serverless cost (monthly): 1,054.08 USD 

Ra3.xlplus – 1 node 

792.78 USD 

So as you can see provisioned is still cheaper, but look into Serverless if 

·         You know your processing time of the cluster will be 50% idle 

·         You don’t want to deal with the management headaches 

·         You don’t need a public endpoint 
 

DBT

Data Built Tool (dbt), has really been gaining a lot of popularity at the moment. It is kind of weird for this pendulum to be swinging back and forth as originally many years ago we had these super big SQL scripts running on data warehouses. That went out of fashion, but now here we are 
 

A super interesting thing that got released is a dbt-glue adapter. 

https://pypi.org/project/dbt-glue/

https://aws.amazon.com/blogs/big-data/build-your-data-pipeline-in-your-aws-modern-data-platform-using-aws-lake-formation-aws-glue-and-dbt-core/

 
That means you can now run dbt SQL processing on Athena now 

For those new to dbt feel free to check this out 

https://dbt.picturatechnica.com/

https://corpinfollc.atlassian.net/wiki/spaces/APS/pages/119138643968195/DBT+ETL+getdbt.com

Glue Docker Image

A kind of a weird thing, but I recently saw the ability to launch Glue as a local docker image. I haven’t personally tried this, but it is interesting

https://aws.amazon.com/blogs/big-data/develop-and-test-aws-glue-version-3-0-jobs-locally-using-a-docker-container/

https://github.com/juanAmayaRamirez/docker_compose_glue4.0

https://aws.amazon.com/blogs/big-data/developing-aws-glue-etl-jobs-locally-using-a-container/

Zero ETL Intrigue

This is kind of an old feature, but Amazon rolled out in preview a Zero ETL method of MySQL 8.x to Redshift

https://aws.amazon.com/about-aws/whats-new/2023/06/amazon-aurora-mysql-zero-etl-integration-redshift-public-preview/

This is pretty intriguing meaning SCD Type 1 views should be replicated without doing any work of putting data through a datalake. However it is still in preview, so I can’t recommend it until it goes into general release.

State of Data Engineering 2023 Q2

When looking at data engineering for your projects, it is important to think about market segmentation. In particular, you might be able to think about it in four segments

  • Small Data
  • Medium Data
  • Big Data
  • Lots and Lots of Data


Small Data – This refers to scenarios where companies have data problems (organization, modeling, normalization, etc), but don’t necessarily generate a ton of data. When you don’t have a lot of data, different tool sets are in use ranging from low code tools to simpler storage mechanisms like SQL databases.

 
Low Code Tools 

The market is saturated with low code tools, with an estimated 80-100 products available. Whether low code tools work for you depends on your use case. If your teams lack a strong engineering capacity, it makes sense to use a tool to help accomplish ETL tasks.

However, problems arise when customers need to do something outside the scope of the tool.

Medium Data– This refers to customers who have more data, making it sensible to leverage more powerful tools like Spark. There are several ways to solve the problem with data lakes, data warehouses, ETL, or reverse ETL.

Big Data – This is similar to medium data, but introduces the concepts of incremental ETL (aka transactional data lakes or lake houses). Customers in this space tend to have data in the hundreds gigabytes to terabytes.

Transactional data lakes are essential because incremental ETL is challenging. For example, consider an Uber ride to the airport that costs $30. Later, you give a $5 tip, and now your trip costs $35. In a traditional database, you can run some ETL to update the script. However, Uber has tons of transactions worldwide, and they need a different way of dealing with the problem.

Introducing transactional data lakes requires more operational overhead, which should be taken into consideration.

Lots and Lots of Data – Customers in this space generate terabytes or petabytes of data a day. For example, Walmart creates 10 pb of data (!) a day.

https://medium.com/walmartglobaltech/lakehouse-at-fortune-1-scale-480bcb10391b

When customers are in this space, transactional data lakes with Apache Hudi, Apache Iceberg, and Databricks Deltalake are the main tools used.

Conclusion

The data space is large and crowded. With the small and lots of data sizes, the market segment is clear. However, the mid-market data space will probably take some time for winners to emerge.

Data Engineering Low Code Tools

In the data engineering space we have seen quite a few low code and no code tools pass through our radar. Low code tools have their own nuances as you will get to operationalize quicker, but the minute you need to customize something outside of the toolbox, you may run into problems. That’s when we usually deploy our custom development using things like Glue, EMR, or even transactional datalakes depending on your requirements.

This list is split into open source, ELT (reverse ETL), streaming, popular tools, and the rest of the tools. In the space, one thing I have been looking for is a first class open source product. I know that many of these products start as open source and end up releasing a managed version of the product. Personally of course I am all in for open source teams to make back their money somehow, but it would be ideal to have the platforms still contain an open source license.

One thing my team has been noticing is the traction dbt has been gaining in the market. It flips the paradigm a bit doing ELT (Extract Load Transform – reverse ETL), where everything is loaded to your data warehouse first then you start doing transformations on it.

Another project I have been watching with Zach Wilson’s recommendation is mage.ai. It is a pretty spiffy way of creating quick DAGs with executable Python notebooks. The platform is pretty active soliciting feedback on Slack and is one to watch for the future. Airbyte and Meltano are newer to me and I hope to take some time to play with those tools. This list is by no means the most exhaustive, but let me know if there is anything I have missed.

Opensource Tools

Product: Airbyte
Description: Airbyte is an open-source data integration platform that allows users to replicate data from various sources and load it into different destinations. Its features include real-time data sync, robust data transformations, and automatic schema migrations.
Link: https://airbyte.io/
Github Link: https://github.com/airbytehq/airbyte
Cost: Free, with paid plans available
Release Date: 2020
Number of Employees: 11-50

Product: mage.ai
Description: mage.ai is a no-code AI platform that enables businesses to automate and optimize workflows. It includes features such as visual recognition, natural language processing, and predictive analytics, with a focus on e-commerce applications.
Link: https://mage.ai/
Github Link: https://github.com/mage-ai
Cost: Open source
Release Date: 2020
Number of Employees: 11-50

Product: Meltano
Description: Meltano is an open-source data integration tool that allows users to build, run, and manage data pipelines using YAML configuration files. Its features include source and destination connectors, transformations, and orchestration.
Link: https://meltano.com/
Github Link: https://github.com/meltano/meltano
Cost: Free, with paid options available
Release Date: 2020
Number of Employees: 11-50

Product: Apache Nifi
Description: Apache Nifi is a web-based dataflow system that allows users to automate the flow of data between systems. Its features include a drag-and-drop user interface, data provenance, and support for various data sources and destinations.
Link: https://nifi.apache.org/
Github Link: https://github.com/apache/nifi
Cost: Free
Release Date: 2014
Number of Employees: N/A

Product: Apache Beam
Description: Apache Beam is an open-source, unified programming model for batch and streaming data processing. It provides a simple, portable API for defining and executing data processing pipelines, with support for various execution engines.
Link: https://beam.apache.org/
Github Link: https://github.com/apache/beam
Cost: Free
Release Date: N/A
Number of Employees: N/A

ELT

Product: dbt (data build tool)
Description: dbt is an open-source data transformation and modeling tool that enables analysts and engineers to transform their data into actionable insights. It provides a simple, modular way to manage data transformation pipelines in SQL, with features such as version control, documentation generation, and testing.
Link: https://www.getdbt.com/
Github Link: https://github.com/dbt-labs/dbt
Cost: Free, with paid options available for enterprise features and support
Release Date: 2016
Number of Employees: 51-200

Streaming

Product: Confluent
Description: Confluent is a cloud-native event streaming platform based on Apache Kafka that enables organizations to process, analyze, and respond to data in real-time. It provides a unified platform for building event-driven applications, with features such as data integration, event processing, and management tools.
Link: https://www.confluent.io/
Github Link: https://github.com/confluentinc
Cost: Free, with paid options available for enterprise features and support
Release Date: 2014
Number of Employees: 1001-5000

Popular Tools

Product: Fivetran
Description: Fivetran is a cloud-based data integration platform that automates the process of data pipeline building and maintenance. It provides pre-built connectors for over 150 data sources and destinations, with features such as data synchronization, transformation, and monitoring.
Link: https://fivetran.com/
Github Link: https://github.com/fivetran
Cost: Subscription-based, with a free trial available
Release Date: 2012
Number of Employees: 501-1000

Product: Alteryx
Description: Alteryx is an end-to-end analytics platform that enables users to perform data blending, advanced analytics, and machine learning tasks. It provides a drag-and-drop interface for building and deploying analytics workflows, with features such as data profiling, data quality, and data governance.
Link: https://www.alteryx.com/
Github Link: https://github.com/alteryx
Cost: Subscription-based, with a free trial available
Release Date: 1997
Number of Employees: 1001-5000

Product: Informatica
Description: Informatica is a data management platform that enables users to integrate, manage, and govern data across various sources and destinations. It provides a unified platform for data integration, quality, and governance, with features such as data profiling, data masking, and data lineage.
Link: https://www.informatica.com/
Github Link: https://github.com/informatica
Cost: Subscription-based, with a free trial available
Release Date: 1993
Number of Employees: 5001-10,000

Product: Matillion
Description: Matillion is a cloud-native ETL platform that enables users to extract, transform, and load data into cloud data warehouses. It provides a visual interface for building and deploying ETL workflows, with features such as data transformation, data quality, and data orchestration.
Link: https://www.matillion.com/
Github Link: https://github.com/matillion
Cost: Subscription-based, with a free trial available
Release Date: 2011
Number of Employees: 501-1000

Orchestration Tools

Sure! Here are the entries for Prefect, Dagster, Airflow, Azkaban, Luigi, and Oozie:

Product: Prefect
Description: Prefect is a modern data workflow orchestration platform that enables users to automate their data pipelines with Python. It provides a simple, Pythonic interface for defining and executing workflows, with features such as distributed execution, versioning, and monitoring.
Link: https://www.prefect.io/
Github Link: https://github.com/PrefectHQ/prefect
Cost: Free, with paid options available for enterprise features and support
Release Date: 2018
Number of Employees: 51-200

Product: Dagster
Description: Dagster is a data orchestrator and data integration testing tool that enables users to build and deploy reliable data pipelines. It provides a Python-based API for defining and executing pipelines, with features such as type-checking, validation, and monitoring.
Link: https://dagster.io/
Github Link: https://github.com/dagster-io/dagster
Cost: Free, with paid options available for enterprise features and support
Release Date: 2019
Number of Employees: 11-50

Product: Airflow
Description: Airflow is an open-source platform for creating, scheduling, and monitoring data workflows. It provides a Python-based API for defining and executing workflows, with features such as task dependencies, retries, and alerts.
Link: https://airflow.apache.org/
Github Link: https://github.com/apache/airflow
Cost: Free
Release Date: 2015
Number of Employees: N/A (maintained by the Apache Software Foundation)

Product: Azkaban
Description: Azkaban is an open-source workflow manager that enables users to create and run workflows on Hadoop. It provides a web-based interface for creating and scheduling workflows, with features such as task dependencies, notifications, and retries.
Link: https://azkaban.github.io/
Github Link: https://github.com/azkaban/azkaban
Cost: Free
Release Date: 2010
Number of Employees: N/A (maintained by the Azkaban Project)

Product: Luigi
Description: Luigi is an open-source workflow management system that enables users to build complex pipelines of batch jobs. It provides a Python-based API for defining and executing workflows, with features such as task dependencies, retries, and notifications.
Link: https://github.com/spotify/luigi
Github Link: https://github.com/spotify/luigi
Cost: Free
Release Date: 2012
Number of Employees: N/A (maintained by Spotify)

Product: Oozie
Description: Oozie is a workflow scheduler system for managing Hadoop jobs. It provides a web-based interface for defining and scheduling workflows, with features such as task dependencies, triggers, and notifications.
Link: https://oozie.apache.org/
Github Link: https://github.com/apache/oozie
Cost: Free
Release Date: 2009
Number of Employees: N/A (maintained by the Apache Software Foundation)

Tools

3forge – https://3forge.com/ – 3forge delivers software tools for creating financial applications and data delivery platforms.

Ab Initio Software – https://www.abinitio.com/ – Ab Initio Software provides a data integration platform for building large-scale data processing applications.

Adeptia – https://adeptia.com/ – Adeptia offers a cloud-based, self-service integration solution that allows users to easily connect and automate data flows across multiple systems and applications.

Aera – https://www.aeratechnology.com/ – Aera provides an AI-powered platform for enterprises to accelerate their digital transformation by automating and optimizing business processes.

Aiven – https://aiven.io/ – Aiven offers managed cloud services for open-source technologies such as Kafka, Cassandra, and Elasticsearch.

Ascend.io – https://ascend.io/ – Ascend.io provides a unified data platform that allows users to build, scale, and automate data pipelines across various sources and destinations.

Astera Software – https://www.astera.com/ – Astera Software offers a suite of data integration and management tools for businesses of all sizes.

Black Tiger – https://blacktiger.io/ – Black Tiger provides an open-source data pipeline framework that simplifies the process of building and deploying data pipelines.

Bryte Systems – https://www.brytesystems.com/ – Bryte Systems offers an AI-powered data platform that helps organizations manage their data operations more efficiently.

CData Software – https://www.cdata.com/ – CData Software provides a suite of drivers and connectors for integrating with various data sources and APIs.

Census – https://www.getcensus.com/ – Census offers an automated data syncing platform that allows businesses to keep their customer data up-to-date across various systems and applications.

CloverDX – https://www.cloverdx.com/ – CloverDX provides a data integration platform for building and managing complex data transformations.

Data Virtuality – https://www.datavirtuality.com/ – Data Virtuality offers a data integration platform that allows users to connect and query data from various sources using SQL.

Datameer – https://www.datameer.com/ – Datameer provides a data preparation and exploration platform that enables users to analyze large datasets quickly and easily.

DBSync – https://www.mydbsync.com/ – DBSync provides a cloud-based data integration platform for connecting and synchronizing data across various systems and applications.

Denodo – https://www.denodo.com/ – Denodo provides a data virtualization platform that allows users to access and integrate data from various sources in real-time.

Devart – https://www.devart.com/ – Devart offers a suite of database tools and data connectivity solutions for various platforms and technologies.

DQLabs – https://dqlabs.ai/ – DQLabs provides a self-service data management platform that automates the process of discovering, curating, and governing data assets.

eQ Technologic – https://www.eqtechnologic.com/ – eQ Technologic offers a data integration platform that enables users to extract, transform, and load data from various sources.

Equalum – https://equalum.io/ – Equalum provides a real-time data ingestion and processing platform that enables organizations to make data-driven decisions faster.

Etleap – https://etleap.com/ – Etleap offers a cloud-based data integration platform that simplifies the process of building and managing data pipelines.

Etlworks – https://www.etlworks.com/ – Etlworks provides a data integration platform that allows users to create and manage complex data transformations.

Harbr – https://harbr.com/ – Harbr is a data exchange platform that connects and facilitates secure data collaboration between organizations.

HCL Technologies (Actian) – https://www.actian.com/ – Actian provides hybrid cloud data analytics software solutions that enable organizations to extract insights from big data and act on them in real time.

Hevo Data – https://hevodata.com/ – Hevo Data provides a cloud-based data integration platform that enables companies to move data from various sources to a data warehouse or other destination in real time.

Hitachi Vantara – https://www.hitachivantara.com/ – Hitachi Vantara provides data management, analytics, and storage solutions for businesses across various industries.

HULFT – https://www.hulft.com/ – HULFT provides data integration and management solutions that enable businesses to streamline data transfer and reduce data integration costs.

ibi – https://www.ibi.com/ – ibi provides data and analytics software solutions that help organizations make data-driven decisions.

Impetus Technologies – https://www.impetus.com/ – Impetus Technologies provides data engineering and analytics solutions that enable businesses to extract insights from big data.

Infoworks – https://www.infoworks.io/ – Infoworks provides a cloud-native data engineering platform that automates the process of data ingestion, transformation, and orchestration.

insightsoftware – https://insightsoftware.com/ – insightsoftware provides financial reporting and enterprise performance management software solutions that help organizations improve their financial and operational performance.

Integrate.io – https://www.integrate.io/ – Integrate.io provides a cloud-based data integration platform that enables businesses to integrate and manage data from various sources.

Intenda – https://intenda.net/ – Intenda provides a data integration and analytics platform that enables businesses to unlock insights from their data.

IRI – https://www.iri.com/ – IRI provides data management and integration software solutions that enable businesses to integrate and manage data from various sources.

Irion – https://www.irion-edm.com/ – Irion provides a data management and governance platform that enables businesses to automate data quality and compliance processes.

K2view – https://www.k2view.com/ – K2view provides a data fabric platform that enables businesses to connect and manage data across various sources and applications.

Komprise – https://www.komprise.com/ – Komprise provides an intelligent data management platform that enables businesses to manage and optimize data across various storage tiers.

Minitab – https://www.minitab.com/ – Minitab is a statistical software package designed for data analysis and quality improvement.

Nexla – https://www.nexla.com/ – Nexla offers a data operations platform that automates the process of ingesting, transforming, and delivering data to various systems and applications.

OpenText – https://www.opentext.com/ – OpenText is a Canadian company that provides enterprise information management software.

Palantir – https://www.palantir.com/ – Palantir is an American software company that specializes in data analysis.

Precisely – https://www.precisely.com/ – Precisely provides data integrity, data integration, and data quality software solutions.

Primeur – https://www.primeur.com/ – Primeur is an Italian software company that offers products and services for data integration, managed file transfer, and digital transformation.

Progress – https://www.progress.com/ – Progress is an American software company that provides products for application development, data integration, and business intelligence.

PurpleCube – https://www.purplecube.ca/ – PurpleCube is a Canadian consulting company that specializes in data integration, data warehousing, and business intelligence.

Push – https://www.push.tech/ – Push is a French software company that provides products and services for data processing and analysis.

Qlik – https://www.qlik.com/ – Qlik provides business intelligence software that helps organizations visualize and analyze their data.

RELX (Adaptris) – https://www.adaptris.com/ – Adaptris, now a RELX company, offers data integration software that helps organizations connect systems and applications.

Rivery – https://rivery.io/ – Rivery is a cloud-based data integration platform that allows businesses to consolidate, transform, and automate data.

Safe Software – https://www.safe.com/ – Safe Software provides spatial data integration and spatial data transformation software.

Semarchy – https://www.semarchy.com/ – Semarchy provides a master data management platform that helps organizations consolidate and manage their data.

Sesame Software – https://www.sesamesoftware.com/ – Sesame Software offers data management solutions that simplify data integration, data warehousing, and data analytics.

SnapLogic – https://www.snaplogic.com/ – SnapLogic provides a cloud-based integration platform that enables enterprises to connect cloud and on-premise applications and data.

Software AG – https://www.softwareag.com/ – Software AG offers a platform that enables enterprises to integrate and optimize their business processes and systems.

Stone Bond Technologies – https://www.stonebond.com/ – Stone Bond Technologies offers a platform that enables enterprises to integrate data from various sources and systems.

Stratio – https://www.stratio.com/ – Stratio offers a platform that enables enterprises to process and analyze large volumes of data in real-time.

StreamSets – https://streamsets.com/ – StreamSets offers a data operations platform that enables enterprises to ingest, transform, and move data across systems and applications.

Striim – https://www.striim.com/ – Striim offers a real-time data integration and streaming analytics platform that enables enterprises to collect, process, and analyze data in real-time.

Suadeo – https://www.suadeo.com/ – Suadeo provides a platform that enables enterprises to integrate and manage their data from various sources.

Syniti – https://www.syniti.com/ – Syniti offers a data management platform that enables enterprises to integrate, enrich, and govern their data.

Talend – https://www.talend.com/ – Talend provides a cloud-based data integration platform that enables enterprises to connect, cleanse, and transform their data.

Tengu – https://tengu.io/ – Tengu offers a data engineering platform that enables enterprises to automate the process of ingesting, processing, and delivering data.

ThoughtSpot – https://www.thoughtspot.com/ – ThoughtSpot offers a cloud-based platform that enables enterprises to analyze their data in real-time.

TIBCO Software – https://www.tibco.com/ – TIBCO Software offers a platform that enables enterprises to integrate and optimize their business processes and systems.

Tiger Technology – https://www.tiger-technology.com/ – Tiger Technology offers a platform that enables enterprises to manage, move, and share their data across systems and applications.

Timbr.ai – https://timbr.ai/ – Timbr.ai provides a platform that enables enterprises to manage and process their data in real-time.

Upsolver – https://www.upsolver.com/ – Upsolver offers a cloud-native data integration platform that enables enterprises to process and analyze their data in real-time.

WANdisco – https://wandisco.com/ – WANdisco offers a platform that enables enterprises to replicate and migrate their data across hybrid and multi-cloud environments.

ZAP – https://www.zapbi.com/ – ZAP offers a data management platform that enables enterprises to integrate, visualize, and analyze their data.

Domo – https://www.domo.com/ – Domo is a cloud-native platform that gives data-driven teams real-time visibility into all the data and insights needed to drive business forward.

Dell Boomi – https://boomi.com/ – Dell Boomi is a business unit acquired by Dell that specializes in cloud-based integration, API management, and Master Data Management.

Stitch – https://www.stitchdata.com/ – Stitch is a cloud-first, open-source platform for rapidly moving data. It allows users to integrate with over 100 data sources and automate data movement to a cloud data warehouse.

Sparkflows – https://sparkflows.io/ – Sparkflows is a low-code, drag-and-drop platform that enables organizations to build, deploy, and manage Big Data applications on Apache Spark.

Liquibase – https://www.liquibase.com/ – Liquibase is an open-source database-independent library for tracking, managing, and applying database schema changes.

Shipyard – https://shipyardapp.com/ – Shipyard is a container management platform that makes it easy to deploy, manage, and monitor Docker containers.

Flyway – https://flywaydb.org/ – Flyway is an open-source database migration tool that allows developers to evolve their database schema easily and reliably across different environments.

Software Estimations Using Reference Class Forecasting

18 years ago I’m sitting in my cubicle doing Java programming, and my tech lead comes up to me to chat about my next project. We discuss the details, and then she asks me the dreaded questions programmers fear which is “how long will it take?”. I stumble with some guestimate based off my limited experience and she goes along her merry way and plugs the number into a gantt chart.

Even with the emergence with the agile manifesto, and now the current paradigms of using 1-2 week sprints to plan projects, business and customers still are asking technologists to provide how long a project will take.

The unfortunate thing about agile is that even though it is an ideal way to run a project, financial models rarely follow that methodology. Meaning, most statement of works are written with a time estimate on a project. There are some exceptions to the rule where some customers pay for work 2 weeks at a time, but it is pretty rare.

Throughout my technical career, I have rarely seen any formalized software estimation models emerge that we all use, so I was surprised when I was reading How Big Things Get Done, a mention about software project estimation. The beginning chapters talked about the challenges and successes of large architectural projects ranging from the Sydney Opera House (problematic project) all the way to the Guggenheim in Bilbao (amazingly under budget).

The book proposes using reference class forecasting which asks you to

  1. Get software estimates of all similar projects perform in the past in your organization with your current project
  2. Take the mean value
  3. Use that as an anchor

For example, if I was doing an application modernization of Hadoop to EMR and I had no idea how long it would take, I would try to get references to other projects of similar complexity. Let’s say I had data of 10 previous projects and the mean came out to 6 months. Then 6 months would be your anchor point.

The book does immediately point out that the biggest problem isn’t this approach, it is obtaining the historical data of how long previous projects took. Think about it this way, out of all the projects you have ever estimated, have you compared the actuals to your forecast? I bet you, most of us haven’t done these retros at all.

Some take aways for me is:

  1. If you are in a large organization and you have done multiple projects, take the time to do a retro on projects you have done and store in a spreadsheet what project you have done, the tasks, complexities, and the actual time it took to finish. Unfortunately large companies have this valuable data but don’t go through the exercises to calculate this. With this, some rudimentary reference class forecasting can start to be used instead of subjective software estimations.
  2. If you are a small organization or don’t have a history of projects and don’t have any reference point, then unfortunately I just think you are out of luck.

At the end of the day, I think industry needs to get better at software estimation, and the only way is to develop some type of methodology and refine it over time.