State of Data Engineering 2023 Q3

As we roll towards the end of the year data engineering as expected does have some changes, but now everyone wants to see how Generative AI intersects with everything. The fits are not completely natural, as Generative AI like Chat GPT is more NLP type systems, but there are a few interesting cases to keep an eye on. Also Apache Iceberg is one to watch now there is more first class Amazon integration.

Retrieval Augmented Generation (RAG) Pattern

One of the major use cases for data engineers to understand for Generative AI is the retrieval augmented generation (rag) pattern.

There are quite a few articles on the web articulating this such as

What is important to realize is that Generative AI is only providing the light weight wrapper interface to your system. The RAG paradigm was created to help address context limitations by vectorizing your document repository and using some type of nearest neighbors algorithm to find the relevant data and passing it back to a foundation model. Perhaps LLMS with newer and larger context windows (like 100k context) may address these problems.

At the end of the data engineers will be tasked more with chunking, and vectorizing back end systems, and debates probably will emerge in your organization whether you want to roll out your own solution or just use a SAAS to do it quickly.

Generative AI for Data Engineering?

One of the core problems with generative AI is eventually it will start hallucinating. I played around with asking ChatGPT to convert CSV to JSON, and it worked for about the first 5 prompts, but by the 6th prompt, it started to make up JSON fields which never existed. 
Things I kind of envision in the future is the ability to use LLMs to stitch parts of data pipelines concerning data mapping and processing. But at the moment, it is not possible because of this. 
There is some interesting research occurring where a team has put a finite state machine (FSM) with LLMs to create deterministic JSON output. I know that might not seem like a big deal, but if we can address deterministic outcomes of data generation, it might be interesting to look at

So far use cases we see day to day are 

1.      Engineers using LLMs to help create SQL or Spark code scaffolds 

2.      Creation of synthetic data – basically pass in a schema and ask an LLM to generate a data set for you to test 

3.      Conversion of one schema to another schema-ish. This kind of works, but buyer beware 

Apache Iceberg

Last year our organization did a proof of concept with Apache Iceberg, but one of the core problems, is that Athena and Glue didn’t have any native support, so it was difficult to do anything.

However on July 19, 2023 AWS quietly released an integration with Apache Iceberg & Athena into production 

Since then, AWS has finally started to treat Iceberg as a first class product with their documentation and resources 

Something to keep track of is that the team which founded Apache Iceberg, founded a company called which provides hosted compute for Apache Iceberg workloads. Their model is pretty interesting because what you do is give Tabular access to your S3 buckets and they will deal with ingestion, processing, and file compaction for you. They even can point to DMS CDC logs, and create SCD Type 1, and query SCD Type 2 via time travel via a couple clicks which is pretty fancy to me.

However if you choose to roll things out yourself, expect to handle engineering efforts similar to this

The Open Source Table Format Wars Continue

One of the core criticisms of traditional datalakes the difficulty to perform updates or deletes against them. With that, we have 3 major players in the market for transactional datalakes. 

PlatformLinkPaid Provider
Databricks Via hyperscaler
Apache Hudihttps://hudi.apache
Apache Iceberg

What’s the difference between these 3 you say? Well, 70% of the major features are similar, but there are some divergent features

Also don’t even consider AWS Governed Tables and focus on the top 3 if you have these use cases.

Redshift Serverless Updates 

There has been another major silent update that now Redshift Serverless only requires 8 RPUs to provision a cluster. Before it was 32 RPUs which was ridiculously high number

8 RPUs x 12 hours x 0.36 USD x 30.5 days in a month = 1,054.08 USD 

Redshift Serverless cost (monthly): 1,054.08 USD 

Ra3.xlplus – 1 node 

792.78 USD 

So as you can see provisioned is still cheaper, but look into Serverless if 

·         You know your processing time of the cluster will be 50% idle 

·         You don’t want to deal with the management headaches 

·         You don’t need a public endpoint 


Data Built Tool (dbt), has really been gaining a lot of popularity at the moment. It is kind of weird for this pendulum to be swinging back and forth as originally many years ago we had these super big SQL scripts running on data warehouses. That went out of fashion, but now here we are 

A super interesting thing that got released is a dbt-glue adapter.

That means you can now run dbt SQL processing on Athena now 

For those new to dbt feel free to check this out

Glue Docker Image

A kind of a weird thing, but I recently saw the ability to launch Glue as a local docker image. I haven’t personally tried this, but it is interesting

Zero ETL Intrigue

This is kind of an old feature, but Amazon rolled out in preview a Zero ETL method of MySQL 8.x to Redshift

This is pretty intriguing meaning SCD Type 1 views should be replicated without doing any work of putting data through a datalake. However it is still in preview, so I can’t recommend it until it goes into general release.

State of Data Engineering 2023 Q2

When looking at data engineering for your projects, it is important to think about market segmentation. In particular, you might be able to think about it in four segments

  • Small Data
  • Medium Data
  • Big Data
  • Lots and Lots of Data

Small Data – This refers to scenarios where companies have data problems (organization, modeling, normalization, etc), but don’t necessarily generate a ton of data. When you don’t have a lot of data, different tool sets are in use ranging from low code tools to simpler storage mechanisms like SQL databases.

Low Code Tools 

The market is saturated with low code tools, with an estimated 80-100 products available. Whether low code tools work for you depends on your use case. If your teams lack a strong engineering capacity, it makes sense to use a tool to help accomplish ETL tasks.

However, problems arise when customers need to do something outside the scope of the tool.

Medium Data– This refers to customers who have more data, making it sensible to leverage more powerful tools like Spark. There are several ways to solve the problem with data lakes, data warehouses, ETL, or reverse ETL.

Big Data – This is similar to medium data, but introduces the concepts of incremental ETL (aka transactional data lakes or lake houses). Customers in this space tend to have data in the hundreds gigabytes to terabytes.

Transactional data lakes are essential because incremental ETL is challenging. For example, consider an Uber ride to the airport that costs $30. Later, you give a $5 tip, and now your trip costs $35. In a traditional database, you can run some ETL to update the script. However, Uber has tons of transactions worldwide, and they need a different way of dealing with the problem.

Introducing transactional data lakes requires more operational overhead, which should be taken into consideration.

Lots and Lots of Data – Customers in this space generate terabytes or petabytes of data a day. For example, Walmart creates 10 pb of data (!) a day.

When customers are in this space, transactional data lakes with Apache Hudi, Apache Iceberg, and Databricks Deltalake are the main tools used.


The data space is large and crowded. With the small and lots of data sizes, the market segment is clear. However, the mid-market data space will probably take some time for winners to emerge.

Data Engineering Low Code Tools

In the data engineering space we have seen quite a few low code and no code tools pass through our radar. Low code tools have their own nuances as you will get to operationalize quicker, but the minute you need to customize something outside of the toolbox, you may run into problems. That’s when we usually deploy our custom development using things like Glue, EMR, or even transactional datalakes depending on your requirements.

This list is split into open source, ELT (reverse ETL), streaming, popular tools, and the rest of the tools. In the space, one thing I have been looking for is a first class open source product. I know that many of these products start as open source and end up releasing a managed version of the product. Personally of course I am all in for open source teams to make back their money somehow, but it would be ideal to have the platforms still contain an open source license.

One thing my team has been noticing is the traction dbt has been gaining in the market. It flips the paradigm a bit doing ELT (Extract Load Transform – reverse ETL), where everything is loaded to your data warehouse first then you start doing transformations on it.

Another project I have been watching with Zach Wilson’s recommendation is It is a pretty spiffy way of creating quick DAGs with executable Python notebooks. The platform is pretty active soliciting feedback on Slack and is one to watch for the future. Airbyte and Meltano are newer to me and I hope to take some time to play with those tools. This list is by no means the most exhaustive, but let me know if there is anything I have missed.

Opensource Tools

Product: Airbyte
Description: Airbyte is an open-source data integration platform that allows users to replicate data from various sources and load it into different destinations. Its features include real-time data sync, robust data transformations, and automatic schema migrations.
Github Link:
Cost: Free, with paid plans available
Release Date: 2020
Number of Employees: 11-50

Description: is a no-code AI platform that enables businesses to automate and optimize workflows. It includes features such as visual recognition, natural language processing, and predictive analytics, with a focus on e-commerce applications.
Github Link:
Cost: Open source
Release Date: 2020
Number of Employees: 11-50

Product: Meltano
Description: Meltano is an open-source data integration tool that allows users to build, run, and manage data pipelines using YAML configuration files. Its features include source and destination connectors, transformations, and orchestration.
Github Link:
Cost: Free, with paid options available
Release Date: 2020
Number of Employees: 11-50

Product: Apache Nifi
Description: Apache Nifi is a web-based dataflow system that allows users to automate the flow of data between systems. Its features include a drag-and-drop user interface, data provenance, and support for various data sources and destinations.
Github Link:
Cost: Free
Release Date: 2014
Number of Employees: N/A

Product: Apache Beam
Description: Apache Beam is an open-source, unified programming model for batch and streaming data processing. It provides a simple, portable API for defining and executing data processing pipelines, with support for various execution engines.
Github Link:
Cost: Free
Release Date: N/A
Number of Employees: N/A


Product: dbt (data build tool)
Description: dbt is an open-source data transformation and modeling tool that enables analysts and engineers to transform their data into actionable insights. It provides a simple, modular way to manage data transformation pipelines in SQL, with features such as version control, documentation generation, and testing.
Github Link:
Cost: Free, with paid options available for enterprise features and support
Release Date: 2016
Number of Employees: 51-200


Product: Confluent
Description: Confluent is a cloud-native event streaming platform based on Apache Kafka that enables organizations to process, analyze, and respond to data in real-time. It provides a unified platform for building event-driven applications, with features such as data integration, event processing, and management tools.
Github Link:
Cost: Free, with paid options available for enterprise features and support
Release Date: 2014
Number of Employees: 1001-5000

Popular Tools

Product: Fivetran
Description: Fivetran is a cloud-based data integration platform that automates the process of data pipeline building and maintenance. It provides pre-built connectors for over 150 data sources and destinations, with features such as data synchronization, transformation, and monitoring.
Github Link:
Cost: Subscription-based, with a free trial available
Release Date: 2012
Number of Employees: 501-1000

Product: Alteryx
Description: Alteryx is an end-to-end analytics platform that enables users to perform data blending, advanced analytics, and machine learning tasks. It provides a drag-and-drop interface for building and deploying analytics workflows, with features such as data profiling, data quality, and data governance.
Github Link:
Cost: Subscription-based, with a free trial available
Release Date: 1997
Number of Employees: 1001-5000

Product: Informatica
Description: Informatica is a data management platform that enables users to integrate, manage, and govern data across various sources and destinations. It provides a unified platform for data integration, quality, and governance, with features such as data profiling, data masking, and data lineage.
Github Link:
Cost: Subscription-based, with a free trial available
Release Date: 1993
Number of Employees: 5001-10,000

Product: Matillion
Description: Matillion is a cloud-native ETL platform that enables users to extract, transform, and load data into cloud data warehouses. It provides a visual interface for building and deploying ETL workflows, with features such as data transformation, data quality, and data orchestration.
Github Link:
Cost: Subscription-based, with a free trial available
Release Date: 2011
Number of Employees: 501-1000

Orchestration Tools

Sure! Here are the entries for Prefect, Dagster, Airflow, Azkaban, Luigi, and Oozie:

Product: Prefect
Description: Prefect is a modern data workflow orchestration platform that enables users to automate their data pipelines with Python. It provides a simple, Pythonic interface for defining and executing workflows, with features such as distributed execution, versioning, and monitoring.
Github Link:
Cost: Free, with paid options available for enterprise features and support
Release Date: 2018
Number of Employees: 51-200

Product: Dagster
Description: Dagster is a data orchestrator and data integration testing tool that enables users to build and deploy reliable data pipelines. It provides a Python-based API for defining and executing pipelines, with features such as type-checking, validation, and monitoring.
Github Link:
Cost: Free, with paid options available for enterprise features and support
Release Date: 2019
Number of Employees: 11-50

Product: Airflow
Description: Airflow is an open-source platform for creating, scheduling, and monitoring data workflows. It provides a Python-based API for defining and executing workflows, with features such as task dependencies, retries, and alerts.
Github Link:
Cost: Free
Release Date: 2015
Number of Employees: N/A (maintained by the Apache Software Foundation)

Product: Azkaban
Description: Azkaban is an open-source workflow manager that enables users to create and run workflows on Hadoop. It provides a web-based interface for creating and scheduling workflows, with features such as task dependencies, notifications, and retries.
Github Link:
Cost: Free
Release Date: 2010
Number of Employees: N/A (maintained by the Azkaban Project)

Product: Luigi
Description: Luigi is an open-source workflow management system that enables users to build complex pipelines of batch jobs. It provides a Python-based API for defining and executing workflows, with features such as task dependencies, retries, and notifications.
Github Link:
Cost: Free
Release Date: 2012
Number of Employees: N/A (maintained by Spotify)

Product: Oozie
Description: Oozie is a workflow scheduler system for managing Hadoop jobs. It provides a web-based interface for defining and scheduling workflows, with features such as task dependencies, triggers, and notifications.
Github Link:
Cost: Free
Release Date: 2009
Number of Employees: N/A (maintained by the Apache Software Foundation)


3forge – – 3forge delivers software tools for creating financial applications and data delivery platforms.

Ab Initio Software – – Ab Initio Software provides a data integration platform for building large-scale data processing applications.

Adeptia – – Adeptia offers a cloud-based, self-service integration solution that allows users to easily connect and automate data flows across multiple systems and applications.

Aera – – Aera provides an AI-powered platform for enterprises to accelerate their digital transformation by automating and optimizing business processes.

Aiven – – Aiven offers managed cloud services for open-source technologies such as Kafka, Cassandra, and Elasticsearch. – – provides a unified data platform that allows users to build, scale, and automate data pipelines across various sources and destinations.

Astera Software – – Astera Software offers a suite of data integration and management tools for businesses of all sizes.

Black Tiger – – Black Tiger provides an open-source data pipeline framework that simplifies the process of building and deploying data pipelines.

Bryte Systems – – Bryte Systems offers an AI-powered data platform that helps organizations manage their data operations more efficiently.

CData Software – – CData Software provides a suite of drivers and connectors for integrating with various data sources and APIs.

Census – – Census offers an automated data syncing platform that allows businesses to keep their customer data up-to-date across various systems and applications.

CloverDX – – CloverDX provides a data integration platform for building and managing complex data transformations.

Data Virtuality – – Data Virtuality offers a data integration platform that allows users to connect and query data from various sources using SQL.

Datameer – – Datameer provides a data preparation and exploration platform that enables users to analyze large datasets quickly and easily.

DBSync – – DBSync provides a cloud-based data integration platform for connecting and synchronizing data across various systems and applications.

Denodo – – Denodo provides a data virtualization platform that allows users to access and integrate data from various sources in real-time.

Devart – – Devart offers a suite of database tools and data connectivity solutions for various platforms and technologies.

DQLabs – – DQLabs provides a self-service data management platform that automates the process of discovering, curating, and governing data assets.

eQ Technologic – – eQ Technologic offers a data integration platform that enables users to extract, transform, and load data from various sources.

Equalum – – Equalum provides a real-time data ingestion and processing platform that enables organizations to make data-driven decisions faster.

Etleap – – Etleap offers a cloud-based data integration platform that simplifies the process of building and managing data pipelines.

Etlworks – – Etlworks provides a data integration platform that allows users to create and manage complex data transformations.

Harbr – – Harbr is a data exchange platform that connects and facilitates secure data collaboration between organizations.

HCL Technologies (Actian) – – Actian provides hybrid cloud data analytics software solutions that enable organizations to extract insights from big data and act on them in real time.

Hevo Data – – Hevo Data provides a cloud-based data integration platform that enables companies to move data from various sources to a data warehouse or other destination in real time.

Hitachi Vantara – – Hitachi Vantara provides data management, analytics, and storage solutions for businesses across various industries.

HULFT – – HULFT provides data integration and management solutions that enable businesses to streamline data transfer and reduce data integration costs.

ibi – – ibi provides data and analytics software solutions that help organizations make data-driven decisions.

Impetus Technologies – – Impetus Technologies provides data engineering and analytics solutions that enable businesses to extract insights from big data.

Infoworks – – Infoworks provides a cloud-native data engineering platform that automates the process of data ingestion, transformation, and orchestration.

insightsoftware – – insightsoftware provides financial reporting and enterprise performance management software solutions that help organizations improve their financial and operational performance. – – provides a cloud-based data integration platform that enables businesses to integrate and manage data from various sources.

Intenda – – Intenda provides a data integration and analytics platform that enables businesses to unlock insights from their data.

IRI – – IRI provides data management and integration software solutions that enable businesses to integrate and manage data from various sources.

Irion – – Irion provides a data management and governance platform that enables businesses to automate data quality and compliance processes.

K2view – – K2view provides a data fabric platform that enables businesses to connect and manage data across various sources and applications.

Komprise – – Komprise provides an intelligent data management platform that enables businesses to manage and optimize data across various storage tiers.

Minitab – – Minitab is a statistical software package designed for data analysis and quality improvement.

Nexla – – Nexla offers a data operations platform that automates the process of ingesting, transforming, and delivering data to various systems and applications.

OpenText – – OpenText is a Canadian company that provides enterprise information management software.

Palantir – – Palantir is an American software company that specializes in data analysis.

Precisely – – Precisely provides data integrity, data integration, and data quality software solutions.

Primeur – – Primeur is an Italian software company that offers products and services for data integration, managed file transfer, and digital transformation.

Progress – – Progress is an American software company that provides products for application development, data integration, and business intelligence.

PurpleCube – – PurpleCube is a Canadian consulting company that specializes in data integration, data warehousing, and business intelligence.

Push – – Push is a French software company that provides products and services for data processing and analysis.

Qlik – – Qlik provides business intelligence software that helps organizations visualize and analyze their data.

RELX (Adaptris) – – Adaptris, now a RELX company, offers data integration software that helps organizations connect systems and applications.

Rivery – – Rivery is a cloud-based data integration platform that allows businesses to consolidate, transform, and automate data.

Safe Software – – Safe Software provides spatial data integration and spatial data transformation software.

Semarchy – – Semarchy provides a master data management platform that helps organizations consolidate and manage their data.

Sesame Software – – Sesame Software offers data management solutions that simplify data integration, data warehousing, and data analytics.

SnapLogic – – SnapLogic provides a cloud-based integration platform that enables enterprises to connect cloud and on-premise applications and data.

Software AG – – Software AG offers a platform that enables enterprises to integrate and optimize their business processes and systems.

Stone Bond Technologies – – Stone Bond Technologies offers a platform that enables enterprises to integrate data from various sources and systems.

Stratio – – Stratio offers a platform that enables enterprises to process and analyze large volumes of data in real-time.

StreamSets – – StreamSets offers a data operations platform that enables enterprises to ingest, transform, and move data across systems and applications.

Striim – – Striim offers a real-time data integration and streaming analytics platform that enables enterprises to collect, process, and analyze data in real-time.

Suadeo – – Suadeo provides a platform that enables enterprises to integrate and manage their data from various sources.

Syniti – – Syniti offers a data management platform that enables enterprises to integrate, enrich, and govern their data.

Talend – – Talend provides a cloud-based data integration platform that enables enterprises to connect, cleanse, and transform their data.

Tengu – – Tengu offers a data engineering platform that enables enterprises to automate the process of ingesting, processing, and delivering data.

ThoughtSpot – – ThoughtSpot offers a cloud-based platform that enables enterprises to analyze their data in real-time.

TIBCO Software – – TIBCO Software offers a platform that enables enterprises to integrate and optimize their business processes and systems.

Tiger Technology – – Tiger Technology offers a platform that enables enterprises to manage, move, and share their data across systems and applications. – – provides a platform that enables enterprises to manage and process their data in real-time.

Upsolver – – Upsolver offers a cloud-native data integration platform that enables enterprises to process and analyze their data in real-time.

WANdisco – – WANdisco offers a platform that enables enterprises to replicate and migrate their data across hybrid and multi-cloud environments.

ZAP – – ZAP offers a data management platform that enables enterprises to integrate, visualize, and analyze their data.

Domo – – Domo is a cloud-native platform that gives data-driven teams real-time visibility into all the data and insights needed to drive business forward.

Dell Boomi – – Dell Boomi is a business unit acquired by Dell that specializes in cloud-based integration, API management, and Master Data Management.

Stitch – – Stitch is a cloud-first, open-source platform for rapidly moving data. It allows users to integrate with over 100 data sources and automate data movement to a cloud data warehouse.

Sparkflows – – Sparkflows is a low-code, drag-and-drop platform that enables organizations to build, deploy, and manage Big Data applications on Apache Spark.

Liquibase – – Liquibase is an open-source database-independent library for tracking, managing, and applying database schema changes.

Shipyard – – Shipyard is a container management platform that makes it easy to deploy, manage, and monitor Docker containers.

Flyway – – Flyway is an open-source database migration tool that allows developers to evolve their database schema easily and reliably across different environments.

Software Estimations Using Reference Class Forecasting

18 years ago I’m sitting in my cubicle doing Java programming, and my tech lead comes up to me to chat about my next project. We discuss the details, and then she asks me the dreaded questions programmers fear which is “how long will it take?”. I stumble with some guestimate based off my limited experience and she goes along her merry way and plugs the number into a gantt chart.

Even with the emergence with the agile manifesto, and now the current paradigms of using 1-2 week sprints to plan projects, business and customers still are asking technologists to provide how long a project will take.

The unfortunate thing about agile is that even though it is an ideal way to run a project, financial models rarely follow that methodology. Meaning, most statement of works are written with a time estimate on a project. There are some exceptions to the rule where some customers pay for work 2 weeks at a time, but it is pretty rare.

Throughout my technical career, I have rarely seen any formalized software estimation models emerge that we all use, so I was surprised when I was reading How Big Things Get Done, a mention about software project estimation. The beginning chapters talked about the challenges and successes of large architectural projects ranging from the Sydney Opera House (problematic project) all the way to the Guggenheim in Bilbao (amazingly under budget).

The book proposes using reference class forecasting which asks you to

  1. Get software estimates of all similar projects perform in the past in your organization with your current project
  2. Take the mean value
  3. Use that as an anchor

For example, if I was doing an application modernization of Hadoop to EMR and I had no idea how long it would take, I would try to get references to other projects of similar complexity. Let’s say I had data of 10 previous projects and the mean came out to 6 months. Then 6 months would be your anchor point.

The book does immediately point out that the biggest problem isn’t this approach, it is obtaining the historical data of how long previous projects took. Think about it this way, out of all the projects you have ever estimated, have you compared the actuals to your forecast? I bet you, most of us haven’t done these retros at all.

Some take aways for me is:

  1. If you are in a large organization and you have done multiple projects, take the time to do a retro on projects you have done and store in a spreadsheet what project you have done, the tasks, complexities, and the actual time it took to finish. Unfortunately large companies have this valuable data but don’t go through the exercises to calculate this. With this, some rudimentary reference class forecasting can start to be used instead of subjective software estimations.
  2. If you are a small organization or don’t have a history of projects and don’t have any reference point, then unfortunately I just think you are out of luck.

At the end of the day, I think industry needs to get better at software estimation, and the only way is to develop some type of methodology and refine it over time.

West Coast Trail – The 75km/48 mile death hike

Author Note: This trip was taken in 2021, but updated in 2023 with updated details.

I’m not really sure where I get these crazy ideas, but a friend and I booked the West Coast Trail. It is this multi day thru hike in the west coast of Vancouver Island, which is accessible via ferry. Unfortunately in 2020 the hike was canceled, but a friend and I fortunately got in the lotto and booked one of the most coveted start times, July 2nd. July typically is better to go because you want as little precipitation as possible.

I have done a lot of hiking, and cool trips, but never thru-hiking. What this means is you start from one point and end out and another point. You carry everything on your back including your food, tent, and supplies.

To prepare for the trail, there pretty much were two resources to read. This book Blister’s and Bliss and the super valuable Facebook group.

From reading the group, everybody recommended to either buy dehydrated food or make it yourself. The reason being is you don’t want to carry real food for the possibility of spoilage and additional weight.

I bought the book from the backpacking chef, and decided to start experimenting. First thing I bought was a dehydrator.

There is a fan on top of the dehydrator and you set the temperature and time. It runs typically for a long time, and takes about 8-20 hours to dehydrate certain foods. What you do is fully cook whatever you are going to eat, let it cool a bit, then dehydrate it from 120-135 degrees for multiple hours.

After much experimenting I successfully dehydrated:
+ rice
+ beans
+ lentils
+ tofu (you have to freeze it first)
+ kale
+ ratatouille
+ thai curry paste
+ quinoa

I didn’t really like dehydrating meat such as chicken breast because it kind of tasted weird at end of the day.

For the food I would pack one meal in a ziplock bag.

At the end I made 7 meals consisting of
+ japanese curry – tofu, kale, beans, ratatouille mix, textured vegetable protein
+ thai curry – instant rice noodles, thai curry paste, tofu, beans
+ lentils – green lentils, quinoa, salsa macha

For breakfast I packed oatmeal, for lunch tortillas, and PB&J, some parmesan crackers – bars. Total weight – about 9-10 pounds.

Preparation #2: Packing

For the west coast trail, you want to only have a backpack which is about 20-30% of your body weight. The lighter the better. That meant for me about 30-40 pounds.

What a lot of people do for thru-hiking is weigh every item and put it in a website called lighter pack. It basically is a fancy excel spreadsheet online.

During the pandemic, all sports gear in Vancouver was in short supply. I spent uhh, a lot of pennies upgrading all of my gear. I bought an ultralight 1.2 lb tent in the states, bought a new jacket, a new sleeping pad, and a gravity filter. I couldn’t find the tent in Canada, so I bought it from REI in the states, and then asked my parents to ship it up.

Visualizing my gear one last time I put everything in my bag for a final weigh in and test

Final weigh in was about 34 lbs. If I count the number of hours I spent dehydrating and packing and thinking about the trip, I for sure spent at least 40 hours planning.

One app which was incredible useful was Avenza Maps. With this you are able to see where you are relative to the trail that Parks Canada provides as a PDF. However be aware that the map is not 100% updated to the latest routes so use Avenza Maps only as a reference and cross-check the physical map given.

Trail Report Day 1: 75km —> 70km – 3.1 miles
AKA – The day I despise ginormous large ladders

For the thru-hike there were two options, south to north or north to south. We opted to go south to north as it starts off super difficult, then slowly gets easier. Logistically, we spent a night in Victoria, and then got dropped off the trailhead in Port Renfrew. After a quick orientation we took a ferry across and this was the first thing we saw:

If there was anything to wake you up, it is a ladder two stories high. At this point I turned off my brain and went up really slowly.

I didn’t realize it at the time, but this trail was actually quite dangerous, because if you fall or slip, consequences could be quite fatal. In hiking, there are some interesting terms such as calling a trail ‘technical’.

When hikers call something technical it refers to the terrain being more difficult where you don’t simply walk on a dirt path. When you walk, on more technical terrain it may refer to scrambling on rocks, uneven trail, roots, etc.

For this portion of the trail it wasn’t too technical, but rather high in elevation. The hiking in this section took about 4.5 hours to get to the campsite.

In this hike, every campsite is by a beach because there are glacial melt from rivers which feed into oceans. This is important because you need to filter water at each site when you are done. Carrying gallons of water for 7 days would be impossible!

At the campsite there were a mix of people finishing the trail and starting the trail. It is pretty typical in any really big hike to inquire about trail conditions. We heard that many people bailed out at the hike half way because of the heat conditions. I’m sure you heard about the ‘heat dome’ in the Pacific Northwest, and temperatures were in Portland/Seattle/Vancouver from 100f and higher! Hiking in 100 degree weather would be brutal.

After we ate dinner, one of the ladies we were talking to came back to me and asked if I was a doctor. She asked if I had hydrogen peroxide and said I looked familiar and asked if I worked at the BC Women’s Hospital.

—— Aside
For some odd reason, people pretty often have asked me pretty weird questions about my occupation. One time I was in Dallas Lovefield Airport flying on Southwest airlines waiting for my gate. Somebody asked me if I was a pilot.

I was kind of just puzzled like, what makes me look like a pilot? Just kind of weird what people assume of you.

Another time I was yet again at the airport (this was pre-covid life where I used to travel twice a month), where someone asked if I was an athlete competing in the Olympics. As flattered as I was, that was again a pretty weird assumption to make. I distinctly recall wearing sweat pants and having a Bose headset on me.
—— End Aside

Knowing I didn’t want to cramp up doing yoga stretches on the beach was near impossible, so I did it on the platform of the restroom.

I’m sure people were wondering who that crazy person was doing yoga at night.

Unfortunately/fortunately I was getting strong 5G reception from T-mobile from Washington. Most people had the true chance to disconnect, but uhh.. I was checking my e-mails before sleeping.

Trail Report Day 2: 70 —> 58km – 7.4 miles
AKA – The day I despise rocks

You would think sleeping by the beach is relaxing, but really that is far from the case. I didn’t sleep that well as the ocean was thundering in the middle of the night. I finally dug out my ear plugs and somewhat slept okay

One of the things which was really beautiful and I couldn’t capture in photos was that mornings unique sunrise. On the left where you see that bright light is the sun. As time progressed because of the cloud formation all I would see is an expanding line over the horizon.

Brushing your teeth also has some special considerations. That means brushing and flossing near the ocean and away from your campsite because you don’t want any food bits to be near your tent to attract animals.

Again, these were one of those times where I just shut off my brain, and prayed for safety the entire trek. This would be rated uber technical.

Later on in the Facebook group I read about someone who slipped off a rock and fell and had to be medivac’ed out. Looking back it was a pretty dicey section.

We finally reached a section called Owen Point, where you could not cross unless tides were low enough.

While my friend was taking a picture I witnessed someone attempt to cross when the tide was not low enough and slipped off a rock. She fortunately was okay. After watching several people get hurt, we decided to really wait for the tides to be safe and crossed.

After the boulder section there was a super interesting coastal walk for quite a long time. The waves really shaped the geography of the land in a unique way.

However walking on coast shelves had their own problems. You would need to be aware of what was slippery and not.

Certain spots looked like dead body markings, but they were just salt which had dried up, perhaps from previous rocks moved?

Similar to Galiano Island, again so many interesting formations in the rocks

After the coastal part, we reached KM 66 and went inland. The scenery changed back to forest

At one point, the trail turned to be pretty muddy and as I was stepping off a slippery platform. I slipped right off and fell 4 feet off the log and right on my back. Fortunately I landed right on my backpack. I was pretty shaken up, extremely scared, but Praise God had no injuries from that fall. Later on, I checked and nothing broke in my backpack.

We stayed at a pretty small campsite for the night.

Trail Report Day 3: 58km —> 41km / 10 miles cullite to cribs
AKA – The day I despise uneven coastal hiking and realized I forget stuff easily

Paranoia set in after falling off a log earlier. I basically was watching nearly every step I was taking.

We had a super long 10km walk along the beach. You would think walks along the beach are fun, but nope. First off when you step, you sink into the sand. Second off, you are kind of walking at a weird 45 degree slope where your left and right legs are uneven.

—— Aside: the grand debate about shoes
When of the topics debated quite heavily in the hiking community is to wear trail shoes or boots. For most of my hikes I have always worn trail shoes. The pros I would say are:

+ Lightweight
+ Dry quickly
+ You don’t develop blisters around your toes

I had always done hiking in very hot areas so I never had an issue with trail shoes. EXCEPT on this trail I got my shoes and socks wet. What happened is that my shoes never dried because of the mistiness and humidity of the trail causing 2 blisters on the bottom of my feet.

A lot of people say that boots protect your ankles, but I am of the view that having strong ankles protects your ankles. That means doing various lunges, steps, and light weights to help your feet.

I learned later from the Facebook group that trail shoe wearers should be bringing a mineral based cream to put on their feet when wet to avoid blisters.

Let’s say at the end of the day I am still a trail shoe fan, but now open to perhaps waterproof style shoes. Still not convinced about boots~
— End Aside

After endless walking, we went through tide pools again, and there were quite a few dead crabs, washed up kelp, and sea urchins. We even saw some green sand which I’ve only seen in Hawaii.

After a long slog we finally arrived at a pretty nice beach campsite.

When you cook in the back country, it is quite different than regular cooking. What you do is put your dehydrated food in a camping stove, add water, and bring it to a boil. Think of it as a healthier cup of noodles.

After dinner we chatted with a mom who was with 5 kids (!). She mentioned that her husband had a brain concussion 10 years ago and couldn’t do any of these hikes. She really liked talking with us because she wanted some adult time as all of her conversations were mainly jokes with kids.

I then proceeded to do my night routine and realized I couldn’t find my toothbrush. I started to panic and realized I couldn’t find my toiletry bag. I had left it at the previous campsite at the beach *face palm*.

Further more, the repercussions would be bigger because I wouldn’t be able to brush or floss for 4 days!

I approached Cindy (the mom) as she was sitting down with other people. I publicly explained my debacle and Cindy gave me some toothpaste in a ziplock bag. I needed to floss with braces, and another lady had dental floss picks which were BRACES FRIENDLY. The odds of getting this were so small. I offered chocolate to them, but they just said to pay it forward.

The bigger problem is I now had no toothbrush, but from talking to some people, they said that at the next stop, I probably would be able to pick up a toothbrush.

At late night, I fell asleep to the chorus of frogs chirping. Actually was quite soothing after a stressful day.

Trail Report Day 4: 42km —> 33km cribs to nitinat narrows

— Warning: below talks about poop talk
One of the things hikers and campers talk a lot about is poop. You need to consider how you will poop and where. For this trail, there are outhouses, so all you have to do is bring toilet paper, hand sanitizer, and soap.

It is important to time your poop schedule because you want to go to the bathroom in the morning then in the evening. Because if you need to go #2 in the middle of the day, it is extremely inconvenient as you have to dig a hole.

My routine pretty much is wake up to poop, eat breakfast, then poop one more time before heading out. Fortunately throughout the hike I have pretty much adhered to this routine.

Another huge issue is peeing in the middle of the night. When you are warm in the tent, you have to change, walk to the bathroom, then walk back. Imagine being at home, and instead of walking to your bathroom, you have to walk to the building next to you.

Many people try to alleviate this issue by doing a double pee. So peeing at night, hanging around the restroom for 20 minutes, and peeing again.
 End Poop Talk

This morning it didn’t rain, but the beach was EXTREMELY misty and everything got wet. That means packing up was miserable. I was so out of it I thwacked myself in the eye with my tent pole, but fortunately everything was fine.

We trekked inland and the trail was extremely overgrown and extremely muddy. After 5 hours of hiking we passed by this really beautiful lily field.

I knew the first half of the trip would be brutal, so I booked a cabin halfway. In the middle of the hike you have the opportunity to do something called ‘comfort camping’. There is a place where you can eat and order real food. Although it is at exorbitant prices, every morsel was worth it.

We finally arrived at Nitanit Narrows which is an area run by first nations, the Nitinat tribe. The area consists of cabins for rent and a super popular food shack pretty much everyone eats at.

It was odd that I had only been eating dehydrated food for 2 days, but I already was craving real food. I got the halibut and baked potato and it was GLORIOUS.

Afterwards we met Doug, one of the caretakers of the property. He showed us to our room and I was pretty pleasantly surprised. I had seen pictures, but this way actually better in person.

After drying all of our stuff outside, we sat in the patio area where there was a group of 5. They were heading north to south, and they asked about a bunch of tips on the difficult section.

Doug came by to talk about the land and his experiences here. He talked about how his family escaped residential schooling because his mom was white, but many were taken away.

Residential schooling has occurred in the United States, but it is a a pretty hot button issue in Canada. In short, there has been a long history of first nations (in the US called Indians or Native Americans), being taken away from their families to be educated in government run schools. Of course you can imagine the trauma, and destruction of families about this.

We were with 5 other guys in the afternoon talking, and when we all were talking Doug asked if we all wanted to go pick up crabs from their crab traps in a boat!

We all headed into the boat with the DOG, who amazingly enjoyed the experience and probably quite used to it. Crab traps were set-up with fish heads spread out in the lake and then later on they are picked up.

There are regulations where crabs have to be a certain size, or else they are thrown back. This does make sense in a sustainability perspective.

Trail Report Day 5: Nitanit Narrows 32 km to 23 km klanawa river.
AKA: Approaching easy town

—Aside Hiking Debate #2 – poles or no poles
You would be surprised but there are so many debates in the hiking community. This debate is to bring hiking poles or not.

Hiking poles to me are insurance that if you have a slip you have the opportunity to catch yourself with your poles.

For gear, my opinion is to buy higher quality but more expensive gear because if it breaks on the trail, you are out for the rest. I remember buying cascade hiking poles from Costco, and it breaking in the middle of hiking of Peru. That really was not a cool experience.

My vote is if the trail is remotely technical – yes poles!
—End Aside

After a refreshing nights sleep, we headed out once again. There was some mud, some slippery boardwalks, and a lot of walking through twisted roots in a forest.

We did a brief stop at Tsusiat Falls where we both jumped into the lake. About 2 km later, we arrived at a campsite where it was the only the two of us.

After setting up camp, I explored the beach area

Around near the campsite I saw mussel shells and a ton of logs everywhere. I remember reading that during the winter, torrential storms come in and reshape the beach landscape. Here are tons of logs that washed up in the beach.

Trail Report 6: 23km to 0km pachena bay
AKA: Let’s get out of here!

The trail started again coastal with an endless slog of beach and tons of rocks and boulders. At this point I had developed two blisters from wet socks so I was cautious. We arrived at the last campsite before the exit at 1pm, and decided just to exit out of the park immediately. It was another 4 long hours, but then we exited!

The ending was super uneventful. Like we could really find the parking lot and there were no acclaims of cheer or anyone to even meet.

At the end of the day, a lot of people have been asking me, was the hike enjoyable or worth it?

I’ve been thinking about it a lot. I think my style of hiking is to hike to a super gorgeous viewpoint and take photos. The West Coast Trail to me is more of a hike of endurance as I’ve never done a thru-hike before.

Life revelations?

As I told some before I usually don’t have any life revelations during really challenging hikes. I guess that’s a good sign?

As in most things of life going outdoors is part preparation, part training, part luck, and all prayer.

Addendum: Here the recipes I used for my trip
Dehydrated Recipes:

  • Black beans – 125F, 5 hours –
  • Mayocoba beans – 125F, 5 hours
  • Ratatouille – 135 – 18 hours – need to break it up half way, make sure all vegetables
  • Quinoa – 135 – 5-8 hours, fruit roll up tray
  • Lentils – 135 – 8 hour


  • 50 g
  • 3g chia
  • 6 g barley
  • 10g blueberries

Japanese Curry (2x)

  • 66g dried rice
  • 4g kale
  • 50g lentils
  • 25g tofu
  • 15g tvp
  • 21g beans
  • 8g ratatouille
  • 1g spring onions
  • 10g dried mushrooms
  • 1/2 block block japanese curry block
  • Furikake spice

Tumeric Curry

  • 1 package rice noodles
  • 2g curry packet
  • 50g tofu
  • 50g beans
  • 5g coconut milk powder
  • 10g tvp
  • fish sauce

Green Lentils

  • 100g lentils
  • 50grams quinoa
  • 20g vegetables
  • Salsa macha
  • Raisins
  • Olive Oil