A Primer
Hi all,
So I recently conducted a data engineering mock interview, and there’s a couple of things that I want to touch upon :
Fore most, I want to point out that unlike most other interviews, data engineering really is the wild west world : there’s few books on the subject and it’s a highly-variadic process across companies. Organizations often emphasize one facet of data engineering over another ; some seek strong SQL skills, others covet strong ETL pipeline designers, some want folks who understand lambda ( RT/nearRT ) versus kappa ( batch ) architectures, and a few really emphasize an understand of performant distributed compute with platforms like Spark and Flink.
Still, it’s a good idea to try to understand what folks look for at a high level, and luckily, data engineering has some overlap with general system design principles ( in fact, a question from Alex Xu’s Volume 2 book, Chapter 6, Ad-Click Event Aggregation, makes for a perfect data engineering interview question )
Data Engineering Case Study Questions
( some my own, some motivated from online sources 🙂 ):
ETL Pipelines :
Case Study : I want you to design me an end-to-end solution. Construct me a data pipeline for near-RT ingestion of Netflix IoT Data : click stream data or playback data. It should be designed for ad-hoc monitoring of select metrics. It operates at Netflix scale and data is geographically-distributed. Your pipeline should be able to populate analytics databases for personas such as BAs ( Business Analysts ) and DAs ( Data Analysts )
- The metrics are up to you for decisioning.
- You can either focus on a general solution or you can delve into solutions focused on targeted tools,technologies, and platforms of your choice!.
Question source = https://www.youtube.com/watch?v=53tcAZ6Qda8&t=603s
Q2 :
Feedback ( Case #1 )
- TC made for a flexible design and recognized the business context for justifying specific metrics
- TC recognized performance bottlenecks in their pipelines.
- They noticed it upstream ( on events collection ) and downstream ( on storage of results to staging databases )
- TC provided different types of analytics for OLAP or historical database – Athena
- TC recognized use cases of storing events in a Data Lake
- TC recognized seperate paths : ingestion and analytics. Kept them seperate to account for performancy
- TC developed a data pipeline with minimal components and minimal infrastructure
- TC engaged in a solid discussion between the push model versus the pull model in their pipeline’s data capture stage
- TC engaged in data modeling.
- TC mentioned multiple different technologies across pipeline stages and their associated trade-offs
Feedback ( Case #2 )
- TC asked really good clarifying questions. Understood the types of metrics they’d drill down properly
- TC thought of really good customer metrics : onboarded, retained, resurrected, and churn rates.
- TC solidly started with data modeling – fact and dimension tables.
- TC showed how to create a cummulative fact table to compute 30/90 day rolling averages
- TC made a good justification for OLAP database. I could easily segue into the pipeline/ingestion portion of the problem
- TC understood how to employ distributed compute engines ( DCEs ) to solution the problem
- TC build a multi-stage ingestion pipeline for IoT Telemetry Data and delved into different components upon my ask
- TC justified how to extend archtiectures to real-time ; not just batching
- TC answered how to handle large volumes and large events cases and how to employ strategies to reduce upstream ingestion
- TC had really good discussions on log enrichments and canonical data in different pipeline phases.
- TC answered remaining design deep dives solidly
- TC really “drove the conversation” and the discussion ; I learned new ways of problem-solving and thinking from them.
- TC is really thinking about data quality and internal dashboards they’d present for noticing discrepancies and metrics crossing thresholds

Leave a comment