harisrid Tech News

Spanning across many domains – data, systems, algorithms, and personal

Data Engineering Interview Insights: Key Skills & Challenges

A Primer

Hi all,

So I recently conducted a data engineering mock interview, and there’s a couple of things that I want to touch upon :

Fore most, I want to point out that unlike most other interviews, data engineering really is the wild west world : there’s few books on the subject and it’s a highly-variadic process across companies. Organizations often emphasize one facet of data engineering over another ; some seek strong SQL skills, others covet strong ETL pipeline designers, some want folks who understand lambda ( RT/nearRT ) versus kappa ( batch )  architectures, and a few really emphasize an understand of performant distributed compute with platforms like Spark and Flink.

Still, it’s a good idea to try to understand what folks look for at a high level, and luckily, data engineering has some overlap with general system design principles ( in fact, a question from Alex Xu’s Volume 2 book, Chapter 6, Ad-Click Event Aggregation, makes for a perfect data engineering interview question )

Data Engineering Case Study Questions

( some my own, some motivated from online sources 🙂 ):

ETL Pipelines :

Case Study : I want you to design me an end-to-end solution. Construct me a data pipeline for near-RT ingestion of Netflix IoT Data : click stream data or playback data. It should be designed for ad-hoc monitoring of select metrics. It operates at Netflix scale and data is geographically-distributed. Your pipeline should be able to populate analytics databases for personas such as BAs ( Business Analysts ) and DAs ( Data Analysts )

  • The metrics are up to you for decisioning.
  • You can either focus on a general solution or you can delve into solutions focused on targeted tools,technologies, and platforms of your choice!.

Question source = https://www.youtube.com/watch?v=53tcAZ6Qda8&t=603s

Q2 : 

Feedback ( Case #1 )
  1. TC made for a flexible design and recognized the business context for justifying specific metrics
  2. TC recognized performance bottlenecks in their pipelines.
    1. They noticed it upstream ( on events collection ) and downstream ( on storage of results to staging databases )
  1. TC provided different types of analytics for OLAP or historical database – Athena
  2. TC recognized use cases of storing events in a Data Lake
  3. TC recognized seperate paths : ingestion and analytics. Kept them seperate to account for performancy
  4. TC developed a data pipeline with minimal components and minimal infrastructure
  5. TC engaged in a solid discussion between the push model versus the pull model in their pipeline’s data capture stage
  6. TC engaged in data modeling.
  7. TC mentioned multiple different technologies across pipeline stages and their associated trade-offs

Feedback ( Case #2 )
  1. TC asked really good clarifying questions. Understood the types of metrics they’d drill down properly
  2. TC thought of really good customer metrics : onboarded, retained, resurrected, and churn rates.
  3. TC solidly started with data modeling – fact and dimension tables.
    1. TC showed how to create a cummulative fact table to compute 30/90 day rolling averages
  4. TC made a good justification for OLAP database. I could easily segue into the pipeline/ingestion portion of the problem
  5. TC understood how to employ distributed compute engines ( DCEs ) to solution the problem
  6. TC build a multi-stage ingestion pipeline for IoT Telemetry Data and delved into different components upon my ask
  7. TC justified how to extend archtiectures to real-time ; not just batching
  8. TC answered how to handle large volumes and large events cases and how to employ strategies to reduce upstream ingestion
  9. TC had really good discussions on log enrichments and canonical data in different pipeline phases.
  10. TC answered remaining design deep dives solidly
  11. TC really “drove the conversation” and the discussion ; I learned new ways of problem-solving and thinking from them.
  12. TC is really thinking about data quality and internal dashboards they’d present for noticing discrepancies and metrics crossing thresholds
Posted in

Leave a comment