Data Engineering – leverage sampling in pre-production ( there’s to much data in production anyways )

Lies, damned lies, and statistics – Mark Twain

Hi all,

I ran into this scenario at work¹ where we needed to deploy and productionize a feature that would exhaustively read all records for a given data source. The biggest challenge at hand – to figure out an effective test that would be quick and comprehensive enough; what strategy entails the most confidence that a feature works?

My feature integrates with multiple data sources ( e.g. PostgresDB, DB2, SQLServer), each with a larger number of logical abstractions/subdomains under them ( e.g. tables, columns, or schemas ). The number of subdomains can go to 100+ ( putting an estimate of at least 1 Terabyte of hard-disk data to process ).

One-off testing in the local env and in pre-prod environments gives a visual level of confidence than a feature works, BUT, there are cases where one-off is insufficient. One-off testing doesn’t give long-term confidence. There can be edge cases : wrongly-typed schema name, intermittent network failure, or unexpected configuration changes. But outside of failure mode, even as humans, we don’t trust one-off tests immediately – most folks are suspicious and might think to themselves that an engineer was “just lucky” that day OR mis-interpreted/misrepresented their tests.

But we can’t test every single subdomain for a source – for a plenitude of reasons. Such tests consume significant blocks of time ( more than an hour ). If the test fails, we would not only (a) have to re-execute it multiple times, but also (b) waste a number of dev cycles upon each test re-execution and failure diagnosis.

Hmmm … can we establish an in-between goldilocks zone? If you recall your statistics classes, a sufficient sized random sample, sourced from a population, can lend credence than a feature works. Out of a population of 1000, if I select say, 25 members, and demonstrate feature correctness, I establish enough credibility.

Fortuity had it that the following were available :

a. Feature testing could be scoped out at data source level and at subdomain level ( the table/the column )
b. There’s three testing environments : local, pre-production, and production.
c. Engineers can execute their tests faster, and with more control, in the local environment and the pre-prod environment.

Alright, let me circle back to my team’s four data sources. We chose a handful – five subdomains – for one-off test execution. This gives us a sample size of 20 unique subdomains ( out of the four sources * <number_domains_per_source> ). 20 passing members gives confidence that the feature “does what it does”. Which led us to a final testing strategy, doable within an hour by a single developer – execution of 4 subdomain one-off tests in a local environment and 20 subdomain one-off tests in a pre-production environment.

Credit to the discussions with other engineers for conjuring up an optimal testing strategy – some ideas are mind, other ideas were others, and they came to a good ideal at the end. Credit due to the party it’s due to ( names redacted for journalistic integrity ) ↩︎

harisrid Tech News

recent posts

about