谁负责分析质量:数据仓库质量保证

相信您的眼睛和在仪表板上看到的内容

Wheely,我们严重依赖数据来制定运营和战略决策。从向合作伙伴支付每周奖金到扩展到其他城市和国家。

每个经理或产品负责人都非常了解自己所在的地区,任何偏离都会引起疑问。因此,对仪表板和指标的可靠性提出了更高的要求。我们的Google Analytics(分析)团队努力在发现问题之前发现并解决问题。 

如您所知,更容易预防,因此我决定以系统和积极的方式来解决该问题。而且,当然,我要做的第一件事是创建一个Slack通道,在其中配置了有关管道中任何错误的通知的传递。

对数据集市的相关性充满信心

, :

  • 10

  • 8

  • DWH

, QA :

  • ,

:

  • .yml freshness:

freshness:
   warn_after: {count: 4, period: hour}
   error_after: {count: 8, period: hour}
 loaded_at_field: "__etl_loaded_at"

  • SQL-:

select
 max({{ loaded_at_field }}) as max_loaded_at,
 {{ current_timestamp() }} as snapshotted_at
from {{ source }}
where {{ filter }}
  • :

, , :

  • (edge cases),

  • (bottleneck)

:

  • : , Out of Memory, Disk Full

  • SLA

:

  • , + ( )

  • CPU

  • - IO, network

.

:

  • ,

:

+pre-hook: "{{ logging.log_model_start_event() }}"
+post-hook: "{{ logging.log_model_end_event() }}"

, , . - , , , , PRIMARY KEY, FOREIGN KEY, NOT NULL, UNIQUE.

DWH . . .. , .

:

  • (NULL) , ?

  • (UNIQUE ID )?

  • (PRIMARY - FOREIGN KEYS)?

  • , (ACCEPTED VALUES)?

QA :

  • ,

:

  • .yml tests:

- name: dim_cars
     description: Wheely partners cars.
     columns:
         - name: car_id
           tests:
               - not_null
               - unique
         - name: status
           tests:
               - not_null
               - accepted_values:
                   values: ['deleted', 'unknown', 'active', 'end_of_life', 'pending', 'rejected'
                           , 'blocked', 'expired_docs', 'partner_blocked', 'new_partner']   

  • SQL-

-- NOT NULL test
select count(*) as validation_errors
from "wheely"."dbt_test"."dim_cars"
where car_id is null
 
-- UNIQUE test
select count(*) as validation_errors
from (
   select
       car_id
   from "wheely"."dbt_test"."dim_cars"
   where car_id is not null
   group by car_id
   having count(*) > 1
) validation_errors
 
-- ACCEPTED VALUES test
with all_values as (
   select distinct
       status as value_field
   from "wheely"."dbt_test"."dim_cars"
),
validation_errors as (
   select
       value_field
   from all_values
   where value_field not in (
       'deleted','unknown','active','end_of_life','pending','rejected','blocked','expired_docs','partner_blocked','new_partner'
   )
)
select count(*) as validation_errors
from validation_errors

-

- - , . -, .

:

  • ,

  • %

( ), .

QA :

  • , -.

:

  • SQL ,

  • SQL-

  • (PASSED) 0 , (FAILED) >= 1

Continuous Integration - DWH

, . DWH . . , , , PROD- PR Merge:

  • DEV- PROD-

  • (, Out of Memory)

- Continuous Integration (CI). !

:

  • master- PROD- DWH  .

:

  • CI (, PROD-, 7 )

  • feature- master

- DWH

( ) :

  • DWH ,

  • (, , ) --

, , (, ).

:

  • , () .

, :

  • , : , , (, , ), (, , ).

  • ,

  • DWH

drill-down :

, . , :

  • ,

  • Continuous Integration and Testing

  • ( )

, Wheely. , .

, , , «Data Engineer» OTUS, .

4 20:00 «Data Engineer». OTUS , .

:

  1. Data Build Tool - DBT

  2. The farm-to-table testing framework -

  3. Tests - Related reference docs - DBT,

  4. How to get started with data testing - dbt discourse

  5. Data testing: why you need it -

  6. Manual Work is a Bug - DRY




All Articles