如何为移动分析设置Snowplow

使用扫雪机将减少您的分析成本。这是第一篇文章,详细说明了如何设置将事件从移动应用程序传输到RedShift数据库的整个过程。在下一篇文章中,我们将仔细研究如何组装仪表板以查看收集的数据。





使用扫雪机将减少您的分析成本。这是第一篇文章,详细说明了如何设置将事件从移动应用程序传输到RedShift数据库的整个过程。在下一篇文章中,我们将仔细研究如何组装仪表板以查看收集的数据。



The Startup

Founder's Analytics指南》
中的文章内容Tristan Handy非常出色,它是文章的引言,并在Habré上提供了翻译https://habr.com/ru/post/346326/



作者建议使用Snowplow工具进行分析



从现有的分析和事件跟踪系统迁移到Snowplow Analytics。Snowplow可以完成 付费工具所做的

所有

事情,但是它是开源的。您可以自己托管它(并

只支付EC2实例的费用),也可以付费在

SnowplowFivetran中托管事件收集器如果您现在还没有做好准备,那么您将无法收集更多详细的数据,也无法

在不久的将来为一些庞大的Segment,Heap或Mixpanel帐户做好准备。一旦您超过了这个

阶段,付费工具就可以轻松每月向您收取10,000美元。”



让我们假设这听起来足够令人信服。有一个美好的引导厮磨艾哈佛设置扫雪机

极大地帮助了我们,当我们第一次建立了扫雪机



我们以本文为基础,并对其进行了修改,以将除雪机用于移动

应用程序并更新了一些详细信息,因为两年来发生了很多变化。



开始



对于整个过程,我们需要:



  • Linux / Unix命令行(可通过Mac OS X上的Terminal应用程序轻松访问)。
  • Git客户端是可选的,但它可以轻松克隆Snowplow存储库。
  • 新的Amazon Web Services帐户具有前12个月的免费试用期。
  • 信用卡。
  • ( snowplow.denjoy.ru), DNS ( ).
  • Android Snowplow

    Tracker


  • .


?



, :



图片5

:



  • , Clojure Collector.
  • - AWS Elastic Beanstalk,

    AWS Route 53.

  • AWS S3.
  • , ETL (extract,

    transform, load), AWS EMR,

    S3.

  • AWS Redshift.


, .



0: AWS IAM-





  • AWS

    . ,

    .



AMR



, , Amazon Web Services IAM (Identity and Access

Management) , .



, https://aws.amazon.com/

.



image12



, .



(IAM)



, , :

IAM.



IAM Snowplow
.



Services IAM .



image39

  • «Groups».
  • «Create new group» .


image24

  • snowplow-setup «Next step».
  • «Attach Policy», «Next step».
  • «Create Group».


«Policy».



  • «Create Policy».


image17

  • JSON :


  {
    "Version": "2012-10-17",
    "Statement": [
      { "Effect":
        "Allow",
        "Action": [
          "acm:*",
          "autoscaling:*",
          "aws-marketplace:ViewSubscriptions",
          "aws-marketplace:Subscribe",
          "aws-marketplace:Unsubscribe",
          "cloudformation:*",
          "cloudfront:*",
          "cloudwatch:*",
          "ec2:*",
          "elasticbeanstalk:*",
          "elasticloadbalancing:*",
          "elasticmapreduce:*",
          "es:*",
          "iam:*",
          "rds:*",
          "redshift:*",
          "s3:*",
          "sns:*"
        ],
        "Resource": "*"
      }
    ]
  }


图片49

  • «Review policy».
  • snowplow-setup-policy-infrastructure.
  • «Create Policy».


«Groups» «snowplow-setup», .



图片37

  • Permissions «Attach Policy».
  • Snowplow-setup-policy-Infrastructure «Attach Policy».


image19

图片20

«Users» «Add user».



  • snowplow-setup.
  • «Programmatic access».
  • «Next: Permissions».
  • «Add user to group», snowplow-setup, «Next: Tags»
  • «Next: Tags»
  • «Create user».


图片58

, , – . CSV, «Download .csv».



图像61

, , . , , .



, 0 !





  • AWS.
  • IAM- snowplow-setup .


1: Clojure collector





  • DNS




Clojure Collector — , web-endpoint, . -, Apache Tomcat, AWS Elastic Beanstalk. Clojure Collector Tomcat AWS S3, , Clojure Collector, .



Clojure Collector



, , WAR Clojure Collector.



. clojure-collector-1.X.X-standalone.war.



, Elastic Beanstalk.



AWS Services Elastic Beanstalk.



图片26

, AWS, Snowplow, . , . .



图9

Elastic Beanstalk



  • «Create Application».
  • (, Snowplow Clojure Collector).
  • Platform Tomcat, Tomcat 8.5 with Java 8 running on 64bit Amazon Linux
  • Application Code «Upload your code» WAR-.


image31

图片46

  • «Create application»
  • ,


image50



图片28

Clojure Collector , . , Applications cookie sp. , .



图6

! Clojure Collector.



.



S3



Tomcat S3 – . -, HTTP-, , .



S3, Elastic Beanstalk. Elastic Beanstalk AWS.



  • .
  • «Edit» «Software Configuration».


image3

  • «S3 log storage» «Rotate logs».


图片1

, , S3 ETL.



«Apply», .





, Elastic Beanstalk - auto-scalable, .



  • «Configuration» .
  • «Capacity» «Edit».
  • «Environment Type» , «Load balanced», , .


image54

, .



Elastic Beanstalk SSL



.



  • Services AWS «Route 53» .
  • «Create hosted zone».
  • Domain Name , . snowplow.denjoy.ru. «Public Hosted Zone» «Create hosted zone».


image21

  • . NS. .


图片33

  • , NS , cloudflare.

  • 4 NS- . CloudFlare:


图片47

, NS- snowplow.denjoy.ru, NS AWS. .



-, , https://dnschecker.org/.



, , Route 53, . , Route 53 Elastic Beanstalk. , URL- snowplow.denjoy.ru , DNS AWS, - Clojure Collector. !



  • , «Create Record».


image32

  • «Simple Routing»
  • «Define simple record»
  • 在打开的窗口中,将``记录名称''字段保留为空白,在``值/将流量发送到''字段中,选择``别名到Elastic Beanstalk环境'',在下一个字段中,选择区域,在``记录类型''字段中选择``A-records'',然后单击``定义简单记录''按钮在窗口的下角


<img src =“ denjoy.storage.yandexcloud.net/snowplow1/image7.png ” alt =“ image7”

  • 关闭窗口后,单击“创建记录”按钮


现在,如果在浏览器中打开,则http://snowplow.denjoy.ru/i应该看到与打开Clojure Collector页面时相同的像素。因此,域路由有效!



但是我们还没有完成。



为Clojure Collector设置HTTPS



() SSL- AWS Load Balancer. , Route 53, . SSL



  • Services AWS Certificate Manager. «Provision certificates» «Get started»
  • «Request a public certificate»
  • , . snowplow.denjoy.ru «Next»
  • «DNS validation»
  • Tags
  • «Review» «Confirm and request»
  • . , AWS , «Create record in Route 53»


图片40

  • «Create»


图片35

Create . «Continue» . 30 , !



Load Balancer HTTPS



  • Elastic Beanstalk, «Configuration». !
  • «Load balancer» «Edit»
  • «Listeners» «Add listener»
  • Port 443, «Add».


image25

  • «Apply»


!



Snowplow Clojure Collector (, ).



, , .



— . Route 53, .





  • Clojure Collector, Elastic Beanstalk.
  • , Amazon Route 53.
  • SSL- .
  • Tomcat S3. S3 .


2:



Android Tracker . Tracker Demo, , , «Ok» .



, https://snowplow.denjoy.ru, HTTPS «Start». .





图4



图44

.



Clojure Collector, Elastic Beanstalk, Tomcat S3. , S3



图片16

S3 elasticbeanstalk-region-id. resources / environment / logs / publish / (some ID) / (some ID). Some ID – , , e-ab12cd23ef, , , i-1234567890. gzip.



, _var_log_tomcat8_rotated_localhost_access_log.txt123456789.gz – , ETL .



image13

, . HTTP- 200. , , Clojure Collector . . :



image27

, JSON .



图片51

3. ETL





  • Clojure Collector.
  • IAM, 0 .




.



, , AWS Elastic MapReduce (EMR).



  • Tomcat.
  • , IP-.
  • , schema JSON.

  • , , Amazon Redshift.


. , ETL S3-. , , . Tomcat , , .



Java- EmrEtlRunner . ETL Amazon Elastic MapReduce. , EmrEtlRunner . , , , 60 .



EmrEtlRunner



ETL — Unix, . , , snowplow_emr_rXX, XX — . snowplow_emr_r117_biskupin.zip.



  • ZIP- snowplow-emr-etl-runner . .
  • Snowplow Github , SQL, .

  • , , snowplow-emr-etl-runner , :


git clone https://github.com/snowplow/snowplow.git


Git, .



图片56

  • snowplow-emr-etl-runner snowplow .
  • config targets.
  • :

    • snowplow/3-enrich/emr-etl-runner/config/config.yml.sample config/config.yml.
    • snowplow/3-enrich/config/iglu_resulver.json config/iglu_resulver.json.
    • snowplow/4-storage/config/targets/redshift.json config/targets/redshift.json.




图片55

:



  |-- snowplow-emr-etl-runner
  |-- snowplow
  | |-- -SNOWPLOW GIT REPO HERE-
  |-- config
  | |-- iglu_resolver.json
  | |-- config.yml
  | |-- targets
  | | |-- redshift.json
  


EC2



Amazon EC2. ETL Amazon, Amazon EC2. ETL , , .



  • AWS Services EC2. «Key Pairs» .
  • , , . .
  • , , «Create key pair».


图8

  • . denjoy-snowplow.
  • pem
  • , , <key pair name>.pem .


image30

S3



Amazon S3. ETL.



:



  • :raw:in — . - elasticbeanstalk, Clojure Collector’, Elastic Beanstalk.
  • :processin — .
  • :archive — : :raw ( ), :enriched ( ) :shredded ( ).
  • :enriched — : :good ( ), :bad ( , ).
  • :shredded — : :good ( , ), :bad ( , ).
  • :log — , ETL.


, S3, Services AWS S3.



:raw:in , elasticbeanstalk-.



, « » ETL.



«Create bucket» , denjoy-snowplow-data. S3, snowplow. «Next» , , , «Create bucket».



, . :



image10

«Create folder» :



  • archive
  • shredded
  • enriched


图片15

archive :



  • raw
  • enriched
  • shredded


, enriched, shredded, :



  • good
  • bad


, , :



  |-- elasticbeanstalk-region-id
  |-- denjoy-snowplow-data
  | |-- archive
  | | |-- raw
  | | |-- enriched
  | | |-- shredded
  | |-- encriched
  | | |-- good
  | | |-- bad
  | |-- shredded
  | | |-- good
  | | |-- bad
  


S3 denjoy-snowplow-log. , ETL.



EmrEtlRunner



EmrEtlRunner. config.yml , snowplow config/. :



  • snowplow-setup , 0. , AWS IAM.

  • AWS. , Python/pip, Mac OS X, Homebrew. , Homebrew, brew install awscli AWS.



, awscli, aws configure . , , , , eu-west-1.



  $ aws configure
  AWS Access Key ID: <enter your IAM user Access Key ID here>
  AWS Secret Access Key: <enter you IAM user Secret Access Key here>
  Default region name: <enter the region name, e.g. eu-west-1 here>
  Default output format: <just press enter>
  


aws configure aws emr create-default-rules. - EmrEtlRunner, EC2.



EmrEtlRunner!



EmrEtlRunner



EmrEtlRunner — snowplow-emr-etl-runner.



EmrEtlRunner . . . , 13, rdb_load. . .



EmrEtlRunner config.yml, config. , , , .



  aws:
    access_key_id: AKIAIBAWU2NAYME55123
    secret_access_key: iEmruXM7dSbOemQy63FhRjzhSboisP5TcJlj9123
    s3:
      region: eu-west-1
      buckets:
        assets: s3://snowplow-hosted-assets
        jsonpath_assets:
        log: s3://simoahava-snowplow-log
        raw:
          in:
            - s3://elasticbeanstalk-eu-west-1-375284143851/resources/environments/logs/publish/e-f4pdn8dtsg
          processing: s3://simoahava-snowplow-data/processing
          archive: s3://simoahava-snowplow-data/archive/raw
        enriched:
          good: s3://simoahava-snowplow-data/enriched/good
          bad: s3://simoahava-snowplow-data/enriched/bad
          errors:
            archive: s3://simoahava-snowplow-data/archive/enriched
        shredded:
          good: s3://simoahava-snowplow-data/shredded/good
          bad: s3://simoahava-snowplow-data/shredded/bad
          errors:
            archive: s3://simoahava-snowplow-data/archive/shredded
    emr:
      ami_version: 5.9.0
      region: eu-west-1
      jobflow_role: EMR_EC2_DefaultRole
      service_role: EMR_DefaultRole
      placement:
        ec2_subnet_id: subnet-d6e91a9e
        ec2_key_name: simoahava
      bootstrap: []
      software:
        hbase:
        lingual:
      jobflow:
        job_name: Snowplow ETL
        master_instance_type: m1.medium
        core_instance_count: 2
        core_instance_type: m1.medium
        core_instance_ebs:
          volume_size: 100
          volume_type: "gp2"
          volume_iops: 400
        ebs_optimized: false
        task_instance_count: 0
        task_instance_type: m1.medium
        task_instance_bid: 0.015
      bootstrap_failure_tries: 3
      configuration:
        yarn-site:
          yarn.resourcemanager.am.max-attempts: "1"
        spark:
          maximizeResourceAllocation: "true"
      additional_info:
    collectors:
      format: clj-tomcat
    enrich:
      versions:
        spark_enrich: 1.12.0
      continue_on_unexpected_error: false
      output_compression: NONE
    storage:
      versions:
        rdb_loader: 0.14.0
        rdb_shredder: 0.13.0
        hadoop_elasticsearch: 0.1.0
    monitoring:
      tags: {}
      logging:
        level: DEBUG
  


, , , . -. , , .



:aws:access_key_id

IAM.
:aws:secret_access_key

IAM.
:aws:s3:region

, S3.
:aws:s3:buckets:log

S3, ETL.
-:aws:s3:buckets:raw:in

, Tomcat. . ! , !
:aws:s3:buckets:raw:processing

.
:aws:s3:buckets:raw:archive

.
:aws:s3:buckets:enriched:good

.
:aws:s3:buckets:enriched:bad

.
:aws:s3:buckets:enriched:errors

.
:aws:s3:buckets:enriched:archive

.
:aws:s3:buckets:shredded:good

.
:aws:s3:buckets:shredded:bad

.
:aws:s3:buckets:shredded:errors

.
:aws:s3:buckets:shredded:archive

:aws:emr:region

, EC2.
:aws:emr:placement

.
:aws:emr:ec2_subnet_id

VDS, . , EC2, .
:aws:emr:ec2_key_name

EC2.
:collectors:format

clj-tomcat.
:monitoring:snowplow

(:method, :app_id :collector).


.



-, :aws:s3:buckets:raw:in . . , . , .



图片38

:aws:emr:ec2_subnet_id , Services AWS EC2. «Instances», . «subnet» aws:emr:ec2_subnet_id.



图片48

, .



, , , snowplow-emr-etl-runner.



./snowplow-emr-etl-runner run -c config/config.yml -r config/iglu_resulver.json


图片57



Invalid InstanceProfile: EMR_EC2_DefaultRule.




ETL S3. .



ETL, AWS Redshift, !





  • snowplow-emr-etl-runner .
  • S3-.
  • ETL S3.


4: Redshift





  • ETL .
  • S3-.
  • GUI SQL-. Table Plus, , . .





Redshift. Redshift — , AWS. , , Tomcat. SQL . , SQL, Codecademy, SQL!



:



  • Redshift.
  • .
  • EmrEtlRunner Redshift.


, , EmrEtlRunner, . SQL- ( ) Snowplow: .





AWS Amazon Redshift.



, ( , ). «Launch Cluster».



image52

. snowplow-cluster. . snowplow.



Node type dc2.large, Cluster type Single Node 1 .



- (5439).



-. , , . - — .



-.



, «Create cluster».



image53

.



. Redshift.



image18



, , , .



«Clusters» , .



«Properties» «Network and security» VPC security groups ( sg-c3f5c687).



image2

EC2.



.



«Inbound rules» , TCP- 5439 0.0.0.0/0 . , TCP- ( ).



, .



图片29

. Amazon Redshift . .



图片41

SQL. Table Plus. «Create new connection» :



  • : Amazon Redshift (com.amazon.redshift.jdbc.Driver)
  • Host: endpoint
  • User: awsuser
  • Password: master_password
  • Database: snowplow


-, .



:



图34

«Connect», .



SELECT current_database(); «Run current», , . :



image60

– !





-, , Android Tracker. .sql , DDL, .



.sql , Snowplow:





atomic-def.sql Table Plus. atomic atomic.events.



image22

manifest-def.sql. .



DDL . , ETL , .



.sql :





, SQL- , :



SELECT * FROM pg_tables WHERE schemaname='atomic';


图片63



:



  • storageloader, ETL.
  • power_user, , -.
  • read_only, .


SQL-. ($password) , + .



  CREATE USER storageloader PASSWORD '$password';
  GRANT USAGE ON SCHEMA atomic TO storageloader;
  GRANT INSERT ON ALL TABLES IN SCHEMA atomic TO storageloader;
  CREATE USER read_only PASSWORD '$password';
  GRANT USAGE ON SCHEMA atomic TO read_only;
  GRANT SELECT ON ALL TABLES IN SCHEMA atomic TO read_only;
  CREATE SCHEMA scratchpad;
  GRANT ALL ON SCHEMA scratchpad TO read_only;
  CREATE USER power_user PASSWORD '$password';
  GRANT ALL ON DATABASE snowplow TO power_user;
  GRANT ALL ON SCHEMA atomic TO power_user;
  GRANT ALL ON ALL TABLES IN SCHEMA atomic TO power_user;
  


, 12 .



图片59

, , atomic storageLoader, .



, :



  SELECT 'ALTER TABLE atomic.' || tablename ||' OWNER TO storageloader;'
  FROM pg_tables WHERE schemaname='atomic' AND NOT tableowner='storageloader';
  


:



ALTER TABLE atomic.* OWNER TO storageloader;


.



image64

,



SELECT * FROM pg_tables WHERE schemaname='atomic' AND tableowner='storageloader';


.



, EmrEtlRunner ETL, storageloader- S3 Redshift.



IAM-



EmrEtlRunner Redshift RDB Loader ( ). , IAM-, Redshift S3-.



  • , AWS Services IAM.
  • Rules. «Create rule».
  • «Select type of trusted entity» AWS - Redshift . «Select your use case» «Redshift — Customizable «Next: permissions».


图14

  • AmazonS3ReadOnlyAccess . «Next: Tags».


图片43

  • «Next: review»
  • , , RedshiftS3Access «Create Rule».
  • . RedshiftS3Access , . Rule ARN. .


image11

  • Amazon Redshift .
  • Snowplow « IAM».


image23

  • «Available IAM rules» , «Add IAM rule» «Done», .


图片36

Redshift



, 3, config/ targets/ redshift.json.



redshift.json , :



  {
    "schema": "iglu:com.snowplowanalytics.snowplow.storage/redshift_config/jsonschema/2-1-0",
    "data": {
      "name": "AWS Redshift enriched events storage",
        "host": "ADD HERE",
        "database": "ADD HERE",
        "port": 5439,
        "sslMode": "DISABLE",
        "username": "ADD HERE",
        "password": "ADD HERE",
        "roleArn": "ADD HERE",
        "schema": "atomic",
        "maxError": 1,
        "compRows": 20000,
        "sshTunnel": null,
        "purpose": "ENRICHED_EVENTS"
      }
    }
  


, :



  • host: URL- Redshift
  • database:
  • username: storageloader
  • password: storageloader
  • ruleArn: ARN IAM-, .


-.



EmrEtlRunner



, , EmrEtlRunner,

Redshift.



, ( snowplow-emr-etl-runner

):



./snowplow-emr-etl-runner run -c config/config.yml -r config/iglu_resulver.json -t config/targets


:raw:in (, Tomcat)

, , Redshift. ,

.



- :



image62



read_only .



图片42

, , , , (

), ,





, Snowplow.



  • Amazon, , DNS

    AWS.

  • Clojure Collector — , HTTP- Tomcat

    S3-.

  • ETL, ,

    S3.

  • , ETL , ,

    AWS Redshift.



, , , - –

, -.



, , , .



Discourse

Snowplow
— , , .



!




All Articles