Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Teams, tools, and practices for scalable and resilient data value at Klarna Bank

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Chargement dans…3
×

Consultez-les par la suite

1 sur 39 Publicité

Teams, tools, and practices for scalable and resilient data value at Klarna Bank

Télécharger pour lire hors ligne

Manage your data infrastructure like this:
(i) don’t drown the infra teams in domain data specifics
(ii) build robust low-latency lookup facilities to feed online services
(iii) always take stress out of the equation

At Klarna bank we do online decisions on risk, fraud and ID. Over a hundred data sources are being processed by over a hundred analysts and over a hundred batch jobs. Three data infrastructure engineering teams are operating and developing this data lake: Core team, apps team, and performance team. The total head count is less than a dozen.

To keep afloat, we’ve distilled the following practices: (i) The immutability and recomputation properties of the Lambda/Kappa architectures (ii) continuously delivered and automated infrastructure, (iii) tooling to empower producers and consumers of data to be accountable and self-sufficient, and (iv) proactively improve efficiency of data users.

We’ll talk about some of these practices and tools we have built during several years of running banking applications on Hortonworks Hadoop. Ecosystem components we’ll touch include Kafka, Avro, Hive, Oozie, ELK, Ranger, and Ansible. Tools developed by us include HiveRunner, tooling for data import, along with continuous delivery of data pipelines.

Speakers
Erik Zeitler, Senior Data Engineer, PhD, Klarna Bank
Per Ullberg, Lead Software Engineer, Klarna Bank

Manage your data infrastructure like this:
(i) don’t drown the infra teams in domain data specifics
(ii) build robust low-latency lookup facilities to feed online services
(iii) always take stress out of the equation

At Klarna bank we do online decisions on risk, fraud and ID. Over a hundred data sources are being processed by over a hundred analysts and over a hundred batch jobs. Three data infrastructure engineering teams are operating and developing this data lake: Core team, apps team, and performance team. The total head count is less than a dozen.

To keep afloat, we’ve distilled the following practices: (i) The immutability and recomputation properties of the Lambda/Kappa architectures (ii) continuously delivered and automated infrastructure, (iii) tooling to empower producers and consumers of data to be accountable and self-sufficient, and (iv) proactively improve efficiency of data users.

We’ll talk about some of these practices and tools we have built during several years of running banking applications on Hortonworks Hadoop. Ecosystem components we’ll touch include Kafka, Avro, Hive, Oozie, ELK, Ranger, and Ansible. Tools developed by us include HiveRunner, tooling for data import, along with continuous delivery of data pipelines.

Speakers
Erik Zeitler, Senior Data Engineer, PhD, Klarna Bank
Per Ullberg, Lead Software Engineer, Klarna Bank

Publicité
Publicité

Plus De Contenu Connexe

Plus par DataWorks Summit (20)

Plus récents (20)

Publicité

Teams, tools, and practices for scalable and resilient data value at Klarna Bank

  1. 1. tool s
  2. 2. - - -
  3. 3. λ
  4. 4. λ λ
  5. 5. λ
  6. 6. λ
  7. 7. λ
  8. 8. λ
  9. 9. λ
  10. 10. λ
  11. 11. λ
  12. 12. ○ ○ ○ ○ ○

Notes de l'éditeur

  • Who are we?

    Online Payment provider
    Over a decades worth of experience in payments
  • Erik
  • Interaction is valued - Questions are allowed - we’ll see how far we’ll get
  • Erik
  • Erik
  • Erik
  • Erik
  • Pelle

    Linear projection
    Little domain logic
    Append only
    Queryability
    Downstream performance
    validate/invalidate data
    Route
    Normalize

    Single source aggregation
    More domain knowledge
    Velocity, sessions etc.
    May be incremental - Single source makes it easy to understand what has changed

    General Data Processing
    Massive domain knowledge
    F(all data) - Multiple sources
    Big performance gains
  • Pelle

    Linear projection
    Little domain logic
    Append only
    Queryability
    Downstream performance
    validate/invalidate data
    Route
    Normalize

    Single source aggregation
    More domain knowledge
    Velocity, sessions etc.
    May be incremental - Single source makes it easy to understand what has changed

    General Data Processing
    Massive domain knowledge
    F(all data) - Multiple sources
    Big performance gains
  • Pelle

    Linear projection
    Little domain logic
    Append only
    Queryability
    Downstream performance
    validate/invalidate data
    Route
    Normalize

    Single source aggregation
    More domain knowledge
    Velocity, sessions etc.
    May be incremental - Single source makes it easy to understand what has changed

    General Data Processing
    Massive domain knowledge
    F(all data) - Multiple sources
    Big performance gains
  • Pelle

    Linear projection
    Little domain logic
    Append only
    Queryability
    Downstream performance
    validate/invalidate data
    Route
    Normalize

    Single source aggregation
    More domain knowledge
    Velocity, sessions etc.
    May be incremental - Single source makes it easy to understand what has changed

    General Data Processing
    Massive domain knowledge
    F(all data) - Multiple sources
    Big performance gains
  • Pelle

    Linear projection
    Little domain logic
    Append only
    Queryability
    Downstream performance
    validate/invalidate data
    Route
    Normalize

    Single source aggregation
    More domain knowledge
    Velocity, sessions etc.
    May be incremental - Single source makes it easy to understand what has changed

    General Data Processing
    Massive domain knowledge
    F(all data) - Multiple sources
    Big performance gains
  • Pelle
  • Pelle

    Transaction logs on Kafka

    Ingestion from the cloud by mirrored Kafka topics.

    Beware of your model! If all rows change every day… you gain nothing from using the change capture log.

    Write in the cloud - read on prem
  • Erik
  • Domain agnostic infra teams
    Operate infra on a skeleton crew
    storage
    processing
    fwks
    Self-service infra
  • Redirect questions
    Producer requirements

  • Infra is no middleman
    Producer awareness
    “API discovery”
    This can be a conflict!
  • Provide self service
    Build more tools
    Document
  • Slack support channel
  • Dedup
    Guaranteed delivery
    Binary to queryable
    Closed partitions
    Dataset discovery
  • Erik
  • Erik
  • Proactive professional Services!

    I know this sounds like lame consultant speak.
    But we really needed this: We have to solve the kind of performance problems you saw on the previous slide before they become a problem.

    To this end, we needed tooling. We use ELK to store search and display performance data.
  • Optimization is worthless if the output is incorrect.

    We needed a scalable validation tool that could work on scale: The task is to compare two Hive databases. Compare the output of the original queries to the output of the optimized queries.

    So we built one. Difftång.
  • Pelle

×