https://cloudblog.withgoogle.com/topics/customers/otto-groups-lessons-learned-migrating-a-hadoop-infrastructure-to-gcp/amp/
Editor’s note: Today we’re hearing from Google Cloud Platform (GCP) customer Otto Group data.works GmbH, a services organization holding one of the largest retail user data pools in the German-speaking area. They offer e-commerce and logistics SaaS solutions and conduct R&D with sophisticated AI applications for Otto Group subsidiaries and third parties. Read on to learn how Otto Group data.works GmbH recently migrated its on-premises big data Hadoop data lake to GCP and the lessons they learned along the way.
At Otto Group, our business intelligence unit decided to migrate our on-premises Hadoop data lake to GCP. Using managed cloud services for application development, machine learning, data storage and transformation instead of hosting everything on-premises has become popular for tech-focused business intelligence teams like ours. But actually migrating existing on-premises data warehouses and surrounding team processes to managed cloud services brings serious technical and organizational challenges.
The Hadoop data lake and included data warehouses are essential to our e-commerce business. Otto Group BI aggregates anonymized data like clickstreams, user interactions, product information, CRM data, and order transactions of more than 70 online shops of the Otto Group.
On top of this unique data pool, our agile, autonomous, and interdisciplinary product teams—consisting of data engineers, software engineers, and data scientists—develop machine learning-based recommendations, product image recognition, and personalization services. The many online retailers of the Otto Group, such as otto.de and aboutyou.de, integrate our services into their shops to enhance customer experience.
In this blog post, we’ll discuss the motivation that drove us to consider moving to a cloud provider, how we evaluated different cloud providers, why we decided on GCP, the strategy we use to move our on-premises Hadoop data lake and team processes to GCP, and what we have learned so far.
We started with an on-premises infrastructure consisting of a Hadoop cluster-based data lake design, as shown below. We used the Hadoop distributed file system (HDFS) to stage click events, product information, transaction and customer data from those 70 online shops, never deleting raw data.
Overview of the previous on-premises data lake
From there, pipelines of MapReduce jobs, Spark jobs, and Hive queries clean, filter, join, and aggregate the data into hundreds of relational Hive database tables at various levels of abstraction. That let us offer harmonized views of commonly used data items in a data hub to our product teams. On top of this data hub, the teams’ data engineers, scientists, and analysts independently performed further aggregations to produce their own application-specific data marts.
Our purpose-built open-source Hadoop job orchestrator Schedoscope does the declarative, data-driven scheduling of these pipelines, as well as managing metadata and data lineage.
In addition, this infrastructure used a Redis cache and an Exasol EXAsolution main-memory relational database cluster for key-based lookup in web services and fast analytical data access, respectively. Schedoscope seamlessly mirrors Hive tables to the Redis cache and the Exasol databases as Hadoop processing finishes.
Our data scientists ran their iPython notebooks and trained their models on a cluster of GPU-adorned compute nodes. These models were then usually deployed as dockerized Python web services on virtual machines offered by a traditional hosting provider.
This on-premises setup allowed us to quickly grow a large and unique data lake. With Schedoscope’s support for iterative, lean-ops rollout of new and modified data processing pipelines, we could operate this data lake with a small team. We developed sophisticated machine learning-driven web services for the Otto Group shops. The shops were able to cluster purchase and return history of customers for fit prediction; get improved search results through intelligent search term expansion; sort product lists in a revenue-optimizing way; filter product reviews by topic and sentiment; and more.
However, as the data volume, number of data sources, and services connected to our data lake grew, we ran into various pain points that were hampering our agility, including lack of team autonomy, operational complexity, technology limitations, costs, and more.
Let’s go through each of those pain points, along with how cloud could help us solve them.