Marmaray uber

Connecting users worldwide on our platform all day, every day requires an enormous amount of data management.

When you consider the hundreds of operations and data science teams analyzing large sets of anonymous, aggregated data, using a variety of different tools to better understand and maintain the health of our dynamic marketplace, this challenge is even more daunting.

Three years ago, Uber adopted the open source Apache Hadoop framework as its data platform, making it possible to manage petabytes of data across computer clusters. However, given our many teams, tools, and data sources, we needed a way to reliably ingest and disperse data at scale throughout our platform.

Users can add support to ingest data from any source and disperse to any sink leveraging the use of Apache Spark. Similarly, we envisioned Marmaray within Uber as a pipeline connecting data from any source to any sink depending on customer preference.

Data lakes often hold data of widely varying quality. Marmaray ensures that all ingested raw data conforms to an appropriate source schema, maintaining a high level of quality so that analytical results are reliable.

Data scientists can spend their time extracting useful insights from this data instead of handling data quality issues. At Uber, Marmaray connects a collection of systems and services in a cohesive manner to do the following:.

While Marmaray realizes our vision of an any-source to any-sink data flow, we also found the need to build a completely self-service platform, providing users from a variety of backgrounds, teams, and technical expertise a frictionless onboarding experience. The open source nature of Hadoop allowed us to integrate it into our platform for large-scale data analytics. As we built Marmary to facilitate data ingestion and dispersal on Hadoop, we felt it should also be turned over to the open source community.

In turn, we need to ingest that data into our Hadoop data lake for our business analytics. The need for reliability at scale made it imperative that we re-architect our ingestion platform to ensure we could keep up with our pace of growth. Our previous data architecture required running and maintaining multiple data pipelines, each corresponding to a different production codebase, which proved to be cumbersome over time as the amount of data increased.

Each data source required understanding a different codebase and its associated intricacies as well as a different and unique set of configurations, graphs, and alerts.

Adding new ingestion sources became non-trivial, and the overhead for maintenance required that our Big Data ecosystem support all of these systems.

Uber’s Big Data Platform: 100+ Petabytes with Minute Latency

The on-call burden could be suffocating, with sometimes more than alerts per week. With the introduction of Marmaray, we consolidated our ingestion pipelines into a single source-agnostic pipeline and codebase which will prove to be much more maintainable as well as resource-efficient.

To maximize the usefulness of these derived datasets, the need arose to disperse this data to online datastores, which often have much lower latency semantics than the Hadoop ecosystem. Before we introduced Marmaray, each team was building their own ad-hoc dispersal systems.

This duplication of efforts and creation of esoteric, non-universal features generally led to an inefficient use of engineering resources. Marmaray was envisioned, designed, and ultimately released in late to fulfill the need for a flexible, universal dispersal platform that would complete the Hadoop ecosystem by providing the means to transfer Hadoop data out to any online data store.

Many of our internal users need a guarantee that data generated from a source is delivered to a destination sink with a high degree of confidence.

These same users also need completeness metrics covering how reliably the data has been delivered to the final sink. In theory this would mean percent of the data has been delivered, but in practice we aim to deliver When the number of records are very small it is easy to run queries against source and sink systems to validate that data has been delivered. At Uber, we ingest multiple petabytes of data and greater than billion messages per day, so running these queries would not be possible.

Marmaray uses a system to bucketize records via custom-authored accumulators in Spark, letting users monitor data delivery with minimum overhead.

Enabling Collaboration through Open Source: Highlights from Uber Open Summit Sofia 2019

The LinkedIn team was kind enough to share knowledge and provide a presentation about their project and architecture, which was greatly appreciated.GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.

Note: For an End to End example of how all our components tie together, please see com. Marmaray is a generic Hadoop data ingestion and dispersal framework and library.

Supermicro nvme boot

It is a plug-in based framework built on top of the Hadoop ecosystem where support can be added to ingest data from any source and disperse to any sink leveraging the power of Apache Spark. Marmaray describes a number of abstractions to support the ingestion of any source to any sink.

Spotdl account

They are described at a high-level below to help developers understand the architecture and design of the overall system. This system has been canonically used to ingest data into a Hadoop data lake and disperse data from a data lake to online data stores usually with lower latency semantics.

The framework was intentionally designed, however, to not be tightly coupled to just this particular use case and can move data from any source to any sink. The figure below illustrates a high level flow of how Marmaray jobs are orchestrated, independent of the specific source or sink. During this process, a configuration defining specific attributes for each source and sink orchestrates every step of the next job.

This includes figuring out the amount of data we need to process i. The following sections give an overview of each of the major components that enable the job flow previously illustrated. The architecture diagram below illustrates the fundamental building blocks and abstractions in Marmaray that enable its overall job flow. These generic components facilitate the ability to add extensions to Marmaray, letting it support new sources and sinks. One of the major benefits of Avro data GenericRecord is that it is efficient both in its memory and network usage, as the binary encoded data can be sent over the wire with minimal schema overhead compared to JSON.

These benefits help our Spark jobs more efficiently handle data at a large scale. To support our any-source to any-sink architecture, we require that all ingestion sources define converters from their schema format to Avro and that all dispersal sinks define converters from the Avro Schema to the native sink data model i. Requiring that all converters either convert data to or from an AvroPayload format allows a loose and intentional coupling in our data model.

Once a source and its associated transformation have been defined, the source theoretically can be dispersed to any supported sink, since all sinks are source-agnostic and only care that the data is in the intermediate AvroPayload format.

The central component of our architecture is the introduction of the concept of what we termed the AvroPayload. One of the major benefits of Avro data GenericRecord is that once an Avro schema is registered with Spark, data is only sent during internode network transfers and disk writes which are then highly optimized.

These benefits factor heavily in helping our Spark jobs handle data at large scale more efficiently. Avro includes a schema to specify the structure of the data being encoded while also supporting schema evolution. For large data files, we take advantage that each record is encoded with the same schema and this schema only needs to be defined once in the file which reduces overhead.

This allows an loose and intentional coupling in our data model, where once a source and its associated transformation has been defined, it theoretically can now be dispersed to any supported sink since all sinks are source agnostic and only care that the data is in the intermediate AvroPayload format.

The primary function of ingestion and dispersal jobs are to perform transformations on input records from the source to ensure it is in the desired format before writing the data to the destination sink. Marmaray allows jobs to chain converters together to perform multiple transformations as needed with the potential to also write to multiple sinks. A secondary but critical function of DataConverters is to produce error records with every transformation.

marmaray uber

Before data is ingested into our Hadoop data lake, it is critical that all data conforms to a schema for analytical purposes and any data that is malformed, missing required fields, or otherwise deemed to have issues will be filtered out and written to error tables. This ensures a high level of data quality in our Hadoop data lake.

The convert will act on a single piece of datum from the input schema format and do one of the following: Return an output record in the desired output schema format Write the input record to the error table with an error message and other useful metadata or discard the record.

Error Tables are written to by DataConverters as described in a previous section. In addition, once the owners have fixed the data and ensured it is schema conforming they can push the data back into the pipeline where it can now be successfully ingested.

Marmaray moves data in mini-batches of configurable size. In order to calculate the amount of data to process, we introduced the concept of a WorkUnitCalculator.

At a very high level, a work unit calculator will look at the type of input source, the previously stored checkpoint, and calculate the next work unit or batch of work.

Marmaray: A generic, scalable, and pluggable Hadoop data ingestion and dispersal framework

When calculating the next batch of data to process, a work unit can also take into account throttling information. Examples include the maximum amount of data to read or number of messages to read from Kafka. Each WorkUnitCalculator will return a IWorkCalculatorResult which will include the list of work units to process in the current batch as well as the new checkpoint state if the job succeeds in processing the input batch.

We have also added functionality to calculate the cost of the execution of each work unit for chargeback purposes.GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together. If nothing happens, download GitHub Desktop and try again. If nothing happens, download Xcode and try again. If nothing happens, download the GitHub extension for Visual Studio and try again. Note: For an End to End example of how all our components tie together, please see com.

React select hooks

Marmaray is a generic Hadoop data ingestion and dispersal framework and library. It is a plug-in based framework built on top of the Hadoop ecosystem where support can be added to ingest data from any source and disperse to any sink leveraging the power of Apache Spark.

Marmaray describes a number of abstractions to support the ingestion of any source to any sink. They are described at a high-level below to help developers understand the architecture and design of the overall system. This system has been canonically used to ingest data into a Hadoop data lake and disperse data from a data lake to online data stores usually with lower latency semantics. The framework was intentionally designed, however, to not be tightly coupled to just this particular use case and can move data from any source to any sink.

The figure below illustrates a high level flow of how Marmaray jobs are orchestrated, independent of the specific source or sink.

I 485 interview under review

During this process, a configuration defining specific attributes for each source and sink orchestrates every step of the next job. This includes figuring out the amount of data we need to process i.

The following sections give an overview of each of the major components that enable the job flow previously illustrated. The architecture diagram below illustrates the fundamental building blocks and abstractions in Marmaray that enable its overall job flow. These generic components facilitate the ability to add extensions to Marmaray, letting it support new sources and sinks.

One of the major benefits of Avro data GenericRecord is that it is efficient both in its memory and network usage, as the binary encoded data can be sent over the wire with minimal schema overhead compared to JSON. These benefits help our Spark jobs more efficiently handle data at a large scale.

marmaray uber

To support our any-source to any-sink architecture, we require that all ingestion sources define converters from their schema format to Avro and that all dispersal sinks define converters from the Avro Schema to the native sink data model i. Requiring that all converters either convert data to or from an AvroPayload format allows a loose and intentional coupling in our data model. Once a source and its associated transformation have been defined, the source theoretically can be dispersed to any supported sink, since all sinks are source-agnostic and only care that the data is in the intermediate AvroPayload format.

The central component of our architecture is the introduction of the concept of what we termed the AvroPayload. One of the major benefits of Avro data GenericRecord is that once an Avro schema is registered with Spark, data is only sent during internode network transfers and disk writes which are then highly optimized. These benefits factor heavily in helping our Spark jobs handle data at large scale more efficiently.

Avro includes a schema to specify the structure of the data being encoded while also supporting schema evolution. For large data files, we take advantage that each record is encoded with the same schema and this schema only needs to be defined once in the file which reduces overhead.

This allows an loose and intentional coupling in our data model, where once a source and its associated transformation has been defined, it theoretically can now be dispersed to any supported sink since all sinks are source agnostic and only care that the data is in the intermediate AvroPayload format. The primary function of ingestion and dispersal jobs are to perform transformations on input records from the source to ensure it is in the desired format before writing the data to the destination sink.

Marmaray allows jobs to chain converters together to perform multiple transformations as needed with the potential to also write to multiple sinks. A secondary but critical function of DataConverters is to produce error records with every transformation. Before data is ingested into our Hadoop data lake, it is critical that all data conforms to a schema for analytical purposes and any data that is malformed, missing required fields, or otherwise deemed to have issues will be filtered out and written to error tables.Connecting users worldwide on our platform all day, every day requires an enormous amount of data management.

When you consider the hundreds of operations and data science teams analyzing large sets of anonymous, aggregated data, using a variety of different tools to better understand and maintain the health of our dynamic marketplace, this challenge is even more daunting.

Three years ago, Uber adopted the open source Apache Hadoop framework as its data platform, making it possible to manage petabytes of data across computer clusters. However, given our many teams, tools, and data sources, we needed a way to reliably ingest and disperse data at scale throughout our platform. Built and designed by our Hadoop Platform team, Marmaray is a plug-in-based framework built on top of the Hadoop ecosystem. Users can add support to ingest data from any source and disperse to any sink leveraging the use of Apache Spark.

Open-sourced Generic Hadoop Data Ingestion and Dispersal Framework

The name, Marmaray, comes from a tunnel in Turkey connecting Europe and Asia. Similarly, we envisioned Marmaray within Uber as a pipeline connecting data from any source to any sink depending on customer preference. Data lakes often hold data of widely varying quality. Marmaray ensures that all ingested raw data conforms to an appropriate source schema, maintaining a high level of quality so that analytical results are reliable.

Data scientists can spend their time extracting useful insights from this data instead of handling data quality issues. At Uber, Marmaray connects a collection of systems and services in a cohesive manner to do the following:. While Marmaray realizes our vision of an any-source to any-sink data flow, we also found the need to build a completely self-service platform, providing users from a variety of backgrounds, teams, and technical expertise a frictionless onboarding experience.

The open source nature of Hadoop allowed us to integrate it into our platform for large-scale data analytics. As we built Marmary to facilitate data ingestion and dispersal on Hadoop, we felt it should also be turned over to the open source community. We hope that Marmaray will serve the data needs of other organizations, and that open source developers will broaden its functionality. In turn, we need to ingest that data into our Hadoop data lake for our business analytics.

The need for reliability at scale made it imperative that we re-architect our ingestion platform to ensure we could keep up with our pace of growth. Our previous data architecture required running and maintaining multiple data pipelines, each corresponding to a different production codebase, which proved to be cumbersome over time as the amount of data increased. Data sources such as MySQLKafkaand Schemaless contained raw data that needed to be ingested into Hive to support diverse analytical needs from teams across the company.

Each data source required understanding a different codebase and its associated intricacies as well as a different and unique set of configurations, graphs, and alerts.

marmaray uber

Adding new ingestion sources became non-trivial, and the overhead for maintenance required that our Big Data ecosystem support all of these systems. The on-call burden could be suffocating, with sometimes more than alerts per week. With the introduction of Marmaray, we consolidated our ingestion pipelines into a single source-agnostic pipeline and codebase which will prove to be much more maintainable as well as resource-efficient.

This single ingestion pipeline will execute the same directed acyclic graph job DAG regardless of the source data store, where at runtime the ingestion behavior will vary depending on the specific source akin to the strategy design pattern to orchestrate the ingestion process and use a common flexible configuration suitable to handle future different needs and use cases.

Many of our internal data customers, such as the Uber Eats and Michelangelo Machine Learning platform teams, use Hadoop in concert with other tools to build and train their machine learning models to produce valuable, derived datasets to drive efficiencies and improve user experiences.

To maximize the usefulness of these derived datasets, the need arose to disperse this data to online datastores, which often have much lower latency semantics than the Hadoop ecosystem.At that same instant, halfway around the world, residents of Mumbai might be ordering dinner through Uber Eats.

Data-driven insights about these interactions help us build products that provide rewarding and meaningful experiences to users worldwide.

Mastering biology answer key

Since eaters want their food delivered in a timely manner and riders want to wait a minimal amount of time for a pickup, our data must reflect events on the ground with as much immediacy as possible. As data comes into our data lake from a variety of sources, however, keeping it fresh at this scale represents a major challenge. While existing solutions that offer hour freshness work for many companies, it is far too stale for the real-time needs of Uber.

Additionally, the data size and scale of operations at Uber prevents such a solution from working reliably. To address these needs, we developed DBEvents, a change data capture system designed for high data quality and freshness. A change data capture system CDC can be used to determine which data has changed incrementally so that action can be taken, such as ingestion or replication.

DBEvents facilitates bootstrapping, ingesting a snapshot of an existing table, and incremental, streaming updates. This solution manages petabytes of data and operates at a global scale, helping us give our internal data customers the best possible service. Historically, data ingestion at Uber began with us identifying the dataset to be ingested and then running a large processing job, with tools such as MapReduce and Apache Spark reading with a high degree of parallelism from a source database or table.

This process, referred to as a snapshot, generally took minutes to hours depending on the size of the dataset, which was not quick enough for the needs of our internal customers.

Every time a job began ingesting data, it fanned out parallel tasks, established parallel connections to an upstream table, such as MySQL, and pulled data. Strategies to reduce this pressure include using dedicated servers for extract, transform, and load ETLbut that brings in other complications around data completeness as well as adding extra hardware costs for a backup database server.

The time to take a database or table snapshot increases with the amount of data, and at some point it becomes impossible to satisfy the demands of the business. Since most databases only have part of their data updated with a limited number of new records being added on a daily basis, this snapshot process also results in an inefficient utilization of compute and storage resources, reading and writing the entire table data, including unchanged rows, over and over again.

When we began designing DBEvents, we identified three business requirements for the resulting solution: freshness, quality, and efficiency. Freshness of data refers to how recently it was updated. Consider an update to a row in a table in MySQL at time t1. These use cases include fraud detectionwhere even the slightest delay can impact customer experience. For these reasons, we made data freshness a high priority in DBEvents. Data in a data lake is of no use if we cannot describe or make sense of it.

Imagine a situation where different upstream services have different schemas for different tables. Although each of these tables were created with a schema, these schemas evolved as their use cases changed.Marmaray is an Open source, Data Ingestion and dispersal framework and library for Apache Hadoop, build on the top of the Hadoop ecosystem.

Users ingest data from any source and also further, distribute it to any sink leveraging the use of Apache Spark. Marmaray is responsible for ingesting the raw data in a data lake with an appropriate source schema to obtain reliable analytical results.

Marmaray is visualized by Uber as a pipeline used for connecting the raw data from different types of data sources to Hadoop or Hive and also further connecting both derived and raw datasets from Hive to a variety of sinks depending on SLA, latency, and other customer requirements.

If there is any malformed data found during transformation such as any missing fields or any other issues, then it is written to error tables.

Work Unit Calculator is responsible for creating the batches of data for processing. It takes cares that the defined amount of data to read or defined number of messages fetched to read from Kafka. Metadata Manager is only responsible for storing the relevant metadata for a running job. Metadata Manager is used to storing the metadata as checkpoint information or can say partition offsets in case of Kafka.

Both frameworks Marmaray and Gobblin are responsible for handling the job, task scheduling and metadata management. It can perform multiple transformations on data without storing the previously transformed data to HDFS.

However, the commit time of the previous persisted one updated to the latest, which makes the downstream incremental ETL counting this record twice.Note: For an End to End example of how all our components tie together, please see com. Marmaray is a generic Hadoop data ingestion and dispersal framework and library. It is a plug-in based framework built on top of the Hadoop ecosystem where support can be added to ingest data from any source and disperse to any sink leveraging the power of Apache Spark.

Marmaray describes a number of abstractions to support the ingestion of any source to any sink. They are described at a high-level below to help developers understand the architecture and design of the overall system. This system has been canonically used to ingest data into a Hadoop data lake and disperse data from a data lake to online data stores usually with lower latency semantics.

marmaray uber

The framework was intentionally designed, however, to not be tightly coupled to just this particular use case and can move data from any source to any sink. The figure below illustrates a high level flow of how Marmaray jobs are orchestrated, independent of the specific source or sink. During this process, a configuration defining specific attributes for each source and sink orchestrates every step of the next job.

This includes figuring out the amount of data we need to process i. The following sections give an overview of each of the major components that enable the job flow previously illustrated.

The architecture diagram below illustrates the fundamental building blocks and abstractions in Marmaray that enable its overall job flow. These generic components facilitate the ability to add extensions to Marmaray, letting it support new sources and sinks. One of the major benefits of Avro data GenericRecord is that it is efficient both in its memory and network usage, as the binary encoded data can be sent over the wire with minimal schema overhead compared to JSON.

These benefits help our Spark jobs more efficiently handle data at a large scale. To support our any-source to any-sink architecture, we require that all ingestion sources define converters from their schema format to Avro and that all dispersal sinks define converters from the Avro Schema to the native sink data model i.

Requiring that all converters either convert data to or from an AvroPayload format allows a loose and intentional coupling in our data model. Once a source and its associated transformation have been defined, the source theoretically can be dispersed to any supported sink, since all sinks are source-agnostic and only care that the data is in the intermediate AvroPayload format.

The central component of our architecture is the introduction of the concept of what we termed the AvroPayload. One of the major benefits of Avro data GenericRecord is that once an Avro schema is registered with Spark, data is only sent during internode network transfers and disk writes which are then highly optimized. These benefits factor heavily in helping our Spark jobs handle data at large scale more efficiently.

Avro includes a schema to specify the structure of the data being encoded while also supporting schema evolution.

Razorpay android sdk github

For large data files, we take advantage that each record is encoded with the same schema and this schema only needs to be defined once in the file which reduces overhead. This allows an loose and intentional coupling in our data model, where once a source and its associated transformation has been defined, it theoretically can now be dispersed to any supported sink since all sinks are source agnostic and only care that the data is in the intermediate AvroPayload format.

The primary function of ingestion and dispersal jobs are to perform transformations on input records from the source to ensure it is in the desired format before writing the data to the destination sink. Marmaray allows jobs to chain converters together to perform multiple transformations as needed with the potential to also write to multiple sinks.

A secondary but critical function of DataConverters is to produce error records with every transformation. Before data is ingested into our Hadoop data lake, it is critical that all data conforms to a schema for analytical purposes and any data that is malformed, missing required fields, or otherwise deemed to have issues will be filtered out and written to error tables.

This ensures a high level of data quality in our Hadoop data lake. The convert will act on a single piece of datum from the input schema format and do one of the following: Return an output record in the desired output schema format Write the input record to the error table with an error message and other useful metadata or discard the record.

Error Tables are written to by DataConverters as described in a previous section. In addition, once the owners have fixed the data and ensured it is schema conforming they can push the data back into the pipeline where it can now be successfully ingested. Marmaray moves data in mini-batches of configurable size. In order to calculate the amount of data to process, we introduced the concept of a WorkUnitCalculator.

At a very high level, a work unit calculator will look at the type of input source, the previously stored checkpoint, and calculate the next work unit or batch of work. When calculating the next batch of data to process, a work unit can also take into account throttling information.

Examples include the maximum amount of data to read or number of messages to read from Kafka. Each WorkUnitCalculator will return a IWorkCalculatorResult which will include the list of work units to process in the current batch as well as the new checkpoint state if the job succeeds in processing the input batch.


This Post Has Comments

Leave a Reply