Uber Engineering recently published how it collects, standardises and uses data from the Uber Rider app. Rider data comprises all the rider’s interactions with the Uber app. This data accounts for billions of events from Uber’s online systems every day. Uber uses this data to deal with top problem areas such as increasing funnel conversion, user engagement, etc.
Data is crucial for our products. Data analytics help us provide a frictionless experience to the people that use our services. It also enables our engineers, product managers, data analysts, and data scientists to make informed decisions. The impact of data analysis can be seen in every screen of our app: what is displayed on the home screen, the order in which products are shown, what relevant messages are shown to users, what is stopping users from taking rides or signing up, and so on.
Uber captures the required data from two sources – the application (client) itself and the backend services used by the application. Client logs are generated automatically by the platform (e.g., events like user interactions with UI elements, impressions etc.), or added manually by the developers. Backend logs provide more metadata, either unavailable for mobile or too much for a mobile phone to handle. Each record logged has a key that allows joining it to the mobile interaction and produces a unified view.
As Ravichandran and Verma note, “It’s important to have a standardised process for logging since hundreds of engineers are involved in adding or editing events.” To ensure that all events are consistent across platforms and have standardised metadata, Uber defines Apache Thrift structs that need to be implemented by the event models to define its payload. The image below shows an example of a standard schema definition for an analytic event.
An Event Processor receives the events and decides how they need to be processed and propagated further. To improve the signal-to-noise ratio, the Event Processor doesn’t propagate the events downstream unless the metadata and mapping for that event are available. The following diagram illustrates the end-to-end flow.
Data collected is structured and copied over as offline datasets in Apache Hive. Ravichandran and Verma explain:
Offline datasets help us identify the problem areas (…) and measure the success of the solutions developed to address them. Huge, raw, offline datasets are hard to manage. Raw data gets enriched and modelled into tiered tables. In the process of enrichment, different datasets are joined together to make the data more meaningful.
Ravichandran and Verma emphasise that since the data provides signals that drive business decisions, it becomes essential that data integrity and quality are maintained. Tests ensure that if analytic events are fired, they use the expected payload and correct order. Anomaly detection verifies that data is logged and flowing as expected. In offline modelled tables, testing frameworks ensure data correctness, coverage, and consistency across various tables. Each pipeline run triggers the configured tests to ensure that any data produced is guaranteed for quality SLA (Service-Level Agreement).