Skip to content

Method for calculating de-identified trip data (open data) #571

@joanathan

Description

@joanathan

Is your feature request related to a problem?

Trip data is useful for many reasons, including for planning and research purposes, such as those expressed in #136.

Trip data however carries a risk, as described in the Chicago TNP and Taxi Open Data Approach: “It has been recognized in scientific literature and news reports that even data without directly identifying attributes can be reidentified using other data sources. Specifically, data about an individual’s location at certain points in time can create a ‘fingerprint’ that can allow for re identification, as long as there is a separate dataset available containing parts of the fingerprint along with identifying fields.”
By design, MDS data does not include rider characteristics, e.g. rider’s name, date of birth, zip code, phone number, gender identification, or any other attribute related to the individual. The exception is the location and time of the trip related events, and in-trip route trace telemetry.

To further protect against re-identification of MDS data, LADOT is implementing a similar approach to the one outlined and used by Chicago for TNP and Taxi, and micromobility trip data. Here we propose a Trip Binning approach to further de-identified trip data and a new API to support serving it.

Describe the solution you'd like

Following the example of the submitted Metrics API #485, we propose a new de-identified trips API that could be implemented by either mobility Providers or regulating Agencies (based on data obtained from the Provider or Agency /trips endpoint). The single endpoint would be something like /trips/deidentified with a variety of search parameters.

The API would contain precise definitions for further de-identified trips. It could provide either private exchange between Providers and Agencies, or between either of those and more public feeds for e.g. academic researchers.

Is this a breaking change

  • No, not breaking

Impacted Spec

  • agency
  • provider

Describe alternatives you've considered

We considered having one /trips endpoint that has different permissions and scope whether to serve the granular or further de-identified trip information. We went with separating /trips from /trips/deidentified to create clear delineation between granular and further de-identified trip information.

In regards to methodologies to de-identify trip information, cities have taken different approaches whether through rounding the start and end location to broader geography, binning the time to a larger time interval, and/or applying k-anonymization. We went with a de-identification technique that puts forward the best aspects of these different approaches.

We tested approaches that go beyond spatial and temporal rounding, such as using k-anonymization methodologies, namely, Trip Binning, which originated from the City of Chicago, and Point Fuzzing by Louisville. The result of the comparison study here.

Additional context

I'd welcome suggestions for better naming for /trips/deidentified.

Special thanks to @schnuerle for developing the Point Fuzzing technique, and providing input to the comparison study, @nicklucius for developing the Trip Binning approach for Chicago, and @whereissean, @playground-julia, Sam Jackson who kickstarted and have been relentless in developing a standard for trip deidentification.

We hope to push the privacy conversation a step further and stimulate discussion with more members of the OMF community.

Metadata

Metadata

Assignees

No one assigned

    Labels

    MetricsRelated to the Metrics API and related topicsProviderSpecific to the Provider APIprivacyImplications around privacy for the attention of the OMF Privacy Committee

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions