Skip to content

CSV export revision (Adding new fields and removing some) #2325

Description

CSV has not evolved since a long time. Some people need new data in the CSV. This issue allows to discuss which new fields could be exported.

Fields to add:

  • ingredients_analysis_tags

    • description: this field is computed from ingredients' analysis to know if the product is or not vegan, vegetarian, and/or with palm-oil
    • Eg. ingredients_analysis_tags: ["en:palm-oil","en:non-vegan","en:vegetarian-status-unknown"]
    • rationale: it can be a helper to control data quality
  • nutrient_levels_tags?

    • description: "It represents the traffic lights system made by the UK FSA (Food Safety Administration). It is used by some manufacturers on a voluntary basis in Great Britain, but was rejected by the European Commission in 2010. On Open Food Facts, when the nutritional values are known, the traffic lights are displayed on the product pages. The calculation formula defining the colors of the lights is described on a dedicated page of Open Food Facts website. Nutrient levels can be found here on the website: https://world.openfoodfacts.org/nutrient-levels
    • Example: nutrient_levels_tags: ["en:fat-in-high-quantity","en:saturated-fat-in-high-quantity","en:sugars-in-high-quantity","en:salt-in-low-quantity"]
    • rationale: it can be a helper to control data quality
  • product_quantity

    • description: "This is the normalized quantity of the product in grams (ISO system)."
    • Example: product_quantity: "1500", computed from quantity: "1,5 L"; product_quantity: "320", computed from quantity: "2 x 160 g"
    • rationale: it's easier to play with this data as Open Food Facts already do the computation. We already provide serving_quantity.
  • owner

    • description: "This is the owner of the product, which have sent the product's data to Open Food Facts. The list of owners can be found at https://world.openfoodfacts.org/owners"
    • Example: "owner: "org-carrefour"
    • rationale: it can be a helper to control data quality
  • data_quality_errors_tags

    • description: "Returns a list of all detected errors for the product."
    • Example: data_quality_errors_tags: ["en:nutrition-saturated-fat-greater-than-fat"] (speaks for itself).
    • rationale: 1. Open Food Facts reusers might want to remove the products with quality issues. 2. It can ease to build tools for data quality. It represents ~30K products as of 2022-10.
  • unique_scans_n (contains the number of unique scans of a product (~33% of products))

    • description: "Returns an integer which represent the number of users who have scanned the product at least one time. "Users" are identified by different IPs. This value is not computed in real time but once a year."
    • Example: unique_scans_n: "8".
    • rationale: this is a good proxy to understand which are the most consumed products.
  • popularity_tags

    • description: "The popularity of a product is computed thanks to its number of unique scans. The popularity_tags field groups products by different levels of popularity by year, either in the world, either in the countries where it is popular."
    • Example: popularity_tags: ["top-50000-scans-2019","top-100000-scans-2019","at-least-5-scans-2019","at-least-10-scans-2019","top-75-percent-scans-2019","top-80-percent-scans-2019","top-85-percent-scans-2019","top-90-percent-scans-2019","top-50000-fr-scans-2019","top-100000-fr-scans-2019","top-country-fr-scans-2019","at-least-5-fr-scans-2019","at-least-10-fr-scans-2019"]
    • rationale: this field might be more clear than unique_scans_n, as the latter could suggest this number is a fresh data if not real-time
  • completeness:

    • description: completeness is a float number, between 0 and 1.1, measuring how complete is the product (the higher, the most complete). Currently, we check for more than 10 items: image completeness, product_name, quantity, packaging, brands, categories, origins, emb_codes, expiration_date, ingreditents_text, nutriments (or no_nutrition_data if it is on).
    • Eg. completeness: 0.7625
  • last_image_t:

    • description: It is the date (Unix format) of the last image being uploaded for the product.
    • eg. last_image_t: 1666661491
    • rationale: it can be a helper to control data quality: if the product does not have any image, it should be impossible to fix data quality issues

Fields to delete ??

  • should we keep both created_t / lastmodified_t vs created_datetime / last_modified_datetime (51 Mb lost)
  • additives, which is empty
  • one of the states* three fields; states and states_tags are almost identical, the only difference is that states contains spaces
  • brands or brands_tags: the latter is only the normalized version of the first one (lowercased, unaccented, and replacing spaces and typographic signs by a "-")

Fields that could evolve:

Process:

  • build a new CSV file as a beta version, with a different name than the current one, and let people discuss about it
  • when beta version turn stable/official, keep old version and tell everyone that old version will be unavailable in xx months

Implementation

[to be completed]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

Labels

CSV exportsData exportWe export data nightly as CSV, MongoDB… See: https://world.openfoodfacts.org/dataexport🧽 Data qualityhttps://wiki.openfoodfacts.org/Quality

Projects

  • Status

    To discuss and validate

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions