Skip to content

Latest commit

 

History

History
307 lines (290 loc) · 10.8 KB

example.md

File metadata and controls

307 lines (290 loc) · 10.8 KB

ISI-datamart Restful API

There is a service running on dsbox02.isi.edu:9000

RESTful APIs - Basics:

1. search
  • route: /new/search_data
  • methods: POST
  • bodyform-data: two files, one for query json, the other for supplied data(csv, optional)
    • query: file, json file following query schema
    • data: file, csv file (optional)
  • params:
    • max_return_docs: a number for the maximum number of search results will be returned(default is 10)
    • return_named_entity: true or false for if return the named_entity (default is false)
  • example:
    curl -X POST \
      https://dsbox02.isi.edu:9000/new/search_data 
      -H 'content-type: multipart/form-data' \
      -F data=@datamart/example/fifa_example/fifa.csv \
      -F query=@datamart/example/fifa_example/fifa_query.json
    
  • sample response
    {
      "code": "0000",
      "message": "Success",
      "data": [
        {
          "summary": "STRING SUMMARY FOE THE DATASET"
          "score": 84.735825,
          "metadata": {},
          "datamart_id": "127860000"
        }, 
        ...
      ]
    }
    
2. materialize
  • route: /new/materialize_data
  • methods: GET
  • params: one param for datamart_id, and an optional param to return the first several rows only
    • datamart_id: the datamart_id of the data you would like to materialize
    • first_n_rows: int, if you would like to get only the first several rows of the dataset rather than all of them
  • example:
    curl -X GET \
      'https://dsbox02.isi.edu:9000/new/materialize_data?datamart_id=127860000&first_n_rows=10'
    
  • sample response
    {
      "code": "0000",
      "message": "Success",
      "data": "CSV RESULT HERE"
    }
    
3. join(augment)
  • route: /new/join_data

  • methods: POST

  • bodyform-data: one file for supplied dataset, one field for datamart_id of augment data, two fields for the joining columns

    • left_data: file, a csv file, which is the supplied data provided by users
    • right_data: text, a datamart_id for the data you would like to use for augmentation
    • left_columns: text, specify the join features in the left dataset, by column indeces
    • right_columns: text, specify the join features in the right dataset, by column indeces
    • left_meta: text, json for the metadata of supplied data, follow index_schema
      • useful when there is implicit_variables, e.g. :
      {
          "implicit_variables": [
              {
                  "name": "city",
                  "value": "New York",
                  "semantic_type": []
              }
          ]
      }
      
    • exact_match: text, exact join or fuzzy join, either true or false, default is false - fuzzy matching
  • example:

    curl -X POST \
      https://dsbox02.isi.edu:9000/new/join_data \
      -H 'content-type: multipart/form-data' \
      -F left_data=@datamart/example/fifa_example/fifa.csv \
      -F right_data=127860000 \
      -F 'left_columns=[[3], [4]]' \
      -F 'right_columns=[[22], [24]]' \
      -F 'left_meta={"implicit_variables":[{"name":"city","value":"New York","semantic_type":[]}]}' \
      -F 'exact_match=true'
    

    *exact_match uses pandas left-merge, may return a table with more rows when there is multiple rows in the right dataset matched one row in the left dataset

    *Fuzzy match will return a dataset has exactly the same number of rows as the left dataset

  • sample response

    {
      "code": "0000",
      "message": "Success",
      "data": "CSV RESULT HERE",
      "matched_rows": [1, 3, 2, 0, None],
      "cover_ratio": 0.8
    }
    

    *The matched_row and cover_ratio only available when NOT use exact_match for now

    • matched_rows: which row(index) in the right row is aligned to the each left(row)
      • e.g. [1, 3, 2, 0, None] here means:
        • left_rows[0] <-matched-> right_rows[1]
        • left_rows[1] <-matched-> right_rows[3]
        • left_rows[2] <-matched-> right_rows[2]
        • left_rows[3] <-matched-> right_rows[0]
        • left_rows[5] <--- Nothing matched in the right dataset
    • cover_ratio: how many rows in the left dataset is augmented, in the example above it is 0.8 (4/5).

Upload data to ISI-datamart:

If you would like to index a new dataset into ISI-datamart, there are two methods:

  1. By single file:

    1. find the url for the data you would like to upload:
      • it can be a csv file, an excel file, an html page with tabular data, or a json file.
    2. construct a description json for the data like:
      {
         "title": "title for the dataset", 
         "description": "the description for the dataset",
         ...
         "materialization_arguments": {
             "url": "http://example.com/sample_csv.csv",
             "file_type": "csv"
         }
      }
      
      • The only required field is materialization_arguments.url, all the others are optional.
      • More available attributes can be found in index_schema
    3. call the /new/get_metadata_single_file api with the description json, and check the returned metadata
    4. send the confirmed metadata through /new/upload_metadata_list to finish indexing
  2. By a html page that includes many links for "single file"

    1. find the url for the html, containing many links for dataset you would like to upload
      • ISI-datamart will extract tags and recognize if the link is a data file
      • if so, try to materialize each file and generate metadata
    2. call /new/get_metadata_extract_links with the url in body json and check the returned metadata
    3. send the confirmed metadata through /new/upload_metadata_list to finish indexing

RESTful APIs - Upload data:

1. get metadata for a single file
  • route: /new/get_metadata_single_file
  • methods: POST
  • bodyjson: the description json for the file, including the url
    • see 1.ii above
    • materialization_arguments.file_type can be one of csv, excel, html, table
  • params:
    • enable_two_ravens_profiler: if true, will try to use twoRavens profiler on the dataset and append info onto metadata. default is disabled.
  • example:
    curl -X POST \
      https://dsbox02.isi.edu:9000/new/get_metadata_single_file?enable_two_ravens_profiler=false \
      -H 'Content-Type: application/json' \
      -d '{
        "materialization_arguments": {
            "url": "https://www.w3schools.com/html/html_tables.asp",
            "file_type": "html"
        }
      }'
    
  • sample response:
    {
      "code": "0000",
      "message": "Success",
      "data": [   // a list of metadata object, mostly only one metadata in the list
          {},     // when there are many sheets in an excel file there can be mutiple matadata
        ...
       ]
    }
    
2. get metadata by link extraction from an HTML page
  • route: /new/get_metadata_extract_links
  • methods: POST
  • bodyjson: {"url": "http://example.page.with.many.csv.links"}
  • example:
    curl -X POST \
      https://dsbox02.isi.edu:9000/new/get_metadata_extract_links \
      -H 'Content-Type: application/json' \
      -d '{
        "url": "https://sample-videos.com/download-sample-xls.php"
    }'
    
  • sample response:
    {
      "code": "0000",
      "message": "Success",
      "data": [
          // each inner-list is for a link:
          [   // a list of metadata object, mostly only one metadata in the list
              {},     // when there are many sheets in an excel file there can be mutiple matadata
            ...
          ],
          [],
          ...
        ]
    }
    
3。 upload metadata to ISI-datamart
  • route: /new/upload_n=metadata_list
  • methods: POST
  • bodyjson: hold the metadata(or list(metadata), list(list(metadata)))
    • metadata: put the metadata(or list(metadata), list(list(metadata)))
  • example:
    curl -X POST \
      https://dsbox02.isi.edu:9000/new/upload_metadata_list \
      -H 'Content-Type: application/json' \
      -d '{
        "metadata":     {
          "datamart_id": 0,
          "title": "html tables",
          "url": "https://www.w3schools.com/html/html_tables.asp",
          "materialization": {
            "python_path": "general_materializer",
            "arguments": {
              "url": "https://www.w3schools.com/html/html_tables.asp",
              "file_type": "html",
              "index": 0
            }
          },
          "variables": [
            {
              "datamart_id": 1,
              "semantic_type": [],
              "name": "Company",
              "description": "column name: Company, dtype: object",
              "named_entity": [
                "Alfreds Futterkiste",
                "Centro comercial Moctezuma",
                "Ernst Handel",
                "Island Trading",
                "Laughing Bacchus Winecellars",
                "Magazzini Alimentari Riuniti"
              ]
            },
            {
              "datamart_id": 2,
              "semantic_type": [],
              "name": "Contact",
              "description": "column name: Contact, dtype: object",
              "named_entity": [
                "Maria Anders",
                "Francisco Chang",
                "Roland Mendel",
                "Helen Bennett",
                "Yoshi Tannamuri",
                "Giovanni Rovelli"
              ]
            },
            {
              "datamart_id": 3,
              "semantic_type": [],
              "name": "Country",
              "description": "column name: Country, dtype: object",
              "named_entity": [
                "Germany",
                "Mexico",
                "Austria",
                "UK",
                "Canada",
                "Italy"
              ]
            }
          ],
          "description": "Company : object, Contact : object, Country : object",
          "keywords": [
            "Company",
            "Contact",
            "Country"
          ]
        }
    }'
    
  • sample response:
    {
      "code": "0000",
      "message": "Success",
      "data": [   // successed metadata, with valid datamart_id assigned
          {},
        ...
       ]
    }
    

You can run your own flask

conda activate datamart_env
python ../../datamart_web/webapp.py