Skip to content

Latest commit

 

History

History
60 lines (44 loc) · 2.89 KB

README.md

File metadata and controls

60 lines (44 loc) · 2.89 KB

Project Structure

Out-of-distribution with language supervision

  • play_with_clip.py: ID zero-shot classification and ID fine-tuning (with img encoder). Currently we have three options:

  • evaluate zero shot performance of CLIP: call zero_shot_evaluation_CLIP(image_dataset_name, test_labels, ckpt)

  • fine-tune CLIP image encoder and test (linear probe): call linear_probe_evaluation_CLIP(image_dataset_name)

  • play with SkImages: call play_with_skimage()

  • eval_ood_detection.py: OOD detection for CIFAR-10, CIFAR-100, and ImageNet-1K as ID. Currently we support only one score:

    • MIS (maximum inner product score).
  • play_with_clip.ipynb: contains various visualization methods for trained CLIP model.

Week2 Record

New dataset from ImageNet:

  • 10 Classes from ImageNet-1k
    • location: inst-01 /nobackup/ImageNet
    • classes: n04552348 (plane), n04285008 (car/automobile), n01530575 (bird), n02123597 (cat), n02422699 (antelope), n02107574 (dog) ,n01641577 (frog) , n01728572 (snake), n03095699 (ship), n03417042 (truck)
  • Tasks:
    • generate Captions
    • fine tune with multi-modal contrastive loss

Week1 Record

Q: What are desirable properties of pre-trained CLIP?

  • It recognizes objects (not background)! -> it's been shown that CLIP is robust across background shift
  • It associates image representations with label descriptions
  • If the true label is available, it assigns high confidence

Q: Problems of pre-trained CLIP?

  • If the true labels are not there, it can still be overconfident

Q: [detection side] Now we have text embedding, how to design a better detection score?

  • feature-based approach:
    • NIPS-2021 cheating approach (assume OOD labels are known)
    • Inner product based
      • only using ID labels
      • find a fixed sef of template labels?
    • KNN
    • Mahalanobis score
    • improved Mahalanobis score
      • challenge: there is a mismatch between text and img feature spaces, although they have the same dimension
  • logit-based approach: (to check if text embeddings are useful)
    • based on pre-trained CLIP - > linear probe
    • based on pre-trained ViT (without pre-training with text encoder) -> linear probe

Q: [fine-tuning side] Now we have an ID dataset, how to make CLIP aware of ID and OOD?

  • baseline #1: just train with contrastive loss with ID only
  • baseline #2: add K+1 class as "OOD" and use an auxiliary pool
    • challenge: CLIP is good at recognizing concrete objects; is it still good for abstract notions such as "OOD"?
  • baseline #3: typical OE approach: use uniform distribution as proxy for outliers
  • baseline #4: [grouping-based] similar to the MOS paper (say we have 8 big categories from ImageNet): some of them are ID (e.g. plants) and some of them are OOD (e.g. furniture). This is more abstract but more specific than “placeholder”. And during fine-tuning, we can assign ID and auxiliary outliers to those categories.