Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
Yifei Ming authored Apr 9, 2022
1 parent d34ec07 commit d3f78ac
Showing 1 changed file with 0 additions and 43 deletions.
43 changes: 0 additions & 43 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,46 +17,3 @@ cloned and installed [Oscar](https://github.com/microsoft/Oscar) and
[scene\_graph\_benchmark](https://github.com/microsoft/scene_graph_benchmark) in the directory running the notebook
from (you can change these directories in the notebook).

# Week2 Record
New dataset from ImageNet:
- 10 Classes from ImageNet-1k
- location: inst-01 /nobackup/ImageNet
- classes: n04552348 (plane), n04285008 (car/automobile), n01530575 (bird), n02123597 (cat), n02422699 (antelope), n02107574 (dog) ,n01641577 (frog)
, n01728572 (snake), n03095699 (ship), n03417042 (truck)
- Tasks:
- generate Captions
- fine tune with multi-modal contrastive loss
# Week1 Record

Q: What are desirable properties of pre-trained CLIP?

- It recognizes objects (not background)! -> it's been shown that CLIP is robust across background shift
- It associates image representations with label descriptions
- If the true label is available, it assigns high confidence

Q: Problems of pre-trained CLIP?

- If the true labels are not there, it can still be overconfident

Q: [detection side] Now we have text embedding, how to design a better detection score?

- feature-based approach:
- NIPS-2021 cheating approach (assume OOD labels are known)
- Inner product based
- only using ID labels
- find a fixed sef of template labels?
- KNN
- Mahalanobis score
- improved Mahalanobis score
- challenge: there is a mismatch between text and img feature spaces, although they have the same dimension
- logit-based approach: (to check if text embeddings are useful)
- based on pre-trained CLIP - > linear probe
- based on pre-trained ViT (without pre-training with text encoder) -> linear probe

Q: [fine-tuning side] Now we have an ID dataset, how to make CLIP aware of ID and OOD?

- baseline #1: just train with contrastive loss with ID only
- baseline #2: add K+1 class as "OOD" and use an auxiliary pool
- challenge: CLIP is good at recognizing concrete objects; is it still good for abstract notions such as "OOD"?
- baseline #3: typical OE approach: use uniform distribution as proxy for outliers
- baseline #4: [grouping-based] similar to the MOS paper (say we have 8 big categories from ImageNet): some of them are ID (e.g. plants) and some of them are OOD (e.g. furniture). This is more abstract but more specific than “placeholder”. And during fine-tuning, we can assign ID and auxiliary outliers to those categories.

0 comments on commit d3f78ac

Please sign in to comment.