- Multimodal language-image instruction resource: LLaVA.pdf
Data Generation
- GPT-4: Used for generating instruction-following data
- CLIP (Vision Encoder): This is for [Visual instruction tuning]
- Vicuna (Language Model)
GPT-assisted Visual Instruction Data Generation
Visual instruction tuning
