Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset S Chen, H Li, Q Wang, Z Zhao, M Sun, X Zhu, J Liu Advances in Neural Information Processing Systems 36, 2024 | 52 | 2024 |
OPT: Omni-perception pre-trainer for cross-modal understanding and generation J Liu, X Zhu, F Liu, L Guo, Z Zhao, M Sun, W Wang, H Lu, S Zhou, J Zhang, ... arXiv preprint arXiv:2107.00249, 2021 | 39 | 2021 |
Chatbridge: Bridging modalities with large language model as a language catalyst Z Zhao, L Guo, T Yue, S Chen, S Shao, X Zhu, Z Yuan, J Liu arXiv preprint arXiv:2305.16103, 2023 | 30 | 2023 |
Vl-mamba: Exploring state space models for multimodal learning Y Qiao, Z Yu, L Guo, S Chen, Z Zhao, M Sun, Q Wu, J Liu arXiv preprint arXiv:2403.13600, 2024 | 17 | 2024 |
Mm21 pre-training for video understanding challenge: Video captioning with pretraining techniques S Chen, X Zhu, D Hao, W Liu, J Liu, Z Zhao, L Guo, J Liu Proceedings of the 29th ACM International Conference on Multimedia, 4853-4857, 2021 | 6 | 2021 |
Mamo: Fine-grained vision-language representations learning with masked multimodal modeling Z Zhao, L Guo, X He, S Shao, Z Yuan, J Liu Proceedings of the 46th International ACM SIGIR Conference on Research and …, 2023 | 5 | 2023 |
Mamo: masked multimodal modeling for fine-grained vision-language representation learning Z Zhao, L Guo, X He, S Shao, Z Yuan, J Liu arXiv preprint arXiv:2210.04183, 2022 | 4 | 2022 |
Beyond Literal Descriptions: Understanding and Locating Open-World Objects Aligned with Human Intentions W Wang, Y Zhang, X He, Y Yan, Z Zhao, X Wang, J Liu arXiv preprint arXiv:2402.11265, 2024 | 1 | 2024 |
SC-Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language Models T Yue, J Cheng, L Guo, X Dai, Z Zhao, X He, G Xiong, Y Lv, J Liu Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern …, 2024 | 1 | 2024 |
OneDiff: A Generalist Model for Image Difference E Hu, L Guo, T Yue, Z Zhao, S Xue, J Liu arXiv preprint arXiv:2407.05645, 2024 | | 2024 |
Towards Event-oriented Long Video Understanding Y Du, K Zhou, Y Huo, Y Li, WX Zhao, H Lu, Z Zhao, B Wang, W Chen, ... arXiv preprint arXiv:2406.14129, 2024 | | 2024 |
Needle In A Video Haystack: A Scalable Synthetic Framework for Benchmarking Video MLLMs Z Zhao, H Lu, Y Huo, Y Du, T Yue, L Guo, B Wang, W Chen, J Liu arXiv preprint arXiv:2406.09367, 2024 | | 2024 |