-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathpublications.html
2658 lines (2277 loc) · 375 KB
/
publications.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta content="width=device-width, initial-scale=1.0" name="viewport">
<title>VLAA</title>
<meta content="" name="description">
<meta content="" name="keywords">
<!-- Favicons -->
<link href="assets/img/UCSC_icon.png" rel="icon">
<link href="assets/img/UCSC_icon.png" rel="apple-touch-icon">
<!-- Google Fonts -->
<link href="https://fonts.googleapis.com/css?family=Open+Sans:300,300i,400,400i,600,600i,700,700i|Raleway:300,300i,400,400i,500,500i,600,600i,700,700i|Poppins:300,300i,400,400i,500,500i,600,600i,700,700i" rel="stylesheet">
<!-- Vendor CSS Files -->
<link href="assets/vendor/fontawesome-free/css/all.min.css" rel="stylesheet">
<link href="assets/vendor/animate.css/animate.min.css" rel="stylesheet">
<link href="assets/vendor/bootstrap/css/bootstrap.min.css" rel="stylesheet">
<link href="assets/vendor/bootstrap-icons/bootstrap-icons.css" rel="stylesheet">
<link href="assets/vendor/boxicons/css/boxicons.min.css" rel="stylesheet">
<link href="assets/vendor/glightbox/css/glightbox.min.css" rel="stylesheet">
<link href="assets/vendor/remixicon/remixicon.css" rel="stylesheet">
<link href="assets/vendor/swiper/swiper-bundle.min.css" rel="stylesheet">
<link href="https://fonts.googleapis.com/css?family=Lato:100,300,400,700,900" rel="stylesheet">
<link rel="stylesheet" type="text/css" media="screen,print" href="assets/css_pub/style.css" />
<!-- <link href="assets/css_pub/bootstrap.min.css" rel="stylesheet" media="screen" /> -->
<link rel="icon" type="image/png" href="./images/logos/princeton.png">
<!-- Template Main CSS File -->
<link href="assets/css/style.css" rel="stylesheet">
<!-- =======================================================
* Template Name: Medilab - v4.7.1
* Template URL: https://bootstrapmade.com/medilab-free-medical-bootstrap-theme/
* Author: BootstrapMade.com
* License: https://bootstrapmade.com/license/
======================================================== -->
</head>
<body>
<!-- ======= Top Bar ======= -->
<div id="topbar" class="d-flex align-items-center fixed-top">
<div class="container d-flex justify-content-between">
<div class="contact-info d-flex align-items-center">
</div>
<div class="d-none d-lg-flex social-links align-items-center">
<a href="opening.html" class="envelope"><i class="bi-envelope"></i></a>
</div>
</div>
</div>
<!-- ======= Header ======= -->
<header id="header" class="fixed-top">
<div class="container d-flex align-items-center">
<h1 class="logo me-auto"><a href="index.html">VLAA lab</a></h1>
<nav id="navbar" class="navbar order-last order-lg-0">
<ul>
<li><a class="nav-link scrollto" href="index.html">Home</a></li>
<li><a class="nav-link scrollto" href="people.html">People</a></li>
<li><a class="nav-link scrollto active" href="publications.html">Publications</a></li>
<li><a class="nav-link scrollto" href="https://github.com/UCSC-VLAA">GitHub</a></li>
<li><a class="nav-link scrollto" href="https://huggingface.co/UCSC-VLAA">HuggingFace</a></li>
<li><a class="nav-link scrollto" href="opening.html">Opening</a></li>
</ul>
<i class="bi bi-list mobile-nav-toggle"></i>
</nav><!-- .navbar -->
</div>
</header><!-- End Header -->
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<div class="section-title">
<h2>Publications</h2>
</div>
<script>
function copy(dest, source) {
if(dest.source == source) {
dest.innerHTML = "";
dest.source = null;
dest.style.width="0px";
dest.style.border = "";
dest.style.padding = "0px";
}
else {
dest.innerHTML = source.innerHTML;
dest.source = source;
dest.style.width = "800px";
dest.style.padding = "10px";
dest.style.border = "2px dotted gray";
dest.style.background = "#F5F5F5";
dest.style.margin = "10px";
}
dest.blur();
}
</script>
<div class="container">
<!-- <h1> Papers</h1> -->
<br>
<!-- <p>(*: equal contribution)</p> -->
<details close>
<summary><font size="5">Pre-print</font></summary>
<script>
paper_count = 0
function add_paper(title, authors, conference, link, bib, abstract, arxiv_link, code, press, slides, talk, msg) {
list_entry = "<li style=\"font-size:18px\">"
if (link != null)
list_entry += "<a href=\"" + link + "\">"
list_entry += "<b>" + title + "</b>"
if (link != null)
list_entry += "</a>"
list_entry += "<br>" + authors + ".<br>"
if (conference != null)
list_entry+= conference + ".</li>"
if (bib != null) {
list_entry += "<div id=\"bib" + paper_count + "\" style=\"display:none\">" + bib + "</div>"
list_entry += "<a href=\"javascript:copy(div" + paper_count + ",bib" + paper_count + ")\"> <span class=\"label label-success\">bib</span></a>"
}
if (abstract != null) {
list_entry += "<div id=\"abstract" + paper_count + "\" style=\"display:none\">" + abstract + "</div>"
list_entry += "<a href=\"javascript:copy(div" + paper_count + ",abstract" + paper_count + ")\"> <span class=\"label label-warning\">abstract</span></a>"
}
if (arxiv_link != null)
list_entry += " <a href=\"" + arxiv_link + "\"><span class=\"label label-primary\">arxiv</span></a>"
if (code != null)
list_entry += " <a href=\"" + code + "\"><span class=\"label label-danger\">code/models</span></a>"
if (press != null)
list_entry += " <a href=\"" + press + "\"><span class=\"label label-success\">press</span></a>"
if (slides != null)
list_entry += " <a href=\"" + slides + "\"><span class=\"label label-info\">slides/poster</span></a>"
if (talk != null)
list_entry += " <a href=\"" + talk + "\"><span class=\"label label-default\">talk</span></a>"
list_entry += "<br>"
if (msg != null)
list_entry += "<i>" + msg + "</i>"
list_entry += "<div id=\"div" + paper_count + "\" style=\"font-size:15px\"></div><br>"
document.write(list_entry)
paper_count += 1
}
// document.write("<h2>Preprint</h2>")
// document.write("<ul>")
document.write("</ul>")
// document.write("<h2>Preprint</h2>")
document.write("<ul><br>")
add_paper("Complex-Edit: CoT-Like Instruction Generation for Complexity-Controllable Image Editing Benchmark",
"Siwei Yang, Mude Hui, Bingchen Zhao, Yuyin Zhou, Nataniel Ruiz, Cihang Xie",
null,
"https://arxiv.org/abs/2504.13143",
"@article{yang2025textttcomplexeditcotlikeinstructiongeneration,<br>" +
" title = {Complex-Edit: CoT-Like Instruction Generation for Complexity-Controllable Image Editing Benchmark},<br>" +
" author = {Siwei Yang, Mude Hui, Bingchen Zhao, Yuyin Zhou, Nataniel Ruiz, Cihang Xie},<br>" +
" journal = {arXiv preprint arXiv:2504.13143},<br>" +
" year = {2025},<br>",
"We introduce Complex-Edit, a comprehensive benchmark designed to systematically evaluate instruction-based image editing models across instructions of varying complexity. To develop this benchmark, we harness GPT-4o to automatically collect a diverse set of editing instructions at scale. Our approach follows a well-structured ``Chain-of-Edit'' pipeline: we first generate individual atomic editing tasks independently and then integrate them to form cohesive, complex instructions. Additionally, we introduce a suite of metrics to assess various aspects of editing performance, along with a VLM-based auto-evaluation pipeline that supports large-scale assessments. Our benchmark yields several notable insights: 1) Open-source models significantly underperform relative to proprietary, closed-source models, with the performance gap widening as instruction complexity increases; 2) Increased instructional complexity primarily impairs the models' ability to retain key elements from the input images and to preserve the overall aesthetic quality; 3) Decomposing a complex instruction into a sequence of atomic steps, executed in a step-by-step manner, substantially degrades performance across multiple metrics; 4) A straightforward Best-of-N selection strategy improves results for both direct editing and the step-by-step sequential approach; and 5) We observe a ``curse of synthetic data'': when synthetic data is involved in model training, the edited images from such models tend to appear increasingly synthetic as the complexity of the editing instructions rises -- a phenomenon that intriguingly also manifests in the latest GPT-4o outputs.",
"https://arxiv.org/abs/2504.13143",
"https://github.com/UCSC-VLAA/Complex-Edit"
)
add_paper("MedSegFactory: Text-Guided Generation of Medical Image-Mask Pairs",
"Jiawei Mao, Yuhan Wang, Yucheng Tang, Daguang Xu, Kang Wang, Yang Yang, Zongwei Zhou, Yuyin Zhou",
null,
"https://arxiv.org/abs/2504.06897",
"@article{mao2025medsegfactory,<br>" +
" title = {MedSegFactory: Text-Guided Generation of Medical Image-Mask Pairs},<br>" +
" author = {Jiawei Mao, Yuhan Wang, Yucheng Tang, Daguang Xu, Kang Wang, Yang Yang, Zongwei Zhou, Yuyin Zhou},<br>" +
" journal = {arXiv preprint arXiv:2504.06897},<br>" +
" year = {2025},<br>",
"This paper presents MedSegFactory, a versatile medical synthesis framework that generates high-quality paired medical images and segmentation masks across modalities and tasks. It aims to serve as an unlimited data repository, supplying image-mask pairs to enhance existing segmentation tools. The core of MedSegFactory is a dual-stream diffusion model, where one stream synthesizes medical images and the other generates corresponding segmentation masks. To ensure precise alignment between image-mask pairs, we introduce Joint Cross-Attention (JCA), enabling a collaborative denoising paradigm by dynamic cross-conditioning between streams. This bidirectional interaction allows both representations to guide each other's generation, enhancing consistency between generated pairs. MedSegFactory unlocks on-demand generation of paired medical images and segmentation masks through user-defined prompts that specify the target labels, imaging modalities, anatomical regions, and pathological conditions, facilitating scalable and high-quality data generation. This new paradigm of medical image synthesis enables seamless integration into diverse medical imaging workflows, enhancing both efficiency and accuracy. Extensive experiments show that MedSegFactory generates data of superior quality and usability, achieving competitive or state-of-the-art performance in 2D and 3D segmentation tasks while addressing data scarcity and regulatory constraints.",
"https://arxiv.org/abs/2504.06897",
"https://github.com/jwmao1/MedSegFactory"
)
add_paper("A Comprehensive Analysis of Mamba for 3D Volumetric Medical Image Segmentation",
"Chaohan Wang, Yutong Xie, Qi Chen, Yuyin Zhou, Qi Wu",
null,
"https://arxiv.org/abs/2503.19308",
"@article{wang2025comprehensive,<br>" +
" title = {A Comprehensive Analysis of Mamba for 3D Volumetric Medical Image Segmentation},<br>" +
" author = {Chaohan Wang, Yutong Xie, Qi Chen, Yuyin Zhou, Qi Wu},<br>" +
" journal = {arXiv preprint arXiv:2503.19308},<br>" +
" year = {2025},<br>",
"Mamba, with its selective State Space Models (SSMs), offers a more computationally efficient solution than Transformers for long-range dependency modeling. However, there is still a debate about its effectiveness in high-resolution 3D medical image segmentation. In this study, we present a comprehensive investigation into Mamba's capabilities in 3D medical image segmentation by tackling three pivotal questions: Can Mamba replace Transformers? Can it elevate multi-scale representation learning? Is complex scanning necessary to unlock its full potential? We evaluate Mamba's performance across three large public benchmarks-AMOS, TotalSegmentator, and BraTS. Our findings reveal that UlikeMamba, a U-shape Mamba-based network, consistently surpasses UlikeTrans, a U-shape Transformer-based network, particularly when enhanced with custom-designed 3D depthwise convolutions, boosting accuracy and computational efficiency. Further, our proposed multi-scale Mamba block demonstrates superior performance in capturing both fine-grained details and global context, especially in complex segmentation tasks, surpassing Transformer-based counterparts. We also critically assess complex scanning strategies, finding that simpler methods often suffice, while our Tri-scan approach delivers notable advantages in the most challenging scenarios. By integrating these advancements, we introduce a new network for 3D medical image segmentation, positioning Mamba as a transformative force that outperforms leading models such as nnUNet, CoTr, and U-Mamba, offering competitive accuracy with superior computational efficiency. This study provides key insights into Mamba's unique advantages, paving the way for more efficient and accurate approaches to 3D medical imaging.",
"https://arxiv.org/abs/2503.19308",
"https://arxiv.org/abs/2503.19308"
)
add_paper("ViLBench: A Suite for Vision-Language Process Reward Modeling",
"Haoqin Tu, Weitao Feng, Hardy Chen, Hui Liu, Xianfeng Tang, Cihang Xie",
null,
"https://arxiv.org/abs/2503.20271",
"@article{tu2025vilbench,<br>" +
" title = {ViLBench: A Suite for Vision-Language Process Reward Modeling},<br>" +
" author = {Haoqin Tu, Weitao Feng, Hardy Chen, Hui Liu, Xianfeng Tang, Cihang Xie},<br>" +
" journal = {arXiv preprint arXiv:2503.20271},<br>" +
" year = {2025},<br>",
"Process-supervised reward models serve as a fine-grained function that provides detailed step-wise feedback to model responses, facilitating effective selection of reasoning trajectories for complex tasks. Despite its advantages, evaluation on PRMs remains less explored, especially in the multimodal domain. To address this gap, this paper first benchmarks current vision large language models (VLLMs) as two types of reward models: output reward models (ORMs) and process reward models (PRMs) on multiple vision-language benchmarks, which reveal that neither ORM nor PRM consistently outperforms across all tasks, and superior VLLMs do not necessarily yield better rewarding performance. To further advance evaluation, we introduce ViLBench, a vision-language benchmark designed to require intensive process reward signals. Notably, OpenAI's GPT-4o with Chain-of-Thought (CoT) achieves only 27.3% accuracy, indicating the benchmark's challenge for current VLLMs. Lastly, we preliminarily showcase a promising pathway towards bridging the gap between general VLLMs and reward models -- by collecting 73.6K vision-language process reward data using an enhanced tree-search algorithm, our 3B model is able to achieve an average improvement of 3.3% over standard CoT and up to 2.5% compared to its untrained counterpart on ViLBench by selecting OpenAI o1's generations. We release the implementations with our code, model, and data.",
"https://arxiv.org/abs/2503.20271",
"https://ucsc-vlaa.github.io/ViLBench/"
)
add_paper("Exploring the Vulnerabilities of Federated Learning: A Deep Dive into Gradient Inversion Attacks",
"Pengxin Guo, Runxi Wang, Shuang Zeng, Jinjing Zhu, Haoning Jiang, Yanran Wang, Yuyin Zhou, Feifei Wang, Hui Xiong, Liangqiong Qu",
null,
"https://arxiv.org/abs/2503.11514",
"@article{guo2025exploring,<br>" +
" title = {Exploring the Vulnerabilities of Federated Learning: A Deep Dive into Gradient Inversion Attacks},<br>" +
" author = {Pengxin Guo, Runxi Wang, Shuang Zeng, Jinjing Zhu, Haoning Jiang, Yanran Wang, Yuyin Zhou, Feifei Wang, Hui Xiong, Liangqiong Qu},<br>" +
" journal = {arXiv preprint arXiv:2503.11514},<br>" +
" year = {2025},<br>",
"Federated Learning (FL) has emerged as a promising privacy-preserving collaborative model training paradigm without sharing raw data. However, recent studies have revealed that private information can still be leaked through shared gradient information and attacked by Gradient Inversion Attacks (GIA). While many GIA methods have been proposed, a detailed analysis, evaluation, and summary of these methods are still lacking. Although various survey papers summarize existing privacy attacks in FL, few studies have conducted extensive experiments to unveil the effectiveness of GIA and their associated limiting factors in this context. To fill this gap, we first undertake a systematic review of GIA and categorize existing methods into three types, i.e., \textit{optimization-based} GIA (OP-GIA), \textit{generation-based} GIA (GEN-GIA), and \textit{analytics-based} GIA (ANA-GIA). Then, we comprehensively analyze and evaluate the three types of GIA in FL, providing insights into the factors that influence their performance, practicality, and potential threats. Our findings indicate that OP-GIA is the most practical attack setting despite its unsatisfactory performance, while GEN-GIA has many dependencies and ANA-GIA is easily detectable, making them both impractical. Finally, we offer a three-stage defense pipeline to users when designing FL frameworks and protocols for better privacy protection and share some future research directions from the perspectives of attackers and defenders that we believe should be pursued. We hope that our study can help researchers design more robust FL frameworks to defend against these attacks.",
"https://arxiv.org/abs/2503.11514",
"https://arxiv.org/abs/2503.11514"
)
add_paper("SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models",
"Hardy Chen, Haoqin Tu, Fali Wang, Hui Liu, Xianfeng Tang, Xinya Du, Yuyin Zhou, Cihang Xie",
null,
"https://arxiv.org/abs/2504.11468",
"@article{wang2025star,<br>" +
" title = {SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models},<br>" +
" author = {Hardy Chen, Haoqin Tu, Fali Wang, Hui Liu, Xianfeng Tang, Xinya Du, Yuyin Zhou, Cihang Xie},<br>" +
" journal = {arXiv preprint arXiv:2504.11468},<br>" +
" year = {2025},<br>",
"This work revisits the dominant supervised fine-tuning (SFT) then reinforcement learning (RL) paradigm for training Large Vision-Language Models (LVLMs), and reveals a key finding: SFT can significantly undermine subsequent RL by inducing ``pseudo reasoning paths'' imitated from expert models. While these paths may resemble the native reasoning paths of RL models, they often involve prolonged, hesitant, less informative steps, and incorrect reasoning. To systematically study this effect, we introduce VLAA-Thinking, a new multimodal dataset designed to support reasoning in LVLMs. Constructed via a six-step pipeline involving captioning, reasoning distillation, answer rewrite and verification, VLAA-Thinking comprises high-quality, step-by-step visual reasoning traces for SFT, along with a more challenging RL split from the same data source. Using this dataset, we conduct extensive experiments comparing SFT, RL and their combinations. Results show that while SFT helps models learn reasoning formats, it often locks aligned models into imitative, rigid reasoning modes that impede further learning. In contrast, building on the Group Relative Policy Optimization (GRPO) with a novel mixed reward module integrating both perception and cognition signals, our RL approach fosters more genuine, adaptive reasoning behavior. Notably, our model VLAA-Thinker, based on Qwen2.5VL 3B, achieves top-1 performance on Open LMM Reasoning Leaderboard among 4B scale LVLMs, surpassing the previous state-of-the-art by 1.8%. We hope our findings provide valuable insights in developing reasoning-capable LVLMs and can inform future research in this area.",
"https://arxiv.org/abs/2504.11468",
"https://github.com/UCSC-VLAA/VLAA-Thinking"
)
add_paper("MedReason: Eliciting Factual Medical Reasoning Steps in LLMs via Knowledge Graphs",
"Juncheng Wu, Wenlong Deng, Xingxuan Li, Sheng Liu, Taomian Mi, Yifan Peng, Ziyang Xu, Yi Liu, Hyunjin Cho, Chang-In Choi, Yihan Cao, Hui Ren, Xiang Li, Xiaoxiao Li, Yuyin Zhou",
null,
"https://arxiv.org/pdf/2504.00993",
"@article{wu2025medreason,<br>" +
" title = {MedReason: Eliciting Factual Medical Reasoning Steps in LLMs via Knowledge Graphs},<br>" +
" author = {Juncheng Wu, Wenlong Deng, Xingxuan Li, Sheng Liu, Taomian Mi, Yifan Peng, Ziyang Xu, Yi Liu, Hyunjin Cho, Chang-In Choi, Yihan Cao, Hui Ren, Xiang Li, Xiaoxiao Li, Yuyin Zhou},<br>" +
" journal = {arXiv preprint arXiv:2504.00993},<br>" +
" year = {2025},<br>",
"Medical tasks such as diagnosis and treatment planning require precise and complex reasoning, particularly in life-critical domains. Unlike mathematical reasoning, medical reasoning demands meticulous, verifiable thought processes to ensure reliability and accuracy. However, there is a notable lack of datasets that provide transparent, step-by-step reasoning to validate and enhance the medical reasoning ability of AI models. To bridge this gap, we introduce MedReason, a large-scale high-quality medical reasoning dataset designed to enable faithful and explainable medical problem-solving in large language models (LLMs). We utilize a structured medical knowledge graph (KG) to convert clinical QA pairs into logical chains of reasoning, or ``thinking paths'', which trace connections from question elements to answers via relevant KG entities. Each path is validated for consistency with clinical logic and evidence-based medicine. Our pipeline generates detailed reasoning for various medical questions from 7 medical datasets, resulting in a dataset of 32,682 question-answer pairs, each with detailed, step-by-step explanations. Experiments demonstrate that fine-tuning with our dataset consistently boosts medical problem-solving capabilities, achieving significant gains of up to 7.7% for DeepSeek-Ditill-8B. Our top-performing model, MedReason-8B, outperforms the Huatuo-o1-8B, a state-of-the-art medical reasoning model, by up to 4.2% on the clinical benchmark MedBullets. We also engage medical professionals from diverse specialties to assess our dataset's quality, ensuring MedReason offers accurate and coherent medical reasoning.",
"https://arxiv.org/pdf/2504.00993",
"https://github.com/UCSC-VLAA/MedReason"
)
add_paper("m1: Unleash the Potential of Test-Time Scaling for Medical Reasoning with Large Language Models",
"Xiaoke Huang, Juncheng Wu, Hui Liu, Xianfeng Tang, Yuyin Zhou",
null,
"https://arxiv.org/abs/2504.00869",
"@article{huang2025m1,<br>" +
" title = {m1: Unleash the Potential of Test-Time Scaling for Medical Reasoning with Large Language Models},<br>" +
" author = {Xiaoke Huang, Juncheng Wu, Hui Liu, Xianfeng Tang, Yuyin Zhou},<br>" +
" journal = {arXiv preprint arXiv:2504.00869},<br>" +
" year = {2025},<br>",
"Test-time scaling has emerged as a powerful technique for enhancing the reasoning capabilities of large language models. However, its effectiveness in medical reasoning remains uncertain, as the medical domain fundamentally differs from mathematical tasks in terms of knowledge representation and decision-making processes. In this paper, we provide the first comprehensive investigation of test-time scaling for medical reasoning and present m1, a simple yet effective approach that increases a model's medical reasoning capability at inference. Our evaluation across diverse medical tasks demonstrates that test-time scaling consistently enhances medical reasoning, enabling lightweight fine-tuned models under 10B parameters to establish new state-of-the-art performance, while our 32B model rivals previous 70B-scale medical LLMs. However, we identify an optimal reasoning token budget of approximately 4K, beyond which performance may degrade due to overthinking. Budget forcing, which extends test-time computation through iterative prompts, helps models double-check answers but does not necessarily improve the overall medical QA performance and, in some cases, even introduces errors into previously correct responses. Our case-by-case analysis identifies insufficient medical knowledge as a key bottleneck that prevents further performance gains through test-time scaling. We find that increasing data scale, improving data quality, and expanding model capacity consistently enhance medical knowledge grounding, enabling continued performance improvements, particularly on challenging medical benchmarks where smaller models reach saturation. These findings underscore fundamental differences between medical and mathematical reasoning in LLMs, highlighting that enriched medical knowledge, other than increased reasoning depth alone, is essential for realizing the benefits of test-time scaling.",
"https://arxiv.org/abs/2504.00869",
"https://github.com/UCSC-VLAA/m1"
)
add_paper("STAR-1: Safer Alignment of Reasoning LLMs with 1K Data",
"Zijun Wang, Haoqin Tu, Yuhan Wang, Juncheng Wu, Jieru Mei, Brian R. Bartoldson, Bhavya Kailkhura, Cihang Xie",
null,
"https://arxiv.org/abs/2504.01903",
"@article{wang2025star,<br>" +
" title = {STAR-1: Safer Alignment of Reasoning LLMs with 1K Data},<br>" +
" author = {Zijun Wang, Haoqin Tu, Yuhan Wang, Juncheng Wu, Jieru Mei, Brian R. Bartoldson, Bhavya Kailkhura, Cihang Xie},<br>" +
" journal = {arXiv preprint arXiv:2504.01903},<br>" +
" year = {2025},<br>",
"This paper introduces STAR-1, a high-quality, just-1k-scale safety dataset specifically designed for large reasoning models (LRMs) like DeepSeek-R1. Built on three core principles -- diversity, deliberative reasoning, and rigorous filtering -- STAR-1 aims to address the critical needs for safety alignment in LRMs. Specifically, we begin by integrating existing open-source safety datasets from diverse sources. Then, we curate safety policies to generate policy-grounded deliberative reasoning samples. Lastly, we apply a GPT-4o-based safety scoring system to select training examples aligned with best practices. Experimental results show that fine-tuning LRMs with STAR-1 leads to an average 40% improvement in safety performance across four benchmarks, while only incurring a marginal decrease (e.g., an average of 1.1%) in reasoning ability measured across five reasoning tasks. Extensive ablation studies further validate the importance of our design principles in constructing STAR-1 and analyze its efficacy across both LRMs and traditional LLMs.",
"https://arxiv.org/abs/2504.01903",
"https://github.com/UCSC-VLAA/STAR-1"
)
add_paper("EpiFoundation: A Foundation Model for Single-Cell ATAC-seq via Peak-to-Gene Alignment",
"Juncheng Wu, Changxin Wan, Zhicheng Ji, Yuyin Zhou, Wenpin Hou",
null,
"https://www.biorxiv.org/content/10.1101/2025.02.05.636688",
"@article{wu2025epifoundation,<br>" +
" title = {EpiFoundation: A Foundation Model for Single-Cell ATAC-seq via Peak-to-Gene Alignment},<br>" +
" author = {Juncheng Wu, Changxin Wan, Zhicheng Ji, Yuyin Zhou, Wenpin Hou},<br>" +
" journal = {bioRxiv},<br>" +
" year = {2025},<br>",
"Foundation models exhibit strong capabilities for downstream tasks by learning generalized representations through self-supervised pre-training on large datasets. While several foundation models have been developed for single-cell RNA-seq (scRNA-seq) data, there is still a lack of models specifically tailored for single-cell ATAC-seq (scATAC-seq), which measures epigenetic information in individual cells. The principal challenge in developing such a model lies in the vast number of scATAC peaks and the significant sparsity of the data, which complicates the formulation of peak-to-peak correlations. To address this challenge, we introduce EpiFoundation, a foundation model for learning cell representations from the high-dimensional and sparse space of peaks. Epi-Foundation relies on an innovative cross-modality pre-training procedure with two key technical innovations. First, EpiFoundation exclusively processes the non-zero peak set, thereby enhancing the density of cell-specific information within the input data. Second, EpiFoundation utilizes dense gene expression information to supervise the pre-training process, aligning peak-to-gene correlations. EpiFoundation can handle various types of downstream tasks, including cell-type annotation, batch correction, and gene expression prediction. To train and validate EpiFoundation, we curated MiniAtlas, a dataset of 100,000+ single cells with paired scRNA-seq and scATAC-seq data, along with diverse test sets spanning various tissues and cell types for robust evaluation. EpiFoundation demonstrates state-of-the-art performance across multiple tissues and diverse downstream tasks.",
"https://www.biorxiv.org/content/10.1101/2025.02.05.636688",
"https://github.com/UCSC-VLAA/EpiFoundation"
)
add_paper("MethylProphet: A Generalized Gene-Contextual Model for Inferring Whole-Genome DNA Methylation Landscape",
"Xiaoke Huang, Qi Liu, Yifei Zhao, Xianfeng Tang, Yuyin Zhou, Wenpin Hou",
null,
"https://www.biorxiv.org/content/10.1101/2025.02.05.636730",
"@article{huang2025methylprophet,<br>" +
" title = {MethylProphet: A Generalized Gene-Contextual Model for Inferring Whole-Genome DNA Methylation Landscape},<br>" +
" author = {Xiaoke Huang, Qi Liu, Yifei Zhao, Xianfeng Tang, Yuyin Zhou, Wenpin Hou},<br>" +
" journal = {bioRxiv},<br>" +
" year = {2025},<br>",
"DNA methylation (DNAm), an epigenetic modification, regulates gene expression, influences phenotypes, and encodes inheritable information, making it critical for disease diagnosis, treatment, and prevention. While human genome contains approximately 28 million CpG sites where DNAm can be measured, only 1–3% of these sites are typically available in most datasets due to complex experimental protocols and high costs, hindering insights from DNAm data. Leveraging the relationship between gene expression and DNAm offers promise for computational inference, but existing statistical, machine learning, and masking-based generative Transformers face critical limitations: they cannot infer DNAm at unmeasured CpGs or in new samples effectively. To overcome these challenges, we introduce MethylProphet, a gene-guided, context-aware Transformer model designed for DNAm inference. MethylProphet employs a Bottleneck MLP for efficient gene profile compression and a specialized DNA sequence tokenizer, integrating global gene expression patterns with local CpG context through a Transformer encoder architecture. Trained on whole-genome bisulfite sequencing data from ENCODE (1.6B training CpG-sample pairs; 322B tokens), MethylProphet demonstrates strong performance in hold-out evaluations, effectively inferring DNAm for unmeasured CpGs and new samples. In addition, its application to 10842 pairs of gene expression and DNAm samples at TCGA chromosome 1 (450M training CpGsample pairs; 91B tokens) highlights its potential to facilitate pan-cancer DNAm landscape inference, offering a powerful tool for advancing epigenetic research and precision medicine. All codes, data, protocols, and models are publicly available via https://github.com/xk-huang/methylprophet/.",
"https://www.biorxiv.org/content/10.1101/2025.02.05.636730",
"https://github.com/xk-huang/methylprophet/"
)
add_paper("Scaling Laws in Patchification: An Image Is Worth 50,176 Tokens And More",
"Feng Wang, Yaodong Yu, Guoyizhe Wei, Wei Shao, Yuyin Zhou, Alan Yuille, Cihang Xie",
null,
"https://arxiv.org/abs/2502.03738",
"@article{wang2025scaling,<br>" +
" title = {Scaling Laws in Patchification: An Image Is Worth 50,176 Tokens And More},<br>" +
" author = {Feng Wang, Yaodong Yu, Guoyizhe Wei, Wei Shao, Yuyin Zhou, Alan Yuille, Cihang Xie},<br>" +
" journal = {arXiv preprint arXiv:2502.03738},<br>" +
" year = {2025},<br>",
"Since the introduction of Vision Transformer (ViT), patchification has long been regarded as a de facto image tokenization approach for plain visual architectures. By compressing the spatial size of images, this approach can effectively shorten the token sequence and reduce the computational cost of ViT-like plain architectures. In this work, we aim to thoroughly examine the information loss caused by this patchification-based compressive encoding paradigm and how it affects visual understanding. We conduct extensive patch size scaling experiments and excitedly observe an intriguing scaling law in patchification: the models can consistently benefit from decreased patch sizes and attain improved predictive performance, until it reaches the minimum patch size of 1x1, i.e., pixel tokenization. This conclusion is broadly applicable across different vision tasks, various input scales, and diverse architectures such as ViT and the recent Mamba models. Moreover, as a by-product, we discover that with smaller patches, task-specific decoder heads become less critical for dense prediction. In the experiments, we successfully scale up the visual sequence to an exceptional length of 50,176 tokens, achieving a competitive test accuracy of 84.6% with a base-sized model on the ImageNet-1k benchmark. We hope this study can provide insights and theoretical foundations for future works of building non-compressive vision models. Code is available at https://github.com/wangf3014/Patch_Scaling.",
"https://arxiv.org/abs/2502.03738",
"https://github.com/wangf3014/Patch_Scaling"
)
add_paper("ARFlow: Autogressive Flow with Hybrid Linear Attention",
"Mude Hui, Rui-Jie Zhu, Songlin Yang, Yu Zhang, Zirui Wang, Yuyin Zhou, Jason Eshraghian, Cihang Xie",
null,
"https://arxiv.org/abs/2501.16085",
"@article{hui2025arflow,<br>" +
" title = {ARFlow: Autogressive Flow with Hybrid Linear Attention},<br>" +
" author = {Mude Hui, Rui-Jie Zhu, Songlin Yang, Yu Zhang, Zirui Wang, Yuyin Zhou, Jason Eshraghian, Cihang Xie},<br>" +
" journal = {arXiv preprint arXiv:2501.16085},<br>" +
" year = {2025},<br>",
"Flow models are effective at progressively generating realistic images, but they generally struggle to capture long-range dependencies during the generation process as they compress all the information from previous time steps into a single corrupted image. To address this limitation, we propose integrating autoregressive modeling -- known for its excellence in modeling complex, high-dimensional joint probability distributions -- into flow models. During training, at each step, we construct causally-ordered sequences by sampling multiple images from the same semantic category and applying different levels of noise, where images with higher noise levels serve as causal predecessors to those with lower noise levels. This design enables the model to learn broader category-level variations while maintaining proper causal relationships in the flow process. During generation, the model autoregressively conditions the previously generated images from earlier denoising steps, forming a contextual and coherent generation trajectory. Additionally, we design a customized hybrid linear attention mechanism tailored to our modeling approach to enhance computational efficiency. Our approach, termed ARFlow, under 400k training steps, achieves 14.08 FID scores on ImageNet at 128 * 128 without classifier-free guidance, reaching 4.34 FID with classifier-free guidance 1.5, significantly outperforming the previous flow-based model SiT's 9.17 FID. Extensive ablation studies demonstrate the effectiveness of our modeling strategy and chunk-wise attention design.",
"https://arxiv.org/abs/2501.16085"
)
add_paper("UD-Mamba: A pixel-level uncertainty-driven Mamba model for medical image segmentation",
"Weiren Zhao, Feng Wang, Yanran Wang, Yutong Xie, Qi Wu, Yuyin Zhou",
null,
"https://arxiv.org/abs/2502.02024",
"@article{zhao2025udmamba,<br>" +
" title = {UD-Mamba: A pixel-level uncertainty-driven Mamba model for medical image segmentation},<br>" +
" author = {Weiren Zhao, Feng Wang, Yanran Wang, Yutong Xie, Qi Wu, Yuyin Zhou},<br>" +
" journal = {arXiv preprint arXiv:2502.02024},<br>" +
" year = {2025},<br>",
"Recent advancements have highlighted the Mamba framework, a state-space model known for its efficiency in capturing long-range dependencies with linear computational complexity. While Mamba has shown competitive performance in medical image segmentation, it encounters difficulties in modeling local features due to the sporadic nature of traditional location-based scanning methods and the complex, ambiguous boundaries often present in medical images. To overcome these challenges, we propose Uncertainty-Driven Mamba (UD-Mamba), which redefines the pixel-order scanning process by incorporating channel uncertainty into the scanning mechanism. UD-Mamba introduces two key scanning techniques: 1) sequential scanning, which prioritizes regions with high uncertainty by scanning in a row-by-row fashion, and 2) skip scanning, which processes columns vertically, moving from high-to-low or low-to-high uncertainty at fixed intervals. Sequential scanning efficiently clusters high-uncertainty regions, such as boundaries and foreground objects, to improve segmentation precision, while skip scanning enhances the interaction between background and foreground regions, allowing for timely integration of background information to support more accurate foreground inference. Recognizing the advantages of scanning from certain to uncertain areas, we introduce four learnable parameters to balance the importance of features extracted from different scanning methods. Additionally, a cosine consistency loss is employed to mitigate the drawbacks of transitioning between uncertain and certain regions during the scanning process. Our method demonstrates robust segmentation performance, validated across three distinct medical imaging datasets involving pathology, dermatological lesions, and cardiac tasks.",
"https://arxiv.org/abs/2502.02024"
)
add_paper("Safety at Scale: A Comprehensive Survey of Large Model Safety",
"Xingjun Ma, Yifeng Gao, Yixu Wang, Ruofan Wang, Xin Wang, Ye Sun, Yifan Ding, Hengyuan Xu, Yunhao Chen, Yunhan Zhao, Hanxun Huang, Yige Li, Jiaming Zhang, Xiang Zheng, Yang Bai, Zuxuan Wu, Xipeng Qiu, Jingfeng Zhang, Yiming Li, Jun Sun, Cong Wang, Jindong Gu, Baoyuan Wu, Siheng Chen, Tianwei Zhang, Yang Liu, Mingming Gong, Tongliang Liu, Shirui Pan, Cihang Xie, Tianyu Pang, Yinpeng Dong, Ruoxi Jia, Yang Zhang, Shiqing Ma, Xiangyu Zhang, Neil Gong, Chaowei Xiao, Sarah Erfani, Bo Li, Masashi Sugiyama, Dacheng Tao, James Bailey, Yu-Gang Jiang",
null,
"https://arxiv.org/abs/2502.05206",
"@article{ma2025safety,<br>" +
" title = {Safety at Scale: A Comprehensive Survey of Large Model Safety},<br>" +
" author = {Xingjun Ma, Yifeng Gao, Yixu Wang, Ruofan Wang, Xin Wang, Ye Sun, Yifan Ding, Hengyuan Xu, Yunhao Chen, Yunhan Zhao, Hanxun Huang, Yige Li, Jiaming Zhang, Xiang Zheng, Yang Bai, Zuxuan Wu, Xipeng Qiu, Jingfeng Zhang, Yiming Li, Jun Sun, Cong Wang, Jindong Gu, Baoyuan Wu, Siheng Chen, Tianwei Zhang, Yang Liu, Mingming Gong, Tongliang Liu, Shirui Pan, Cihang Xie, Tianyu Pang, Yinpeng Dong, Ruoxi Jia, Yang Zhang, Shiqing Ma, Xiangyu Zhang, Neil Gong, Chaowei Xiao, Sarah Erfani, Bo Li, Masashi Sugiyama, Dacheng Tao, James Bailey, Yu-Gang Jiang},<br>" +
" journal = {arXiv preprint arXiv:2502.05206},<br>" +
" year = {2025},<br>",
"The rapid advancement of large models, driven by their exceptional abilities in learning and generalization through large-scale pre-training, has reshaped the landscape of Artificial Intelligence (AI). These models are now foundational to a wide range of applications, including conversational AI, recommendation systems, autonomous driving, content generation, medical diagnostics, and scientific discovery. However, their widespread deployment also exposes them to significant safety risks, raising concerns about robustness, reliability, and ethical implications. This survey provides a systematic review of current safety research on large models, covering Vision Foundation Models (VFMs), Large Language Models (LLMs), Vision-Language Pre-training (VLP) models, Vision-Language Models (VLMs), Diffusion Models (DMs), and large-model-based Agents. Our contributions are summarized as follows: (1) We present a comprehensive taxonomy of safety threats to these models, including adversarial attacks, data poisoning, backdoor attacks, jailbreak and prompt injection attacks, energy-latency attacks, data and model extraction attacks, and emerging agent-specific threats. (2) We review defense strategies proposed for each type of attacks if available and summarize the commonly used datasets and benchmarks for safety research. (3) Building on this, we identify and discuss the open challenges in large model safety, emphasizing the need for comprehensive safety evaluations, scalable and effective defense mechanisms, and sustainable data practices. More importantly, we highlight the necessity of collective efforts from the research community and international collaboration. Our work can serve as a useful reference for researchers and practitioners, fostering the ongoing development of comprehensive defense systems and platforms to safeguard AI models.",
"https://arxiv.org/abs/2502.05206",
"https://github.com/xingjunm/Awesome-Large-Model-Safety"
)
add_paper("Double Visual Defense: Adversarial Pre-training and Instruction Tuning for Improving Vision-Language Model Robustness",
"Zeyu Wang, Cihang Xie, Brian Bartoldson, Bhavya Kailkhura",
null,
"https://arxiv.org/abs/2501.09446",
"@article{wang2024double,<br>" +
" title = {Double Visual Defense: Adversarial Pre-training and Instruction Tuning for Improving Vision-Language Model Robustness},<br>" +
" author = {Zeyu Wang, Cihang Xie, Brian Bartoldson, Bhavya Kailkhura},<br>" +
" journal = {arXiv preprint arXiv:2501.09446},<br>" +
" year = {2025},<br>",
"This paper investigates the robustness of vision-language models against adversarial visual perturbations and introduces a novel \"double visual defense\" to enhance this robustness. Unlike previous approaches that resort to lightweight adversarial fine-tuning of a pre-trained CLIP model, we perform large-scale adversarial vision-language pre-training from scratch using web-scale data. We then strengthen the defense by incorporating adversarial visual instruction tuning. The resulting models from each stage, ΔCLIP and Δ2LLaVA, show substantially enhanced zero-shot robustness and set a new state-of-the-art in adversarial defense for vision-language models. For example, the adversarial robustness of ΔCLIP surpasses that of the previous best models on ImageNet-1k by ~20%. Similarly, compared to prior art, Δ2LLaVA brings a ~30% robustness improvement to image captioning task and a ~20% robustness improvement to visual question answering task. Furthermore, our models exhibit stronger zero-shot recognition capability, fewer hallucinations, and superior reasoning performance compared to baselines. Our project page is https://doublevisualdefense.github.io/.",
"https://arxiv.org/abs/2501.09446",
"https://doublevisualdefense.github.io/"
)
add_paper("CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions",
"Yanqing Liu, Xianhang Li, Zeyu Wang, Bingchen Zhao, Cihang Xie",
null,
"https://arxiv.org/abs/2411.16828",
"@article{liu2024clips,<br>" +
" title = {CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions},<br>" +
" author = {Yanqing Liu, Xianhang Li, Zeyu Wang, Bingchen Zhao, Cihang Xie},<br>" +
" journal = {arXiv preprint arXiv:2411.16828},<br>" +
" year = {2024},<br>",
"Previous works show that noisy, web-crawled image-text pairs may limit vision-language pretraining like CLIP and propose learning with synthetic captions as a promising alternative. Our work continues this effort, introducing two simple yet effective designs to better leverage richly described synthetic captions. Firstly, by observing a strong inverse effect in learning with synthetic captions -- the short synthetic captions can generally lead to MUCH higher performance than full-length ones -- we therefore fed only partial synthetic captions to the text encoder. Secondly, we incorporate an autoregressive captioner to mimic the recaptioning process -- by conditioning on the paired image input and web-crawled text description, the captioner learns to predict the full-length synthetic caption generated by advanced MLLMs. Experiments show that our framework significantly improves zero-shot performance in cross-modal retrieval tasks, setting new SOTA results on MSCOCO and Flickr30K. Moreover, such trained vision encoders can enhance the visual capability of LLaVA, showing strong improvements on a range of MLLM benchmarks. Our project page is https://ucsc-vlaa.github.io/CLIPS/.",
"https://arxiv.org/abs/2411.16828",
"https://ucsc-vlaa.github.io/CLIPS/"
)
add_paper("M-VAR: Decoupled Scale-wise Autoregressive Modeling for High-Quality Image Generation",
"Sucheng Ren, Yaodong Yu, Nataniel Ruiz, Feng Wang, Alan Yuille, Cihang Xie",
null,
"https://arxiv.org/abs/2411.10433",
"@article{ren2024mvar,<br>" +
" title = {M-VAR: Decoupled Scale-wise Autoregressive Modeling for High-Quality Image Generation},<br>" +
" author = {Sucheng Ren, Yaodong Yu, Nataniel Ruiz, Feng Wang, Alan Yuille, Cihang Xie},<br>" +
" journal = {arXiv preprint arXiv:2411.10433},<br>" +
" year = {2024},<br>",
"There exists recent work in computer vision, named VAR, that proposes a new autoregressive paradigm for image generation. Diverging from the vanilla next-token prediction, VAR structurally reformulates the image generation into a coarse to fine next-scale prediction. In this paper, we show that this scale-wise autoregressive framework can be effectively decoupled into intra-scale modeling, which captures local spatial dependencies within each scale, and inter-scale modeling, which models cross-scale relationships progressively from coarse-to-fine scales. This decoupling structure allows to rebuild VAR in a more computationally efficient manner. Specifically, for intra-scale modeling -- crucial for generating high-fidelity images -- we retain the original bidirectional self-attention design to ensure comprehensive modeling; for inter-scale modeling, which semantically connects different scales but is computationally intensive, we apply linear-complexity mechanisms like Mamba to substantially reduce computational overhead. We term this new framework M-VAR. Extensive experiments demonstrate that our method outperforms existing models in both image quality and generation speed. For example, our 1.5B model, with fewer parameters and faster inference speed, outperforms the largest VAR-d30-2B. Moreover, our largest model M-VAR-d32 impressively registers 1.78 FID on ImageNet 256x256 and outperforms the prior-art autoregressive models LlamaGen/VAR by 0.4/0.19 and popular diffusion models LDM/DiT by 1.82/0.49, respectively. Code is available at https://github.com/OliverRensu/MVAR.",
"https://arxiv.org/abs/2411.10433",
"https://github.com/OliverRensu/MVAR"
)
add_paper("Story-Adapter: A Training-free Iterative Framework for Long Story Visualization",
"Jiawei Mao, Xiaoke Huang, Yunfei Xie, Yuanqi Chang, Mude Hui, Bingjie Xu, Yuyin Zhou",
null,
"https://arxiv.org/abs/2410.06244",
"@article{mao2024storyadapter,<br>" +
" title = {Story-Adapter: A Training-free Iterative Framework for Long Story Visualization},<br>" +
" author = {Jiawei Mao, Xiaoke Huang, Yunfei Xie, Yuanqi Chang, Mude Hui, Bingjie Xu, Yuyin Zhou},<br>" +
" journal = {arXiv preprint arXiv:2410.06244},<br>" +
" year = {2024},<br>",
"Story visualization, the task of generating coherent images based on a narrative, has seen significant advancements with the emergence of text-to-image models, particularly diffusion models. However, maintaining semantic consistency, generating high-quality fine-grained interactions, and ensuring computational feasibility remain challenging, especially in long story visualization (i.e., up to 100 frames). In this work, we propose a training-free and computationally efficient framework, termed Story-Adapter, to enhance the generative capability of long stories. Specifically, we propose an iterative paradigm to refine each generated image, leveraging both the text prompt and all generated images from the previous iteration. Central to our framework is a training-free global reference cross-attention module, which aggregates all generated images from the previous iteration to preserve semantic consistency across the entire story, while minimizing computational costs with global embeddings. This iterative process progressively optimizes image generation by repeatedly incorporating text constraints, resulting in more precise and fine-grained interactions. Extensive experiments validate the superiority of Story-Adapter in improving both semantic consistency and generative capability for fine-grained interactions, particularly in long story scenarios. The project page and associated code can be accessed via https://jwmao1.github.io/storyadapter.",
"https://arxiv.org/abs/2410.06244",
"https://jwmao1.github.io/storyadapter"
)
// add_paper("Efficient MedSAMs: Segment Anything in Medical Images on Laptop",
// "Jun Ma, Feifei Li, Sumin Kim, Reza Asakereh, Bao-Hiep Le, Dang-Khoa Nguyen-Vu, Alexander Pfefferle, Muxin Wei, Ruochen Gao, Donghang Lyu, Songxiao Yang, Lennart Purucker, Zdravko Marinov, Marius Staring, Haisheng Lu, Thuy Thanh Dao, Xincheng Ye, Zhi Li, Gianluca Brugnara, Philipp Vollmuth, Martha Foltyn-Dumitru, Jaeyoung Cho, Mustafa Ahmed Mahmutoglu, Martin Bendszus, Irada Pflüger, Aditya Rastogi, Dong Ni, Xin Yang, Guang-Quan Zhou, Kaini Wang, Nicholas Heller, Nikolaos Papanikolopoulos, Christopher Weight, Yubing Tong, Jayaram K Udupa, Cahill J Patrick, Yaqi Wang, Yifan Zhang, Francisco Contijoch, Elliot McVeigh, Xin Ye, Shucheng He, Robert Haase, Thomas Pinetz, Alexander Radbruch, Inga Krause, Erich Kobler, Jian He, Yucheng Tang, Haichun Yang, Yuankai Huo, Gongning Luo, Kaisar Kushibar, Jandos Amankulov, Dias Toleshbayev, Amangeldi Mukhamejan, Jan Egger, Antonio Pepe, Christina Gsaxner, Gijs Luijten, Shohei Fujita, Tomohiro Kikuchi, Benedikt Wiestler, Jan S Kirschke, Ezequiel de la Rosa, Federico Bolelli, Luca Lumetti, Costantino Grana, Kunpeng Xie, Guomin Wu, Behrus Puladi, Carlos Martín-Isla, Karim Lekadir, Victor M Campello, Wei Shao, Wayne Brisbane, Hongxu Jiang, Hao Wei, Wu Yuan, Shuangle Li, Yuyin Zhou, Bo Wang",
// null,
// "https://arxiv.org/abs/2412.16085",
// "@article{ma2024efficient,<br>" +
// " title = {Efficient MedSAMs: Segment Anything in Medical Images on Laptop},<br>" +
// " author = {Jun Ma, Feifei Li, Sumin Kim, Reza Asakereh, Bao-Hiep Le, Dang-Khoa Nguyen-Vu, Alexander Pfefferle, Muxin Wei, Ruochen Gao, Donghang Lyu, Songxiao Yang, Lennart Purucker, Zdravko Marinov, Marius Staring, Haisheng Lu, Thuy Thanh Dao, Xincheng Ye, Zhi Li, Gianluca Brugnara, Philipp Vollmuth, Martha Foltyn-Dumitru, Jaeyoung Cho, Mustafa Ahmed Mahmutoglu, Martin Bendszus, Irada Pflüger, Aditya Rastogi, Dong Ni, Xin Yang, Guang-Quan Zhou, Kaini Wang, Nicholas Heller, Nikolaos Papanikolopoulos, Christopher Weight, Yubing Tong, Jayaram K Udupa, Cahill J Patrick, Yaqi Wang, Yifan Zhang, Francisco Contijoch, Elliot McVeigh, Xin Ye, Shucheng He, Robert Haase, Thomas Pinetz, Alexander Radbruch, Inga Krause, Erich Kobler, Jian He, Yucheng Tang, Haichun Yang, Yuankai Huo, Gongning Luo, Kaisar Kushibar, Jandos Amankulov, Dias Toleshbayev, Amangeldi Mukhamejan, Jan Egger, Antonio Pepe, Christina Gsaxner, Gijs Luijten, Shohei Fujita, Tomohiro Kikuchi, Benedikt Wiestler, Jan S Kirschke, Ezequiel de la Rosa, Federico Bolelli, Luca Lumetti, Costantino Grana, Kunpeng Xie, Guomin Wu, Behrus Puladi, Carlos Martín-Isla, Karim Lekadir, Victor M Campello, Wei Shao, Wayne Brisbane, Hongxu Jiang, Hao Wei, Wu Yuan, Shuangle Li, Yuyin Zhou, Bo Wang},<br>" +
// " journal = {arXiv preprint arXiv:2412.16085},<br>" +
// " year = {2024},<br>",
// "Promptable segmentation foundation models have emerged as a transformative approach to addressing the diverse needs in medical images, but most existing models require expensive computing, posing a big barrier to their adoption in clinical practice. In this work, we organized the first international competition dedicated to promptable medical image segmentation, featuring a large-scale dataset spanning nine common imaging modalities from over 20 different institutions. The top teams developed lightweight segmentation foundation models and implemented an efficient inference pipeline that substantially reduced computational requirements while maintaining state-of-the-art segmentation accuracy. Moreover, the post-challenge phase advanced the algorithms through the design of performance booster and reproducibility tasks, resulting in improved algorithms and validated reproducibility of the winning solution. Furthermore, the best-performing algorithms have been incorporated into the open-source software with a user-friendly interface to facilitate clinical adoption. The data and code are publicly available to foster the further development of medical image segmentation foundation models and pave the way for impactful real-world applications.",
// "https://arxiv.org/abs/2412.16085",
// )
add_paper("A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?",
"Yunfei Xie, Juncheng Wu, Haoqin Tu, Siwei Yang, Bingchen Zhao, Yongshuo Zong, Qiao Jin, Cihang Xie, Yuyin Zhou",
null,
"https://arxiv.org/abs/2409.15277",
"@article{xie2024preliminarystudyo1medicine,<br>" +
" title = {A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?},<br>" +
" author = {Yunfei Xie, Juncheng Wu, Haoqin Tu, Siwei Yang, Bingchen Zhao, Yongshuo Zong, Qiao Jin, Cihang Xie, Yuyin Zhou},<br>" +
" journal = {arXiv preprint arXiv:2409.15277},<br>" +
" year = {2024},<br>",
"Large language models (LLMs) have exhibited remarkable capabilities across various domains and tasks, pushing the boundaries of our knowledge in learning and cognition. The latest model, OpenAI's o1, stands out as the first LLM with an internalized chain-of-thought technique using reinforcement learning strategies. While it has demonstrated surprisingly strong capabilities on various general language tasks, its performance in specialized fields such as medicine remains unknown. To this end, this report provides a comprehensive exploration of o1 on different medical scenarios, examining 3 key aspects: understanding, reasoning, and multilinguality. Specifically, our evaluation encompasses 6 tasks using data from 37 medical datasets, including two newly constructed and more challenging question-answering (QA) tasks based on professional medical quizzes from the New England Journal of Medicine (NEJM) and The Lancet. These datasets offer greater clinical relevance compared to standard medical QA benchmarks such as MedQA, translating more effectively into real-world clinical utility. Our analysis of o1 suggests that the enhanced reasoning ability of LLMs may (significantly) benefit their capability to understand various medical instructions and reason through complex clinical scenarios. Notably, o1 surpasses the previous GPT-4 in accuracy by an average of 6.2% and 6.6% across 19 datasets and two newly created complex QA scenarios. But meanwhile, we identify several weaknesses in both the model capability and the existing evaluation protocols, including hallucination, inconsistent multilingual ability, and discrepant metrics for evaluation. We release our raw data and model outputs https://ucsc-vlaa.github.io/o1_medicine/ for future research.",
"https://arxiv.org/abs/2409.15277",
"https://ucsc-vlaa.github.io/o1_medicine/"
)
add_paper("Restorer: Removing Multi-Degradation with All-Axis Attention and Prompt Guidance",
"Jiawei Mao, Juncheng Wu, Yuyin Zhou, Xuesong Yin, Yuanqi Chang",
null,
"https://arxiv.org/abs/2406.12587",
"@article{mao2024restorer,<br>" +
" title = {Restorer: Removing Multi-Degradation with All-Axis Attention and Prompt Guidance},<br>" +
" author = {Mao, Jiawei and Wu, Juncheng and Zhou, Yuyin and Yin, Xuesong and Chang, Yuanqi},<br>" +
" journal = {arXiv preprint arXiv:2406.12587},<br>" +
" year = {2024}<br>}",
"There are many excellent solutions in image restoration. However, most methods require on training separate models to restore images with different types of degradation. Although existing all-in-one models effectively address multiple types of degradation simultaneously, their performance in real-world scenarios is still constrained by the task confusion problem. In this work, we attempt to address this issue by introducing Restorer, a novel Transformer-based allin-one image restoration model. To effectively address the complex degradation present in real-world images, we propose All-Axis Attention (AAA), a novel attention mechanism that simultaneously models long-range dependencies across both spatial and channel dimensions, capturing potential correlations along all axes. Additionally, we introduce textual prompts in Restorer to incorporate explicit task priors, enabling the removal of specific degradation types based on user instructions. By iterating over these prompts, Restorer can handle composite degradation in real-world scenarios without requiring additional training. Based on these designs, Restorer with one set of parameters demonstrates state-of-theart performance in multiple image restoration tasks compared to existing all-in-one and even single-task models. Additionally, Restorer is efficient during inference, suggesting the potential in real-world applications. Code will be available at https://github.com/Talented-Q/Restorer.",
"https://arxiv.org/abs/2406.12587",
"https://github.com/Talented-Q/Restorer."
)
add_paper("VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges",
"Yuxuan Wang, Cihang Xie, Yang Liu, Zilong Zheng",
null,
"https://arxiv.org/abs/2409.01071",
"@article{wang2024videollamb,<br>" +
" title = {VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges},<br>" +
" author = {Yuxuan Wang, Cihang Xie, Yang Liu, Zilong Zheng},<br>" +
" journal = {arXiv preprint arXiv:2409.01071},<br>" +
" year = {2024},<br>",
"Recent advancements in large-scale video-language models have shown significant potential for real-time planning and detailed interactions. However, their high computational demands and the scarcity of annotated datasets limit their practicality for academic researchers. In this work, we introduce VideoLLaMB, a novel framework that utilizes temporal memory tokens within bridge layers to allow for the encoding of entire video sequences alongside historical visual data, effectively preserving semantic continuity and enhancing model performance across various tasks. This approach includes recurrent memory tokens and a SceneTilling algorithm, which segments videos into independent semantic units to preserve semantic integrity. Empirically, VideoLLaMB significantly outstrips existing video-language models, demonstrating a 5.5 points improvement over its competitors across three VideoQA benchmarks, and 2.06 points on egocentric planning. Comprehensive results on the MVBench show that VideoLLaMB-7B achieves markedly better results than previous 7B models of same LLM. Remarkably, it maintains robust performance as PLLaVA even as video length increases up to 8 times. Besides, the frame retrieval results on our specialized Needle in a Video Haystack (NIAVH) benchmark, further validate VideoLLaMB's prowess in accurately identifying specific frames within lengthy videos. Our SceneTilling algorithm also enables the generation of streaming video captions directly, without necessitating additional training. In terms of efficiency, VideoLLaMB, trained on 16 frames, supports up to 320 frames on a single Nvidia A100 GPU with linear GPU memory scaling, ensuring both high performance and cost-effectiveness, thereby setting a new foundation for long-form video-language models in both academic and practical applications.",
"https://arxiv.org/abs/2409.01071",
"https://videollamb.github.io/"
)
add_paper("What If We Recaption Billions of Web Images with LLaMA-3",
"Xianhang Li, Haoqin Tu, Mude Hui, Zeyu Wang, Bingchen Zhao, Junfei Xiao, Sucheng Ren, Jieru Mei, Qing Liu, Huangjie Zheng, Yuyin Zhou, Cihang Xie",
null,
"https://arxiv.org/abs/2406.08478",
"@article{li2024recaption,<br>" +
" title = {What If We Recaption Billions of Web Images with LLaMA-3},<br>" +
" author = {Li, Xianhang and Tu, Haoqin and Hui, Mude and Wang, Zeyu and Zhao, Bingchen and Xiao, Junfei and Ren, Sucheng and Mei, Jieru and Liu, Qing and Zheng, Huangjie and Zhou, Yuyin and Xie, Cihang},<br>" +
" journal = {arXiv preprint arXiv:2406.08478},<br>" +
" year = {2024},<br>",
"Web-crawled image-text pairs are inherently noisy. Prior studies demonstrate that semantically aligning and enriching textual descriptions of these pairs can significantly enhance model training across various vision-language tasks, particularly text-to-image generation. However, large-scale investigations in this area remain predominantly closed-source. Our paper aims to bridge this community effort, leveraging the powerful and \textit{open-sourced} LLaMA-3, a GPT-4 level LLM. Our recaptioning pipeline is simple: first, we fine-tune a LLaMA-3-8B powered LLaVA-1.5 and then employ it to recaption 1.3 billion images from the DataComp-1B dataset. Our empirical results confirm that this enhanced dataset, Recap-DataComp-1B, offers substantial benefits in training advanced vision-language models. For discriminative models like CLIP, we observe enhanced zero-shot performance in cross-modal retrieval tasks. For generative models like text-to-image Diffusion Transformers, the generated images exhibit a significant improvement in alignment with users' text instructions, especially in following complex queries. Our project page is https://www.haqtu.me/Recap-Datacomp-1B/",
"https://arxiv.org/abs/2406.08478",
"https://www.haqtu.me/Recap-Datacomp-1B/"
)
add_paper("Medical Vision Generalist: Unifying Medical Imaging Tasks in Context",
"Sucheng Ren, Xiaoke Huang, Xianhang Li, Junfei Xiao, Jieru Mei, Zeyu Wang, Alan Yuille, Yuyin Zhou",
null,
"https://arxiv.org/abs/2406.05565",
"@article{ren2024medicalvision,<br>" +
" title = {Medical Vision Generalist: Unifying Medical Imaging Tasks in Context},<br>" +
" author = {Ren, Sucheng and Huang, Xiaoke and Li, Xianhang and Xiao, Junfei and Mei, Jieru and Wang, Zeyu and Yuille, Alan and Zhou, Yuyin},<br>" +
" journal = {arXiv preprint arXiv:2406.05565},<br>" +
" year = {2024},<br>",
"This study presents Medical Vision Generalist (MVG), the first foundation model capable of handling various medical imaging tasks -- such as cross-modal synthesis, image segmentation, denoising, and inpainting -- within a unified image-to-image generation framework. Specifically, MVG employs an in-context generation strategy that standardizes the handling of inputs and outputs as images. By treating these tasks as an image generation process conditioned on prompt image-label pairs and input images, this approach enables a flexible unification of various tasks, even those spanning different modalities and datasets. To capitalize on both local and global context, we design a hybrid method combining masked image modeling with autoregressive training for conditional image generation. This hybrid approach yields the most robust performance across all involved medical imaging tasks. To rigorously evaluate MVG's capabilities, we curated the first comprehensive generalist medical vision benchmark, comprising 13 datasets and spanning four imaging modalities (CT, MRI, X-ray, and micro-ultrasound). Our results consistently establish MVG's superior performance, outperforming existing vision generalists, such as Painter and LVM. Furthermore, MVG exhibits strong scalability, with its performance demonstrably improving when trained on a more diverse set of tasks, and can be effectively adapted to unseen datasets with only minimal task-specific samples. The code is available at https://github.com/OliverRensu/MVG.",
"https://arxiv.org/abs/2406.05565",
"https://github.com/OliverRensu/MVG"
)
add_paper("Fast-DDPM: Fast Denoising Diffusion Probabilistic Models for Medical Image-to-Image Generation",
"Hongxu Jiang, Muhammad Imran, Linhai Ma, Teng Zhang, Yuyin Zhou, Muxuan Liang, Kuang Gong, Wei Shao",
null,
"https://arxiv.org/abs/2405.14802",
"@article{jiang2024fastddpm,<br>" +
" title = {Fast Denoising Diffusion Probabilistic Models for Medical Image-to-Image Generation},<br>" +
" author = {Jiang, Hongxu and Imran, Muhammad and Ma, Linhai and Zhang, Teng and Zhou, Yuyin and Liang, Muxuan and Gong, Kuang and Shao, Wei},<br>" +
" journal = {arXiv preprint arXiv:2405.14802},<br>" +
" year = {2024},<br>",
"Denoising diffusion probabilistic models (DDPMs) have achieved unprecedented success in computer vision. However, they remain underutilized in medical imaging, a field crucial for disease diagnosis and treatment planning. This is primarily due to the high computational cost associated with (1) the use of large number of time steps (e.g., 1,000) in diffusion processes and (2) the increased dimensionality of medical images, which are often 3D or 4D. Training a diffusion model on medical images typically takes days to weeks, while sampling each image volume takes minutes to hours. To address this challenge, we introduce Fast-DDPM, a simple yet effective approach capable of improving training speed, sampling speed, and generation quality simultaneously. Unlike DDPM, which trains the image denoiser across 1,000 time steps, Fast-DDPM trains and samples using only 10 time steps. The key to our method lies in aligning the training and sampling procedures. We introduced two efficient noise schedulers with 10 time steps: one with uniform time step sampling and another with non-uniform sampling. We evaluated Fast-DDPM across three medical image-to-image generation tasks: multi-image super-resolution, image denoising, and image-to-image translation. Fast-DDPM outperformed DDPM and current state-of-the-art methods based on convolutional networks and generative adversarial networks in all tasks. Additionally, Fast-DDPM reduced training time by a factor of 5 and sampling time by a factor of 100 compared to DDPM. Our code is publicly available at: https://github.com/mirthAI/Fast-DDPM.",
"https://arxiv.org/abs/2405.14802",
"https://github.com/mirthAI/Fast-DDPM"
)
add_paper("VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models",
"Yuxuan Wang, Yueqian Wang, Dongyan Zhao, Cihang Xie, Zilong Zheng",
null,
"https://arxiv.org/abs/2406.16338",
"@article{wang2025videohallucer,<br>" +
" title = {VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models},<br>" +
" author = {Wang, Yuxuan and Wang, Yueqian and Zhao, Dongyan and Xie, Cihang and Zheng, Zilong},<br>" +
" journal = {arXiv preprint arXiv:2406.16338},<br>" +
" year = {2024}<br>",
"Recent advancements in Multimodal Large Language Models (MLLMs) have extended their capabilities to video understanding. Yet, these models are often plagued by 'hallucinations', where irrelevant or nonsensical content is generated, deviating from the actual video context. This work introduces VideoHallucer, the first comprehensive benchmark for hallucination detection in large video-language models (LVLMs). VideoHallucer categorizes hallucinations into two main types: intrinsic and extrinsic, offering further subcategories for detailed analysis, including object-relation, temporal, semantic detail, extrinsic factual, and extrinsic non-factual hallucinations. We adopt an adversarial binary VideoQA method for comprehensive evaluation, where pairs of basic and hallucinated questions are crafted strategically. By evaluating eleven LVLMs on VideoHallucer, we reveal that i) the majority of current models exhibit significant issues with hallucinations; ii) while scaling datasets and parameters improves models' ability to detect basic visual cues and counterfactuals, it provides limited benefit for detecting extrinsic factual hallucinations; iii) existing models are more adept at detecting facts than identifying hallucinations. As a byproduct, these analyses further instruct the development of our self-PEP framework, achieving an average of 5.38% improvement in hallucination resistance across all model architectures.",
"https://arxiv.org/abs/2406.16338",
"https://videohallucer.github.io/"
)
add_paper("RetinaRegNet: A Versatile Approach for Retinal Image Registration",
"Vishal Balaji Sivaraman, Muhammad Imran, Qingyue Wei, Preethika Muralidharan, Michelle R Tamplin, Isabella M Grumbach, Randy H Kardon, Jui-Kai Wang, Yuyin Zhou, Wei Shao",
null,
"https://arxiv.org/abs/2404.16017",
"@article{sivaraman2024retinaregnet,<br>" +
" title = {RetinaRegNet: A Versatile Approach for Retinal Image Registration},<br>" +
" author = {Sivaraman, Vishal Balaji and Imran, Muhammad and Wei, Qingyue and Muralidharan, Preethika and Tamplin, Michelle R and Grumbach, Isabella M and Kardon, Randy H and Wang, Jui-Kai and Zhou, Yuyin and Shao, Wei},<br>" +
" journal = {arXiv preprint arXiv:2404.16017},<br>" +
" year = {2024},<br>",
"We introduce the RetinaRegNet model, which can achieve state-of-the-art performance across various retinal image registration tasks. RetinaRegNet does not require training on any retinal images. It begins by establishing point correspondences between two retinal images using image features derived from diffusion models. This process involves the selection of feature points from the moving image using the SIFT algorithm alongside random point sampling. For each selected feature point, a 2D correlation map is computed by assessing the similarity between the feature vector at that point and the feature vectors of all pixels in the fixed image. The pixel with the highest similarity score in the correlation map corresponds to the feature point in the moving image. To remove outliers in the estimated point correspondences, we first applied an inverse consistency constraint, followed by a transformation-based outlier detector. This method proved to outperform the widely used random sample consensus (RANSAC) outlier detector by a significant margin. To handle large deformations, we utilized a two-stage image registration framework. A homography transformation was used in the first stage and a more accurate third-order polynomial transformation was used in the second stage. The model's effectiveness was demonstrated across three retinal image datasets: color fundus images, fluorescein angiography images, and laser speckle flowgraphy images. RetinaRegNet outperformed current state-of-the-art methods in all three datasets. It was especially effective for registering image pairs with large displacement and scaling deformations. This innovation holds promise for various applications in retinal image analysis. Our code is publicly available at https://github.com/mirthAI/RetinaRegNet.",
"https://arxiv.org/abs/2404.16017",
"https://github.com/mirthAI/RetinaRegNet"
)
add_paper("Audio-Visual LLM for Video Understanding",
"Hangxun Shu, Lei Zhang, Hao Jiang, Cihang Xie",
null,
"https://arxiv.org/abs/2312.06720",
"@article{shu2023audio,<br>" +
" title = {Audio-Visual LLM for Video Understanding},<br>" +
" author = {Shu, Fangxun and Zhang, Lei and Jiang, Hao and Xie, Cihang},<br>" +
" journal = {arXiv preprint arXiv:2312.06720},<br>" +
" year = {2023}<br>}",
"This paper presents Audio-Visual LLM, a Multimodal Large Language Model that takes both visual and auditory inputs for holistic video understanding. A key design is the modality-augmented training, which involves the integration of modality-specific tokens engineered to activate the appropriate visual and/or auditory encoder selectively. This mechanism is pivotal in enabling end-to-end joint training with video data at different modalities, including visual-only, audio-only, and audio-visual formats. Moreover, we introduce a high-quality video instruction dataset, derived from GPT-4. This dataset allows Audio-Visual LLM to adeptly process a variety of task-oriented video instructions, ranging from multi-turn conversations and audio-visual narratives to complex reasoning tasks. Extensive experiments demonstrate that Audio-Visual LLM impressively achieves strong zero-shot results across a range of video understanding tasks. For example, Audio-Visual LLM achieves an accuracy of 53.7% on MSRVTT-QA, outperforming non-LLM-based InterVideo by 6.6% and LLM-based Valley by 4.4%, respectively. Additionally, our Audio-Visual LLM also achieves competitive performance on audio tasks (e.g., AudioCaps).",
"https://arxiv.org/abs/2312.06720"
)
add_paper("Compress & Align: Curating Image-Text Data with Human Knowledge",
"Lei Zhang, Fangxun Shu, Sucheng Ren, Hao Jiang, Bingchen Zhao, Cihang Xie",
null,
"https://arxiv.org/abs/2312.06726",
"@article{zhang2023compress,<br>" +
" title = {Compress & Align: Curating Image-Text Data with Human Knowledge},<br>" +
" author = {Zhang, Lei and Shu, Fangxun and Ren, Sucheng and Zhao, Bingchen and Jiang, Hao and Xie, Cihang},<br>" +
" journal = {arXiv preprint arXiv:2312.06726},<br>" +
" year = {2023}<br>}",
"The massive growth of image-text data through web crawling inherently presents the challenge of variability in data quality. This paper introduces a novel algorithm, rooted in human knowledge, to compress this vast corpus of web-crawled image-text datasets to a compact and high-quality form. Our method unfolds in three major steps. First, we collect an image-text dataset, wherein each image is associated with multiple captions sourced from diverse origins. Then, to systemically capture human preferences regarding the best caption paired with each image, we establish a comprehensive set of both subjective and objective criteria for critically guiding the alignment assessment from labelers. Lastly, we train a reward model on the annotated dataset to internalize the nuanced human understanding of image-text alignment. The resulting reward model thus can act as a human-like referee to filter misaligned/low-quality image-text pairs. Extensive experiments demonstrate that we are able to secure (or even improve) model performance by compressing the image-text datasets up to ~90%. An impressive example is that, by aggressively reducing the total training sample from 130M to 15.5M (e.g., ~9x smaller), our BLIP-B/16 models still consistently show superior performance compared with the full-size-dataset counterpart on image-text retrieval (Flickr30K, COCO) by ~2.5% in Recall@1, and on image-captioning (Nocaps, COCO) by ~10.0% in CIDEr and ~2.7% in SPICE.",
"https://arxiv.org/abs/2312.06726"
)
// add_paper("3D TransUNet: Advancing Medical Image Segmentation through Vision Transformers",
// "Jieneng Chen, Jieru Mei, Xianhang Li, Yongyi Lu, Qihang Yu, Qingyue Wei, Xiangde Luo, Yutong Xie, Ehsan Adeli, Yan Wang, Matthew Lungren, Lei Xing, Le Lu, Alan Yuille, Yuyin Zhou",
// null,
// "https://arxiv.org/abs/2310.07781",
// "@article{chen2023transunet,<br>" +
// " title = {3D TransUNet: Advancing Medical Image Segmentation through Vision Transformers},<br>" +
// " author = {Chen, Jieneng and Mei, Jieru and Li, Xianhang and Lu, Yongyi and Yu, Qihang and Wei, Qingyue and Luo, Xiangde and Xie, Yutong and Adeli, Ehsan and Wang, Yan and Lungren, Matthew and Xing, Lei and Lu, Le and Yuille, Alan and Zhou, Yuyin},<br>" +
// " journal = {arXiv preprint arXiv:2310.07781},<br>" +
// " year = {2023}<br>}",
// "Medical image segmentation plays a crucial role in advancing healthcare systems for disease diagnosis and treatment planning. The u-shaped architecture, popularly known as U-Net, has proven highly successful for various medical image segmentation tasks. However, U-Net's convolution-based operations inherently limit its ability to model long-range dependencies effectively. To address these limitations, researchers have turned to Transformers, renowned for their global self-attention mechanisms, as alternative architectures. Our previous TransUNet, which leverages Transformers' self-attention to complement U-Net's localized information with the global context, is now extended to a 3D network. This is achieved by building upon the state-of-the-art nnU-Net architecture, fully exploring Transformers' potential in both the encoder and decoder design. We introduce a Transformer encoder for tokenizing image patches and a Transformer decoder for adaptively refining candidate regions. The Transformer encoder excels in multi-organ segmentation, while the Transformer decoder is more beneficial for small and challenging segmented targets such as tumor segmentation. Our extensive experiments showcase the significant potential of integrating a Transformer-based encoder and decoder into the u-shaped medical image segmentation architecture, with TransUNet outperforming competitors in various medical applications.",
// "https://arxiv.org/abs/2310.07781",
// "https://github.com/Beckschen/3D-TransUNet"
// )
// add_paper("BiomedGPT: A Unified and Generalist Biomedical Generative Pre-trained Transformer for Vision, Language, and Multimodal Tasks",
// "Kai Zhang, Jun Yu, Zhiling Yan, Yixin Liu, Eashan Adhikarla, Sunyang Fu, Xun Chen, Chen Chen, Yuyin Zhou, Xiang Li, Lifang He, Brian D Davison, Quanzheng Li, Yong Chen, Hongfang Liu, Lichao Sun",
// null,
// "https://arxiv.org/abs/2305.17100",
// "@article{zhang2023biomedgpt,<br>" +
// " title = {BiomedGPT: A Unified and Generalist Biomedical Generative Pre-trained Transformer for Vision, Language, and Multimodal Tasks},<br>" +
// " author = {Zhang, Kai and Yu, Jun and Yan, Zhiling and Liu, Yixin and Adhikarla, Eashan and Fu, Sunyang and Chen, Xun and Chen, Chen and Zhou, Yuyin and Li, Xiang and He, Lifang and Davison, Brian D and Li, Quanzheng and Chen, Yong and Liu, Hongfang and Sun, Lichao},<br>" +
// " journal = {arXiv preprint arXiv:2305.17100}<br>" +
// " year = {2023},<br>",
// "In this paper, we introduce a unified and generalist Biomedical Generative Pre-trained Transformer (BiomedGPT) model, which leverages self-supervision on large and diverse datasets to accept multi-modal inputs and perform a range of downstream tasks. Our experiments demonstrate that BiomedGPT delivers expansive and inclusive representations of biomedical data, outperforming the majority of preceding state-of-the-art models across five distinct tasks with 20 public datasets spanning over 15 unique biomedical modalities. Through the ablation study, we also showcase the efficacy of our multi-modal and multi-task pretraining approach in transferring knowledge to previously unseen data. Overall, our work presents a significant step forward in developing unified and generalist models for biomedicine, with far-reaching implications for improving healthcare outcomes.",
// "https://arxiv.org/abs/2305.17100",
// )
add_paper("Distribution Aligned Diffusion and Prototype-guided network for Unsupervised Domain Adaptive Segmentation",
"Haipeng Zhou, Lei Zhu, Yuyin Zhou",
// "arxiv, 2023",
null,
"https://arxiv.org/abs/2303.12313",
"@article{zhou2023distribution,<br>" +
" title={Distribution Aligned Diffusion and Prototype-guided network for Unsupervised Domain Adaptive Segmentation},<br>" +
" author={Zhou, Haipeng and Zhu, Lei and Zhou, Yuyin},<br>" +
" journal = {arXiv preprint arXiv:2303.12313},<br>" +
" year={2023}<br>}",
"The Diffusion Probabilistic Model (DPM) has emerged as a highly effective generative model in the field of computer vision. Its intermediate latent vectors offer rich semantic information, making it an attractive option for various downstream tasks such as segmentation and detection. In order to explore its potential further, we have taken a step forward and considered a more complex scenario in the medical image domain, specifically, under an unsupervised adaptation condition. To this end, we propose a Diffusion-based and Prototype-guided network (DP-Net) for unsupervised domain adaptive segmentation. Concretely, our DP-Net consists of two stages: 1) Distribution Aligned Diffusion (DADiff), which involves training a domain discriminator to minimize the difference between the intermediate features generated by the DPM, thereby aligning the inter-domain distribution; and 2) Prototype-guided Consistency Learning (PCL), which utilizes feature centroids as prototypes and applies a prototype-guided loss to ensure that the segmentor learns consistent content from both source and target domains. Our approach is evaluated on fundus datasets through a series of experiments, which demonstrate that the performance of the proposed method is reliable and outperforms state-of-the-art methods. Our work presents a promising direction for using DPM in complex medical image scenarios, opening up new possibilities for further research in medical imaging.",
"https://arxiv.org/abs/2303.12313"
)
add_paper("Bag of Tricks for FGSM Adversarial Training",
"Zichao Li, Li Liu, Zeyu Wang, Yuyin Zhou, Cihang Xie",
null,
"https://arxiv.org/abs/2209.02684",
"@article{li2022bag,<br>" +
" title = {Bag of Tricks for FGSM Adversarial Training},<br>" +
" author = {Li, Zichao and Liu, Li and Wang, Zeyu and Zhou, Yuyin and Xie, Cihang},<br>" +
" journal = {arXiv preprint arXiv:2209.02684},<br>" +
" year = {2022}<br>}",
"Adversarial training (AT) with samples generated by Fast Gradient Sign Method (FGSM), also known as FGSM-AT, is a computationally simple method to train robust networks. However, during its training procedure, an unstable mode of “catastrophic overfitting” has been identified in [Wong et al., 2020], where the robust accuracy abruptly drops to zero within a single training step. Existing methods use gradient regularizers or random initialization tricks to attenuate this issue, whereas they either take high computational cost or lead to lower robust accuracy. In this work, we provide the first study, which thoroughly examines a collection of tricks from three perspectives: Data Initialization, Network Structure, and Optimization, to overcome the catastrophic overfitting in FGSM-AT. Surprisingly, we find that simple tricks, i.e., a) masking partial pixels (even without randomness), b) setting a large convolution stride and smooth activation functions, or c) regularizing the weights of the first convolutional layer, can effectively tackle the overfitting issue. Extensive results on a range of network architectures validate the effectiveness of each proposed trick, and the combinations of tricks are also investigated. For example, trained with PreActResNet-18 on CIFAR-10, our method attains 49.8% accuracy against PGD-50 attacker and 46.4% accuracy against AutoAttack, demonstrating that pure FGSM-AT is capable of enabling robust learners. The code and models are publicly available at https://github. com/UCSC-VLAA/Bag-of-Tricks-for-FGSM-AT.",
"https://arxiv.org/abs/2209.02684",
"https://github. com/UCSC-VLAA/Bag-of-Tricks-for-FGSM-AT"
)
add_paper("The FELIX Project: Deep Networks To Detect Pancreatic Neoplasms",
"Yingda Xia, Qihang Yu, Linda Chu, Satomi Kawamoto, Seyoun Park, Fengze Liu, Jieneng Chen, Zhuotun Zhu, Bowen Li, Zongwei Zhou, Yongyi Lu, Yan Wang, Wei Shen, Lingxi Xie, Yuyin Zhou, Christopher Wolfgang, Ammar Javed, Daniel Fadaei Fouladi, Shahab Shayesteh, Jefferson Graves, Alejandra Blanco, Eva S Zinreich, Benedict Kinny-Köster, Kenneth Kinzler, Ralph H Hruban, Bert Vogelstein, Alan Yuille, Elliot K Fishman",
null,
"https://www.medrxiv.org/content/10.1101/2022.09.24.22280071v1",
"@article{xia2022felix,<br>" +
" title = {The FELIX Project: Deep Networks To Detect Pancreatic Neoplasms},<br>" +
" author = {Xia, Yingda and Yu, Qihang and Chu, Linda and Kawamoto, Satomi and Park, Seyoun and Liu, Fengze and Chen, Jieneng and Zhu, Zhuotun and Li, Bowen and Zhou, Zongwei and others},<br>" +
" journal = {medRxiv},<br>" +
" year = {2022},<br>",
"Tens of millions of abdominal images are performed with computed tomography (CT) in the U.S. each year but pancreatic cancers are sometimes not initially detected in these images. We here describe a suite of algorithms (named FELIX) that can recognize pancreatic lesions from CT images without human input. Using FELIX, >90% of patients with pancreatic ductal adenocarcinomas were detected at a specificity of >90% in patients without pancreatic disease. FELIX may be able to assist radiologists in identifying pancreatic cancers earlier, when surgery and other treatments offer more hope for long-term survival.",
"https://www.medrxiv.org/content/10.1101/2022.09.24.22280071v1"
)
// 2021 preprint
add_paper("Radfusion: Benchmarking performance and fairness for multimodal pulmonary embolism detection from ct and ehr",
"Yuyin Zhou, Shih-Cheng Huang, Jason Alan Fries, Alaa Youssef, Timothy J Amrhein, Marcello Chang, Imon Banerjee, Daniel Rubin, Lei Xing, Nigam Shah, Matthew P Lungren",
null,
"https://arxiv.org/abs/2111.11665",
"@article{zhou2021radfusion,<br>" +
" title = {Radfusion: Benchmarking performance and fairness for multimodal pulmonary embolism detection from ct and ehr},<br>" +
" author = {Zhou, Yuyin and Huang, Shih-Cheng and Fries, Jason Alan and Youssef, Alaa and Amrhein, Timothy J and Chang, Marcello and Banerjee, Imon and Rubin, Daniel and Xing, Lei and Shah, Nigam and others},<br>" +
" journal = {arXiv preprint arXiv:2111.11665},<br>" +
" year = {2021}<br>}",
"Despite the routine use of electronic health record (EHR) data by radiologists to contextualize clinical history and inform image interpretation, the majority of deep learning architectures for medical imaging are unimodal, i.e., they only learn features from pixel-level information. Recent research revealing how race can be recovered from pixel data alone highlights the potential for serious biases in models which fail to account for demographics and other key patient attributes. Yet the lack of imaging datasets which capture clinical context, inclusive of demographics and longitudinal medical history, has left multimodal medical imaging underexplored. To better assess these challenges, we present RadFusion, a multimodal, benchmark dataset of 1794 patients with corresponding EHR data and high-resolution computed tomography (CT) scans labeled for pulmonary embolism. We evaluate several representative multimodal fusion models and benchmark their fairness properties across protected subgroups, e.g., gender, race/ethnicity, age. Our results suggest that integrating imaging and EHR data can improve classification performance and robustness without introducing large disparities in the true positive rate between population groups.",
"https://arxiv.org/abs/2111.11665"
)
// add_paper("Transunet: Transformers make strong encoders for medical image segmentation",
// "Jieneng Chen, Yongyi Lu, Qihang Yu, Xiangde Luo, Ehsan Adeli, Yan Wang, Le Lu, Alan Yuille, Yuyin Zhou",
// // "arXiv, 2021",
// null,
// "https://arxiv.org/abs/2102.04306",
// "@article{chen2021transunet,<br>" +
// " title = {Transunet: Transformers make strong encoders for medical image segmentation},<br>" +
// " author = {Chen, Jieneng and Lu, Yongyi and Yu, Qihang and Luo, Xiangde and Adeli, Ehsan and Wang, Yan and Lu, Le and Yuille, Alan and Zhou, Yuyin},<br>" +
// " journal = {arXiv preprint arXiv:2102.04306},<br>" +
// " year = {2021}<br>}",
// "Medical image segmentation is an essential prerequisite for developing healthcare systems, especially for disease diagnosis and treatment planning. On various medical image segmentation tasks, the u-shaped architecture, also known as U-Net, has become the de-facto standard and achieved tremendous success. However, due to the intrinsic locality of convolution operations, U-Net generally demonstrates limitations in explicitly modeling long-range dependency. Transformers, designed for sequence-to-sequence prediction, have emerged as alternative architectures with innate global self-attention mechanisms, but can result in limited localization abilities due to insufficient low-level details. In this paper, we propose TransUNet, which merits both Transformers and U-Net, as a strong alternative for medical image segmentation. On one hand, the Transformer encodes tokenized image patches from a convolution neural network (CNN) feature map as the input sequence for extracting global contexts. On the other hand, the decoder upsamples the encoded features which are then combined with the high-resolution CNN feature maps to enable precise localization. We argue that Transformers can serve as strong encoders for medical image segmentation tasks, with the combination of U-Net to enhance finer details by recovering localized spatial information. TransUNet achieves superior performances to various competing methods on different medical applications including multi-organ segmentation and cardiac segmentation. Code and models are available at https://github.com/Beckschen/TransUNet.",
// "https://arxiv.org/abs/2102.04306",
// "https://github.com/Beckschen/TransUNet"
// )
add_paper("Can temporal information help with contrastive self-supervised learning?",
"Yutong Bai, Haoqi Fan, Ishan Misra, Ganesh Venkatesh, Yongyi Lu, Yuyin Zhou, Qihang Yu, Vikas Chandra, Alan Yuille",
null,
"https://arxiv.org/abs/2011.13046",
"@article{bai2020can,<br>" +
" title = {Can temporal information help with contrastive self-supervised learning?},<br>" +
" author = {Bai, Yutong and Fan, Haoqi and Misra, Ishan and Venkatesh, Ganesh and Lu, Yongyi and Zhou, Yuyin and Yu, Qihang and Chandra, Vikas and Yuille, Alan},<br>" +
" journal = {arXiv preprint arXiv:2011.13046},<br>" +
" year = {2020}<br>" ,
"Leveraging temporal information has been regarded as essential for developing video understanding models. However, how to properly incorporate temporal information into the recent successful instance discrimination based contrastive self-supervised learning (CSL) framework remains unclear. As an intuitive solution, we find that directly applying temporal augmentations does not help, or even impair video CSL in general. This counter-intuitive observation motivates us to re-design existing video CSL frameworks, for better integration of temporal knowledge. To this end, we present Temporal-aware Contrastive self-supervised learningTaCo, as a general paradigm to enhance video CSL. Specifically, TaCo selects a set of temporal transformations not only as strong data augmentation but also to constitute extra self-supervision for video understanding. By jointly contrasting instances with enriched temporal transformations and learning these transformations as self-supervised signals, TaCo can significantly enhance unsupervised video representation learning. For instance, TaCo demonstrates consistent improvement in downstream classification tasks over a list of backbones and CSL approaches. Our best model achieves 85.1% (UCF-101) and 51.6% (HMDB-51) top-1 accuracy, which is a 3% and 2.4% relative improvement over the previous state-of-the-art.",
"https://arxiv.org/abs/2011.13046"
)
add_paper("Smooth Adversarial Training",
"Cihang Xie, Mingxing Tan, Boqing Gong, Alan Yuille, Quoc Le",
null,
"https://arxiv.org/abs/2006.14536",
"@article{xie2020smooth,<br>" +
" title = {Smooth adversarial training},<br>" +
" author = {Xie, Cihang and Tan, Mingxing and Gong, Boqing and Yuille, Alan and Le, Quoc V},<br>" +
" journal = {arXiv preprint arXiv:2006.14536},<br>" +
" year = {2020}<br>}",
"It is commonly believed that networks cannot be both accurate and robust, that gaining robustness means losing accuracy. It is also generally believed that, unless making networks larger, network architectural elements would otherwise matter little in improving adversarial robustness. Here we present evidence to challenge these common beliefs by a careful study about adversarial training. Our key observation is that the widely-used ReLU activation function significantly weakens adversarial training due to its non-smooth nature. Hence we propose smooth adversarial training (SAT), in which we replace ReLU with its smooth approximations to strengthen adversarial training. The purpose of smooth activation functions in SAT is to allow it to f ind harder adversarial examples and compute better gradient updates during adversarial training. Compared to standard adversarial training, SAT improves adversarial robustness for “free”, i.e., no drop in accuracy and no increase in computational cost. For example, without introducing additional computations, SAT significantly enhances ResNet-50’s robustness from 33.0% to 42.3%, while also improving accuracy by 0.9% on ImageNet. SAT also works well with larger networks: it helps EfficientNet-L1 to achieve 82.2% accuracy and 58.6% robustness on ImageNet, outperforming the previous state-ofthe-art defense by 9.5% for accuracy and 11.6% for robustness. Models are available at https://github.com/ cihangxie/SmoothAdversarialTraining.",
"https://arxiv.org/abs/2006.14536",
"https://github.com/cihangxie/SmoothAdversarialTraining"
)
</script>
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.0/jquery.min.js"></script>
</details>
<!-- --------------------------------------Publications -------------------->
<script>
paper_count = paper_count
function add_paper(title, authors, conference, link, bib, abstract, arxiv_link, code, press, slides, talk, msg) {
list_entry = "<li style=\"font-size:18px\">"
if (link != null)
list_entry += "<a href=\"" + link + "\">"
list_entry += "<b>" + title + "</b>"
if (link != null)
list_entry += "</a>"
list_entry += "<br>" + authors + ".<br>"
if (conference != null)
list_entry+= "<i><font color=\" #707070\">" + conference + "</font></i>.</li>"
if (bib != null) {
list_entry += "<div id=\"bib" + paper_count + "\" style=\"display:none\">" + bib + "</div>"
list_entry += "<a href=\"javascript:copy(div" + paper_count + ",bib" + paper_count + ")\"> <span class=\"label label-success\">bib</span></a>"
}
if (abstract != null) {
list_entry += "<div id=\"abstract" + paper_count + "\" style=\"display:none\">" + abstract + "</div>"
list_entry += "<a href=\"javascript:copy(div" + paper_count + ",abstract" + paper_count + ")\"> <span class=\"label label-warning\">abstract</span></a>"
}
if (arxiv_link != null)
list_entry += " <a href=\"" + arxiv_link + "\"><span class=\"label label-primary\">arxiv</span></a>"
if (code != null)
list_entry += " <a href=\"" + code + "\"><span class=\"label label-danger\">code/models</span></a>"
if (press != null)
list_entry += " <a href=\"" + press + "\"><span class=\"label label-success\">press</span></a>"
if (slides != null)
list_entry += " <a href=\"" + slides + "\"><span class=\"label label-info\">slides/poster</span></a>"
if (talk != null)
list_entry += " <a href=\"" + talk + "\"><span class=\"label label-default\">talk</span></a>"
list_entry += "<br>"
if (msg != null)
list_entry += "<i>" + msg + "</i>"
list_entry += "<div id=\"div" + paper_count + "\" style=\"font-size:15px\"></div><br>"
document.write(list_entry)
paper_count += 1
}
document.write("</ul>")
document.write("<ul>")
document.write("</ul><br>")
document.write("<h1>2025</h1>")
document.write("<ul>")
add_paper("Mamba-R: Vision Mamba ALSO Needs Registers",
"Feng Wang, Jiahao Wang, Sucheng Ren, Guoyizhe Wei, Jieru Mei, Wei Shao, Yuyin Zhou, Alan Yuille, Cihang Xie",
"CVPR, 2025",
"https://arxiv.org/abs/2405.14858",
"@article{wang2024mambar,<br>" +
" title = {Mamba-R: Vision Mamba also needs registers},<br>" +
" author = {Wang, Feng and Wang, Jiahao and Ren, Sucheng and Wei, Guoyizhe and Mei, Jieru and Shao, Wei and Zhou, Yuyin and Yuille, Alan and Xie, Cihang},<br>" +
" journal = {CVPR},<br>" +
" year = {2025},<br>",
"Similar to Vision Transformers, this paper identifies artifacts also present within the feature maps of Vision Mamba. These artifacts, corresponding to high-norm tokens emerging in low-information background areas of images, appear much more severe in Vision Mamba -- they exist prevalently even with the tiny-sized model and activate extensively across background regions. To mitigate this issue, we follow the prior solution of introducing register tokens into Vision Mamba. To better cope with Mamba blocks' uni-directional inference paradigm, two key modifications are introduced: 1) evenly inserting registers throughout the input token sequence, and 2) recycling registers for final decision predictions. We term this new architecture Mamba-R. Qualitative observations suggest, compared to vanilla Vision Mamba, Mamba-R's feature maps appear cleaner and more focused on semantically meaningful regions. Quantitatively, Mamba-R attains stronger performance and scales better. For example, on the ImageNet benchmark, our base-size Mamba-R attains 82.9% accuracy, significantly outperforming Vim-B's 81.8%; furthermore, we provide the first successful scaling to the large model size (i.e., with 341M parameters), attaining a competitive accuracy of 83.2% (84.5% if finetuned with 384x384 inputs). Additional validation on the downstream semantic segmentation task also supports Mamba-R's efficacy.",
"https://arxiv.org/abs/2405.14858",
"https://wangf3014.github.io/mambar-page/"
)
add_paper("Causal Image Modeling for Efficient Visual Understanding",
"Feng Wang, Timing Yang, Yaodong Yu, Sucheng Ren, Guoyizhe Wei, Angtian Wang, Wei Shao, Yuyin Zhou, Alan Yuille, Cihang Xie",
"CVPR, 2025",
"https://arxiv.org/abs/2410.07599",
"@article{wang2024causalimagemodeling,<br>" +
" title = {Causal Image Modeling for Efficient Visual Understanding},<br>" +
" author = {Feng Wang, Timing Yang, Yaodong Yu, Sucheng Ren, Guoyizhe Wei, Angtian Wang, Wei Shao, Yuyin Zhou, Alan Yuille, Cihang Xie},<br>" +
" journal = {CVPR},<br>" +
" year = {2025},<br>",
"In this work, we present a comprehensive analysis of causal image modeling and introduce the Adventurer series models where we treat images as sequences of patch tokens and employ uni-directional language models to learn visual representations. This modeling paradigm allows us to process images in a recurrent formulation with linear complexity relative to the sequence length, which can effectively address the memory and computation explosion issues posed by high-resolution and fine-grained images. In detail, we introduce two simple designs that seamlessly integrate image inputs into the causal inference framework: a global pooling token placed at the beginning of the sequence and a flipping operation between every two layers. Extensive empirical studies demonstrate the significant efficiency and effectiveness of this causal image modeling paradigm. For example, our base-sized Adventurer model attains a competitive test accuracy of 84.0% on the standard ImageNet-1k benchmark with 216 images/s training throughput, which is 5.3 times more efficient than vision transformers to achieve the same result.",
"https://arxiv.org/abs/2410.07599",
"https://github.com/wangf3014/Adventurer"
)
add_paper("Generative Image Layer Decomposition with Visual Effects",
"Jinrui Yang, Qing Liu, Yijun Li, Soo Ye Kim, Daniil Pakhomov, Mengwei Ren, Jianming Zhang, Zhe Lin, Cihang Xie, Yuyin Zhou",
"CVPR, 2025",
"https://arxiv.org/abs/2411.17864",
"@article{yang2024generative,<br>" +
" title = {Generative Image Layer Decomposition with Visual Effects},<br>" +
" author = {Jinrui Yang, Qing Liu, Yijun Li, Soo Ye Kim, Daniil Pakhomov, Mengwei Ren, Jianming Zhang, Zhe Lin, Cihang Xie, Yuyin Zhou},<br>" +
" journal = {CVPR},<br>" +
" year = {2025},<br>",
"Recent advancements in large generative models, particularly diffusion-based methods, have significantly enhanced the capabilities of image editing. However, achieving precise control over image composition tasks remains a challenge. Layered representations, which allow for independent editing of image components, are essential for user-driven content creation, yet existing approaches often struggle to decompose image into plausible layers with accurately retained transparent visual effects such as shadows and reflections. We propose a generative framework for image layer decomposition which outputs photorealistic clean backgrounds and high-quality transparent foregrounds with faithfully preserved visual effects. To enable effective training, we first introduce a dataset preparation pipeline that automatically scales up simulated multi-layer data with synthesized visual effects. To further enhance real-world applicability, we supplement this simulated dataset with camera-captured images containing natural visual effects. Additionally, we propose a consistency loss which enforces the model to learn accurate representations for the transparent foreground layer when ground-truth annotations are not available. Our method achieves superior quality in layer decomposition, outperforming existing approaches in object removal and spatial editing tasks across several benchmarks and multiple user studies, unlocking various creative possibilities for layer-wise image editing. The project page is https://rayjryang.github.io/LayerDecomp.",
"https://arxiv.org/abs/2411.17864",
"https://rayjryang.github.io/LayerDecomp"
)
add_paper("HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing",
"Mude Hui, Siwei Yang, Bingchen Zhao, Yichun Shi, Heng Wang, Peng Wang, Cihang Xie, Yuyin Zhou",
"ICLR, 2025",
"https://arxiv.org/abs/2404.09990",
"@article{hui2024hqedit,<br>" +
" title = {HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing},<br>" +
" author = {Hui, Mude and Yang, Siwei and Zhao, Bingchen and Shi, Yichun and Wang, Heng and Wang, Peng and Xie, Cihang and Zhou, Yuyin},<br>" +
" journal = {ICLR},<br>" +
" year = {2025},<br>",
"This study introduces HQ-Edit, a high-quality instruction-based image editing dataset with around 200,000 edits. Unlike prior approaches relying on attribute guidance or human feedback on building datasets, we devise a scalable data collection pipeline leveraging advanced foundation models, namely GPT-4V and DALL-E 3. To ensure its high quality, diverse examples are first collected online, expanded, and then used to create high-quality diptychs featuring input and output images with detailed text prompts, followed by precise alignment ensured through post-processing. In addition, we propose two evaluation metrics, Alignment and Coherence, to quantitatively assess the quality of image edit pairs using GPT-4V. HQ-Edits high-resolution images, rich in detail and accompanied by comprehensive editing prompts, substantially enhance the capabilities of existing image editing models. For example, an HQ-Edit finetuned InstructPix2Pix can attain state-of-the-art image editing performance, even surpassing those models fine-tuned with human-annotated data. The project page is https://thefllood.github.io/HQEdit_web/",
"https://arxiv.org/abs/2404.09990",
"https://thefllood.github.io/HQEdit_web/"
)
add_paper("Autoregressive Pretraining with Mamba in Vision",
"Sucheng Ren, Xianhang Li, Haoqin Tu, Feng Wang, Fangxun Shu, Lei Zhang, Jieru Mei, Linjie Yang, Peng Wang, Heng Wang, Alan Yuille, Cihang Xie",
"ICLR, 2025",
"https://arxiv.org/abs/2406.07537",
"@article{ren2024autoregressive,<br>" +
" title = {Autoregressive Pretraining with Mamba in Vision},<br>" +
" author = {Ren, Sucheng and Li, Xianhang and Tu, Haoqin and Wang, Feng and Shu, Fangxun and Zhang, Lei and Mei, Jieru and Yang, Linjie and Wang, Peng and Wang, Heng and Yuille, Alan and Xie, Cihang},<br>" +
" journal = {ICLR},<br>" +
" year = {2025},<br>",
"The vision community has started to build with the recently developed state space model, Mamba, as the new backbone for a range of tasks. This paper shows that Mamba's visual capability can be significantly enhanced through autoregressive pretraining, a direction not previously explored. Efficiency-wise, the autoregressive nature can well capitalize on the Mamba's unidirectional recurrent structure, enabling faster overall training speed compared to other training strategies like mask modeling. Performance-wise, autoregressive pretraining equips the Mamba architecture with markedly higher accuracy over its supervised-trained counterparts and, more importantly, successfully unlocks its scaling potential to large and even huge model sizes. For example, with autoregressive pretraining, a base-size Mamba attains 83.2% ImageNet accuracy, outperforming its supervised counterpart by 2.0%; our huge-size Mamba, the largest Vision Mamba to date, attains 85.0% ImageNet accuracy (85.5% when finetuned with 384x384 inputs), notably surpassing all other Mamba variants in vision. The code is available at https://github.com/OliverRensu/ARM.",
"https://arxiv.org/abs/2406.07537",
"https://github.com/OliverRensu/ARM"
)
add_paper("MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine",
"Yunfei Xie, Ce Zhou, Lang Gao, Juncheng Wu, Xianhang Li, Hong-Yu Zhou, Sheng Liu, Lei Xing, James Zou, Cihang Xie, Yuyin Zhou",
"ICLR, 2025",
"https://arxiv.org/abs/2408.02900",
"@article{xie2024medtrinity,<br>" +
" title = {MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine},<br>" +
" author = {Yunfei Xie, Ce Zhou, Lang Gao, Juncheng Wu, Xianhang Li, Hong-Yu Zhou, Sheng Liu, Lei Xing, James Zou, Cihang Xie, Yuyin Zhou},<br>" +
" journal = {ICLR},<br>" +
" year = {2025},<br>",
"This paper introduces MedTrinity-25M, a comprehensive, large-scale multimodal dataset for medicine, covering over 25 million images across 10 modalities, with multigranular annotations for more than 65 diseases. These enriched annotations encompass both global textual information, such as disease/lesion type, modality, region-specific descriptions, and inter-regional relationships, as well as detailed local annotations for regions of interest (ROIs), including bounding boxes, segmentation masks. Unlike existing approach which is limited by the availability of image-text pairs, we have developed the first automated pipeline that scales up multimodal data by generating multigranular visual and texual annotations (in the form of image-ROI-description triplets) without the need for any paired text descriptions. Specifically, data from over 90 different sources have been collected, preprocessed, and grounded using domain-specific expert models to identify ROIs related to abnormal regions. We then build a comprehensive knowledge base and prompt multimodal large language models to perform retrieval-augmented generation with the identified ROIs as guidance, resulting in multigranular texual descriptions. Compared to existing datasets, MedTrinity-25M provides the most enriched annotations, supporting a comprehensive range of multimodal tasks such as captioning and report generation, as well as vision-centric tasks like classification and segmentation. This dataset can be utilized to support large-scale pre-training of multimodal medical AI models, contributing to the development of future foundation models in the medical domain.",
"https://arxiv.org/abs/2408.02900",
"https://yunfeixie233.github.io/MedTrinity-25M/"
)
add_paper("A New Federated Learning Framework Against Gradient Inversion Attacks",
"Pengxin Guo, Shuang Zeng, Wenhao Chen, Xiaodan Zhang, Weihong Ren, Yuyin Zhou, Liangqiong Qu",
"AAAI, 2025",
"https://arxiv.org/abs/2412.07187",
"@inproceedings{guo2023new,<br>" +
" title = {A New Federated Learning Framework Against Gradient Inversion Attacks},<br>" +
" author = {Guo, Pengxin and Zeng, Shuang and Chen, Wenhao and Zhang, Xiaodan and Ren, Weihong and Zhou, Yuyin and Qu, Liangqiong},<br>" +
" booktitle = {AAAI},<br>" +
" year = {2025}<br>}",
"Federated Learning (FL) aims to protect data privacy by enabling clients to collectively train machine learning models without sharing their raw data. However, recent studies demonstrate that information exchanged during FL is subject to Gradient Inversion Attacks (GIA) and, consequently, a variety of privacy-preserving methods have been integrated into FL to thwart such attacks, such as Secure Multi-party Computing (SMC), Homomorphic Encryption (HE), and Differential Privacy (DP). Despite their ability to protect data privacy, these approaches inherently involve substantial privacy-utility trade-offs. By revisiting the key to privacy exposure in FL under GIA, which lies in the frequent sharing of model gradients that contain private data, we take a new perspective by designing a novel privacy preserve FL framework that effectively ``breaks the direct connection'' between the shared parameters and the local private data to defend against GIA. Specifically, we propose a Hypernetwork Federated Learning (HyperFL) framework that utilizes hypernetworks to generate the parameters of the local model and only the hypernetwork parameters are uploaded to the server for aggregation. Theoretical analyses demonstrate the convergence rate of the proposed HyperFL, while extensive experimental results show the privacy-preserving capability and comparable performance of HyperFL. Code is available at https://github.com/Pengxin-Guo/HyperFL.",
"https://arxiv.org/abs/2412.07187",
"https://github.com/Pengxin-Guo/HyperFL"
)
add_paper("ARVideo: Autoregressive Pretraining for Self-Supervised Video Representation Learning",
"Sucheng Ren, Hongru Zhu, Chen Wei, Yijiang Li, Alan Yuille, Cihang Xie",
"TMLR, 2025",
"https://arxiv.org/abs/2405.15160",
"@article{ren2024arvideo,<br>" +
" title = {ARVideo: Autoregressive Pretraining for Self-Supervised Video Representation Learning},<br>" +
" author = {Ren, Sucheng and Zhu, Hongru and Wei, Chen and Li, Yijiang and Yuille, Alan and Xie, Cihang},<br>" +
" journal = {TMLR},<br>" +
" year = {2025}<br>}",
"This paper presents a new self-supervised video representation learning framework, ARVideo, which autoregressively predicts the next video token in a tailored sequence order. Two key designs are included. First, we organize autoregressive video tokens into clusters that span both spatially and temporally, thereby enabling a richer aggregation of contextual information compared to the standard spatial-only or temporal-only clusters. Second, we adopt a randomized spatiotemporal prediction order to facilitate learning from multi-dimensional data, addressing the limitations of a handcrafted spatial-first or temporal-first sequence order. Extensive experiments establish ARVideo as an effective paradigm for self-supervised video representation learning. For example, when trained with the ViT-B backbone, ARVideo competitively attains 81.2% on Kinetics-400 and 70.9% on Something-Something V2, which are on par with the strong benchmark set by VideoMAE. Importantly, ARVideo also demonstrates higher training efficiency, i.e., it trains 14% faster and requires 58% less GPU memory compared to VideoMAE.",
"https://arxiv.org/abs/2405.15160",
)
add_paper("AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention Manipulation",
"Zijun Wang, Haoqin Tu, Jieru Mei, Bingchen Zhao, Yisen Wang, Cihang Xie",
"TMLR, 2025",
"https://arxiv.org/abs/2410.09040",
"@article{wang2024attngcg,<br>" +
" title = {AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention Manipulation},<br>" +
" author = {Zijun Wang, Haoqin Tu, Jieru Mei, Bingchen Zhao, Yisen Wang, Cihang Xie},<br>" +
" journal = {TMLR},<br>" +
" year = {2025}<br>}",
"This paper studies the vulnerabilities of transformer-based Large Language Models (LLMs) to jailbreaking attacks, focusing specifically on the optimization-based Greedy Coordinate Gradient (GCG) strategy. We first observe a positive correlation between the effectiveness of attacks and the internal behaviors of the models. For instance, attacks tend to be less effective when models pay more attention to system prompts designed to ensure LLM safety alignment. Building on this discovery, we introduce an enhanced method that manipulates models' attention scores to facilitate LLM jailbreaking, which we term AttnGCG. Empirically, AttnGCG shows consistent improvements in attack efficacy across diverse LLMs, achieving an average increase of ~7% in the Llama-2 series and ~10% in the Gemma series. Our strategy also demonstrates robust attack transferability against both unseen harmful goals and black-box LLMs like GPT-3.5 and GPT-4. Moreover, we note our attention-score visualization is more interpretable, allowing us to gain better insights into how our targeted attention manipulation facilitates more effective jailbreaking. We release the code at https://github.com/UCSC-VLAA/AttnGCG-attack.",
"https://arxiv.org/abs/2410.09040",
"https://github.com/UCSC-VLAA/AttnGCG-attack"
)
add_paper("SPFormer: Enhancing Vision Transformer with Superpixel Representation",
"Jieru Mei, Liang-Chieh Chen, Alan Yuille, Cihang Xie",
"TMLR, 2025",
"https://arxiv.org/abs/2401.02931",
"@article{mei2024spformer,<br>" +
" title = {SPFormer: Enhancing Vision Transformer with Superpixel Representation},<br>" +
" author = {Mei, Jieru and Chen, Liang-Chieh and Yuille, Alan and Xie, Cihang},<br>" +
" journal = {TMLR},<br>" +
" year = {2025}<br>}",
"In this work, we introduce SPFormer, a novel Vision Transformer enhanced by superpixel representation. Addressing the limitations of traditional Vision Transformers' fixed-size, non-adaptive patch partitioning, SPFormer employs superpixels that adapt to the image's content. This approach divides the image into irregular, semantically coherent regions, effectively capturing intricate details and applicable at both initial and intermediate feature levels. SPFormer, trainable end-to-end, exhibits superior performance across various benchmarks. Notably, it exhibits significant improvements on the challenging ImageNet benchmark, achieving a 1.4% increase over DeiT-T and 1.1% over DeiT-S respectively. A standout feature of SPFormer is its inherent explainability. The superpixel structure offers a window into the model's internal processes, providing valuable insights that enhance the model's interpretability. This level of clarity significantly improves SPFormer's robustness, particularly in challenging scenarios such as image rotations and occlusions, demonstrating its adaptability and resilience.",
"https://arxiv.org/abs/2401.02931"
)
add_paper("AQA-Bench: An Interactive Benchmark for Evaluating LLMs' Sequential Reasoning Ability",
"Siwei Yang, Bingchen Zhao, Cihang Xie",
"TMLR, 2025",
"https://arxiv.org/abs/2402.09404",
"@article{yang2024aqabench,<br>" +
" title = {AQA-Bench: An Interactive Benchmark for Evaluating LLMs' Sequential Reasoning Ability},<br>" +
" author = {Yang, Siwei and Zhao, Bingchen and Xie, Cihang},<br>" +
" journal = {TMLR},<br>" +