Releases: allenai/olmocr
v0.1.58
v0.1.53
What's new
- Fixed git checks
Commits
08f7612 Bump version to v0.1.53 for release
58bdfa5 CI
25ec87b CI
c05e015 Hopefully CI runs now
15f9b8b Install poppler in CI
229da8c unused imports
32aa359 Formatting fix
0dcdbcc Update README.md
6583fb6 hfupload scripts
8297955 Making my parquets
51cfdbd Better converter
e369569 Update README.md
91eef27 Adding some gnarly 1 pager pdfs from kyle
87cb957 First pass at dataset builder script
6ed6f85 Generating parquets for hugging face
84c0c71 Merge branch 'main' of https://github.com/allenai/olmocr
7d67a59 Remove unused
6471f28 Random git ignores, remove unused code
c74d47a Pipeline fixes
04844b3 More beaker and docker fixes
9df86da Beaker fixes
cf6673c Pipeline fixes
7fbbb57 Remove mypy for now
d36e556 Hopefully fixes build
c69e0d6 More cleanup, removing dead adv anchor code
d4d711d Nicer glob handing for pipeline.py
84477b5 More formatting
e3d04ee Merge branch 'main' of https://github.com/allenai/olmocr into main
c37e545 running isort again
2c29533 Fixing most ruff errors
5690377 Ruff
fb40229 Isort and black update
cdb10a9 Python 3.11
dcaca8a Black formatting
4a1762d isort
0628d31 Some unit test cleanup
7d2403d More infos
8dd006d Merge branch 'main' of https://github.com/allenai/olmocr into main
04615d7 More logging on sglang server
0ccb99c readme
2e4ef95 Readme
2192505 Update README.md
9a1be7e Readme
496e162 Update README.md
b574766 Viewer and gitignore
86267d8 Viewer cleanup
a243c89 Update README.md
dbf6477 viewer fix
4c35105 More readme imporvements
f16acec Readme improvements
dee494a Local file stuff
7882944 Local pdf support
dbe5487 Support stats feature later
48447b6 Can use remote s3 files, and local workspace now
50f9a6a Name refactor
e0afb93 Better check for separate sglang installation step
00e3aac Inference test for qwen2 and 2.5, work queue fixes, build current still broken
4d0d924 Merge branch 'main' of https://github.com/allenai/olmocr
b28aad6 More test docs
96ae2dd Refactoring
c606267 Cleaning up some unused code
d8c13d0 Readmes and version updates
b2894d0 Massive refactor from pdelfin to olmocr
7261bfc Update README.md
cbfc803 Merge pull request #27 from allenai/molmo
aa59d38 Merge branch 'main' of https://github.com/allenai/pdelfin
eacd044 csv output
201fec3 Config update
72d2fa2 Reviewing molmo training
0311b44 Some small updates
6586744 Building some data summary tools
c74e3d1 ELO stuff
18f72b4 New ELO building stuff finished up I think
50464c1 build elo v1
3a28955 Added ELO scores
a8d9a55 Fixes for elo
00f2a67 More elo scoring stuff
834e91c runelo start
ef4167d Test set script
683be68 Better error handling on expand_s3_glob
5e633e0 Merge branch 'main' of https://github.com/allenai/pdelfin
0d1fc08 Small fixes
2190f61 Merge branch 'main' of https://github.com/allenai/pdelfin
e2bbd0e Adding some long context stats
0b72eda Move form check into exception handler, don't mark the work item as done if it had an exception on it
fa318da New version with s3 fix in it
84c53c2 Merge branch 'main' of https://github.com/allenai/pdelfin
e9c3c21 Skipping files which are not found
3e33ce1 Ignores
37cdb9e Merge branch 'main' of https://github.com/allenai/pdelfin
1eda300 Dolma viewer niceties
fe04db8 Better error handling
35502bc Limit the number of retries on the server process
b3ca86a More robust to errors when reading logs which had caused freezes
d4f3cff More reliable weka
6872105 Merge branch 'main' of https://github.com/allenai/pdelfin
c93fc36 Missing import
dd17185 More things to try
46fe4ac Trying fixes for live lock
41accfe Error out if you see a broken process pool, might need a better check for this
a95487e Adding check for possible sglang livelock
cff9799 Moving to official sglang release
f8dcdf6 Better catching of httpx errors and retrying them
d6a0013 Faster init by caching pdf filter
a91befc Fix for fallback stuff
8c858a9 New version
66fff4f Merge branch 'main' of https://github.com/allenai/pdelfin
212d391 More convservative filtering
cb800d6 Merge branch 'main' of https://github.com/allenai/pdelfin into main
7dd2046 New version
af8ce51 Merge branch 'main' of https://github.com/allenai/pdelfin into main
9112d81 No keep alive connection to try to resolve sglang livelock
53a5104 Merge branch 'main' of https://github.com/allenai/pdelfin into main
67d11ec TODOs and client fix
3153aea Merge branch 'main' of https://github.com/allenai/pdelfin into main
9b8d58b Better stats and metadata
273a8b0 Logging fallback pages
b0acfa8 Adding support for fallback pages
204a4a8 Better stats
3ef4609 Fixing args
27d2352 Claude recommends httpx instead of aiohttp, seeing if that will help with straggler timeouts
4469f4b Version patch
9e2e09b More fixes
8793fc7 Adding more retries, and it was able to process more complicated books
2f55a3d fix
d4d4736 more gcs
e48d4be Fix
8c3b575 Gcs support better
9381bf8 docs
f287f24 Fixing a few stats things
e499413 Better work queue
04429b2 Basic work queue from claude
995b1d1 Fixes, mocking out queue into separate file
fcabb8e Handling more error cases
96984fc Fix a reliability issue
0af29f1 Adding page rotation
e2303f2 Running on l40s, fixing queue
68543d4 Adding stats
b4ca563 Decent set of todos for monday
2f1664f Stop everything on a Nan
eac3b10 allow weka from augusta through vpn
370dbba new build
9ce243e no weka on augusta
eefb045 Single cluster fix
2e1d0b6 Fix
748b095 Fix
80ba562 Fixing timeout situation
65763de Don't retry accessdenied errors
2c52664 Cleaner exit
77c82fd New version with aiohttp fixes
ae1e4bc More realistic results
770da2b Docker
bfe4211 Debugging timeout errors and other things
fd17652 Trying to make it faster
278422b Fixing one max context issue
62de9fe weka fix
9a1e82f Logging
fe0574c Cleanup code, s3 retries
2c7686f I think I have error handling better now
8217e49 Page calc
4eab90f Fixing bugs
b67d8e7 Fixing work queue population
827b77e Working on task groups
a58efea better logging
a9cf2e0 Allow setting beaker priority
41c8d55 exponential backoff
4dcf9ed more fixes
06331d7 Fix timeout
8e16780 Beaker stuff
4c3bf70 Beaker fixes
3172a1c Shuffling
fe3c9a2 Creds and other things
a3b6962 fix
83bb1dc Dockerfile fixes
6c9c785 Using version strings
9610eac Secrets management
39256c1 Beaker running
867e2c9 Docker builds
a091412 Starting to play with docker too
bce85e6 pipeline
a085e8c Beaker test
910c2eb Downloads from s3 based on hash
6598e2d Control http session at the worker level
fbacdd0 Stuff
ae9b1c4 Better stats
9ce28c0 Measuring metrics better now
193e521 Semaphore timeout
102c0e4 new version of sglang, server restarts, semaphore timeouts
918e2f3 Pipeline stuff
691cc5a A few items
4f2f4fd Quicker results by limited workers via semaphore while still utilizing gpu
6154095 Logging and perf stuff
ade3580 FIxes
732300a Some errors dealt with
24a9d23 Trying to get reliablity up
fedda40 Small fixes
a9a94f2 Code to get stats
6b625b2 Bugfixes
9fb464c Refactoring to assemble docs
da1b23f Minor fixes
9ff107b Merge branch 'main' of https://github.com/allenai/pdelfin into main
299819e Reqs
9d51935 some cleanups
6590164 Starting to work
82ec249 Progress
37dc412 Working on script
e5fb7c0 Organization
ee72b36 Starting up server and workers async now
a39350e Reworking to be async
a103ce7 Some small things
b15bff6 Work queue coallescing
57186c7 Doing some more stuff
923231e exit handlers
051a7b4 Prepping work script
a65e12b Model download stuff
12a91ff Starting on a new approach
faf8659 Putting aside redis
3d6be3c Work queue sharing thing
75d4a0e Experimental beaker pipeline self organizing redis idea
a14febc sglang support for runeval
592cc50 More docs
03f5b25 Docs good now
d89ea6b docs
0362ce6 docs
b2b3f06 docs
46ccab3 More docs
93d7068 More docs
73bd961 Logger fix
3778228 More docs
ef2e4d6 Adding more docs
5ebc8cd Checkfix
9f010e6 Add check for poppler installation
be8fb28 Update README.md
426fda1 Removing some logs
500bd2d flash attn
d45b34f Trust remote code
cda0ad7 Config typo
cf3b377 train script
8f001bf Config updates
6a4a55f Hopefully working molmo HF trainer config
bede854 Startng to write molmo formatters
e65747e Some better logging
a0e0917 Merge branch 'main' of https://github.com/allenai/pdelfin into main
43aa4f2 Proper selection of LORA weights
bcb4794 Starting on molmo changes
232c445 Pipeline stability fixes hopefully and logging
ce2e4ba Applying rotation corrections
08d51b7 Adding some rotation retry contrl
7678f31 Fixing some reliability issues with the pipeline script
45269fa Switching to logging vs prints
a3e7654 Update all docs at once
062abff Adding some skip logic
8e6d0c6 swtichin to orjson, some better json error handling
48a3aff Reindexing
f13d0a5 List configs to list
ffe470b Fix
180dde0 dataprep sampling tests
64041bd Allow sampling different anchor text lens
6a22900 Allow for sampling anchor and other params
999f64d Adding empty anchor support
f8c5aac Some cleanup
a1a4798 Some crazy idea I had to simplify futures and memory limits
f6ac591 vllm benchmarker
4047258 Fixing one old bug to make update_static atomic
38dc5a2 Refactored to have a more efficient batchwriter, and also not allow too many running futures
d99096e Adding vllm profile script for reference
0a5c506 index
7c78676 Fix pipeline bug with indexing
31becaf S2orc dataset extractor
302eee3 Yay matches between birr and hf
f44dbd1 Small fixes
a482271 train more steps
c9ac48b Try to save at the last second only
9d35d3c Birr tokenization test
77f0b9f help text
7dbcbc1 Birr tests that don't do anything but help me understand the universe
492a3f6 Adding parameters for taget image and anchor text sizes
1c8602c Removing rotation invalid ones to see what happens
dd4f967 Filter refactor
3ecbeae Trying save to s3 but with threaded saver
5ba78ed Fix
89fcff2 Fixing saving bug again
7d4cff5 Nice test for picking proper page in birrpipelie
a4d7620 Choosing proper page
529d51d Put LR back, need to save larger checkpoints to weka to prevent timeouts
e141c91 Try lora run higher LR
2826bca Yay all unit tests pass cleanly now too
124aaf5 Hmm, cant repro ...