-
process.py
contains a quick script for turning theexample_video
into a very small tfrecord for pretraining. -
The dataset is available for academic use, please contact Rowan for access. We probably cannot release the videos (for legal reasons and to protect privacy). What we are releasing are annotations that look like this
-
denoised
: a list of spans ofnoisyasr
text, that was cleaned up with a finetuned Grover model (output iscleanasr
). The perplexity of the context is underctx_ppl
-
info
: a dictionary of info with information about the YouTube video -
subtitles
: Each word, along with the approximate timestamp about when it was said in the video -
_te
: Time elapsed (this isn't needed at all)
data
Folders and files
Name | Name | Last commit date | ||
---|---|---|---|---|
parent directory.. | ||||