Skip to content

Latest commit

 

History

History
64 lines (42 loc) · 4.76 KB

2023-02-14.md

File metadata and controls

64 lines (42 loc) · 4.76 KB

2023-02-14

Experiment 6

What's new

Lots of data. Notably:

  • More community contributed chat logs: We're up to ~6GB in raw submissions.
  • Instruction-following data: Flan, chain of thought and other user-contributed instruction following datasets.
  • Other fiction/roleplay data: These are being processed and, since they're also much lesser in quantity, will be added later into the training data (possibly at a higher learning rate, too)
  • Knowledge grounding data: To teach the model how to take any sort of external knowledge manually fed in at inference time (e.g. world info on Kobold, long-term memories fetched from a vector search database, internet search results) and convert it into a conversational response.

Training

Overview

Like the previous supervised fine-tune runs, good hyperparameters were again tricky to find. I ended up doing a few sweeps over the weekend to find a decent configuration. In the end, I went with:

  • Effective batch size of 256
  • Learning rate of 0.98e-5
  • Constant learning rate, with 24 warmup steps

Another problem is that we are no longer dealing with a training set in the range of tens/hundreds of MBs. Instead, the processed training set (before tokenization) now clocks in at around 14GB.

Not counting all the small annoyances that come with having to deal with that much data (transferring over the network, processing time, memory limits, etc.), we now have the problem of training times. A rough estimate would put us at around a month of training time with the current setup (4 RTX A6000s).

To make this sting a little less, I'll be splitting the training set into ten parts and gradually releasing "work in progress" checkpoints after training on each of them. This will also allow us to possibly pivot on our data choices if, after testing the latest checkpoint, we notice any sort of problems with the generations.

That being the case, let's get started!

Part 1/10

Training notes:

  • Loss curves look good. Evaluation loss trending down consistently, from 1.438 to 1.203.
  • Towards the end (last ~60k training examples), evaluation loss seemed to stagnate and slightly bounce back up a few times.
    • This is about in line with numbers I've heard from other people (1GB of data for fine-tuning a 6B), but I'm curious to see what more data will do anyways.
  • Done! 682701 training examples seen after approximately 2 days and 21 hours of runtime.

Testing notes:

Sampling settings:

  • Temperature: Usually 0.5 to 0.65, bumping up to 0.8-1.0 every now and again to see what happens.
  • Top P sampling: Either 0.9 or disabled (1.0).
    • Disabling this seems to give the model a little more freedom to be creative when coupled with a slightly higher temperature(?).
  • Repetition penalty: 1.0 and 1.1.
    • Looks like the model is surprisingly good at not repeating itself even at 1.0, but every now and again in a more difficult situation it'll fall back to asking a question it just asked you or something of the sorts. Bumping to 1.1 seems to fix that in most cases.

Results:

  • From what I could tell, for assistant-like characters, this checkpoint performs much better than v2 or v6. Obviously it's no ChatGPT or GPT-JT though, trying to ask for more detail or a different formatting for example doesn't work at all.
  • Knowledge grounding data certainly had an effect: adding an extra turn before the response with Relevant Knowledge: [something here] results in the model rewording the knowledge into something more conversational.
    • Unfortunately this seems to be really hit or miss for anything that's not a stated fact.
    • For example: if you ask about a band and give the model a Wikipedia excerpt, it'll pick out some phrases to reply to your question with.
    • However, if you give it some information about the character/yourself as if it's a memory, it's a coin flip as to whether that'll correctly guide the conversation or whether the model will ignore it outright, or get confused and think it's conversation history.
  • The curse of the short responses is still here. If you don't take the time to write out a long opening message and some detailed example conversations, the model will quickly trend towards short generations, like talking to someone over an instant messaging app.
    • The messages themselves are actually fine and even seem more engaging than what v6 spits out, but anyone expecting more novel/story-driven dialog or things of the sorts will likely be disappointed.
    • To alleviate this, I'm considering dropping training examples where the responses are too short for part 2.

I'll release this as an experimental version over at HuggingFace, and get to work on part 2 of training.

Update: As of now, the v7 run has been aborted. Check the next logbook entry for more details.