Skip to content

Latest commit

 

History

History
87 lines (55 loc) · 5.19 KB

2023-01-11.md

File metadata and controls

87 lines (55 loc) · 5.19 KB

2023-01-11

Logbook for the saga of trying to get a new version of the 6B released. Started on 2023-01-11.

Context

Since this is the very first logbook, here's some historical context: as of now we've released 350M, 1.3B, 2.7B and our most recent model was a 6B.

The 6B Colab notebook was announced a couple(?) days ago and there was apparently plenty of hype: it already became the 10th most downloaded conversational model of the month on HuggingFace, and must have reminded people of our data collection efforts, since contributed data shot up from ~189MB to ~304MB in that timeframe.

We've also gotten plenty of feedback. Numbers show that the 6B is our most popular model by far (even summing up the downloads for all the other models, the 6B still has ~7x more), so for now I plan to iterate on the 6B and once I have a good version of that, just attempt to replicate the configuration for the smaller models.

Notes about the new runs can be found below.

Experiment 1

What's new

On the data side:

  • A bit of AllenAI's SODA dataset was added into our training data.
    • Specifically, rows where relation is xAttr were fed in, with literal being used as the persona data and narrative as the scenario.
    • This is an attempt to dillute some of the NSFW and possible nonsense dialogue present in the data contributed by the community, which as of now is the biggest part of our training data.
    • The SODA data was trimmed as to not completely overpower our other data sources.
  • Sentiment classification was run over the community contributed data (I'll call this CC data from now on to save on typing lol) for data balancing purposes.
    • We got feedback that characters would blush and get nervous too often. As it turns out, if we drop all episodes where characters would blush, get nervous or stutter, almost 50% of the training data disappears (ouch).
    • To try and cut back on this overrepresentation, episodes with blushing/nervousness had a 70% chance of being dropped, and episodes with stuttering had a 50% chance of being dropped.
  • Basic deduping was done over the final training data.
    • A chunk of new CC data was actually old files, but with new dialogue appended at the end. This results in duplicate episodes, which are compounded by the data augmentation introduced in the last run.
    • To reduce possible impacts, episodes were deduped before being added to the training dataset.

The dataset consisted of 74,825,728 tokens.

On the training side, I talked with some other people who fine-tuned similarly-sized models and made some adjustments so our parameters resembled theirs more closely:

  • LR was significantly reduced: from 4e-5 to 5e-6
  • Total batch size was reduced from 64 to 32

Results

The run finished in about 5 hours and a half. 9134 steps.

Because I'm a troglodyte of a researcher, there are no eval/test splits hence no metrics. I manually ran inference on the model to try and observe its behaviors instead.

Sampling settings used:

max_new_tokens=196, temperature=1, top_p=0.9, typical_p=1, repetition_penalty=1.05, top_k=0, penalty_alpha=0

Testing notes:

  • Characters mess up their own names sometimes. For example, "Tsuki" will refer to themselves "Tsukino" every now and again.
  • Managed to run into a bug on the prototype UI: the last character of the user message sometimes gets appended to the beginning of the character's name in the response
    • For example, if you say "How are you?", the response might be "? Bot: I'm doing well!"
  • Persona still kinda weak. Writing "I have very pale skin" and swapping it out with "I have tanned skin" for example will result in the character responding correctly most of the time when you ask them about it directly, but they'll still get it wrong a non-trivial amount of times.
  • Invalid markdown every now and again (opening * but no closing, for example - easily fixable on the UI side of things).

Experiment 2

What's new

  • Another ~80MB of raw CC data has come in since the last experiment.
  • Also, I got more feedback about character personas being weak and heavily influenced by the name of the character.
    • Non-CC data already has a nice amount of variety with character names, but CC data has a substantial amount of dialogue examples coming from a much smaller amount of characters, so that might bias things.
    • To work around that, I'm masking character names for CC data on this run.
  • Also, I noticed other redaction-related tokens that I wasn't handling properly in the CC data. Updated the code to handle the majority of them.

89,620,480 tokens over 10940 steps.

Results

Mostly the same issues as the last experiment, but character persona seems to have improved noticeably. Even W++ actually works better than I expected.

Experiment 3

What's new

  • More CC data
  • Batch size slashed in half again
  • LR slightly down to 4e-6

98,935,808 tokens over 12008 steps.

Results

Haven't had the opportunity to test yet! Will push to HuggingFace and let the community test it out. Experiment 2 was pretty well-received and people are saying it's an improvement over the first release, so I'll probably release that as the new official version, which marks the end of this specific logbook.