October | 2024 | One Two Bytes

Tried to install Salesforce/codegen25-7b-multi_P on my Macbook with Huggingface transformers 4.45, which failed with the following error:

.env/lib/python3.12/site-packages/transformers/tokenization_utils_base.py", line 1590, in __init__
    raise AttributeError(f"{key} conflicts with the method {key} in {self.__class__.__name__}")
AttributeError: add_special_tokens conflicts with the method add_special_tokens in CodeGen25Tokenizer

Going back a few transformer versions gives this error:

codegen25-7b-multi/0bdf3f45a09e4f53b333393205db1388634a0e2e/tokenization_codegen25.py", line 149, in vocab_size
    return self.encoder.n_vocab
           ^^^^^^^^^^^^
AttributeError: 'CodeGen25Tokenizer' object has no attribute 'encoder'. Did you mean: 'encode'?

After zipping through older transformer versions I found a note in their release saying it requires transformers 4.29.2. That version didn’t want to compile on my currnet Mac setup anymore because of Rust, with this error:

error: could not compile `tokenizers` (lib) due to 1 previous error; 3 warnings emitted
      
      Caused by:
        process didn't exit successfully: `rustc --crate-name tokenizers --edition=2018 tokenizers-lib/src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --crate-type lib 
...      
error: `cargo rustc --lib --message-format=json-render-diagnostics --manifest-path Cargo.toml --release -v --features pyo3/extension-module --crate-type cdylib -- -C 'link-args=-undefined dynamic_lookup -Wl,-install_name,@rpath/tokenizers.cpython-312-darwin.so'` failed with code 101
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for tokenizers
Failed to build tokenizers

The Solution

Here is a solution that worked for me:

RUSTFLAGS="-A invalid_reference_casting" pip install transformers==4.33.2

Transformers 4.29.2 works as well. Then install torch.

And here everything together:

virtualenv .env
source .env/bin/activate
RUSTFLAGS="-A invalid_reference_casting" HF_HOME=.cache pip install tiktoken==0.4.0 torch transformers==4.33.2
python test.py

where test.py would be the following:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen25-7b-instruct", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen25-7b-instruct")

text = "def hello_world():"
input_ids = tokenizer(text, return_tensors="pt").input_ids
generated_ids = model.generate(input_ids, max_length=128)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))

One Two Bytes

Roaming the software world. A smorgasbord of topics I come across while writing software.

Month / October 2024

CodeGen2.5 LLM not working with latest Huggingface Transformers

The Solution