charts/models/values.yaml (9 lines of code) (raw):

# nameOverride and fullnameOverride allow for custom naming nameOverride: "" fullnameOverride: "" # name defines the name of the application to be deployed, being the name shared # by the Deployment, Service and Ingress appName: tgi # instanceName defines the Dell instance in which the workload will run instanceName: xe9680-nvidia-h100 # numReplicas defines the number of replicas for the deployment, but note that # you need to have access on Dell to as many instances as replicas you define numReplicas: 1 # modelId defines the Hugging Face Hub model ID for the model (case insensitive) modelId: "meta-llama/meta-llama-3-70b-instruct" # env contains the environment variables for Text Generation Inference, you can # find them all at https://huggingface.co/docs/text-generation-inference/en/basic_tutorials/launcher env: # PORT defines the Text Generation Inference (TGI) port to be exposed, as well # as the port defined and exposed within the NGINX Ingress controller and the Service PORT: 80 # NUM_SHARD defines the number of GPUs to use within each instance, but note that # not all the instances support any arbitraty number of GPUs, even though those # usually are within the 1-8 range (1, 2, 4, 8) NUM_SHARD: 8 # NOTE: by default TGI will use the "best" configuration for the arguments as e.g. # `MAX_TOTAL_TOKENS`, `MAX_INPUT_TOKENS`, etc. in order to stress out the existing # resources and benefit from all the available VRAM