Loading Datassets on face kiss: Step Guide for Step

Part 1: Loading Dataset in Russian HUB Face
Introduction
This part of the lesson goes through the process of uploading the custom data to the hub's surveillance. Hugging Face Hub is a platform that allows engineers to share and participate in the data detance models.
Here, we will take the following Python dataset instructions, converted you into the appropriate training format, and downloads it to refresh the face of public use. We formulate our details to match the LLAMA chat template 3.2, making it fixed for lilama 3.2 beautiful models.
Step 1: Installation and Verification
First, we need to submit the required libraries and confirm with the face of the face of hugs:
!pip install -q datasets
!huggingface-cli login
What happened here:
- Designs are as far as the Face Barrant to work with machine learning details
- The silent -q flag reduces the output messages
- HuggingFatCe-CLI Login will move you to enter your face-token token
- You can find your token according to your Hugging-Hugging Settings → Access to tokens
After running this jump, you will be informed to enter your poster. This confirmed your session and allows you to press the content on the harp.
Step 2: Upload Dataset and Describe a Reform Work
Next, we will upload an existing dataset and explain the work to change it to accompany the LLAMA 3.2 Chat Chat:
from datasets import load_dataset
# Load your complete custom dataset
dataset = load_dataset('Vezora/Tested-143k-Python-Alpaca')
# Define a function to transform the data
def transform_conversation(example):
system_prompt = """
You are an expert Python coding assistant. Your role is to help users write clean, efficient, and bug-free Python code.
You have been trained on a diverse set of high-quality Python code samples, all of which passed rigorous
automated testing for functionality and performance.
Always follow best practices in Python programming, provide concise and readable solutions,
and ensure that your responses include informative comments when necessary.
When presented with a coding problem, first create a detailed pseudocode that outlines the
structure and logic of the solution step-by-step. Once the pseudocode is complete,
follow it to generate the actual Python code. This approach will help ensure
clarity and alignment with the desired logic before writing the code.
If asked to modify existing code, provide pseudocode highlighting the changes and
optimizations to be made, focusing on improvements related to performance, error handling,
and robustness. Remember to explain your thought process and rationale clearly for
any modifications or code suggestions you provide.
"""
instruction = example['instruction'].strip() # Accessing the instruction column
output = example['output'].strip() # Accessing the output column
formatted_text = (
f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>
{system_prompt}
<|eot_id|>n<|start_header_id|>user<|end_header_id|>
{instruction}
<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{output}<|eot_id|>"""
)
# instruction = example['instruction'].strip() # Accessing the instruction column
# output = example['output'].strip() # Accessing the output column
# Apply the new template
# Since there is no system prompt, we construct the string without the SYS part
# formatted_text = f'[INST] {instruction} [/INST] {output} '
return {'text': formatted_text}
What happened here:
- We upload a '-Veyora / test dataset dataset
- We describe the function of a revolution that plan for each instance in the LLAMA chat format 3.2
- We quickly include a detailed system that gives the context of the model on their role as the assistant of Psyth Codes
- Special tokens are like <| Qala_of_Text |>,
- This work creates a well-formed conversation with the user, user, and assistant messages
System Prompt is very important as describes the expectations of Persona and the model behavior. In this regard, we teach a model to work as a Python coding assistant who follow good manners and provide well-meaning solutions.
Step 3: Add a transformation to data
We now use our Transformation Work in all details:
# Apply the transformation to the entire dataset
transformed_dataset = dataset['train'].map(transform_conversation)
What happened here:
- Map () The work is working for our Transformation Work in all examples in the Datasette
- This processes all 143,000 examples in the datasette, transformed with LLAMA chat format 3.2
- The result is a new dataset that has the same but well-ordered content LLAMA 3.2
This change is important because it changes the data in a specific template required by the LLAMA model family. Apart from this, the model will not recognize different roles in conversation (program, user, assistant) or where each message begins and ends.
Step 4: Enter Dataset to Work With Fascination Hub
With our prepared dataset, we can now load it from the face of a hug to kiss:
transformed_dataset.push_to_hub("Llama-3.2-Python-Alpaca-143k")
What happened here:
- Push_To_Hub () How to load our converted data into face hub
- “Llama-3.2-Python-Alpaca-143k” will be the last name of your data
- This creates a new location under your username:
- The data is now available in public that others download and use them
After using this cell, you will see the bars of progress that show upload status. When you're done, you can visit Hug supporting Hub to view your newly downloading data, edit its meaning, and share it with the community.
This data is now ready for the LLANA-Tuning Llama models 3.2 in the Pyphon's correct functional settings, with well-designed conversations include system instructions, user questions, and assistant answers!
Part 2: Good setting and uploading model of a facial
Now that we are ready and load our data, let's continue to organize the model well and load it on the hub of kissing.
Step 1: Add the required libraries
First, we need to enter all libraries needed to correct large models of loud language:
!pip install "unsloth[colab-new] @ git+
!pip install "git+
!pip install -U trl
!pip install --no-deps trl peft accelerate bitsandbytes
!pip install torch torchvision torchaudio triton
!pip install xformers
!python -m xformers.info
!python -m bitsandbytes
What you do: An unpleasant installation (Faster Lolly Library Library), the latest version of transformers, TRL (TRUMY), PEFT (a good formulative), etc. Xformers and libraries of BitFundBindBindBindbers help with a memory functioning.
Step 2: Upload Dataset
Next, we upload data prepared in the previous section:
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
import torch
from datasets import load_dataset
max_seq_length = 2048
dataset = load_dataset("nikhiljatiwal/Llama-3.2-Python-Alpaca-143k", split="train")
What you do: It puts the length of the highest order of our model and loaded our pre-python coding data in the face.
Step 3: Upload a Pre-Trained model
We now upload a broad version of Lullama 3.2:
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/Llama-3.2-3B-Instruct-bnb-4bit",
max_seq_length = max_seq_length,
dtype = None,
load_in_4bit = True
)
What you do: Loading the 4 llama price version 3.2 3b I teach me a model from an unpleasant deception. The size reduces the Memory Footprint while you keep the most model app.
Step 4: Prepare PEFT (parameter-perameter functioning)
We will set proper order model using Lora (to adapt to low-quality situations):
model = FastLanguageModel.get_peft_model(
model,
r = 16,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 16,
lora_dropout = 0, # Supports any, but = 0 is optimized
bias = "none", # Supports any, but = "none" is optimized
# [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
random_state = 3407,
use_rslora = False, # We support rank stabilized LoRA
loftq_config = None, # And LoftQ
max_seq_length = max_seq_length
)
What you do: Prepare model of parameter planning in Lora. This method trains a small amount of new parameters while keeping most of the original frozen model, allowing effective training with limited resources. We aim for some species of speculation in the position of 16.
Step 5: Google Drive for storage
To ensure that our trained model is stored even if time terminates:
from google.colab import drive
drive.mount("/content/drive")
What you do: It raises your Google Drive to save Checkpoints and last model.
Step 6: Set the training and start training
We are now preparing and starting the training process:
trainer = SFTTrainer(
model = model,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = max_seq_length,
tokenizer = tokenizer,
args = TrainingArguments(
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_steps = 10,
# num_train_epochs = 1, # Set this for 1 full training run.
max_steps = 60,
learning_rate = 2e-4,
fp16 = not torch.cuda.is_bf16_supported(),
bf16 = torch.cuda.is_bf16_supported(),
logging_steps = 1,
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "linear",
seed = 3407,
output_dir = "/content/drive/My Drive/Llama-3.2-3B-Instruct-bnb-4bit"
),
)
trainer.train()
What you do: It creates a good organized trainer with our model, data, and training parameters. Training runs 60 Steps with a batch 2 size 2, collection of 4 positions, and a 2-year study value. Model Checkpoints will be saved in Google Drive.
Step 7: Keep a well-organized model in the area
After training, we keep our model:
model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model")
What you do: Keeps the Lora model that is well prepared with Tokenzer in a place indicator.
Step 8: Upload model to kiss the face of the face
Finally, we upload our beautiful face refreshing model:
import os
from google.colab import userdata
HF_TOKEN = userdata.get('HF_WRITE_API_KEY')
model.push_to_hub_merged("nikhiljatiwal/Llama-3.2-3B-Instruct-code-bnb-4bit", tokenizer, save_method = "merged_16bit", token=HF_TOKEN)
In this guide, we showed a complete flow of AI model customization of the facial formation. We have transformed Pastaset for the Python Return Dataset into a LLAMAMA format 3.2 with a special fastest system and load like a “Llama-Python-143k”. We have redemption the Mrlo model 3.2 using the relevant strategies (4 skills and Lora) with small computer resources. Finally, we share resources both in the National Synast, making our Python code assisted in the community. The project shows how AI Development is possible, making developers able to create special models in certain humble services.
Here is the Colab Notebook_LAMAMAM_3_2_2B_INRUCT_CODe including Colab Notebook_LAMAMAM_3_2_python_allpaca_143k . Also, don't forget to follow Sane and join ours Telegraph station including LinkedIn Grtopic. Don't forget to join ours 90k + ml subreddit.
🔥 [Register Now] Summit of the Minicon Virtual in Agentic AI: Free Registration + Certificate of Before Hour 4 Hour Court (May 21, 9 AM
Nikhil is a student of students in MarktechPost. Pursuing integrated graduates combined in the Indian Institute of Technology, Kharagpur. Nikhl is a UI / ML enthusiasm that searches for applications such as biomoutomostoments and biomedical science. After a solid in the Material Science, he examines new development and developing opportunities to contribute.
