The next AI changes: Study using VAMES to generate high-quality data for executive

nimda February 21, 2025

0 6 9 minutes read

The next AI changes: Study using VAMES to generate high-quality data for executive

What are the data of being done?

Data made by computer intended to repeat or add an existing data.

Why is it useful?

We all learned Chatgpt success, Llama, and recently, deep. These language models are used in the community and create many applications that we approach the General Generances Generances Generances Generances

Before being happy, or fears, it depends on your opinion – and we quickly approaches the decision of these language models. According to a paper published by a group from a research organization, Epoch [1], We run out of data. They estimate that in 2028 we will have reached the upper limit of the data that may be training language models.

What happens if we run out of data?

Yes, if we run out of the data then we will not have any new when training our language models. These types will stop to improve. If we want to pursue normal artificial intelligence will be required to have new ways to improve AI without enhance the capacity of real training.

Another potential saver is customized data to correct existing data and is already used to improve the performance of models such as Gemini and DBRX.

Data for executed exception of llms

Across the victory of data shortages of large languages, the data of export can be used in the following situations:

Sensitive data – If we do not want to share or apply critical qualities, the data of being made can be made to imitate these features while storing unknown.
Expensive data– If collecting data is too expensive that we can generate a large dose of the data performance from the small amount of the real world data.
Lack of information– Preene datasets where there is a low number of each data points from a particular team. Data generation can be used to balance data data.

Datasets are real

Unstopable datassets can (* but not always *) be a problem as may not have enough information to successfully train the model for predicting. For example, if the dataset contains many men who contain women, our model may be tied to see men and misclassify future samples as men.

In this article indicates inequality in the most adult database data [2]and how we can use a Auto-Encoder is different Producing the data of being done to improve separation by this example.

We start to download adult dataset. This data contains features such as age, education and work that can be used for the effect of the Download of 'Money'.

# Download dataset into a dataframe
url = "
columns = [
   "age", "workclass", "fnlwgt", "education", "education-num", "marital-status",
   "occupation", "relationship", "race", "sex", "capital-gain",
   "capital-loss", "hours-per-week", "native-country", "income"
]
data = pd.read_csv(url, header=None, names=columns, na_values=" ?", skipinitialspace=True)

# Drop rows with missing values
data = data.dropna()

# Split into features and target
X = data.drop(columns=["income"])
y = data['income'].map({'>50K': 1, '<=50K': 0}).values

# Plot distribution of income
plt.figure(figsize=(8, 6))
plt.hist(data['income'], bins=2, edgecolor="black")
plt.title('Distribution of Income')
plt.xlabel('Income')
plt.ylabel('Frequency')
plt.show()

In old adult dadaase, income variations are very different for binary, representing the above, and under, $ 50,000. We plan a distribution of income in all the details below. We see that the datasset can be widely rated with the largest amount of people receiving less than $ 50,000.

Photo by author. Real Dataset: The number of data conditions via Label ≤50k and> 50k. There is a major illustration of the wrong submissions of the people who receive less than 50k money in the dataset.

Apart from the inequality we can still train the class of studying classroom for adult data which we can use to detect, or the test, people should be separated as the lead above, or less, 50k.

# Preprocessing: One-hot encode categorical features, scale numerical features
numerical_features = ["age", "fnlwgt", "education-num", "capital-gain", "capital-loss", "hours-per-week"]
categorical_features = [
   "workclass", "education", "marital-status", "occupation", "relationship",
   "race", "sex", "native-country"
]

preprocessor = ColumnTransformer(
   transformers=[
       ("num", StandardScaler(), numerical_features),
       ("cat", OneHotEncoder(), categorical_features)
   ]
)

X_processed = preprocessor.fit_transform(X)

# Convert to numpy array for PyTorch compatibility
X_processed = X_processed.toarray().astype(np.float32)
y_processed = y.astype(np.float32)
# Split dataset in train and test sets
X_model_train, X_model_test, y_model_train, y_model_test = train_test_split(X_processed, y_processed, test_size=0.2, random_state=42)


rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_model_train, y_model_train)

# Make predictions
y_pred = rf_classifier.predict(X_model_test)

# Display confusion matrix
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt="d", cmap="YlGnBu", xticklabels=["Negative", "Positive"], yticklabels=["Negative", "Positive"])
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()

The printing matrix of our classification is indicating that our model is doing well despite unequal. Our model has a total of total 16% error rate but a good paragraph error (revenue> 50k) 36% when the rate of the error.

This conflict shows that the model is satisfied in a negative class. The model usually separates people earned more than 50k as 50 Khead.

Below we show how we can use a different autoencoder to generate the correct class data to measure this data. We then train the same model that uses a customary dataset and reduces models models in the test set.

Photo by author. Consuelion matrix of the specified specimens in the real dataset.

How can we cross the data for executiveness?

There are many different ways to produce the data of action. This can include many traditional ways such as smute and Gaussian Noise that produces new data by changing existing data. Otherwise produce models such as different Autoencoders or standard networks determined to produce new data as their properties learn the distribution of actual information.

In this lesson we use a different autoencoder to generate the data for executiveness.

Autoencoders are different

Autoencoders are different (ves) ready for production data production because they use real data to learn the ongoing space. We can look at this recent space like a magical bucket where we can let the sample data like the existing data. The continuation of this space is one of its major sales points as the model generation is not limited to a special installation space.

VAe contains an unification which maps installation data in possible distribution (means and the difference) and DOCODER Rewriting the information from the latent space.

Instead of continuous ongoing, VEAs use a repetitive strategy, When the random sound vector can be measured and modified using the learned and variations, ensuring smooth and continuous representation in a continuous environment.

Below we build a Rate Section that uses this process by creating easy.

EncoderIt presses the installation of the small, hidden, productivity, both different and condoms that describe Gaussian distribution to create our magical bucket. Instead of direct sampling, the model uses a reparamederization strategy to produce middle variables, which has been transmitted to decoder.
DECODERRenew real data from these variables, to ensure that productive data stores real data features.

class BasicVAE(nn.Module):
   def __init__(self, input_dim, latent_dim):
       super(BasicVAE, self).__init__()
       # Encoder: Single small layer
       self.encoder = nn.Sequential(
           nn.Linear(input_dim, 8),
           nn.ReLU()
       )
       self.fc_mu = nn.Linear(8, latent_dim)
       self.fc_logvar = nn.Linear(8, latent_dim)
      
       # Decoder: Single small layer
       self.decoder = nn.Sequential(
           nn.Linear(latent_dim, 8),
           nn.ReLU(),
           nn.Linear(8, input_dim),
           nn.Sigmoid()  # Outputs values in range [0, 1]
       )

   def encode(self, x):
       h = self.encoder(x)
       mu = self.fc_mu(h)
       logvar = self.fc_logvar(h)
       return mu, logvar

   def reparameterize(self, mu, logvar):
       std = torch.exp(0.5 * logvar)
       eps = torch.randn_like(std)
       return mu + eps * std

   def decode(self, z):
       return self.decoder(z)

   def forward(self, x):
       mu, logvar = self.encode(x)
       z = self.reparameterize(mu, logvar)
       return self.decode(z), mu, logvar

Given our Balsippae to create our jobs for losing and an example that are an example.

def vae_loss(recon_x, x, mu, logvar, tau=0.5, c=1.0):
   recon_loss = nn.MSELoss()(recon_x, x)
 
   # KL Divergence Loss
   kld_loss = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
   return recon_loss + kld_loss / x.size(0)

def train_vae(model, data_loader, epochs, learning_rate):
   optimizer = optim.Adam(model.parameters(), lr=learning_rate)
   model.train()
   losses = []
   reconstruction_mse = []

   for epoch in range(epochs):
       total_loss = 0
       total_mse = 0
       for batch in data_loader:
           batch_data = batch[0]
           optimizer.zero_grad()
           reconstructed, mu, logvar = model(batch_data)
           loss = vae_loss(reconstructed, batch_data, mu, logvar)
           loss.backward()
           optimizer.step()
           total_loss += loss.item()

           # Compute batch-wise MSE for comparison
           mse = nn.MSELoss()(reconstructed, batch_data).item()
           total_mse += mse

       losses.append(total_loss / len(data_loader))
       reconstruction_mse.append(total_mse / len(data_loader))
       print(f"Epoch {epoch+1}/{epochs}, Loss: {total_loss:.4f}, MSE: {total_mse:.4f}")
   return losses, reconstruction_mse

combined_data = np.concatenate([X_model_train.copy(), y_model_train.cop
y().reshape(26048,1)], axis=1)

# Train-test split
X_train, X_test = train_test_split(combined_data, test_size=0.2, random_state=42)

batch_size = 128

# Create DataLoaders
train_loader = DataLoader(TensorDataset(torch.tensor(X_train)), batch_size=batch_size, shuffle=True)
test_loader = DataLoader(TensorDataset(torch.tensor(X_test)), batch_size=batch_size, shuffle=False)

basic_vae = BasicVAE(input_dim=X_train.shape[1], latent_dim=8)

basic_losses, basic_mse = train_vae(
   basic_vae, train_loader, epochs=50, learning_rate=0.001,
)

# Visualize results
plt.figure(figsize=(12, 6))
plt.plot(basic_mse, label="Basic VAE")
plt.ylabel("Reconstruction MSE")
plt.title("Training Reconstruction MSE")
plt.legend()
plt.show()

vae_loss It contains two elements: Loss of reconstructionbalancing how well data matches the first input using error (MSE), and The loss of kl DefergenceEnsure that a space that has read Sandent following a regular distribution.

train_vaeMake a vae using Adam Optimizer over many epos. During training, the model takes the mini-batches of data, rebuilds, and indicates loss using using vae_loss. These errors and then repaired with the backproapation where the model weights are reviewed. We train the model in 50 Ecoschs and plan how a limited error is reduced in training.

We see that our model learns quickly how to create our data, bring effective reading.

Photo by author. Reconstruction of the Basic Vala VaiselvaEs of Adult Dataset.

Now we train our foundation found to remember adult data now we can use it to produce the data of manufacturing. We want to produce many good class samples (people who get more than 50k) to balance classes and remove bias from our model.

To do this we choose all samples from our VAE data when income is a good stage (Find more than 50k more than 50k). We then set the samples in the back space. With the samples only of the right class acled, the backpone will show good category buildings we can see in a sample to create an integrated data.

We sampled 15000 new samples from this back space and receives the backs of the installation data in the installation database as our Data Points.

# Create column names
col_number = sample_df.shape[1]
col_names = [str(i) for i in range(col_number)]
sample_df.columns = col_names

# Define the feature value to filter
feature_value = 1.0  # Specify the feature value - here we set the income to 1

# Set all income values to 1 : Over 50k
selected_samples = sample_df[sample_df[col_names[-1]] == feature_value]
selected_samples = selected_samples.values
selected_samples_tensor = torch.tensor(selected_samples, dtype=torch.float32)

basic_vae.eval()  # Set model to evaluation mode
with torch.no_grad():
   mu, logvar = basic_vae.encode(selected_samples_tensor)
   latent_vectors = basic_vae.reparameterize(mu, logvar)

# Compute the mean latent vector for this feature
mean_latent_vector = latent_vectors.mean(dim=0)


num_samples = 15000  # Number of new samples
latent_dim = 8
latent_samples = mean_latent_vector + 0.1 * torch.randn(num_samples, latent_dim)

with torch.no_grad():
   generated_samples = basic_vae.decode(latent_samples)

We have now produced a good category data, we can include this with original training data to produce balanced data data.

new_data = pd.DataFrame(generated_samples)

# Create column names
col_number = new_data.shape[1]
col_names = [str(i) for i in range(col_number)]
new_data.columns = col_names

X_synthetic = new_data.drop(col_names[-1],axis=1)
y_synthetic = np.asarray([1 for _ in range(0,X_synthetic.shape[0])])

X_synthetic_train = np.concatenate([X_model_train, X_synthetic.values], axis=0)
y_synthetic_train = np.concatenate([y_model_train, y_synthetic], axis=0)

mapping = {1: '>50K', 0: '<=50K'}
map_function = np.vectorize(lambda x: mapping[x])
# Apply mapping
y_mapped = map_function(y_synthetic_train)

plt.figure(figsize=(8, 6))
plt.hist(y_mapped, bins=2, edgecolor="black")
plt.title('Distribution of Income')
plt.xlabel('Income')
plt.ylabel('Frequency')
plt.show()

Photo by author. Synthetic data: The number of data conditions in label ≤50k and> 50k. Now there is a balanced number of high-earned people and less than 50k.

We can now use our moderate dataset to regain our random carsefier. We can evaluate this new model to the original assessment details to see how our performance data is done to reduce the model model.

rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_synthetic_train, y_synthetic_train)

# Step 5: Make predictions
y_pred = rf_classifier.predict(X_model_test)

cm = confusion_matrix(y_model_test, y_pred)

# Create heatmap
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt="d", cmap="YlGnBu", xticklabels=["Negative", "Positive"], yticklabels=["Negative", "Positive"])
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()

Our new classifier, trained for our Dataset DataASet makes a few errors in the original test of our real classified professionals in the equivalentable Database and the average permission is reduced to 14%.

Photo by author. The conPuix of a specification model of speculative modelaet.

However, we were unable to reduce the differences in a large amount, our best class failure is 36%% 36%. This can be caused for the following reasons:

We discussed that one of the benefits of Vasa is continuous continuous environment. However, if most class is dominated, the latest place may look into a lot of class.
The model may well read different sub-section presentations due to lack of data, making it difficult for the region.

In this study we have introduced and develops basicvae construction that can be produced to produce the accuracy data that promote data separation.

Follow the future articles where I will show how to create VAEA buildings for the above problems with sample and more.

[1] Villalobos, P., Ho, A., Sevilla, J., Sirioroglu, T., Heim, L. (2024). Will we run out of data? LLM restrictions Rate based on human activity. Arxiv Print Arxiv: 2211.04325, +.

[2] Becker, B. & Kohavi, R. (1996). Old [Dataset]. A UCI machine study.

Source link

nimda February 21, 2025

0 6 9 minutes read