How to Use Data from Development for Portfolio Project


Photo by writer | Kanele
Obvious Introduction
Finding real infantry information can be a challenge because it is usually private (protected), unoccupied (background features).
The data of creating an art-producted information corrects the actual datasets. You can control the size, complex, and the authenticators of the data to be compatible to be based on your data requirements.
In this article, we will explore synthetic data systems. Then we will create a portfolio project by checking the information, creating a machine learning model, and uses AI to develop a complete portfolio project with the streamlit application.
Obvious How to Fall Data Data
Data generation is usually done from time to time, using simulations, rules, or AI.
// Method 1: Random data generation
Data generating time-to-date, we will use simple tasks to create prices without certain laws.
It is helpful in testing, but will not capture a reality relationship between features. We will make us use the Nutpy random and created Pandas Dataframe.
import numpy as np
import pandas as pd
np.random.seed(42)
df_random = pd.DataFrame({
"feature_a": np.random.randint(1, 100, 5),
"feature_b": np.random.rand(5),
"feature_c": np.random.choice(["X", "Y", "Z"], 5)
})
df_random.head()
Here's the outgoing.
// Way 2: Data generation based on the Right
Rulated-based data generation is a very sharp and more reasonable way than random data production. Following a straight formula or a set of laws. This makes getting out of it and is consistent.
In our example, the size of the house is directly linked at its price. To show this clearly, we will build a dataset in size and price. We will explain relationships with formula:
Price = size × 300 + ε (random sound)
In this way, you can see the connection while keeping information logical.
np.random.seed(42)
n = 5
size = np.random.randint(500, 3500, n)
price = size * 300 + np.random.randint(5000, 20000, n)
df_rule = pd.DataFrame({
"size_sqft": size,
"price_usd": price
})
df_rule.head()
Here's the outgoing.
// Method 3: Simple-based data generation
A data-based imitation approach that includes random diversity and laws from the real world. The mix creates the datasets behave as indigenous.
What Do We Know About the House?
- Great homes usually cost more
- Some cities are worth more than others
- Basic price
How do we create dataset?
- Choose the city randomly
- Draw home size
- Set sleep rooms between 1 and 5
- Identify the price for a clear rule
The rule of the price: We start with basic price, add a bump of city city, and add size × estimate.
Price_usd = base_price × city_bump + sqft × Account
Here is the code.
import numpy as np
import pandas as pd
rng = np.random.default_rng(42)
CITIES = ["los_angeles", "san_francisco", "san_diego"]
# City price bump: higher means pricier city
CITY_BUMP = {"los_angeles": 1.10, "san_francisco": 1.35, "san_diego": 1.00}
def make_data(n_rows=10):
city = rng.choice(CITIES, size=n_rows)
# Most homes are near 1,500 sqft, some smaller or larger
sqft = rng.normal(1500, 600, n_rows).clip(350, 4500).round()
beds = rng.integers(1, 6, n_rows)
base = 220_000
rate = 350 # dollars per sqft
bump = np.array([CITY_BUMP[c] for c in city])
price = base * bump + sqft * rate
return pd.DataFrame({
"city": city,
"sqft": sqft.astype(int),
"beds": beds,
"price_usd": price.round(0).astype(int),
})
df = make_data()
df.head()
Here's the outgoing.
// Method 4: Powerful Data Generation
To have your Ai dataset, you need to be clearly clear. AI is strong, but works better when set simple, intelligent rules.
Soon, we will include:
- BACKGROUND: What is the data?
- Features: What columns do we want?
- City, neighbor, SQFT, sleeping rooms, bathrooms
- RELATIONS: How do the features connect?
- Price depends on the city, SQFT, sleep rooms, and criminal indicator
- Form: How should the AI reuse?
Here's a fast.
Produce the Python code that forms the Dataset of California Dataet.
The data should contain 10,000 lines in columns: City, Latitude, Latitude, bathrooms, bathrooms, Lindex, Dest_km_CEX.
Cities: Los Angeles, San Francisco, San Diego, San Jose, Sacramento.
The price should depend on the City Premium, Sleep, bathrooms, bathrooms, size, school scores, crime index, and distance from the city center.
Add a random sound, missing prices, and fewer retailers.
Return the result as a pandas Danda-name and keep it on 'CA_HOUSIING_STHTH.CSV'
Let's use this time with Chatgpt.
Returns dataset as a CSV. Here is the process that shows how Chatgpt has caused it.
This is the most complicated data we have created away. Let's look at the first few lines of this data.
Obvious To create a portfolio project from the data of manufacturing
We have used four different strategies to create the production dataset. We will use the data generated by AI to create a portfolio project.
First, we will check the details, and create a machine learning model. Next, we will visualize the meaningful effects of the AI, and in the last step, we will find out what steps should follow the model production.
// Step 1: Exploring and Understanding DataASet Dataset
We will start testing data by reading it first with pandas and shows the first few lines.
df = pd.read_csv("ca_housing_synth.csv")
df.head()
Here's the outgoing.
The data adds (city, neighborhood, longitude, logical details (size, rooms, index terms.
We have 15 pillars, and other, like has_gagage or DIST_CEter, specified.
// Step 2: Construction model
The next step is to create a machine learning model that predicts home prices.
We will follow these steps:
Here is the code.
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.inspection import permutation_importance
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# --- Step 1: Define columns based on the generated dataset
num_cols = ["sqft", "bedrooms", "bathrooms", "lot_sqft", "year_built",
"school_score", "crime_index", "dist_km_center", "latitude", "longitude"]
cat_cols = ["city", "neighborhood", "property_type", "condition", "has_garage"]
# --- Step 2: Split the data
X = df.drop(columns=["price_usd"])
y = df["price_usd"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# --- Step 3: Preprocessing pipelines
num_pipe = Pipeline([
("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler())
])
cat_pipe = Pipeline([
("imputer", SimpleImputer(strategy="most_frequent")),
("encoder", OneHotEncoder(handle_unknown="ignore"))
])
preprocessor = ColumnTransformer([
("num", num_pipe, num_cols),
("cat", cat_pipe, cat_cols)
])
# --- Step 4: Model
model = RandomForestRegressor(n_estimators=300, random_state=42, n_jobs=-1)
pipeline = Pipeline([
("preprocessor", preprocessor),
("model", model)
])
# --- Step 5: Train
pipeline.fit(X_train, y_train)
# --- Step 6: Evaluate
y_pred = pipeline.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f"MAE: {mae:,.0f}")
print(f"RMSE: {rmse:,.0f}")
print(f"R²: {r2:.3f}")
# --- Step 7: (Optional) Permutation Importance on a subset for speed
pi = permutation_importance(
pipeline, X_test.iloc[:1000], y_test.iloc[:1000],
n_repeats=3, random_state=42, scoring="r2"
)
# --- Step 8: Plot Actual vs Predicted
plt.figure(figsize=(6, 5))
plt.scatter(y_test, y_pred, alpha=0.25)
vmin, vmax = min(y_test.min(), y_pred.min()), max(y_test.max(), y_pred.max())
plt.plot([vmin, vmax], [vmin, vmax], linestyle="--", color="red")
plt.xlabel("Actual Price (USD)")
plt.ylabel("Predicted Price (USD)")
plt.title(f"Actual vs Predicted (MAE={mae:,.0f}, RMSE={rmse:,.0f}, R²={r2:.3f})")
plt.tight_layout()
plt.show()
Here's the outgoing.
Working with Model:
- MEA (85,877 USD: On average, prediction is closed for $ 86K, logical given to the diversity of the housing arms
- RMSE (113,512 USD: Large errors are more punished; RMSE confirms that model defects well well
- R² (0.853): The model describes ~ 85% of the variations in local prices, displays strong daumostic forecasting data
// Step 3: Add to Mind Details
In this step, we will show our process, including EDA and Model Building, using a broadcast dashboard. Why do we use the broadcast? You can create a streamlit dashboard immediately and easily use others to look and participate.
Using Gemini Cli
To create a streamlit system, we will use Gemini Cli.
Gemini Cli is an open-open sound agent You can write the code and create apps using Gemini CLI. Is straight and free.
Installing it, use the following command in your field.
npm install -g @google/gemini-cli
After installation, use this code to start.
It will ask you to log in to your Google account, and then you will see the screen where you will build this streamlit app.
To create a dashboard
To create a dashboard, we need to create faster in accordance with your specific data and mission. Soon the following, describes everything AI needs to build streamlit dashboard.
Build a Streamlit app for the California Real Estate dataset by using this dataset ( path-to-dataset )
Here is the dataset information:
• Domain: California housing — Los Angeles, San Francisco, San Diego, San Jose, Sacramento.
• Location: city, neighborhood, lat, lon, and dist_km_center (haversine to city center).
• Home features: sqft, beds, baths, lot_sqft, year_built, property_type, has_garage, condition.
• Context: school_score, crime_index.
• Target: price_usd.
• Price logic: city premium + size + rooms + lot size + school/crime + distance to center + property type + condition + noise.
• Files you have: ca_housing_synth.csv (data) and real_estate_model.pkl (trained pipeline).
The Streamlit app should have:
• A short dataset overview section (shape, column list, small preview).
• Sidebar inputs for every model feature except the target:
- Categorical dropdowns: city, neighborhood, property_type, condition, has_garage.
- Numeric inputs/sliders: lat, lon, sqft, beds, baths, lot_sqft, year_built, school_score, crime_index.
- Auto-compute dist_km_center from the chosen city using the haversine formula and that city’s center.
• A Predict button that:
- Builds a one-row DataFrame with the exact training columns (order-safe).
- Calls pipeline.predict(...) from real_estate_model.pkl.
- Displays Estimated Price (USD) with thousands separators.
• One chart only: What-if: sqft vs price line chart (all other inputs fixed to the sidebar values).
- Quality of life: cache model load, basic input validation, clear labels/tooltips, English UI.
Next, Germin will ask your permit to build this file.
Let us allow us to continue. Once it is completed to enter codes, it will automatically unlock the streamlit dashboard.
If not, go to the active indicator of app.py
File again run streamlit run app.py
Starting this streamlit app.
Here is our distribution dashboard.
If you click on the data view, you can see the category representing the data.
From the property features on the left, we can customize property and for predictively predicted. This part of the dashboard represents what we have done in building models, but in appearance.
Let's choose Richmond, San Francisco, Family, the best State, 1500 SQFT, and click the “Price Retemy” button:
The price predicted is $ 1.24m. Also, you can see the actual VS price predicted on the second graph of the entire program when Scroll down.
You can change many features in the left panel, such as the creation year, crime index, or the bath number.
// Step 4: Use the model
The next step is uploading your model to be produced. To do that, you can follow these steps:
Obvious The last thoughts
In this article, we have found various methods to create an export information, such as random order, based on Ai-Power, or Ai-powered. Next, create Portfolio Data project by starting from data tests and construction of a machine study model.
We have also used the open source agent (Gemini Cli) to develop dataset data testing and predicts the house prices based on selected items, including the number of sleep rooms, criminal indicator and square correction.
Creating Your Design Data lets you avoid privacy barriers, measure your examples, and move quickly without expensive data collection. Less than that can show your thoughts and miss real world commandments. If you want more inspiration, check this set of machine reading projects you can synchronize with your portfolio.
Finally, we looked at how we can load your model to be produced using streamit Community Cloud. Continue following these construction measures and display your portfolio project today!
Nate Rosid He is a data scientist and product plan. He is a person who is an educated educator, and the Founder of Stratascratch, a stage that helps data scientists prepare their conversations with the highest discussion of the chat. Nate writes the latest stylies in the work market, offers chat advice, sharing data science projects, and covered everything SQL.