ANI

Create ETL Pipeline of Data Science in 30 Python Lines


Photo by writer | Ideogram

Do you know that feeling when you have data scattered by different formats and wells, and you need to understand everything? It is exactly how we solve today. Let's build an ETL pipe that takes dirty data and transforms it into a real useful thing.

In this article, I will travel to create a pipe that processes e-commerce transactions. Nothing fancy, a valid code just finds work done.

We will hold data from CSV file (as you can download from the e-commerce platform), clean, and save it in the appropriate analysis database.

🔗 Link code to Githubub

What is being released, convert, loading (ETL) Pipeline?

All ETL Pipeline follows the same pattern. He holds data from one place (released), cleaning it and makes it better (convert), and put it in a useful place (load).

etl-pifeline
ETL Pipeline | Photo by writer | Drawings.net (DraW.IO)

The process begins with issue Category, where the data is received from various sources system such as information, APIs, files, or broadcast platforms. For this section, the pipe points and attracts proper information while storing connections in the separating programs that can work in schedules and formats.

Following change The Category represents the Important processing phase, where the data issued receives vomiting, verification, and reconstruction. This step looks at the data quality news, using business rules, make counts, and turn the data into the required format. The standard conversion includes the modification of data type, forum mapping, rehabilitation, and removal of disemeters or incorrect records.

Finally, abduct The category transmits data now altered in the target system. This step can happen with complete responsibilities, where all data information is replaced, or additional responsibilities, where new or modified data. The uploadest strategy depends on the materials such as data volume, program operational requirements, and business needs.

Step 1: Uninstall

The “Extract” step is where we find our hands on the data. In the real world, you may have downloaded the e-commerce reporting CSV reporting, pulling to the FTP server, or to receive it with API. Here, we read from the existing CSV file.

def extract_data_from_csv(csv_file_path):
    try:
        print(f"Extracting data from {csv_file_path}...")
        df = pd.read_csv(csv_file_path)
        print(f"Successfully extracted {len(df)} records")
        return df
    except FileNotFoundError:
        print(f"Error: {csv_file_path} not found. Creating sample data...")
        csv_file = create_sample_csv_data()
        return pd.read_csv(csv_file)

Now that we have raw data from its well (green_

Step 2: Convert Change

That is where we do the data is actually useful.

def transform_data(df):
    print("Transforming data...")
    
    df_clean = df.copy()
    
    # Remove records with missing emails
    initial_count = len(df_clean)
    df_clean = df_clean.dropna(subset=['customer_email'])
    removed_count = initial_count - len(df_clean)
    print(f"Removed {removed_count} records with missing emails")
    
    # Calculate derived fields
    df_clean['total_amount'] = df_clean['price'] * df_clean['quantity']
    
    # Extract date components
    df_clean['transaction_date'] = pd.to_datetime(df_clean['transaction_date'])
    df_clean['year'] = df_clean['transaction_date'].dt.year
    df_clean['month'] = df_clean['transaction_date'].dt.month
    df_clean['day_of_week'] = df_clean['transaction_date'].dt.day_name()
    
    # Create customer segments
    df_clean['customer_segment'] = pd.cut(df_clean['total_amount'], 
                                        bins=[0, 50, 200, float('inf')], 
                                        labels=['Low', 'Medium', 'High'])
    
    return df_clean

First, we throw lines with lost emails because incomplete customer data is useful in the most analytical.

Then we count total_amount at the recurrent price and value. This seems to be obvious, but you can be surprised how many territories like this is lost in green data.

The release of the day is truly useful. Instead of having time for the time, now we now have a different year, the moon, and church weeks. This makes it easier to analyze the patterns like “We would sell more on weekends?”

Customer classification is used pd.cut() It can especially help. Backs are baked buckets into a stage of spending. Now instead of just a purchase price, we have contributions to a reasonable business.

Step 3: Upload

In real project, you may have loaded the database, sends to API, or presses to the cloud storage.

Here, we upload our clean data from the relevant SQLITE database database.

def load_data_to_sqlite(df, db_name="ecommerce_data.db", table_name="transactions"):
    print(f"Loading data to SQLite database '{db_name}'...")
    
    conn = sqlite3.connect(db_name)
    
    try:
        df.to_sql(table_name, conn, if_exists="replace", index=False)
        
        cursor = conn.cursor()
        cursor.execute(f"SELECT COUNT(*) FROM {table_name}")
        record_count = cursor.fetchone()[0]
        
        print(f"Successfully loaded {record_count} records to '{table_name}' table")
        
        return f"Data successfully loaded to {db_name}"
        
    finally:
        conn.close()

Analysts can now use SQL questions, connect the bi tools, and then use this decision data.

The SQLITE works well with this because it does not be limited to set, and creates a single file you can easily share or backup. This page if_exists="replace" The parameter means you can use this Pipeline as many times without worrying about duplicate data.

We have added steps to confirm in order to make a load. Nothing worse than thought your data is safely stored to find a blank table later.

Running ETL Pipeline

This adorns all issued, transformed, loaded by work.

def run_etl_pipeline():
    print("Starting ETL Pipeline...")
    
    # Extract
    raw_data = extract_data_from_csv('raw_transactions.csv')
    
    # Transform  
    transformed_data = transform_data(raw_data)
    
    # Load
    load_result = load_data_to_sqlite(transformed_data)
    
    print("ETL Pipeline completed successfully!")
    
    return transformed_data

Note how this includes everything together. Uninstall, change, upload, made. You can use this and immediately see your configured data.

You can get a complete code in GitTub.

Rolling up

The Pipeline takes a green purchase data and converts you into an analysis or the data scientist you can work with. You have clean records, numbered fields, and meaningful components.

Each work does one thing well, and you can easily change or extend any part without breaking others.

Now try running yourself. And try to convert it against another case. Codes?

Count Priya c He is the writer and a technical writer from India. He likes to work in mathematical communication, data science and content creation. His areas of interest and professionals includes deliefs, data science and natural language. She enjoys reading, writing, codes, and coffee! Currently, he works by reading and sharing his knowledge and engineering society by disciples of teaching, how they guide, pieces of ideas, and more. Calculate and create views of the resources and instruction of codes.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button