ANI

A feature of Time-Series Engineering with Python Itertools

0 0 9 minutes read

A feature of Time-Series Engineering with Python Itertools

# Introduction

Time series feature engineering does not follow the same rules as tabular data. Views are not independent, line order is not automatic, and the most useful features are rarely read individually. You will need to identify patterns over time such as rates of change, lag comparisons, deviations from baseline, and more.

Constructing lags, sliding windows, and pooling decisions are all, at their core, repeated problems in an ordered sequence. Itertools module for Python it is naturally suited to this type of work. It is not a substitute for a higher level the pandas abstractions like .rolling()but it gives you the low-level building blocks to build the features you need, with full control over the concept.

In this article, you will create seven categories of time series features using itertools. You will use each on a sample dataset.

You can find the code on GitHub.

# Creating a Sample Data Set

Before we start building features, let's run through a sample sensor dataset that we'll be working with throughout the article.

import numpy as np
import pandas as pd
import itertools

np.random.seed(42)

periods = 168  # one week of hourly readings
index = pd.date_range(start="2024-03-01", periods=periods, freq="h")
hours = np.arange(periods)

# Temperature (°C): daily cycle + gradual drift + noise
temp_base = 3.5
temp_daily = 1.2 * np.sin(2 * np.pi * hours / 24)
temp_drift = 0.003 * hours
temp_noise = np.random.normal(0, 0.3, periods)
temperature = temp_base + temp_daily + temp_drift + temp_noise

# Humidity (%): inverse relationship with temperature + noise
humidity = 78 - 2.1 * (temperature - temp_base) + np.random.normal(0, 1.2, periods)

# Power draw (kW): peaks during business hours, higher on weekdays
day_of_week = index.dayofweek
business_hours = ((index.hour >= 8) & (index.hour <= 18)).astype(int)
weekend_factor = np.where(day_of_week >= 5, 0.6, 1.0)
power = (
    42.0
    + 18.0 * business_hours * weekend_factor
    + np.random.normal(0, 2.1, periods)
)

df = pd.DataFrame({
    "temperature_c": np.round(temperature, 3),
    "humidity_pct":  np.round(humidity, 2),
    "power_kw":      np.round(power, 2),
}, index=index)
df.index.name = "timestamp"

print(df.head(8))
print(f"nShape: {df.shape}")

Output:

                     temperature_c  humidity_pct  power_kw
timestamp
2024-03-01 00:00:00          3.649         77.39     40.27
2024-03-01 01:00:00          3.772         76.52     41.33
2024-03-01 02:00:00          4.300         75.25     42.87
2024-03-01 03:00:00          4.814         74.26     40.82
2024-03-01 04:00:00          4.481         75.85     40.27
2024-03-01 05:00:00          4.604         76.09     42.51
2024-03-01 06:00:00          5.192         74.78     42.51
2024-03-01 07:00:00          4.910         76.03     40.94

Shape: (168, 3)

We now have 168 hours of readings from three sensor stations. Now let's build the features.

# 1. Generating Lag Features with `islice`

Lag features are the most important time series feature: the amount of change in a fixed number of steps in the past. For example, values in the past 1 step, the past 6 steps, or the past 24 steps can each capture different patterns such as short-term fluctuations, recurring behavior within time, and long-term or seasonal trends.

Let's build the residual features using our sample dataset islice:

sensor_readings = df["temperature_c"].tolist()
lag_offsets = [1, 6, 12, 24]

lag_features = {}
for lag in lag_offsets:
    lagged = list(itertools.islice(sensor_readings, 0, len(sensor_readings) - lag))
    # Pad the beginning with None to preserve index alignment
    lag_features[f"temp_lag_{lag}h"] = [None] * lag + lagged

lag_df = pd.DataFrame(lag_features, index=df.index)
lag_df["temperature_c"] = df["temperature_c"]

print(lag_df.iloc[24:30])

Output:

                     temp_lag_1h  temp_lag_6h  temp_lag_12h  temp_lag_24h  
timestamp
2024-03-02 00:00:00        2.831        2.082         3.609         3.649
2024-03-02 01:00:00        3.409        1.974         2.654         3.772
2024-03-02 02:00:00        3.919        2.960         2.425         4.300
2024-03-02 03:00:00        3.833        2.647         2.528         4.814
2024-03-02 04:00:00        4.542        2.986         2.205         4.481
2024-03-02 05:00:00        4.443        2.831         2.486         4.604

                     temperature_c
timestamp
2024-03-02 00:00:00          3.409
2024-03-02 01:00:00          3.919
2024-03-02 02:00:00          3.833
2024-03-02 03:00:00          4.542
2024-03-02 04:00:00          4.443
2024-03-02 05:00:00          4.659

islice(sensor_readings, 0, len - lag) outputs the sequence backwards by the remaining steps without creating a copy of the full list. I None the padding in the front keeps all the lag elements consistent with the original index. This is important if you later drop NaNs for model training.

# 2. Building Window Rolling Features with `islice` again `accumulate`

One lag value tells you what the sensor read in the past. Motion stats tell you what the sensor has been doing over time, which is often very useful.

readings = df["temperature_c"].tolist()
window_size = 6  # 6-hour rolling window

rolling_features = []

for i in range(len(readings)):
    if i < window_size:
        rolling_features.append({
            "rolling_mean_6h": None,
            "rolling_std_6h":  None,
            "rolling_min_6h":  None,
            "rolling_max_6h":  None,
        })
        continue

    window = list(itertools.islice(readings, i - window_size, i))

    # Use accumulate to compute running sum for mean
    running_sum = list(itertools.accumulate(window))
    window_mean = running_sum[-1] / window_size
    window_mean_sq = sum(x**2 for x in window) / window_size

    rolling_features.append({
        "rolling_mean_6h": round(window_mean, 4),
        "rolling_std_6h":  round((window_mean_sq - window_mean**2) ** 0.5, 4),
        "rolling_min_6h":  round(min(window), 4),
        "rolling_max_6h":  round(max(window), 4),
    })

roll_df = pd.DataFrame(rolling_features, index=df.index)
roll_df["temperature_c"] = df["temperature_c"]

print(roll_df.iloc[6:12])

Output:

                     rolling_mean_6h  rolling_std_6h  rolling_min_6h  
timestamp
2024-03-01 06:00:00           4.2700          0.4256           3.649
2024-03-01 07:00:00           4.5272          0.4386           3.772
2024-03-01 08:00:00           4.7168          0.2929           4.300
2024-03-01 09:00:00           4.7372          0.2662           4.422
2024-03-01 10:00:00           4.6912          0.2728           4.422
2024-03-01 11:00:00           4.6095          0.3769           3.991

                     rolling_max_6h  temperature_c
timestamp
2024-03-01 06:00:00           4.814          5.192
2024-03-01 07:00:00           5.192          4.910
2024-03-01 08:00:00           5.192          4.422
2024-03-01 09:00:00           5.192          4.538
2024-03-01 10:00:00           5.192          3.991
2024-03-01 11:00:00           5.192          3.704

I accumulate the call here calculates the effective value of the window so we can get the value in one pass – running_sum[-1] – without calling sum() separately. For large datasets processed in a streaming manner, avoiding unnecessary overlap of the same data works well.

# 3. Creating Seasonal Interactive Features with `product`

Multiple time series show a layered season, where multiple temporal cycles interact – such as time of day, day of the week, and broad or cyclical periods of operation. Interaction features that include these dimensions can capture patterns that individual time segments may ignore.

Now let's build the interactive features product:

hours_of_day = list(range(24))
day_types = ["weekday", "weekend"]
operational_shifts = ["off_peak", "on_peak"]  # on_peak: 08:00–18:00

# Build a full lookup grid for all combinations
season_grid = list(itertools.product(hours_of_day, day_types, operational_shifts))
season_df = pd.DataFrame(season_grid, columns=["hour", "day_type", "shift"])

# Simulate expected baseline temperature per combination
np.random.seed(14)
season_df["baseline_temp_c"] = np.round(
    3.5
    + 0.8 * np.sin(2 * np.pi * season_df["hour"] / 24)
    + np.where(season_df["day_type"] == "weekend", 0.3, 0.0)
    + np.where(season_df["shift"] == "on_peak", 0.5, 0.0)
    + np.random.normal(0, 0.1, len(season_df)),
    3
)

print(season_df[season_df["hour"].isin([0, 8, 14, 20])].head(16).to_string(index=False))
print(f"nTotal grid combinations: {len(season_df)}")

Output:

hour day_type    shift  baseline_temp_c
   0  weekday off_peak            3.655
   0  weekday  on_peak            4.008
   0  weekend off_peak            3.817
   0  weekend  on_peak            4.293
   8  weekday off_peak            4.325
   8  weekday  on_peak            4.601
   8  weekend off_peak            4.446
   8  weekend  on_peak            4.978
  14  weekday off_peak            3.370
  14  weekday  on_peak            3.628
  14  weekend off_peak            3.279
  14  weekend  on_peak            3.959
  20  weekday off_peak            2.726
  20  weekday  on_peak            3.256
  20  weekend off_peak            3.056
  20  weekend  on_peak            3.530

Total grid combinations: 96

This grid links back to your main dataset as baseline_temp_c feature per line – giving every read the expected amount of context awareness. Deviating from that baseline, temperature_c - baseline_temp_cthen it is a useful feature for anomaly detection.

# 4. Outputs Sliding Window Statistics with `tee`

Sometimes you need to process the same sequence using multiple statistical lenses at once – ie, variation, rate of change – without repeating it multiple times. itertools.tee creates independent iterators from a single source, which is exactly what you need.

def sliding_window_stats(series, window_size):
    """Compute mean, range and rate-of-change over sliding windows using tee."""
    results = []
    it = iter(series)

    window = list(itertools.islice(it, window_size))
    if len(window) < window_size:
        return results

    results.append({
        "window_mean":    round(sum(window) / window_size, 4),
        "window_range":   round(max(window) - min(window), 4),
        "rate_of_change": round(window[-1] - window[0], 4),
    })

    for next_val in it:
        window = window[1:] + [next_val]

        # tee creates two independent iterators over the same window
        iter_a, iter_b = itertools.tee(iter(window))

        values_a = list(iter_a)
        values_b = list(iter_b)

        mean_val = sum(values_a) / window_size
        results.append({
            "window_mean":    round(mean_val, 4),
            "window_range":   round(max(values_b) - min(values_b), 4),
            "rate_of_change": round(window[-1] - window[0], 4),
        })

    return results

power_readings = df["power_kw"].tolist()
stats = sliding_window_stats(power_readings, window_size=8)

stats_df = pd.DataFrame(stats, index=df.index[7:])
stats_df["power_kw"] = df["power_kw"].iloc[7:].values

print(stats_df.iloc[0:8])

Output:

                     window_mean  window_range  rate_of_change  power_kw
timestamp
2024-03-01 07:00:00      41.4400          2.60            0.67     40.94
2024-03-01 08:00:00      43.7825         18.74           17.68     59.01
2024-03-01 09:00:00      46.1775         20.22           17.62     60.49
2024-03-01 10:00:00      47.9387         20.22           16.14     56.96
2024-03-01 11:00:00      49.9663         20.22           16.77     57.04
2024-03-01 12:00:00      52.2437         19.55           15.98     58.49
2024-03-01 13:00:00      54.3738         19.55           17.04     59.55
2024-03-01 14:00:00      56.6412         19.71           19.71     60.65

As seen, tee allows you to pass the same window iterator to two different stream enumerations without iterating or copying the list yourself.

# 5. Combining Multi-Resolution Time Features with `chain`

Useful time series features often appear in multiple temporal resolutions: raw hourly readings, 6-hour averages, 24-hour averages, and a calendar feature such as the hour of the day. These are usually in separate arrays and need to be combined into a single clean feature array. Here's how you can use it chain including features such as:

humidity = df["humidity_pct"].tolist()

def rolling_means(series, window):
    means = []
    for i in range(len(series)):
        if i < window:
            means.append(None)
        else:
            w = list(itertools.islice(series, i - window, i))
            means.append(round(sum(w) / window, 3))
    return means

rolling_6h       = rolling_means(humidity, 6)
rolling_24h      = rolling_means(humidity, 24)
hour_of_day      = df.index.hour.tolist()
is_business_hour = [1 if 8 <= h <= 18 else 0 for h in hour_of_day]

# chain assembles feature name list from logically grouped sublists
feature_names = list(itertools.chain(
    ["humidity_raw"],
    ["humidity_roll_6h", "humidity_roll_24h"],
    ["hour_of_day", "is_business_hour"],
))

multi_res_df = pd.DataFrame({
    name: vals for name, vals in zip(
        feature_names,
        [humidity, rolling_6h, rolling_24h, hour_of_day, is_business_hour]
    )
}, index=df.index)

print(multi_res_df.iloc[24:30])

Output:

                     humidity_raw  humidity_roll_6h  humidity_roll_24h  
timestamp
2024-03-02 00:00:00         78.45            79.622             78.055
2024-03-02 01:00:00         75.63            79.105             78.100
2024-03-02 02:00:00         77.51            78.190             78.062
2024-03-02 03:00:00         76.27            78.088             78.157
2024-03-02 04:00:00         74.96            77.805             78.240
2024-03-02 05:00:00         75.75            77.208             78.203

                     hour_of_day  is_business_hour
timestamp
2024-03-02 00:00:00            0                 0
2024-03-02 01:00:00            1                 0
2024-03-02 02:00:00            2                 0
2024-03-02 03:00:00            3                 0
2024-03-02 04:00:00            4                 0
2024-03-02 05:00:00            5                 0

chain here it includes a feature list from logically organized sub-lists – green sensor, collapsible measurements, calendar features. As your feature set grows to more sensor channels and more resolution, chain it keeps that organization easy to read and easy to extend.

# 6. Using Computing Pairwise Temporal Correlations with `combinations`

In a multi-sensor setting, relationships between variables over time often contain important signals that individual measurements cannot capture. For example, simultaneous expansion across two sensors may reveal emergent conditions or interactions that would not be apparent when each series is analyzed separately.

Incorporating features that reflect these combined forces can improve a model's ability to detect subtle patterns and dependencies. Let's try to create a two-way connection using combinations:

sensor_cols = ["temperature_c", "humidity_pct", "power_kw"]
window_size = 12

pairwise_features = {}

for col_a, col_b in itertools.combinations(sensor_cols, 2):
    feature_name = f"corr_{col_a[:4]}_{col_b[:4]}_12h"
    correlations = []

    series_a = df[col_a].tolist()
    series_b = df[col_b].tolist()

    for i in range(len(series_a)):
        if i < window_size:
            correlations.append(None)
            continue

        win_a = list(itertools.islice(series_a, i - window_size, i))
        win_b = list(itertools.islice(series_b, i - window_size, i))

        mean_a = sum(win_a) / window_size
        mean_b = sum(win_b) / window_size

        cov   = sum((a - mean_a) * (b - mean_b) for a, b in zip(win_a, win_b)) / window_size
        std_a = (sum((a - mean_a)**2 for a in win_a) / window_size) ** 0.5
        std_b = (sum((b - mean_b)**2 for b in win_b) / window_size) ** 0.5

        corr = round(cov / (std_a * std_b), 4) if std_a > 0 and std_b > 0 else None
        correlations.append(corr)

    pairwise_features[feature_name] = correlations

corr_df = pd.DataFrame(pairwise_features, index=df.index)
print(corr_df.iloc[12:18])

Output:

                     corr_temp_humi_12h  corr_temp_powe_12h  
timestamp
2024-03-01 12:00:00             -0.6700             -0.2281
2024-03-01 13:00:00             -0.7208             -0.4960
2024-03-01 14:00:00             -0.7442             -0.6669
2024-03-01 15:00:00             -0.7678             -0.7076
2024-03-01 16:00:00             -0.8116             -0.7265
2024-03-01 17:00:00             -0.8368             -0.7482

                     corr_humi_powe_12h
timestamp
2024-03-01 12:00:00              0.5380
2024-03-01 13:00:00              0.6614
2024-03-01 14:00:00              0.7202
2024-03-01 15:00:00              0.7311
2024-03-01 16:00:00              0.7233
2024-03-01 17:00:00              0.7219

# 7. Accumulating Running Bases with `accumulate`

A given value can carry a different value depending on when it occurs in the sequence. What matters is its deviation from a dynamic base – performance means arriving at that point in time. Using the addition method as accumulateyou can calculate this efficiency definition without saving the entire history.

readings = df["temperature_c"].tolist()

running_sums   = list(itertools.accumulate(readings))
running_counts = list(itertools.accumulate([1] * len(readings)))
running_means  = [
    round(s / c, 4)
    for s, c in zip(running_sums, running_counts)
]

# Running max — highest temperature seen so far, useful for breach tracking
running_max = list(itertools.accumulate(readings, func=max))

deviation_from_baseline = [
    round(r - m, 4)
    for r, m in zip(readings, running_means)
]

baseline_df = pd.DataFrame({
    "temperature_c":           readings,
    "running_mean":            running_means,
    "running_max":             running_max,
    "deviation_from_baseline": deviation_from_baseline,
}, index=df.index)

print(baseline_df.iloc[20:28])

Output:

                     temperature_c  running_mean  running_max  
timestamp
2024-03-01 20:00:00          2.960        3.5857        5.192
2024-03-01 21:00:00          2.647        3.5430        5.192
2024-03-01 22:00:00          2.986        3.5188        5.192
2024-03-01 23:00:00          2.831        3.4902        5.192
2024-03-02 00:00:00          3.409        3.4869        5.192
2024-03-02 01:00:00          3.919        3.5035        5.192
2024-03-02 02:00:00          3.833        3.5157        5.192
2024-03-02 03:00:00          4.542        3.5524        5.192

                     deviation_from_baseline
timestamp
2024-03-01 20:00:00                  -0.6257
2024-03-01 21:00:00                  -0.8960
2024-03-01 22:00:00                  -0.5328
2024-03-01 23:00:00                  -0.6592
2024-03-02 00:00:00                  -0.0779
2024-03-02 01:00:00                   0.4155
2024-03-02 02:00:00                   0.3173
2024-03-02 03:00:00                   0.9896

# Summary

Time series feature engineering is all about interpretation context – how has this brand been doing, compared to what we expect it to do? All the work covered here is a different way to make that question valid and the number the model can learn from.

Here's a summary of the patterns we've covered in this article:

itertools Function	Time Series Feature	For example
`islice`	Features of Lag	Temperature 1h, 6h, 24h ago
`islice + accumulate`	Scrolling window statistics	6h mean, std, min, max
`product`	Seasonal interactive grid	Hour × day type × shift basis
`tee`	Parallel window statistics	Mean + range + rate of change
`chain`	Multi-resolution feature integration	Raw + rolling + calendar features
`combinations`	Pairwise cross-sensor coordinates	Temp-humidity, temp-power rolling Corr
`accumulate`	Starting baseline + deviation	Drift detection in historical interpretation

And because itertools works at the iterator level, all of these patterns integrate cleanly with streaming pipelines. Fun feature engineering!

Count Priya C is an engineer and technical writer from India. He loves working at the intersection of mathematics, programming, data science, and content creation. His areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, he works to learn and share his knowledge with the engineering community by authoring tutorials, how-to guides, ideas, and more. Bala also creates engaging resource overviews and code tutorials.

Source link

nimda 4 hours ago

0 0 9 minutes read