Ensuring Data Integrity with Cryptographic Hashing and the Ethereum Blockchain

0 2 6 minutes read

Ensuring Data Integrity with Cryptographic Hashing and the Ethereum Blockchain

scientific practice, teams often need access to a shared dataset that remains completely synchronized and immutableeg, in distributed machine learning environments where multiple teams rely on the exact same feature set.

In this article, I will go with a a simple, fee-free way to cryptographically hash a dataset of any size and store its hash immutably on the Ethereum blockchain.creating a permanent and verifiable record of dataset integrity.

This approach can also be extended to weight models, specific transformations that need to be implemented in a consistent manner, source code, or other data that needs to be flexible and verifiable.

🤔Why Integrity is Important

If you are at least familiar with data science as a practice, you already know about the importance of data integrity. Even small changes or errors in the input data can collapse the project.

Modern machine learning models are very sensitive to their training data. Missing normalization measures, modified CSV file, shuffled rows, corrupted features, or mismatch between training and validation data sets can produce very different results.

Integrity failures are hard to spot and often go astray.

Models may appear to work normally or train, but metrics may degrade, drift accrues, or tests may not be repeated. Loyalty is doubly important when the team is distributed, perhaps in different organizations, and needs to work on different versions of the same problem.

🔐Using Cryptographic Hash as a “Source of Truth”

A cryptographic hash provides us with the simplest and most useful way to ensure the integrity of data.

A short primer on cryptographic hashes

A hash function takes any amount of input data (bytes) and produces a fixed-length output known as a hash or digest. Cryptographic hashes are fundamental to computer science, as you may already know.

What matters is determinism:

Same data in → same hash

Even a single byte changed in the input data produces a completely different hash.

Because of this feature, hashes act as unique fingerprints of data and are very useful for verifying integrity. There are many flavors of hash functions, and some are very useful for this task as I will explain.

How does this apply to datasets?

Because of the determinism of the hash function, once it is applied to the dataset, we can quickly and reliably check if the dataset matches our expectations.

This it is very valuable with large data sets used by many groups, many companies, from one version to another. Team 1 in Research Group Alpha build elements 1-10, Team 2 in Research Group Zeta creates 10-100 features, System X uses version Y, etc.

We no longer need to ask for details, simply calculate a hash function over the dataset and compare it to the hash compiled in the reference area. If it matches, OK. If not, something has changed.

Hashing works very well. Running a hash function over a 10MB or 10TB dataset quickly gives us a small, compact string that can be shared, stored or published.

🧐 Why Use Ethereum as an Immutable Store?

This is a really helpful piece of this article.

Ethereum, and, as you already know it is a blockchain. This gives us:

Not changing: transactions can never be changed
Distributed availability: always accessible without central authority
Forever: once written, it is permanently accessible

But, is Ethereum for transactions? Don't we need to write a complex smart contract for this special purpose?

You really can. But, we don't need to.

Clever bit is using this not usually used input data field in Ethereum trading, sometimes called “calldata.”

But, Ethereum transactions cost real money (gas, fees, etc.)?

And the truth. In Ethereum, you are charged “gas” for each byte of input data. On the mainnet, at a price of $2,000 per ETH, this would cost us between $0.04 – $0.10 per hash. This does not include the gas required for the actual transmission to be installed by the block validator, which can be large depending on the current load of the network.

Let's make this smarter. 🦊

By uploading everything to a “testnet”, which all blockchains tend to have, We can do this completely free of charge.

Sepolia (ETH testnet) is rarely used unless you are a smart contract developer. Sepolia ETH is free and publicly available on tap.

This means we can create an unlimited number of transactions, on a publicly accessible testnet (called Sepolia for Ethereum), for free!

As long as our input data is the right size, Sepolia offers a way to use blockchain to store infinite datawhich has many of the same features as mainnet*

* Sepolia blockchains are not permanent, but very reliable for many years. If you need absolutely forever, you will need to pay using the mainnet.

Remember, we don't store the actual data on the chain. Just a fingerprint.

⚙️Process

First, we need a way to reliably create transactions on Ethereum.

Despite seeming complicated, this is actually very simple. We don't need additional software or wallet technology. A wallet is nothing but a key, paired with a secret used to sign it.

To create an Ethereum transaction, we create a python object with the necessary keys and format, embed it in our code, and broadcast it to the network. The validator then takes the transaction from the “mempool” and puts it into the block.

As long as we enter all the required fields, and it checks out, it is now a permanent part of the blockchain within ~12 seconds.

Step 1: Create a key and secret with web3.py with a few lines of code

from eth_account import Account 

account = Account.create() 

print("Address:", account.address) 
print("Private Key:", account.key.hex())

Step 2: Get ETH from Sepolia. Plug in your address here and wait 12 seconds. Thanks Google!

Step 3: Hash the dataset

As I said, there are better hashes for this process. We can use SHA256 hash, but Blake2b is actually better in performance. In fact, any hashing function will work.

Use this function to get data.

import hashlib
from pathlib import Path

def hash_dataset(dataset, algorithm="blake2b", chunk_size=1024 * 1024):
    h = hashlib.new(algorithm)

    def update(obj):
        if isinstance(obj, (str, Path)) and Path(obj).exists():
            with open(obj, "rb") as f:
                while chunk := f.read(chunk_size):
                    h.update(chunk)
        elif isinstance(obj, bytes):
            h.update(obj)
        elif isinstance(obj, str):
            h.update(obj.encode("utf-8"))
        elif isinstance(obj, dict):
            for k in sorted(obj.keys()):
                update(k)
                update(obj[k])
        
        elif isinstance(obj, (list, tuple)):
            for item in obj:
                update(item)
                
        elif isinstance(obj, set):
            try:
                for item in sorted(obj):
                    update(item)
            except TypeError:
                for item in sorted(obj, key=str):
                    update(item)
                    
        elif hasattr(obj, "__iter__"):
            for item in obj:
                update(item)
        else:
            h.update(repr(obj).encode("utf-8"))

    update(dataset)
    return h.hexdigest()


digest = hash_dataset("hugedataset.parquet", algorithm="blake2b")

Step 4: Write, sign and publish the transaction with the hash of our dataset.

Using the web3.py library, we can format our transaction as a python dict, and publish it to the network.

We need a provider to broadcast our work (we don't have space). Here we use Infura, but there are others, like Alchemy

Just note that we add a zero bit “0x” to the hash calculated in our dataset. We need to remove it when we verify our hash.

from web3 import Web3 
w3 = Web3(Web3.HTTPProvider(" 
dataset_hash = "0x" + digest
 
account = w3.eth.account.from_key("YOUR_PRIVATE_KEY") 

tx = { 
	"to": account.address, # self-send (no contract required) 
	"value": 0, # no ETH transfer 
	"gas": 50_000, 
	"maxFeePerGas": w3.to_wei("20", "gwei"), 
	"maxPriorityFeePerGas": w3.to_wei("2", "gwei"),
	"nonce": w3.eth.get_transaction_count(account.address), 
	"chainId": 11155111, # Sepolia testnet 
	"data": dataset_hash 
}

Sign it and send it. Here, we wait until the transaction is completed.

signed_tx = account.sign_transaction(tx)

tx_hash = w3.eth.send_raw_transaction(signed_tx.rawTransaction)
print("Broadcast tx hash:", tx_hash.hex())

# Wait for mining / inclusion in a block
tx_receipt = w3.eth.wait_for_transaction_receipt(tx_hash)

print("Transaction mined in block:", tx_receipt["blockNumber"])
print("Status:", tx_receipt["status"])

Make sure you save the transaction id.

Step 5: Create a metadata record to store next to our dataset

Here, we create a simple piece of metadata, which can be stored in a database (DynamoDB, MongoDB) or next to our data object directly (S3, Google Cloud Storage).

The metadata might look like this:

{
  "dataset_id": "feature_set_v42",
  "dataset_uri": "s3://ml-bucket/features/v42.parquet",
  "dataset_hash": "0x9f3c...ab21",
  "tx_hash": "0x7c1a...e91d",
  "timestamp_unix": 1730000000,
  "hash_algorithm": "blake2b",
  "creator": "0xabc123...",
  "notes": "normalized features"
}

Step 6: Whenever you read a dataset, verify that the hash matches the actual hash stored next to our dataset

The last step of the process includes three actions:

Download Ethereum work
Extract the data set hash from the call data
Compare with the local regenerated hash

from web3 import Web3

w3 = Web3(Web3.HTTPProvider("

def verify_dataset(dataset_path, tx_hash):
    tx = w3.eth.get_transaction(tx_hash)

    raw_input = tx["input"]
    onchain_hash = raw_input.hex() if hasattr(raw_input, 'hex') else str(raw_input).lower()

    computed_hash = "0x" + hash_dataset(dataset_path).lower()

    if computed_hash != onchain_hash:
        raise ValueError(f"Integrity FAILED: Local {computed_hash} != On-chain {onchain_hash}")
    
    print("Integrity check PASSED. Dataset matches the blockchain record.")
    return True

That's all!

An important notethis does not prevent anyone from rewriting our metadata object. However, there are many ways to prevent the modification of a small part of the metadata internally, such as audit databases or S3 Object Lock.

Wrapping up

Finally, using a cryptographic hash to ensure the integrity of a dataset is a lightweight approach to a difficult problem.

Some natural extensions to this include using this method to verify model weights, or even hashing snippets of source code to verify that preprocessing is working.

Whether you're collaborating across distributed, open-source teams, building repeatable research, or simply creating a research trail to keep up with, blockchain is a good, impartial notary for your data. You don't need to trust the infrastructure; you just need to trust the math.

Source link

nimda 4 hours ago

0 2 6 minutes read