Streaming Live Blockchain Data to Data Lakes: A Blockchain Dev’s Guide
In our previous overview of blockchain ETL data pipelines, we emphasized the data-richness of blockchains and the need to introduce a technical process that could make the blockchain data usable.
Data lakes can be considered as a precursor or foundational layer to ETL pipelines, providing the necessary infrastructure for handling and managing raw blockchain data efficiently.
In this article, we will explore data lakes, their importance for blockchain data streaming, some of their use cases in web3, and more.
What is a Data Lake?
A data lake is essentially a storehouse for all data types an organization collects. It is a repository that is not bound by schema to store data.
You can think of a data lake as a brain dump note.
Just like how you might dump all your thoughts and ideas onto a notepad when working on a project and later extract and organize what you need out of it, a data lake serves as a vast repository where all raw data, regardless of format or structure, is stored. This raw data can then be processed and refined as needed.
The architecture of a mainstream data lake can essentially be reduced to the following four layers:
- Ingestion layer: Responsible for importing data into the data lake from various sources.
- Storage layer: Stores the ingested data. Can include raw data, structured, semi-structured, or unstructured data.
- Governance layer: Manages and maintains the integrity, quality, and security of the data within the data lake.
- Application layer: The stored and processed data is used to drive business applications and analytics.
A key aspect of data lakes is that, unlike other solutions, they can handle large volumes of varied data derived from blockchain transactions and events.
Importance of Data Lakes for Blockchain Data
The rate at which blockchains are producing data is accelerating very quickly owing to mass-adoption catalysts like account abstraction. Moreover, the variety of data types is continually evolving with the introduction of new token standards and file types, as the ecosystem advances towards greater practical usability.
Such a scenario calls for a flexible and adaptable solution that is not limited by file types and schema. One that allows compliant and integral data storage.
Scalability for Large Organizations
Traditional data streaming methods often fall short due to the high volume of blockchain data, unique data structures, and the need for real-time or near-real-time processing.
Data lakes solve this by allowing horizontal scalability powered by distributed storage systems like HDFS (Hadoop Distributed File System) or object storage like Amazon S3 or Azure Blob Storage. Blockchain-based product organizations can go a step further and use IPFS (InterPlanetary File System) for immutable data storage.
These distributed storage systems work by breaking down large datasets into smaller, more manageable, chunks and distributing them across multiple nodes in a cluster.
Schema-on-Read for Flexibility
Further, traditional data warehousing uses a “schema-on-write” approach, wherein developers or data engineers have to rigidly define the schema before it can be stored. This causes bottlenecks when data types are changing or evolving — then, schemas have to be modified or redefined.
Data lakes offer a viable solution.
By adopting a “schema-on-read” approach, data lakes provide the much-needed flexibility for data ingestion and adaptability for new features without being constrained by pre-defined structures.
Why Use Data Lakes for Storing On-chain Events?
Every smart contract interaction on an EVM chain creates a “log” entry in the blockchain ledger. These logs are essentially arbitrary pieces of data — they are arbitrary in the sense that not all logs are the same; they are relevant to the transaction or interaction taking place on-chain.
Among the various types of log data are “events”, a.k.a. blockchain events.
Events are specific, named occurrences within a smart contract, like a token transfer or a change in ownership.
On-chain events pose unique challenges to storing data that can be effectively addressed with the help of data lakes. Let’s look at a few of them.
Handling Lots of Varied Data
Depending on the complexity of the smart contract interactions, blockchains can trigger multiple events simultaneously that could easily overwhelm a transactional data pipeline that goes from, say, a database to a data warehouse.
Further, events are generally encoded with the contract’s ABI (Application Binary Interface). This requires specialized libraries and tools for decoding and parsing the data, adding an extra layer of complexity to the ingestion process.
More important are perhaps the diversity in the types of events and the constantly changing schemas. With new standards and implementations, schemas are subject to change, and the on-chain event storage solution that you’re using should be adaptable enough to accommodate them.
Real-Time Analysis Requirements
With use cases of microtransactions and gaming on the rise, there’s increasing demand for real-time or near real-time analysis of on-chain events. For example, decentralized exchanges need to react to market events instantly, and fraud detection systems need to identify suspicious transactions as they happen.
As data lakes are built on distributed storage systems (HDFS, S3, etc.), they provide horizontal scalability, which means that they are not limited in processing throughput as more nodes can be added on-demand.
Further, data lakes can begin automated workflows based on event triggers. For example, they can be configured such that when a particular event is emitted by the blockchain, the data is transformed and updated in the ETL pipeline.
The combination of easy data access, horizontal scaling, and event-triggered automations can enable low latency for most data analyses.
Flexibility in Data Formatting
The critical bridge for smart contracts to interact with the outside world are ABIs (Application Binary Interfaces). ABI specifies how data structures and functions are encoded and decoded, facilitating interaction between smart contracts and external applications.
However, when dealing with multiple blockchains, ABI itself has no standardization. It is ironic because ABI is a standardization method but, itself, has no standardization in the context of a multi-chain blockchain ecosystem.
In simpler words, a specific blockchain (say, Ethereum) can have standardized ABI formats for certain types of contracts. For example, ERC-20 tokens on Ethereum follow a standard ABI that defines functions like transfer, balanceOf, and others. But when considering other blockchains, there is no universal ABI.
This lack of standardization can lead to a wide variety of blockchain event structures. And as smart contracts are upgraded, event structures can change again.
That is why, the schema-on-read approach of data lakes is particularly useful as engineers can focus on effective data ingestion before moving to clean up and other operations.
Streamlining Data Lake Ingestion with Pipelines
Data lakes and ETL pipelines complement each other.
The former serves as a centralized data repository for storing vast volumes of structured, semi-structured, and unstructured blockchain data, while the latter automates workflows that orchestrate the extraction, transformation, and loading (ETL) of this data into the data lakes.
Integrated Architecture
The foundation of efficient data lake ingestion lies in a well-designed architecture, which starts with connecting different data sources.
Blockchains are the source of the data (logs, token balances, token transfers, etc.) which is stored in nodes. As these nodes are the points from which data is extracted, they are also referred to as data endpoints of a blockchain.
Most blockchains provide developers with the necessary APIs or RPC interfaces to interact with the node and query the blockchain data.
This extracted raw data can be transformed with the help of transformers. Filtering aggregation, enrichment, and other transformational operations are performed in this data before loading it into a data lake.
Note that schema may or may not be applied at this stage. Data lakes provide the flexibility of not using a rigid schema.
Data lakes also include metadata catalogs to track information about the stored data, such as its origin, transformation history, and schema (if applied).
Automating Data Extraction and Storage
Automation is central to efficient and fast-paced data analytics. By eliminating complex manual processes of data extraction and management, automation helps organizations make better use of time and financial resources.
After extracting data from endpoints, event subscription models, such as Ethereum’s eth_subscribe, allow real-time listening to new blocks, transactions, or specific smart contract events. For multi-chain environments, connectors are set up using standardized APIs or SDKs (e.g., Web3.js, ethers.js for Ethereum).
Then, the pipeline can be configured to regularly (by timeframe or block range) extract data from the nodes and process in batches.
The extracted data is validated, normalized, and stored in cloud-based storage solutions like Amazon S3.
By automating this entire workflow, organizations can reap the following benefits:
- Continuous data ingestion
- Low data processing latency
- High scalability and efficiency
- Great data reliability and consistency
Real-Time vs. Batch Processing
Processing, here, refers to parsing, filtering, enrichment, normalization, cleaning, compression, error detection, and other operations performed on the extracted data.
When streaming on-chain data into data lakes, processing can be done in two ways:
- Real-time
- Batch
Before concluding which one is better, let’s take a look at some of their key differences in a table below.
Real-Time Processing | Batch Processing | |
Data Ingestion | Continuous, as events occur on the blockchain | Periodically, at scheduled intervals (e.g., hourly, daily) |
Latency | Low (near real-time) | High (can be minutes to hours) |
Use Cases | Fraud detection, high-frequency trading, real-time analytics, monitoring | Historical analysis, reporting, machine learning, backfilling |
Data Volume | Typically handles smaller, more frequent updates | Handles large volumes of data in bulk |
Infrastructure | Highly demanding | Less demanding |
Complexity | High | Low |
Error Handling | Fast, but less thorough | Allows for more thorough error handling in post-processing |
As you can see, real-time processing and batch processing of data are distinct, not competitive. In practicality, organizations benefit from taking a hybrid approach — different processing methods are used for different data types.
Conclusion
As the volume and complexity of on-chain data explodes, traditional data infrastructure just can’t keep up. That’s where blockchain data lakes shine.
Unlike rigid data warehouses, data lakes offer the flexibility, scale, and adaptability needed to handle the messy on-chain data. They don’t force structure where it doesn’t exist.
The main takeaway is that integration of data lakes into blockchain data architectures is a foundational requirement for growth today.