The project I'm currently working on requires me to articulate to people what a data warehouse is. Depending on how much those people want to know, I usually go on to explain the role of a data warehouse, and its importance within a business organising such a massive amount of data. Some of the teams within the business require a more in-depth understanding of the architecture behind the warehouse, which I’ll explain a bit later in this article.
A central place for data
A data warehouse is a fancy term for describing the central place we store data related to the business. A warehouse can form a more substantial part of business intelligence systems found in companies requiring greater insight into both internal and external data.
In essence, a data warehouse is a bunch of data pulled in, sometimes augmented and then stored in a Big Data database. This, of course, is a vast simplification of the process, one which I'll cover in more detail below.
Having a data warehouse is essential for many reasons, a primary one being ownership of historical data. Pulling in data from external data sources can be great for real-time reporting, but you’re then reliant on those third party systems existing for as long as you need access to the data within them. Making a copy of that data ensures you have a snapshot of that data from a specific point in time. Cataloguing data like this is essential to ensuring historical reporting reflects the state of the data when it's collected.
Another reason you may want to consider a data warehouse is that the structure of the data isn’t optimal for reporting. For example, you may want to process the data to roll-up rows of information into summaries. While processing the data, you may also choose to cleanse parts of the data where the source may have omissions surrounding specific fields, or you want to convert the values into something more useful in the context of a more extensive system.
A data warehouse is also useful when many data sources come together to add more context and colour to a bigger picture. For example, you may have a system which pulls in comments on a social media post. A data warehouse could combine that comment data with sentiment analysis data for each comment, giving a feel for how the audience felt about that particular post. This information (usually in the form of a score) could then be averaged across all comments to show the general reaction to posts on that social media channel.
How it works
Some data warehouses (and the one I currently work on) use an ETL Pipeline for processing data. ETL stands for Extract, Transform and Load, and describes the process the data goes through to end up in the warehouse. There are many ways to do this, from simple scripts to more complex tools that help visualise and structure jobs to pull data into place. Let’s look a bit deeper into those steps.
Extraction in modern (web-based) systems will hopefully be from an API. Extraction usually takes place on a schedule to ensure snapshots of that data are recorded consistently. Alternatively, data can be collected in response to an event happening (such as a record being created, often found in event-based systems). Continuing with the example of social media comments, the steps involved in extraction might be:
Query a posts API for a collection of posts
Store each post in a posts table
Iterate through each post in the collection and fetch all comments
Store each comment in a comments table
The tables mentioned above are known as “staging” tables, which store the raw data as pulled from the source. These tables could mimic the data structure defined in the source API, and they are mostly a carbon copy of that data that we now own for us to use however we wish.
There are further considerations during this collection phase, such as paginating over large result sets and retry logic for failed requests, but we’ll skip them for now. A cursor might be used to determine the last collection point for data, and the extraction should pick up from that point (usually a timestamp or auto-incrementing integer).
The next step is the transformation of data we collected during the Extract phase. This step isn’t always a requirement, and some data can be used immediately without transformation; however we may want to further process data to make it more useful later on. The details of the transformation step are dependent on the data, but again continuing with the social media post example above, we may:
Go through each post from the posts table
Find all comments related to that post
Working through each comment, create a score from 0 to 1 marking the sentiment in the comment text (for example “I love this” may be 0.9 but “This is rubbish” may be 0.3)
Once all comments have a score, save an average of these comments on the post in the posts table
Of course, this is a simplification of the process, but it illustrates the purpose of the transformation stage where we may be augmenting or adding detail to the raw data from the first step, making it more useful for the next step.
Now that we’ve extracted and transformed the data, it needs loading into a final location to use for reporting and data analysis. This process can be as simple as appending that data to an existing database table or updating existing records. It's important to consider how we load data into the production tables; at this stage duplicate or incorrectly referenced records can cause data integrity issues in the reporting.
From this point, that data may be consumed and visualised in reporting tools such as dashboards or other web front-ends (we're currently using Periscope, for example).
Pulling data in from so many external sources can sometimes feel like tracking a moving target. The nature of this particular beast means that APIs are continually changing, evolving and breaking to make the task of collecting data consistently even harder.
It’s essential to consider alerting surrounding the retrieval of data to ensure consistency and integrity of the data used in reports. Human intervention through manual sanity checks can be helpful to ensure the validity of the data during processing. In this respect, the role of a data warehouse is never “done”.
As the business evolves, the warehouse should evolve with it.
The decision to spin up a data warehouse revolves around a few factors such as access to data sources in the first place, as well as knowing how you want to visualise that data.
Ultimately though, it's about the story you want to tell. Good use of data in this context should start by thinking about how to highlight a particular goal or achievement, not first working backwards from the story the data tells.