Business Intelligence: The Data Analysis Phase
I think every IT Professional has a specialist subject which they feel most passionate about. Mine is data. If you’re building Business Intelligence or reporting systems, having good quality and well understood data is absolutely paramount. Unfortunately, I’ve lost count of the times when I’ve seen system developers jump straight for their tools - sometimes pushed by business managers and project managers who just want their systems delivered. They focus on building automated data feeds and reports, without truly understanding what the data is, and what it means to the business. This results in a lot of money being spent on something that doesn’t actually work - and which generates misleading reports that destroy everyone’s credibility.
Sometimes you need to go back to basics - and the most basic requirement for any Data Warehouse or BI system is data. Skimping on activities in the data analysis phase greatly increases project risk.
Lack of understanding about data means that analysts and developers may end up:
- Using data fields and values because they're convenient,
- Basing systems on their own data assumptions, without doing the "deep dive" to understand what that data really is. And,
- Glossing over obvious data quality problems on the basis that they're “something to be dealt with later”.
The tips below represent my recommended approach the project’s Data Analysis Phase. Getting this right doesn’t need investment in fancy tools or specialist developers … and most of the people that you need for this exercise are probably already working for you. The techniques listed below are pretty basic, and should already be second nature for practically any experienced data analyst … but documenting them is hopefully a benefit for anyone new or looking to plan the Data Analysis activities.
1. Project Plan and Resources for the Data Analysis Phase
First and foremost, the Data Analysis phase needs to be planned for. Time and resources need to be dedicated to this activity in the project plan. This activity is separate from the user interview phase of the requirements gathering - and will need its own dedicated resources. It will need:
- Resources assigned (Data Analysts, Business Users, and IT Support)
- Reasonable period of time for analysis to be completed.
The points that follow should give you a better idea about specific skills and activities required. The ultimate aim is to give your Data Architect and developers something that they can work with. This will allow them to finalize the data model, and to cater for any special scenarios that might need to be handled within your ETL. Information from your data analysis will likely also benefit your Test Manager when defining scenarios and acceptance criteria for use by the test team.
2. Evaluate Organisational Data Complexity
Organisational data complexity will be one of the first indicators of what you’re dealing with - and how much time you might need for the fuller analysis. This is determined by the number of systems involved, and the variations of those systems.
- Are you bringing data across from a single ERP?
- Do you have multiple instances of the same ERP?
(e.g. used by different regions or divisions), or do you need to integrate data from multiple ERP products?
- Besides the ERP - are there cross-overs with other systems requiring data to be matched and merged? If so, how complex are these individual systems, and do you have any gut feel about quality and replication of data between these systems?
Back in NZ, data complexity for projects or systems that I worked on were relatively simple. Although our bigger systems had a lot of data feeds (sometimes high-volume such as telecommunications call details or SMS records), each feed was limited to a specific subject area. This reduced our need to merge records from multiple sources, or apply overly complex logic. Our organisations tended to be smaller, with greater consistency. By contrast, the organisations that I’ve worked with since coming to Australia have been global. They’ve had different databases and systems in different countries, different languages, and have much greater variation (and replication) due to companies having been built up through acquisitions.
3. Evaluate Data Structures and Data Integrity
Once you have a basic idea of the data complexity, and know what you’ve gotten yourself in for, the next step is to start gathering information about source systems and their data structures. This gives you a high level view to start figuring out what data is where.
Most systems should ideally have Entity-Relationship Diagrams (ERD) available, or maybe a Star Schema if there’s an existing reporting system.
- These will typically be held by the internal IT support teams. However:
- In cases where the system has been purchased from an external vendor (and is possibly still managed by that vendor) then you'll need to liaise with that vendor
If all else fails, try getting a list of the tables and data fields, highlighting the primary and foreign keys. If the support team or vendors can’t provide this to you, then its something that you can potentially generate for yourself … if you have access to the backend database. Common databases such as Oracle and SQL Server have hidden metadata or Data Dictionary tables which you can query to generate a list of tables and associated fields - including field types and short field descriptions (if recorded). Although this isn’t perfect, it gives you a starting point to begin piecing together your own ERDs. Additional information will also be revealed if you get access to existing SQL queries and reports - but be aware that this will be a time-consuming piece of detective work.
Once you have an idea about the data structures, try a few basic queries (using SQL or whatever toolset you have available). Try verifying joins on the key fields - looking for mismatches (i.e. key values that occur in one table but not another), duplicate values, or other inconsistencies.
Data inconsistency is a greater problem when attempting to join data between systems. For example, you might have two systems containing employee data - but find that the recorded contact details and position titles are inconsistent between the two. This begs the question: which system do you trust?
- Data integration between systems (e.g. automated data feeds and data reconciliations) will reduce the scale of these issues, if implemented - but this integration hasn't been a priority for every organisation.
- For instance: I recently worked with an Australian company which had their Australian employee data manually input and independently maintained in six different systems. This didn't include employee data held in operational systems maintained by their overseas offices. Employees who'd been marked inactive in one system frequently still showed as active in the other systems.
4. Data Profiling
Having identified your data sources, the next step is data profiling. This means getting a more detailed picture about the quality and completeness of data contained within the systems you’re reviewing.
- Does it appear that all records are being captured - or does there appear to be missing data?
- When did the data capture appear to start?
- Are there gaps or periods of incomplete data?
- Are there other observable data issues or sudden jumps in trends? These might typically coincide with historic process changes or data migrations, and could impact on the reliability of reports generated from this historic data. Getting an early idea of these issues (and recording them in the metadata) will help you answer later questions from report writers, managers and in training. Hopefully, it will also reduce the risk of this data being misinterpreted in future.
My initial queries are typically:
- Does this data look complete? i.e.
Are all required fields captured in all records (or are there fields that are frequently skipped/incomplete)?
- How many records are there overall?
- Try counting record volumes by year (or month). Are these volumes reasonably consistent?
- When did data-capture appear to start?
- What volume changes have applied over time?
Data categories (which might be used to populate Dimensions in a Star Schema or cube) are especially important to delve into, since these have a major impact on reporting capability. They’re highly visible as a means to group, summarise and filter data.
- What categorisations are used?
- Which tables are used to define these categorisations (reference data)?
- Generate a frequency distribution
(e.g. Using a SQL SELECT Count(*) … GROUP BY)
- Based on this distribution, look to see how categorisations have been applied.
- Review categorisations to look for potential duplicates, ambiguities, and how definitions might have changed over time.
- Are there multiple levels of categorisation (i.e. sub-categories)? If so, you may need to repeat similar analysis for these.
- If you're looking at product categories for an internally-managed purchasing or procurement system, consider adding support for a globally-recognised product classifications such as the United Nations Standard Products and Services Code (UNSPC) in addition to your own internally-developed classifications.
- This is as simple as adding an extra column to your dimension and a heading-by-heading mapping between the internal and external classification (it isn't as accurate as an item-by-item mapping … but is much quicker).
I used this approach several years ago when developing categorisations for NZ judicial offence codes. The aim of this was to compare NZ criminal offending and conviction statistics with those for Australia and UK. This meant that statisticians needed to be able generate the NZ figures using the Australian and UK classifications … essentially comparing ”apples with apples”. However, doing this was tricky since NZ, Australian Federal/State Governments, and the UK have all independently drafted their own legislation and offence definitions. Fortunately, each country reports its own crime statistics and, in doing so, provides a clear description of the Offence Type definitions that its used for grouping and counting these offences. These definitions help analysts determine where the lines are drawn for offences that may involve elements of multiple offences - e.g. where “aggravated robbery” might fit under the category headings: Violence, Robbery, Theft etc. Having these definitions allowed for Offence Type mappings to be generated between the NZ, Australian and UK classifications - and then loaded onto the offence type dimension so that the statisticians could readily generate their reports using whichever scheme they wanted.
Value (non-classification) Fields
For value fields, look at minimums/maximums and frequency distributions. If there are many distinct values, then group them by ranges.
Consider Null Values, Defaults and Mandatory Fields
Analyse the incidence of null values. Also look to see if any of the fields have default values specified within the original source systems. The problem with default values is that they result in unrecorded data masquerading as real data. These force us to ask: is this a genuinely recorded value, or is the value there because the data-entry operator didn’t have anything to enter (or maybe they just skipped this field)?
Including records with default values can significantly skew your reports with a large number of untrue values. Therefore you need to make a judgment call about the data validity. If there appears to be a large amount of invalid data:
- Consider options to identify and exclude this invalid data from analysis.
- Consider options to improve the quality of future data capture - such as changes to data-entry screens and rules in the source systems. These changes can be discussed and negotiated with the source system owners. In addition to defaults, also consider the effect of mandatory fields. Although mandatory fields are supposed to ensure that data is captured, these can potentially also trigger bad behaviours where users enter meaningless "junk" values simply as a means to complete the transaction or progress onto the next screen.
Data Source Analysis
The next factor that I look at in my data profiling is the impact that data sources have on the data being collected. For example - if you’re integrating data from different data sources in different countries, the data sources may have different reference data (due to different languages or standards), different text descriptions, and recording formats. As such, it’s important to review the data to identify patterns (and conversions) which might need to be factored into the data integration or standardisation rules.
Section 3 above briefly mentioned the concept of Master Data - i.e identifying the “trusted source” for specific subject areas or data entities. However, try linking equivalent entities from various sources together (and as well as other related tables). Consider:
- Are the data sources compatible?
- Are there mismatches (data appearing in one system but not others, or which appear in both systems but is inconsistent?).
- Compare equivalent categorisation/reference data fields applied across these multiple systems. Are they compatible/consistent?
5. Analysis for Dimension Tables
Developing the dimension tables is the responsibility of the Data or Solution Architect. Nonetheless, as a data or business analyst, I find it helpful to come up with my own conceptual framework. This helps me get my head around the data structures - and trial different approaches. This gives me the opportunity to test how my conceptual view fits with the reality of the business data that’s been captured.
Dimensions are often linked to the classification fields identified in section 4 above.
Consider Data Hierarchies and Groupings
Identify hierarchies in any categorisations that you’ve applied. For example: a group of products might be classified as vehicles - but under that, there might be more specific subcategories, such as cars, light utility vehicles, heavy trucks, loaders etc.
- Try to identify the subcategories under each main category.
- Do some subcategories need further subcategory levels defined?
- Try to map out the overall grouping hierarchy
- Identify if there's someone managing and maintaining these groupings. How are they kept up to date?
The dimension tables also need to contain attributes associated with that entity. For example, the customer dimension will likely need the customer address (potentially separate physical, mailing, and billing addresses), contacts and phone numbers. There might also be customer type classifications and territories recorded against each customer to help with filtering and grouping.
Consider Dimension Types
Take a close look at the Dimension and consider how often the data changes. For example - if you have a customer dimension then its likely that the customer address and contact information will change from time to time. It’s also likely that the hierarchical structure of corporate customers will change. If you have information within dimension records which are likely to change over time, consider frequency of changes being made, and the implication for your future reporting.
In data modelling speak, this introduces the concept of the Slowly Changing Dimension (or SCD)
- For the customer contact example: Do you just want to know the current customer contact? This may suffice for most reporting since the current contact is the person that your organisation is likely be dealing with. Hence, if the contact changes (noting that changes should be detected in data loads as data is refreshed from source systems), you might simply want to overwrite the old contact details with details for the new contact. This is quick and simple to implement - and will be sufficient for future reporting. This is known as a Slowly Changing Dimension Type 1 (or SCD1). The only downside with this is that you can't easily go back to see who the customer contact was at some earlier point in time - for example, who the contact was at the time a particular order was placed (unless the contact is also recorded on the order record).
- If you want to retain the history of changes, or keep a point in time view of the data, then consider using a Slowly Changing Dimension Type 2 (SCD2). Rather than overwriting the contact value within the customer record, you simply create a new version of the record. This contains all details as per the previous version except it contains the current contact. The old customer record version is marked to indicate that it's no longer the latest version. This is usually done using an is_current flag field on the record, and/or adding fields with start/end dates highlighting the period for which the record was valid.
SCD2s are also important for hierarchies - since you may want to generate reports grouping a customer’s year-end spend into branches and subsidiaries as they existed at that time, irrespective of subsequent structural changes that have happened since year-end. See this link for a more detailed explanation of SCDs.
Other factors you should look at are Ragged Hierarchies (where the number of levels for categorisations and subcategorisation isn’t always consistent - for example, addresses in some countries have States, whereas other smaller countries don’t), and other types such as Conformed Dimensions, Junk Dimensions and role-playing dimensions (see this link). All of these will have an impact on the reporting data model to be developed in section 7 below.
6. Identify Measures (and Fact Tables)
These are the value fields needed for generating the totals, subtotals and calculations requested by business users for their reports. Identifying the measures also helps you identify the Fact tables and the data granularity since these are all closely linked. These are also useful for the design of data cubes if these are to be used as the basis for your reporting.
Methods to identify the measures include:
- Review existing reports and KPIs.
- Interview users or business management to find out the indicators or numbers they're most interested in seeing.
- Analyse existing reports and spreadsheets to determine how (and what) data is extracted, and actual logic for existing calculations. This detailed analysis needs some level of technical expertise for reading SQL and reverse-engineering reports in whatever reporting tool has been used.
- Interview existing business managers and analysts to document and confirm the business logic that should be applied (compare this with the actual logic determined in the point above).
- Agree and document the logic that will be used going forward in Fact tables and reports to be developed.
7. Consider Summarised Data and Data Definitions
Besides the measures identified above, you might also want other tables containing data summarised in pre-defined groupings. Although it should be possible for users to generate their own summaries, having these will save time and be much quicker for users to run. It also saves system resource since the summaries don’t need to be recalculated each time a commonly-used report is run.
Additionally, bring together the data definitions based on your analysis so far, and start getting them into a format that which makes them accessible to users. This doesn’t necessarily require a complex metadata management system. A simple set of notes in a Microsoft Word document or a shared wiki can be a great start. If you have data being shared between organisations, share and discuss these definitions to ensure that you have a common understanding of what they mean.
8. Prototype Your Dimensional Models and Transformations
As per comments in section 5 above, this task belongs to the Solution or Data Architect … although small prototypes built by technically-capable Business Analysts can be helpful during the Data Analysis phase to trial and verify formulae or assumptions. This involves prototyping the proposed target data structures for Fact and Dimension tables (along with proposed transformation logic and calculations).
- Where possible, try to pre-calculate your measures (e.g. line-item price = quantity * product price) during the data load processing. These should be stored as simple data values within your Fact table records. These will save you the need to perform calculations at report runtime. It means that reports will be quicker to run, and it reduces the complexity (or replication of logic) within reports. Ideally, the only calculation logic that your reports should need to perform are group summaries such as totals and averages etc.
- Also remember that fact tables don’t need to be fully normalised. Therefore, if it makes sense to replicate dimensional attributes within a fact table (e.g. categorisations for filtering and grouping) then do so. This also reduces the work and data linkages needing to be performed at report runtime and (again) make your reports much quicker to run and develop.
- Start by drawing simple Star Schema models to confirm your target Fact and Dimension tables and relationships. Then,
- Start prototyping the table structures, and trialling them with sample data. Where possible use sample taken from your source systems.
- Try generating the calculated fields and categorisation mappings using logic which you've identified so far. Finally,
- Try prototyping simple queries and reports which your business users have indicated that they want.
- Can your proposed structures generate the type of result needed?
- Do the results "look" to be right based on the sample data you've used?
- Are there data quality issues or refinements needed for the data transformation logic and calculations you've identified?
- How often will the data need to be refreshed and updated from the source system?
- What's the best means of getting data from the source system into your data warehouse?
Note: Although I’ve referred to Star Schemas and Dimensional Modelling, the outcome of your discussion with users (and prototype review) might be that your users really don’t want a Dimensional Model. Don’t try forcing them to use approaches that they really don’t want, or have difficulty using. I’ve encountered this several times - most notably with business analysts, statisticians, or report writers using the SAS suite. In these cases they often just wanted a set of simple “flat file” denormalised datamarts optimised for different reporting purposes. You have to remember to give customers what they want - no matter how much it goes against your training or instincts. Of course, I still keep the Star Schemas in reserve because eventually someone will want data cubes or derived reports.
9. Interview Business Users and Stakeholders about Data Issues and Observations in Your Prototypes
Now that your prototypes are built, review the data and sample reports with your business users (and the original data-entry staff) to identify issues and observations. In some cases, this may be the first time that anyone has looked at the data in detail. Doing this may reveal inconsistencies in data-entry practices, classifications or business procedures. Or, it may reveal issues with your data extracts, ETL process, or data assumptions.
Outcomes of this will be:
- Improved data extract, ETL routines, and reporting data structures.
- Data clean-up activities to improve (and fix) the quality of existing data in source systems.
- Improved data quality checks.
- Clarification of business process or data-entry instructions (improving the quality of data being entered into source systems).
- Additional detail in metadata notes, describing known (often historic) data issues, process descriptions, and clarified data definitions.
As part of your BI deployment, you may need to change practices for data capture in the source systems (e.g.making some fields mandatory, changing default values, refining data categorisations etc), develop data monitoring reports and KPIs, and incorporate data usage or definitions in your training and metadata notes.
10. Data Reconciliation and Quality Measures
Having been responsible for maintaining systems and investigating data issues, I place huge importance on having data reconciliations and quality checks built into your automated data loads. There’s nothing worse than
- Finding out some of your data is corrupted because there was a problem with the data extract you received from a critical source system … one day six weeks ago. Or,
- Uncovering data extract or loading issues that have been occurring for the past three weeks, which no-one has picked up till now.
Therefore, you need be proactive in ensuring the accuracy and completeness of your data. Waiting for users to report errors just isn’t good enough. Your BI Support team need measures in place to detect and correct errors and inconsistencies before users ever notice. These checks should be run as part of your ETL and should generate alerts or messages for your support team in the event that anything needs to be followed up. This won’t entirely eliminate the problem (since some problems may stem from business processes, data-entry, or other source system issues outside your control), but it will reduce your risk.
Building these checks is the responsibility of the original BI Project (or data feed team) since retrofitting them onto existing systems can be time consuming and expensive. However, if you’re in the unfortunate position of having inherited an existing system without these checks, you can still add them - although you may need to do this little by little due to the time, expense, and analysis involved.
Checks can include:
- Source system record counts and checksums provided with each data feed (compare these with equivalent record counts and checksums on initial import into staging database, and again with datamarts after the ETL has been run).
- Automated checks for orphan records or record mismatches.
- Automated checks for invalid values (e.g. out of range).
- Automated checks for unexpected duplicate values.
- Automated checks for new/unexpected reference data (master data) values.
Implementing basic checks will help improve your users’ confidence in your systems. Checksums and record counts can also help you deflect user criticisms that the data is “wrong”. In these cases you can prove that your data is consistent with the source system - getting users to liaise with the source system owners to ensure issues are fixed up there (data fixes needed for reporting should ideally be fixed first in the source system, then flow back into the data warehouse as part of the next data refresh).
Also consider how your automated processes detect issues and generate alerts to your support team. Automated emails are one way - but lineage tools and the development of a status dashboard are also worthwhile options. Besides your support team, this visibility of load status and data quality indicators will also be of interest to power users.
Some Useful References:
The following books are useful references for anyone wanting further information.
Aside from Amazon.com, electronic versions of these books are also available on Safari Books Online (which means that I have them all very conveniently accessible via my iPad).
The Data Warehouse Lifecycle Toolkit, 2nd Ed.
Author(s): Ralph Kimball; Margy Ross; Warren Thornthwaite; Joy Mundy; Bob Becker
Link to Amazon.com page or Read on Safari Books Online
This is the classic text for Data Warehousing. Aside from an introduction to Dimensional Modelling (Facts, Dimensions, Slowing Changing Dimensions etc), it contains invaluable advice and templates for the analysis, stakeholder management, and other tasks associated with any data warehousing/BI project. Ralph Kimball and Margy Ross’ other book (The Data Warehouse Toolkit) contains a more detailed description of the Dimensional Modelling approach, along with examples to cover specific scenarios they’ve encountered in their years of consulting. There’s also a range of other related books focusing on ETL and helping you leverage the specific capabilities of Microsoft SQL Server or Oracle BI products - although these are more technical and therefore less useful for Data Analysts.
Information Quality Applied: Best Practices for Improving Business Information, Processes, and Systems.
Author(s): Larry English
Link to Amazon.com page or Read on Safari Books Online
Detailed coverage about the importance of information quality, and impact to the organisation if data can’t be trusted (albeit somewhat exaggerated IMHO given that company/project examples given didn’t fail solely due to data or information quality issues). Introduces the concept of Total Information Quality Management (TIQM) based on the classic W. Edwards Deeming-inspired Total Quality Management approach.
Author(s): Sid Adelman; Larissa T. Moss; Majid Abai
Link to Amazon.com page or Read on Safari Books Online
Very high-level (and, to be honest, I only read the first few chapters) - but a useful framework for helping to explain Data Strategy components to the uninitiated.
Master Data Management and Data Governance, 2nd Ed.
Author(s): Alex Berson; Larry Dubov
Link to Amazon.com page or Read on Safari Books Online
Much greater emphasis on Master Data Management side of Data Quality.
To Improve BPM, Go Back to the (Data) Source, Business Performance Management Jan 2011
Authors(s): Linda Imonti, KPMG
I'm a Business Intelligence and Data Warehousing consultant based in Brisbane, Australia. I've consulted on or managed several large BI systems in New Zealand, Australia and Latin America.
Please use the contact form on this site.
Or phone 04 5704 1640 (Australia)
Or phone 04 5704 1640 (Australia)