Database analysis to Big Data analysis

In addition to looking at external sources of information, Market Intelligence can also be gathered by linking and cross-analysing existing database information. Many companies now have vast stores of information in their databases (though sometimes 'Big Data' might only be a few tens of thousands of records) that covers customer behaviour over time, but often this is locked away and difficult to use. Increasing though businesses are looking to unlock the power of this 'Big Data' with data science.

Database analysis, or increasingly Big Data analysis, is the process of extracting this data, cleaning it, potentially merging it with other data and performing a statistical analysis to better understand customers and customer behaviour. In very large businesses such as retailers, there are often teams of analysts drilling into the database and producing live data for market segmentation, targeting of offers and a range of other business intelligence uses.Or automated algorithmic systems using the database to target offers and adverts to different customers.

Database analysis extends across a wide range of skills from basic analysis to understand who your customers are and what they buy, to more complex statistical analysis, data-mining, propensity modelling or using tools such as text analysis or Geographic Information Systems (GIS) and real-time event driven marketing. Generally the aim is to understand and predict customers' needs and behaviours so as to be able to target marketing more effectively and provide better products in a more timely or convenient manner - a so called 360° view of the customer.

For instance a common simple analysis approach for sales data is to try to understand customer information in terms of  recency, frequency and value, looking to identify key groups of customers and mechanisms for increasing sales, looking to identify clusters of customers who might have similar needs and looking for predictors of behaviour.

More advanced and complex analysis will use data to build predictive statistical models with the aim of increasing a customers interest in purchasing, the amount they spend or to influence their purchase behaviour.

For an analyst, the basic procedures for analysing database information, whether a simple contact database or Big Data are:

Businesses usually have customer and contact databases, possibly CRM systems, loyalty schemes and web based interaction and transaction data, but it is not uncommon for there to be many separate databases in the organisation each holding different information. In some cases a unified database system is used but it is more common to have operational databases that are lightly linked (eg customer ID) that then need pulling together and unifying for analysis. The importance of maintaining the operational database means it is more common to do the analytics away from the live data. Increasingly, large online databases use fast noSQL data for operations, which are then combined (via a map-reduce process) to hierarchical or analytical reports. However, some thought also needs to go into real-time use of intelligence eg via authomated algorithmic marketing - where data from the live database is used on the fly to generate offers or actions such as an 'also bought' option.

Business also potentially have access to data scraped from websites, or from data providers giving access to social media feeds (these are not freely available) which can be mined and analysed for content and sentiment.

For on-going database analysis, automating as many of these tasks as possible becomes vital with a large dataset, both to ensure that the data is of the same quality for each run of analysis and to save time and effort repeating the same work with each data snapshot. Smaller data snapshots can be handled by hand, but anything over a few tens of thousands of records needs to be properly automated and documented.

Once you have analysed the data, it is then a question of what response you should make, but these are more issues about developing market strategies.

Extracting information

Many internal databases grow and develop through use and contingency and consequently the first stage of extracting the data can be complicated, not just from the scale of the task, but because the database itself is poorly documented, data is missing or has been moved, particularly with long-standing or legacy systems and particularly where an operational database is evolving over time. In theory you'd start with a clean design. In practice fields get added, deleted or changed, new forms get added, external data gets merged in, and sales and marketing campaigns shift the data or create subsets, and then the data gets out of date just through natural attrition.

Surprisingly enough, obtaining internal database information can be one of the most time-consuming tasks involved in database analysis. This isn't necessary because it is difficult, but most information resides on transaction databases that are continually being used, updated and changed as orders come in and go out.

Often obtaining the data needs time to be scheduled and a spec to be written to pull the necessary information out without interfering with the day-to-day running of the database, so pulling a snapshot, checking it, then re-pulling can create some organisational headaches.

For web-based data such as social media feeds, the data can be collected via scraping or bought directly as a feed and may be easier and better structured than an internal dataset from the point of view of analysis.

Once the data has been extracted it is normally held in another database for analysis. In sophisticated systems this second database is also known as a data-warehouse, or for smaller amounts of data, a datamart. However, it is not necessary to have a data-warehouse in order to perform database analysis and much can be carried out on a PC using standard statistical tools such as R even up to a couple of million records.

Much initial database analysis is on an ad hoc basis - particularly for the exploratory phase of a project, so a single extract from one point in time is analysed and used for modelling. However, it is important that models and analysis is reviewed regularly. Using out-of-date models may be worse than using no models at all.


Once data has been obtained, it normally has to be cleaned for statistical analysis. Many databases tend to build up inaccuracies and duplications over time. For instance as addresses change, postcodes are entered incorrectly, or there may be duplication of records sometimes caused by mistaken data entry, but more often than not, because customers have changed and duplicate records have been created (in a typical business-to-business database 20-25% of the data will be out of date after a year just because of people changing jobs). Similarly text feeds need a level of processing to standardise the data and to screen for potential problems.

Within an internal database, or when merging datasets, deduping is an important, but sometimes challenging task. Usually data is checked by name, address (eg postcode) and telephone number to see if there are duplications or matches between records. This task can be automated, but even automated systems are not 100% perfect and some level of decision is required to decide which record should be kept and which discarded. In smaller databases of up to 20,000 records some level of 'eyeballing' can be done to check the quality of the dedupe. In larger datasets this is much more difficult.

When looking at aspects such as purchase histories, it is often the case that the data has grouped up and reclassified. For instance each product on a database will have a separate product code, but for analysis several individual products may need to be grouped together.

The process of cleaning can be automated including de-duplication and cleaning up missing or bad data, but often there is an element of verification that needs doing by hand - often by examining smaller samples of data. Typically changes done for analysis should not be repeated on the main transactional database to avoid losing data integrity.

Where data extraction and analysis is a routine task as is often the case for large datawarehousing projects, cleaning will need to be fully automated to ensure accurate repetition of the cleaning process and for speed purposes.


Once the data is cleaned it can be merged with other data sources. For instance, many organisations actually have a number of different databases that need to be combined before analysis can take place. Alternatively, external sources can be used in order to allow for analysis by industrial classification for instance

Merging again is not entirely straightforward as some allowance may be necessary for the same customer having a different name on different databases. For instance Acme Building Contractors might also be known as ABC. Consequently, there may also be a second period of cleaning necessary once the data has been merged.

It is common on consumer sets to add classification data from external agencies such as the ACORN or MOSAIC geographic classification systems or to link in external data from consumer data companies such as Experian. These provide an additional layer of classification or segmentation data on top of the existing data.


There are many different types of analysis that can be carried out on the data from the database. The first part of any analysis is usually an exploratory stage just to see what's there. A very common simple approach is called Pareto Analysis which involves ranking customers by value and then breaking them into quintiles or deciles to see who the most valuable customers are and what their purchasing characteristics are. In text analysis it might be a simple word frequency count prior to any attempt at sentiment or concept analysis.

Standard transactional database measures are recency, frequency and value. So who bought in the last month, 3 months, 6 months? Who has bought once a year, twice a year etc? How much did they spend? What was the difference between those spending a lot and those spending in the next category down (and so can we get uplift).

Increasingly businesses look to track customers and then look at customer journeys - what transactions happened when and how did a customer move from one transaction to the next.

A core aim of many types of analysis is to build a 'propensity model'. That is a model of those customers who are most likely act in a certain way - for instance those people who need contacting, or those people who would be most likely to respond to a particular communication, or those who are likely to leave or stop buying. For instance, chasing up lapsed customers or making more offers to frequent buyers. In particular you want to identify who are the core purchasing groups and who are those that are most likely to respond to advertising or communications.

More sophisticated analysis starts to look at cross-sale rates (the % of buyers of screws that also bought nails), basket size and common product bundles and to look at statistical behaviours - this may involve analysing small groups of customers as if they are a single person, to be able to model or understand behaviour. Where there are large numbers of products (eg supermarkets), the range and number of combinations can make such analysis extremely complicated.

In particular, the data can be used to attempt predictive modelling. For instance looking to see which customers would be most likely to respond to a piece of direct mail or a new product launch. The aim is usually to reduce the size of your mailing whilst maximising the number of people who will respond and so reducing the cost per response and increasing the value per response.

In addition, the database data can be used for segmentation. A major difference between segmentation via the database as opposed to market research segmentation is that the results can be marked back onto the database, so that each customer is labelled with their segment. This means that if you need to contact or track a particular segment from the database this is entirely possible, whereas for market research you are typically taking a second level guess. For basic sales-level databases the difficulty is you are normally limited in the number of variables you have available for segmentation, but with Big Data there are more potentials for a deeper segmentation approach.


Once the analysis is creating marketing insights, the next stage is implementation - that is to use the data and analysis to create change. If the analysis leads to new ideas and recommendations - for instance changing the product, or website or identifying unmet customer needs then these will be dealt with in a similar way to other forms of market intelligence or research. However, much is made of the ability to use Big Data to create customised experiences for the customer online, or to create uniquely targeted communications, products or recommendations. However, these experiences need to be built in and programmed and the challenge with strategies with large amounts of potential customisation (eg sending out customised reward vouchers for each customer), is that due to the amount of customisation needed, implementation also needs to be done automatically.

For instance a typical market research segmentation might be 4-5 segments each of which can be planned for and targeted. For a Big Data type segmentation there may be thousands of segments each of which needs to have tailored communication, specifically targeted offers and a system of response measurement and management. This need for multi-faceted implementation leads to the development of algorithmic and experimental marketing and the importance of bringing the analysis back to websites.

Experiment and research

A recurring view of Big Data is the idea that all the information you need is sitting in the databases and just needs to proper analysis and the business will be able to predict exactly what the customer wants and will do. Unfortunately, that is far from the truth. Big Data analysis can find relationships and correlations in the data and therefore help improve and optmise products and services, but the main problem with database data is that it is backwards looking - that is it tells you what customers have done. If a new competitor enters the market, or you launch a new product, there is no data about what will happen next. There is also an 'analytical delay' - that is analysis, finding useful insights, takes time. By the time the analysis is finished the market may have moved on to new things.

For this reason, an alternative use of an ongoing stream of Big Data is to use small-scale live experimentation to test how people react to changes, offers and communications. Big Data gives the business enough resolution to be able to run micro-tests, adjust and learn and then implement much bigger. Designing experiments is often thought of in terms of simple A/B testing (which of A or B wins), but principles of conjoint analysis and experimental design are equally applicable in experimental marketing

For help and advice on the effective use of database analysis and Big Data contact


How can we help?

Help with a query Site feedback Contact me