Understanding market research samples and sampling methods

Samples and sampling is the bedrock of market research, but there is not just one way to sample or one type of sample. Researchers will choose how they sample based on a number of factors including how easy it is to find the target population and how important it is to have a genuine random sample as opposed to a so-called convenience sample.

Sampling basics

Sampling, at it's simplest, is a very straightforward process. You start with a list of people in the population you want to sample, you select a sampling fraction N according to the sample size you want to achieve. You randomise the order of the list and then you pick every Nth record in the dataset and conduct a survey or questionnaire with that person. So, for instance you have a list of 20,000 people. You want a sample of 100, so the sample fraction is 20,000/100=200. You pick every 200th person and there is your sample. Crucially, each person on the list would have had an equal chance to have been invited to take part (EPSEM) and so the sample is fully random.

For some projects it can be as simple as this process. A company with a list of willing customers eager to take part in a survey would be a good approximation to this process. The only slight catch is to check the list is randomised fairly and you use the proper 1 in N method (usually with a seed number to pick the first record). It has been known for customers to be contacted sequentially from the first record on, until the sample size is met. Unfortunately as database IDs run from oldest to newest, you can end up with a sample entirely of old customers this way.

However, even this database method of picking a sample is not without it's problems. If you are selling B2B, the likelihood is that the biggest customers will also be larger businesses themselves too. But the list of all customers you have will be dominated by smaller customers because the typical profile of a B2B customer database is that 20% of the customers make up 80% of the sales. If you pull a sample of 200 on the 1 in N basis we would anticipate 40 large customers and 160 small customers. But that doesn't really reflect the sales profile so the views of the larger customers might be drowned out by the larger number of smaller customers in the analysis.

For this reason, the sample needs to be stratified - that is different groups need to be sampled separately - so a sample of large customers separate from a sample of small customers. This type of stratification can also be used to control the profile of the sample. In a fully random sample, the randomness means that at times the sample might lead to a disproportionate number of questionnaires for one group or another. Stratification and sampling by the strata allows known profiles to be controlled for. To take the database example one stage further, it might mean dividing the database into geographic regions and then sampling within each region in order to ensure the sample matches the known profile of the database.

In the database cases, the sample is drawn from a known list. But in most research cases there is no list to draw from - there is no known list of internet users, or mobile phone owners, or owners of a particular car, or drinkers of a particular beer. In these cases sampling moves from the theoretical purity of a 1 in N sample to something which balances purity of design with practicality of locating the people you wish to interview.

Random or pseudo-random samples

With no list to work with, if you really require a random sample (eg for Government statistics or measuring media use where people pay for a particular level of advertising exposure) telephone used to be one of the best methods. For random telephone samples the broad principle is that you set a computer to call numbers at random in order to make contact with individuals. In practice pure random numbers isn't so efficient, but phone companies often allocate numbers in blocks and in places like the US, databases of these blocks existed. It was then a process of randomly selecting a block and then selecting a number within the block at random. And then, if it's a household number, selecting an individual in the household at random. This was at least the principle before mobile phones and before homes with multiple phone lines. It is still used as a method, but once you have more than one line, or a line that might be turned on or off according to the weight of use, the quality of the randomness starts to diminish - albeit slightly compared to other sampling methods. Essentially people with more than one line, are more likely to be called. If you have a mobile phone on more of the time, you are more likely to be called. Mobiles also cause problems because mobile numbers are used differently than fixed lines - you have no idea where the person receiving the call might be. They could be overseas, in which case they might get charged to receive your call. They could be driving in which case you shouldn't be interviewing them.

If you're still looking for a random sample and telephone is not applicable, face-to-face might be an option. It's not common in the United States because of the geography and distances involved, but it was common in the UK for major surveys like follow ups to the government census or major health studies. In a face-to-face random survey you again need a list. In the UK this was the electoral roll. You would pick individuals at random from the electoral role, then go and interview them, returning several times if they were out. Though very pure statistically, these type of survey are extremely labour intensive and so extremely expensive. For this reason alternatives were developed. The main ones split the country into geographic regions, sometimes down to blocks of 10-20 houses (enumeration districts that are used to ensure full coverage for the census). Then starting with a list of houses or small geographic areas, you would pick areas at random, then allocate interviewer to visit the households in those areas. You would control for who was at home (eg employees were likely to be out during the day) by controlling the time of the interviews and who could be interviewed. This is still expensive, but at least manageable in terms of allocating an interviewing team and is still the dominant method for conducting face-to-face omnibus studies and media studies where randomness is important in order to properly measure survey-to-survey variations.

What about online research? In practice, unless you have your own database list it's very difficult to pull a true random sample. And any sample drawn from a database is really only representative of the database. However, practice (eg yougov opinion polls) and the current size of the panels seems to suggest that if you have a large proportion of the population on your lists you can draw a sample and get results like a random sample.

Convenience sampling and quota-based sampling

In the main, unless you have detailed list to start with or can use random-digit dialling, fully randomised samples as expensive and difficult to obtain. For most categories - eg recent car buyers, users of fly spray, visitors to Bristol - a list simply isn't available. For some of these types of categories you can do a 'screen'. In a screen you take a random or pseudo-random sample and then use a screener questionnaire (also known as a recruitment questionnaire) to identify the core group you wish to interview. For groups that are a small part of a larger population this can mean asking thousands of people to help to get just a few hundred responses.

So instead, researchers will use what are known as 'convenience' samples. A convenience sample means finding people who fit the criteria, but not worrying about whether the sample is genuinely random. An example is stopping people in the street to ask them to take part in a survey (street interviewing). Here you can only interview people who are passing so you do not have a genuine random sample - for instance it's likely to be biased towards people not working, able-bodied people and often younger females during the day. Similarly, an online panel is, in reality, another form of convenience sample. The people who sign up for an online panel are not necessarily representative of the full range of views in the market because you don't know if there is a bias introduced by getting people to sign up (eg if you ran a survey on privacy, you might find panel respondents less concerned about privacy than those who have not signed up to a panel).

A classic convenience sample is a company's own customer lists. This introduces a natural bias towards the company and the company's products - it will not include many non-customers or people who reject the company's products. This can be acceptable within known limits, but it is something to be very careful of. This hidden type of bias comes into a lot of database and web-analytics as these internal sources of information can only provide information about the people who bought, or who visited and not those who didn't. With the consequence that it can be very difficult to say anything about why people don't become customers or don't spend a long time on the website.

Because of the hidden potentials for bias in convenience sampling, one method for control is to set quotas to ensure that a certain number of interviews are achieved in certain categories. This might include setting quotas by age, or working status, or socio-economic grade, but in business-to-business surveys might include the sector (the companies that do the most marketing are typically the least likely to do market research surveys - local government the most likely to take part), or size of the business.

A quota is then used to set a target and a limit on the number of interviews to be achieved. For instance a minimum of 5 men under 25 and a maximum of 10 men aged 65+. If the quotas are set very tightly it can make it very difficult to find the last few interviews, but too loose and the sample will tend towards the easy to find categories of respondents.

Non-response bias

Adding quotas and setting interview targets, doesn't make the sample random but for reasons such as cost or speed it may be considered the best available sample for the job. Obviously the researcher needs to keep an eye on potential biases, but there is one more hidden potential bias, even with random samples. Imagine an individual has been chosen at random to take part, if that individual then declines to complete the survey there is the potential that this introduces a non-response bias. In other words, how can you know that the people who don't take part are like or match with people who do take part in the survey? In some cases simply saying a survey is being carried out on behalf of say Epson will mean that customers who prefer HP may be less likely to take part. For this reason deciding to reveal or not reveal the sponsor of the survey could skew the results.

In some cases for governmental surveys, the question of non-response bias has been important enough for follow-up checks on those who did not respond. Instead of completing a full questionnaire the non-responders were asked a handful of the key questions. In general these suggested that the original non-responders were similar to those who took part in the survey at the start.

Other forms of sampling

In some cases obtaining a full sample can be extremely difficult and creative ways are needed to provide an answer to the research problem. A very famous case of this was at BMRB in the 1990s looking into the effectiveness of advertising to counter the threat of AIDS/HIV. In this case a sample of gay men was vital, but extremely difficult to get any form of sample from conventional means. So instead a 'proxy sample' was used. Interviews were carried out in gay clubs and changes in opinions and behaviour monitored over time. This use of a proxy for monitoring purposes is common. Even if the sample is biased, so long as the samples are consistent it may be possible to measure changes, even if these are not directly projectable to the population in question, and therefore judge the success or otherwise of the advertising.

A second common problem is that the population to be researched may exist, but may not be easy to reach through an interviewer or formal request to take part. An example is a survey among volleyball players we carried out for the English Volleyball Association. The group of volleyball players clearly exists, but rather than use an interviewer led approach, a 'snowball' method was used. In other words, friends ask friends to complete the survey. Again there are obvious potential biases - the keener and more interested players are more likely to take part, as are the better connected individuals. Snowball techniques have also been used to recruit difficult to reach groups like ex-teachers to help monitor campaigns to recruit people back into teaching.

For help and advice on sampling and sample design contact info@dobney.com


How can we help?

Help with a query Site feedback Contact me