Big Data in Merger Control: Part 1 – Our take on the five Vs of Big Data

Big Data is playing an increasingly important and prominent role in global merger control. But what do we mean by ‘Big Data’ in this context, how should advisors be using big data, and at which point in the merger control process? This post is the first in a series of articles where we will explore these questions.

First things first though, let’s explain why Big Data matters in merger control reviews.

Quite simply, competition authorities are increasingly demanding large volumes of data in merger control reviews to assess the competitive effects of mergers. This includes internal documents (e.g. relating to deal rationale or entry/expansion plans, and which rivals they track), but also detailed transaction, customer and bidding records (e.g. to capture data on competitive impacts and responses). But, given the broad scope of data that might be requested, we find it useful to use the characteristics of Big Data to frame the “answers” and how this data should be understood, used and interpreted. The old adage of “garbage-in, garbage out” applies even more strongly today.

Next, let’s explore some definitions. Big Data refers to electronically stored information (ESI) and is often described using the so-called five Vs:

Volume – The quantity of the data that needs to be considered is so vast that ‘conventional’ analysis tools (for example, Microsoft Excel) are not practical, or indeed feasible, and therefore specialist data analytics hardware and software is required.
Variety – The relevant data comes from multiple different sources and is stored in many different formats (more on this later) that need to be standardised to consolidate them into a single usable view of the data.
Veracity – The data is of variable quality – i.e. it is not always complete and accurate and therefore any approach to analysis needs to carefully consider this, either by avoiding the use of very dirty data, cleansing the data, or clearly articulating the limitations of analysis that makes use of it.
Value – Fundamentally the data offers value to us if it tells us about how the world was or currently is, how it might be in the future, and how events captured and described in the data may or may not be related to one another.
Velocity – The rate of flow of the data i.e. how much is being produced. This is most relevant to real-time systems that need to ingest, process and present data on-the-fly.

While Velocity may apply in certain aspects of merger control (for example, the ability of firms to access or handle high velocity data being a barrier to entry into some markets), it is not a common feature of the Big Data we see in merger control cases. For this reason, we will swap this one out for our own fifth V…

Vintage (you will need to forgive us somewhat stretching the definition to fit our need for a word starting with V!). Merger control often requires us to obtain and analyse large quantities of historical data, typically three years back but sometimes longer. This data is often what we describe as being “dusty”, lurking in the client’s metaphorical (or actual) IT basement and hidden inside systems that are no longer active and supported. This is a particular challenge and one that, to paraphrase Liam Neeson, requires a special set of skills.

Now we have defined Big Data we can start to break this unwieldy beast into more bite-sized chunks. We often talk about the Big Data landscape in three different buckets, which we explain through three common file formats:

Structured data – At its most basic level this is data that has been deliberately created and stored in a highly defined structure, most commonly a table, like the one you would be used to seeing in a Microsoft Excel spreadsheet. The rows in the table represent a record of a specific event or object (e.g. sales to customers, or products or services that the company provides), whereas the columns in the table provide data points relating to the event or object, such as the date and time an event occurred. The tables of data are commonly stored in databases, which allow you to build intricate structures that can capture the complex relationships that can exist between events and objects (e.g. products and the sales of those products). As the data is highly defined and structured it is a very efficient way of storing information. The challenge arises because of the huge variety of different systems that have been developed over the years, all of which store data differently. In addition, those inputting data may not have always done so consistently.
Unstructured data – This is at the other end of the spectrum and can be most easily characterised as ‘natural language’ content produced by humans. While languages do have rules, we can express the same idea or description of an event or object in an almost endless number of ways. We encounter unstructured data every day in Microsoft Word documents, or emails. Unstructured data deals with a lot of variation and therefore it can’t be stored as efficiently as structured data. In addition, we also encounter the challenge of duplicates (e.g. different versions and email threads) and of identifying which documents are of interest (the classic needle in a haystack challenge). Additionally, data is produced in a context (for a purpose, to communicate with specific people, at a point in time), and sometimes “context is all”.
Semi-structured data – Existing in the middle of the spectrum these are data sources that show elements of structure but need some coding wizardry to transform them into a more useful structure. Examples of these are Adobe PDF bank statements which were produced by a computer using defined rules, but as these rules are often no longer available we have to ‘wrangle’ this data, to transform it back into a table of bank transactions.

We’ve now completed our anatomy of Big Data. Next time, we will explore in more detail what data is relevant in merger control and whether it is a friend or foe.

The 'Big Data in Merger Control' series: