Data Infrastructure & Governance Issues in M&A Due Diligence
Common Data Quality Problems in SaaS Companies
SaaS businesses often grapple with classic data quality issues that can undermine analytics and operations. Common problems include duplicate records, inconsistent data, incomplete or missing fields, and fragmented data silos across different systems . For example, customer information may be duplicated in CRM and billing databases, or key metrics (like ARR or churn) might be defined differently by separate teams, leading to inconsistencies. Outdated data is another concern – without rigorous data hygiene, SaaS companies accumulate stale information that no longer reflects reality. All these issues erode trust in reporting and require extra cleanup work before data can be reliably used for decision-making. Notably, even tech giants are not immune: IBM reports that poor data quality (duplicates, inconsistencies, missing data, etc.) causes missed opportunities and inefficiencies for enterprises leveraging large datasets . In an M&A due diligence context, buyers pay close attention to such data quality gaps. If a target’s KPIs or customer lists are built on flawed data, it raises red flags about the accuracy of the business’s performance metrics and can complicate integration post-deal.
Costs of Poor Data Quality (Decisions, Rework & Compliance)
The business costs of “bad data” are enormous and manifest in multiple ways: incorrect strategic decisions, productivity loss due to rework, compliance penalties, and more. Poor data quality directly leads to bad decisions and financial loss – one estimate pegs the annual impact at $3.1 trillion in the U.S. economy . For instance, 85% of companies blame “stale” (out-of-date) data for bad decision-making and lost revenue . Data teams spend excessive time on rework: one survey found an average 15 hours to resolve a single data incident, and with roughly 1 data quality issue per 10 tables per year, downtime adds up fast (e.g. an environment of 1000 tables might suffer ~100 issues requiring ~1,500 hours of remediation). Compliance and regulatory risks are equally high. Inadequate data can trigger audits or fines – for example, JPMorgan Chase was fined ~$350 million in 2024 for providing incomplete trading data to regulators , a direct consequence of poor data management. Below are a few concrete examples illustrating the heavy price of data quality failures:
Unity Technologies ($110M loss): In 2022, Unity’s ad-targeting product ingested corrupted data, derailing its ML models. The result was a $110 million hit to revenue-sharing, including costly model rebuilds and delayed feature launches . Unity’s stock plunged 37% as investors “lost faith” in management , showing how data errors can damage market value and require extensive rework.
Uber ($45M payout): A data miscalculation led Uber to take a higher commission from New York drivers than allowed. This went on for 2.5 years, forcing Uber to reimburse drivers with 9% interest – about $45 million in total paybacks . Beyond direct cost, it hurt Uber’s reputation (coming on the heels of a $20M FTC settlement over earnings claims) .
Samsung Securities (fat-finger error): In 2018, a trivial data entry mistake (entering “shares” instead of currency) caused Samsung’s brokerage arm to erroneously issue $105 billion in stock to employees. Although the error was caught within minutes, $187 million worth of phantom shares were sold off before trading halted . The fallout included a 12% stock drop (wiping ~$300M in market cap), loss of major clients (due to concerns over “poor safety measures”), a 6-month ban on new business, and the CEO’s resignation . This case shows how a single data lapse can cascade into multi-million-dollar losses and compliance sanctions.
Equifax (credit score fiasco): For three weeks in 2022, Equifax delivered incorrect credit scores to lenders for millions of people – over 300,000 scores were off by 20+ points, causing loan denials and wrong interest rates . The cause was a legacy “coding issue” in a data pipeline. Equifax faced a class-action lawsuit and a ~5% stock price hit once the errors came to light . Coming after a $700M breach settlement in 2017, this further damaged Equifax’s credibility.
Public Health England (COVID under-reporting): In 2020, PHE failed to report 15,841 positive COVID-19 cases because an outdated Excel template (XLS format) truncated rows when aggregating lab results . The immediate cost was public health risk – an estimated 50,000+ potentially exposed people were not contacted by tracers . This widely publicized error underscored how antiquated data processes can have life-or-death consequences (and led PHE to accelerate replacing its legacy data tools ).
These examples make clear that poor data quality isn’t just an IT inconvenience – it causes real financial damage, requires costly rework, invites legal liability, and can even threaten lives. Acquirers in M&A diligence therefore assess the target’s data quality rigor. If a company’s data errors have led to misreported KPIs, compliance issues, or excessive cleanup efforts, buyers will factor those risks into the deal (either through lower valuation or specific remediation plans).
Data Warehouse Pricing Models and Cost Optimization (Snowflake, BigQuery, Redshift)
Modern cloud data warehouses use different pricing models, each with its own cost structure and optimization tactics. In due diligence, buyers examine a SaaS target’s data warehouse spend to see if it’s cost-efficient or if there are optimization opportunities post-acquisition. Below is an overview of how Snowflake, Google BigQuery, and Amazon Redshift pricing works, along with cost optimization considerations for each:
Snowflake: Snowflake uses a usage-based, pay-per-second billing model. Compute is metered in credits (with 1 credit costing ~$2–$4 depending on region and edition) . Users create virtual warehouses of various sizes (XS to 6XL) and are charged credits based on warehouse runtime (auto-suspend can limit charges). Storage is billed separately at roughly $23 per TB per month (compressed) . This flexible model lets companies scale up or down on demand – ideal for spiky or unpredictable workloads, since you only pay for what you use. However, it requires active management to avoid runaway costs: e.g. rightsizing warehouses, using auto-suspend aggressively, and monitoring costly queries. Many firms have learned this the hard way. For example, Instacart saw its Snowflake costs swell to $28M in a year before implementing controls; with fine-grained monitoring and query optimization, they cut costs by 56% . Similarly, HelloFresh optimized its warehouse sizing and saved 30% on Snowflake compute spend . Optimization tips: Employ resource monitors and budgeting per team, use smaller warehouses with auto-scale for concurrency, optimize SQL to avoid scanning excess data, and archive infrequently used data to cheaper storage. Snowflake’s separation of storage/compute is powerful, but without governance, SaaS companies can overspend – so buyers will look at the target’s $/query efficiency and whether cost-saving best practices (e.g. using result caching, turning off idle clusters) are in place.
Google BigQuery: BigQuery offers a serverless, fully-managed warehouse with two pricing options: on-demand (pay per query) and flat-rate. In on-demand mode, you’re charged by data processed – approximately $5 per terabyte scanned (with the first 1 TB per month free) . Storage costs are around $20 per TB per month for active storage . This usage-based approach can be very cost-effective for companies with intermittent or light query workloads, as you incur costs only when running queries. For heavy or steady workloads, BigQuery offers flat-rate pricing (e.g. ~$10,000/month for a dedicated block of 500 slots, which provide a fixed amount of processing capacity) . Optimization tips: Under on-demand, optimize queries to scan less data (partitioning and clustering tables by date or key can dramatically reduce bytes scanned). Use BI Engine or materialized views for frequent queries to cut down runtime. Monitor query cost per user/project – BigQuery provides per-query cost stats that teams should review. In due diligence, an acquirer will check if the target is using BigQuery’s features to contain costs (like whether large tables are partitioned, or if they’re unintentionally scanning entire datasets for simple analyses). If the target has already moved to a flat-rate plan, buyers will evaluate utilization – e.g. is the company fully using its slot capacity or paying for headroom? BigQuery is most cost-efficient when queries are optimized; otherwise, careless querying (e.g. a JOIN without pruning) can scan terabytes and rack up charges. For illustration, a simple ETL that processes 5 million rows daily might cost only about $25/month on BigQuery, versus an estimated $75/month on Snowflake for an equivalent workload, under certain assumptions . Such differences depend on workload pattern, but they highlight that cost optimization is about aligning usage with the right pricing model.
Amazon Redshift: Redshift uses a more traditional cluster-based pricing model. You provision compute nodes (in clusters) and pay for those instances by the hour, whether or not they are fully utilized. There are two main modes: on-demand (hourly rates per node) and reserved instances (commit to 1-3 year terms for deep discounts) . For example, a company might run a Redshift cluster of 4 nodes at an on-demand rate – this provides predictable capacity but if the cluster sits idle at night, that time is still billed. Reserved pricing can save significantly (often 30-50%+ off) if the workload is steady and long-term. Redshift’s pricing is more predictable for constant workloads (you pay a fixed amount for a given cluster size), but it requires sizing the cluster appropriately. Cost optimization tips: Ensure the cluster size (and node type) matches the workload – avoid over-provisioning huge clusters “just in case.” Use features like concurrency scaling or Redshift Serverless (newer offering) if workload varies; these allow adding extra processing only when needed or paying per query-second, respectively. Also, leverage compression and sort keys to improve query efficiency so that smaller nodes can handle the data. Turn off or pause clusters in non-production environments when not in use (Redshift now supports pausing a cluster to stop billing). In an M&A diligence review, a buyer will examine the target’s Redshift utilization metrics: e.g. if the CPU/utilization is low but costs are high, that signals room to optimize by downsizing the cluster or switching to on-demand/serverless usage. Storage in Redshift is tied to the nodes (except with RA3 managed-storage instances), so another optimization is using RA3 nodes which offload colder data to cheaper storage tiers automatically. Overall, Redshift can be very cost-effective at scale (especially with reserved pricing) but only if the environment is well-tuned. An anecdotal rule is that Redshift may be cheaper for consistent 24/7 heavy workloads, whereas Snowflake/BigQuery can save money for spikier or smaller workloads . Buyers will thus consider whether a SaaS target’s data warehouse choice is appropriate and if they’ve optimized within that model (e.g., a company sticking to on-demand nodes without reservations could be spending needlessly, which post-acquisition cost synergy efforts could address).
Pricing illustration: According to one 2025 comparison, a simple data pipeline use-case (daily sync of ~1 million rows) might cost about $5/month on BigQuery’s on-demand plan, versus $15/month on Snowflake (using an extra-small warehouse for a minute a day) . At larger scale, say an hourly integration of 5 million rows, BigQuery might be ~$25 and Snowflake ~$75 per month . These are hypothetical scenarios, but they demonstrate how usage patterns influence cost on different platforms. During due diligence, such analyses are common – the acquirer will model the target’s workloads under various pricing schemes to identify potential savings. For example, if a SaaS target is on Snowflake and running many long-running transformations, an acquirer might explore if those could run cheaper on a different platform or be refactored for efficiency (to reduce compute hours). Conversely, if the target is on Redshift with heavy constant loads, the acquirer will verify they’ve maximized reserved instance discounts or will plan to negotiate those post-close.
ETL/ELT Pipeline Failure Rates and Root Causes
Data pipeline reliability is a crucial aspect of data infrastructure due diligence. Frequent ETL/ELT failures can disrupt reporting and operations, so buyers want to know how often pipelines break and why. Industry telemetry shows that pipeline and integration failures are one of the most common data incident types, accounting for a significant share of data quality issues. In one analysis of millions of pipeline runs, 26.2% of data incidents were caused by pipeline execution faults (e.g. jobs that *didn’t run on schedule, failed tasks, broken dependencies or permissions) . In other words, over a quarter of data quality problems trace back to ETL processes not functioning as intended. The root causes are diverse – below are the top categories identified and their prevalence:
Pipeline Execution Faults – 26.2%: These include scheduled jobs that never executed or aborted, task errors in ETL workflows, and permission/credential issues that prevent data from flowing . For example, a nightly batch load might fail due to a code bug or a lost database connection, causing missing or partial data for that day. This category topping ~26% underscores that even in 2025, “broken pipelines” remain the #1 cause of data downtime. Best practices like robust scheduling, error alerting, and retry mechanisms are essential to minimize these failures.
Real-World Data Anomalies – ~20%: A sizable portion of “incidents” are not technical failures at all but legitimate shifts in the data that trigger anomaly alarms . For instance, a sudden spike in user signups or a one-time event (e.g. holiday surge) might be flagged as an anomaly. Approximately one-fifth of incidents fell in this bucket, which reminds us that not all anomalies are errors – differentiating true issues from expected business changes is key. (Notably, another ~14.2% were intentional changes like backfills or schema updates that were known, further highlighting that ~34% of “incidents” weren’t really errors . Data teams benefit from tools to quickly identify these benign events to avoid wasted effort.)
Data Ingestion Disruptions – 16.6%: These are failures in getting data from source systems into the warehouse/data lake . Common culprits include API outages, file delivery delays, or broken connectors. For example, a SaaS company might ingest CRM data via an API – if the API credentials expired or the source system was down, the pipeline might load nothing. Connector outages or network issues can stop data flow, causing incomplete datasets. This category, along with pipeline faults, reflects the fragility of integrations: companies rely on many external data sources and a single break in the chain can propagate errors downstream.
Platform Instability – 15.2%: This refers to issues with the infrastructure platforms themselves . Examples are a cloud data warehouse going down or running out of compute (contention), or a storage service throttling requests. These incidents highlight how dependent data engineering is on third-party services being up and stable. For instance, if a company’s Redshift cluster runs out of disk or an AWS outage occurs, pipelines will fail despite the ETL code being fine. Buyers often check if the target has redundancy or monitoring in place for such platform issues, as cloud downtime can derail critical data processes.
Schema Drift – 7.8%: Changes to data schemas (tables, fields, data types) account for about 8% of incidents . In agile SaaS environments, upstream developers might add a column or change a field format in a source system, which then breaks the ETL mapping or data model expecting the old schema. For example, if an event tracking schema adds new event types, an ETL job not updated for it might fail or drop those records. Schema drift is a constant tension between agility and stability – without governance, rapid changes can “pull the rug out” from under downstream processes. During diligence, acquirers examine how well the target manages schema changes (e.g. do they have automated tests or observability to catch breaking changes early?).
In aggregate, these stats reveal that roughly 1 in 4 pipeline runs can encounter issues if not meticulously managed. Pipeline failures lead to data downtime – periods when dashboards or models are fed incomplete/incorrect data. The business impact is significant: teams waste time firefighting and in the worst case, decisions might be made on faulty data. One survey noted an average organization experiences about 1 data issue per 10 tables per year , and each takes many hours to fix, as mentioned. Thus, due diligence will focus on the target’s pipeline reliability: How often do their ETLs fail? What monitoring and alerting exists? Are there single points of failure? For instance, if 30% of the target’s daily jobs fail on first attempt and require manual reruns, that’s a sign of immature data operations that the buyer may need to invest in post-acquisition. Conversely, a company with automated data observability (catching issues like late data or schema changes in real-time) will be viewed favorably, as it mitigates one of the most common risks in data infrastructure. The goal for any merged entity will be to reduce these failure rates via better engineering practices – improving reliability directly reduces labor costs on rework and increases confidence in data used for business strategy.
Business Intelligence Tool Sprawl and Associated Costs
Many organizations accumulate an over-abundance of BI and analytics tools over time – for example, a company might simultaneously use Tableau, Power BI, Looker, and custom SQL notebooks across different teams. This “BI tool sprawl” has significant hidden costs. In M&A, buyers are keen to identify if the target has a rationalized BI stack or if multiple redundant tools are generating unnecessary expense (and inconsistent analytics). Key issues and costs associated with tool sprawl include:
Redundant Licensing & Underutilized Tools: Maintaining multiple BI platforms means paying licensing fees for overlapping capabilities, many of which go unused. Research indicates about 30% of enterprise software licenses go entirely unused . BI is no exception – one company analysis found 15% of its 2,000 BI user licenses were completely inactive, translating to ~$210,000/year in wasted fees . When each department buys its own favorite dashboard tool, the organization ends up paying for far more seats than actually needed. An acquiring company will look at the target’s software subscriptions and likely see an opportunity to cut costs by eliminating duplicative BI tools or excess licenses. Tool consolidation can yield immediate savings; for instance, simply retiring unused BI licenses (or downgrading expensive creator licenses to viewer licenses for light users) can save six figures annually .
Low Adoption and Wasted Investment: A sprawling BI environment often indicates low overall adoption of any single source of truth. Employees get confused by multiple tools and many end up not using BI at all. Surveys show only about 25–29% of employees actively use their company’s BI/analytics tools . This means 70–75% of staff are not leveraging the dashboards or reports – essentially a huge chunk of the BI investment is yielding no return. The opportunity cost is high: if a company spends $3 million on BI software, data warehouse, and analysts, but only 25% of users engage with it, then ~$2.25 million of that spend is under-realized value (75% not utilized) . In due diligence, this low utilization would be a red flag – it suggests either the tools are too fragmented/complex or data trust is low. Post-merger, the combined entity would likely streamline the analytics stack to improve adoption. It’s well documented that simpler, unified BI environments drive higher usage. Metric Insights notes that a single “BI portal” or catalog layer can help by guiding users to one certified source, thereby improving trust and adoption (and allowing the reduction of excess tools) .
Higher Support, Training, and Maintenance Costs: Each additional BI tool introduces overhead in training users, maintaining systems, and supporting infrastructure. Different teams might each have BI developers customizing their preferred tool, leading to duplicated effort. There is also a human cost: analysts waste time reconciling conflicting numbers between systems and rebuilding logic in multiple places. One source described teams “bouncing between systems searching for ‘correct’ numbers” when metrics don’t match, and having to fix or rebuild dashboards repeatedly across tools . Furthermore, onboarding new employees is slower when they must learn several BI interfaces. For example, if 100 new hires each need an extra 8 hours of training due to a fragmented BI stack, at a blended $100/hour, that’s ~$80,000/year in added onboarding cost . Simplifying the toolset directly cuts these inefficiencies. Buyers might quantify savings from standardizing on one tool: fewer servers or subscriptions to pay, less training sessions, and one set of expertise to cultivate.
Conflicting Metrics and Decision Risks: Perhaps the biggest cost is qualitative – tool sprawl leads to multiple versions of the truth. Different dashboards may yield different figures for what should be the same KPI, undermining confidence. Business users end up arguing over whose report is correct instead of making decisions. This “strategic waste” is hard to measure but very real: if marketing’s dashboard says 1000 new signups and finance’s says 950 (due to slight differences in data timing or definition), meetings get consumed by reconciliations. Decisions get delayed or based on whichever number people want to believe. The trust deficit caused by BI fragmentation can ultimately cost money in misallocated resources or missed opportunities. The Metric Insights report cites that bad or misused data (often a consequence of siloed BI systems) contributes to trillions in losses, and gives an example formula where even a 20% chance of using a bad dashboard for a $5M decision creates $1M risk . It’s no surprise then that C-level executives sometimes approve budgets for overlapping BI platforms simply because departments can’t agree on a single source – a vicious cycle that further increases cost. The U.S. Air Force famously dubbed the phenomenon of spending time in inter-department data disputes as “analysis paralysis.” In M&A, acquirers will scrutinize if the target has a well-governed BI environment (single source of truth) or if the acquisition will bring a hoard of redundant reports that need rationalization. Often, as part of post-merger integration, companies invest in a unified semantic layer or data catalog to enforce consistency across reports . This not only reduces license costs but also improves decision-making speed by eliminating confusion.
In summary, BI tool sprawl is essentially a tax on efficiency and budget. It leads to paying multiple vendors for similar functionality, low overall usage of analytics, extra work for staff, and the risk of erroneous decisions from inconsistent data. During due diligence, a savvy buyer will catalog the target’s analytics tools and likely find opportunities to streamline. For example, if a SaaS company to be acquired is juggling 5 different BI tools, an acquirer might project that standardizing on one or two could save hundreds of thousands per year in licenses and labor. Indeed, one estimate found that summing up a few factors (unused licenses, decision errors, low adoption, training costs) showed the total cost of BI sprawl can be staggering – but also that eliminating this sprawl can unlock a strong ROI . Thus, rationalizing the BI environment is often an early post-merger synergy target, yielding both cost savings and a more unified data-driven culture.
Data Governance Frameworks and Implementation Costs
A data governance framework is the blueprint for how an organization manages and protects its data assets. It defines the policies, processes, roles, and standards that ensure data is used properly, consistently, and securely across the enterprise. In practical terms, a governance framework covers things like who can access what data, how data is categorized and labeled, quality standards, retention policies, and compliance controls. The goal, as one guide puts it, is to treat data as a strategic asset – with clear ownership and accountability so that data remains reliable, secure, and valuable for the business . Common frameworks or models (e.g. DAMA-DMBOK, the Data Governance Institute framework, or PwC’s governance model) all rest on the triad of people, processes, and technology working together . This means a governance program typically establishes a governance committee or data steward roles (people), sets procedures for data quality checks, approval workflows, etc., and leverages tools like data catalogs, metadata management, and access control systems to enforce policies.
Implementing a robust data governance framework comes with costs in time and resources – but it’s increasingly seen as a necessary investment, especially in M&A where combining data from two companies can be chaotic without governance. The costs include software/tools (for cataloging, data lineage, policy management), personnel (often a data governance lead, plus part-time effort from data owners in various departments), and process overhead (time spent defining standards, doing data quality audits, etc.). However, these costs should be weighed against the potentially catastrophic costs of not governing data. An insightful comparison from OvalEdge: “An enterprise with ~25 data users could spend under $20,000 a year to implement data governance… Compare this with the $4.45 million average global cost of a data breach.” . In other words, a modest annual governance budget can dramatically reduce the risk of multimillion-dollar security incidents or compliance fines. Governance also prevents costly operational issues – for example, a formal framework would catch something like Unity’s faulty data ingestion (which caused the $110M loss) or Public Health England’s Excel error, before they become business crises .
Some tangible implementation costs to consider:
Technology: Data governance platforms or suites (e.g. Collibra, Alation, OvalEdge) often charge based on number of users or data volume. Pricing can range broadly – Atlan notes that governance tool costs vary, often scaling with number of data assets or connectors . For a mid-sized firm, software subscriptions might be tens of thousands per year. There are also costs to integrate these tools with existing systems. In due diligence, buyers will examine if the target has such a tool and if not, whether lack of governance tech has led to issues (manual processes can only scale so far).
Personnel and Training: A successful governance program typically needs dedicated roles like a Data Governance Manager, Data Stewards (often part-time roles in each domain), and involvement from IT and legal compliance teams. The cost here is in salaries or reallocating staff time. Additionally, employees across the organization may need training on new data policies (for instance, how to classify data or how to request access through proper channels). These “soft” costs are harder to pinpoint but are necessary. On the flip side, a well-run governance initiative can save time for analysts who currently spend hours resolving data definition conflicts or hunting for the right data – governance can provide a clear business glossary and single definitions, reducing that inefficiency .
Process Implementation: Establishing data quality processes (like regular data profiling, data correction workflows) and compliance checks (ensuring GDPR/CCPA mandates are followed) can involve initial project costs. For example, setting up a master customer index to eliminate duplicate customer records might be a significant one-time project, but afterwards it yields ongoing benefits in accuracy. Companies often undertake such projects in preparation for M&A to present a “clean house” to buyers.
From a cost/benefit perspective, the “cost of inaction” on data governance can be dire. Without governance, companies risk data breaches, privacy violations, erroneous analytics, and lack of trust – any of which can derail an M&A or reduce its value. A ticking time bomb analogy is apt: you might save some money by not investing in governance now, but you accumulate risk that can explode later . High-profile examples underscore this: British Airways’ £183M GDPR fine was attributed in part to poor data governance (a breach that “could have been mitigated” with better controls) . Capital One’s 2019 breach cost it $100–150M and a stock drop, again tied to lapses in data security governance . Equifax’s 2017 breach settlement of $700M and ongoing reputational damage also illustrate the point . On the other hand, governance can add positive value – it streamlines data integration in mergers, enhances analytics (because data is consistent and high-quality), and fosters innovation safely. Gartner has noted that by ensuring “AI-ready data” through governance, organizations succeed in more AI projects (they predict 60% of AI projects at companies without good data will fail by 2026 ).
In an M&A due diligence report, one might find a section on the target’s data governance maturity. If the target already has a well-implemented framework (e.g. a clear data owner for each domain, an access request process, documented data dictionary, etc.), it can reduce integration headaches. If not, the buyer often sets aside budget to establish governance in the combined entity. Importantly, governance is not a one-time cost but an ongoing discipline – yet its ROI is seen in risk mitigation and improved data-driven decision making. As one LinkedIn article quipped, we should “stop asking what data governance costs” and instead ask what it saves, because when done right it pays for itself many times over in prevented problems .
GDPR Compliance Gaps Related to Data Management
Regulatory compliance – particularly with data privacy laws like the EU’s GDPR (General Data Protection Regulation) – is a critical focus during due diligence. Buyers will assess if the target has any compliance gaps in how it manages personal data, because non-compliance can mean heavy fines or post-acquisition liabilities. Common GDPR-related data management gaps that companies struggle with include:
Incomplete Personal Data Inventory: GDPR requires knowing what personal data you have and where it’s stored (records of processing activities). A frequent gap is that companies lack a comprehensive data inventory. In fact, 63% of organizations struggle to maintain a complete inventory of personal data they process . If a company doesn’t have an accurate map of customer data in all its systems, it risks failing GDPR obligations like responding to Subject Access Requests or fulfilling data deletion (right to be forgotten) in all places. During diligence, a buyer might perform a data mapping audit. An example concern: if the target can’t quickly list all databases and third-party apps where EU customer data resides, that’s a red flag. It means potential undiscovered compliance issues. Addressing this gap often requires deploying privacy management tools (e.g. OneTrust, TrustArc) to automate data discovery – which is an extra cost the acquirer may have to incur .
Poor Consent Management: GDPR sets a high bar for obtaining and managing user consent for data processing. A common mistake is not gathering consent properly (e.g. using pre-ticked boxes, or not refreshing consents when scope changes) or not storing consent records. It’s noted that inadequate consent mechanisms are a leading cause of GDPR fines – about 57% of fines relate to consent violations . For example, a SaaS company might be collecting user data for marketing without a clear opt-in, or might be reusing data in ways users didn’t agree to. In due diligence, buyers will review privacy policies, signup flows, and how the target documents consent. If the target’s user base includes EU individuals and its consent practices are outdated, the buyer may require this to be fixed before close or adjust valuation for risk of penalties. (Notably, some high fines have come from consent issues – Google’s early GDPR fine of €50M was largely about insufficiently informed consent for personalized ads.) Proper consent management often requires both process and system – e.g. ensuring every marketing email list only has users who opted in, with timestamps of consent stored. Lack of that system is a gap that needs plugging.
Inadequate Access Controls and Data Security: GDPR mandates “appropriate technical and organizational measures” to protect personal data. On the technical side, common data protection gaps include insufficient access controls and lack of encryption of personal data . For instance, a company might leave customer data unencrypted in an S3 bucket, or have broad internal access where any employee can query production databases. These are big compliance no-nos – a breach of that data would likely be deemed negligence under GDPR (leading to higher fines). Best practices like role-based access control, need-to-know data permissions, and strong encryption for data at rest and in transit are expected. If due diligence finds, say, that the target’s customer database with EU user info isn’t encrypted or that an ex-employee still had active credentials (real examples that regulators frown upon), the buyer will push for immediate remediation. Many deals now include a cybersecurity and privacy audit for precisely this reason. In fact, one survey found 53% of companies had encountered a critical cybersecurity issue during M&A due diligence that jeopardized the deal . Buyers may also ask if the target has undergone any penetration testing or security audits. If not, it’s a gap indicating the company might not have been rigorously verifying its data security. From a process view, GDPR also requires controlling access via measures like pseudonymization – e.g. developers working with production data should use anonymized datasets, etc. Lack of such practices could be another gap.
Lack of Data Retention Policies: GDPR’s storage limitation principle says personal data shouldn’t be kept longer than necessary. Many companies, however, have no systematic deletion or retention schedules – they accumulate personal data indefinitely. This is a compliance gap: for example, retaining EU customer data from 10 years ago that’s no longer needed could violate GDPR unless there’s a legal reason. During diligence, an acquirer might ask: “Do you have data retention policies? Do you purge old personal data?” If the answer is no, it signals a potential liability. (We might not have a specific stat in the sources above on this, but it’s known from enforcement cases; e.g. companies have been fined for not deleting old user records). Therefore, a buyer may plan to implement retention schedules post-merger to close this gap.
Employee Awareness and Training Gaps: Even with policies on paper, GDPR compliance is heavily about operational behavior – employees handling data correctly. A lack of GDPR training is a common gap. In fact, 39% of businesses reported insufficient staff awareness as the biggest cause of data breaches . If employees don’t know, for example, that they shouldn’t email spreadsheets full of customer data or that they need to honor deletion requests promptly, mistakes will happen. During due diligence interviews, buyers sometimes ask about privacy training programs or check if any breaches in the target’s history were due to human error. A culture with poor data hygiene (like using personal Dropbox to store company data, or not locking down laptops) can indicate GDPR non-compliance risk. The cost to fix is investing in training and stricter policies, which the acquirer will factor in.
To illustrate the stakes: GDPR fines can be up to 4% of global annual revenue or €20 million (whichever is higher) for serious infringements. There have been notable fines for companies acquired or merging: the Marriott case is instructive. Marriott acquired Starwood and later discovered Starwood’s systems had been breached years prior; because that data wasn’t secured properly even post-acquisition, UK regulators fined Marriott £99 million (later reduced to £18M) under GDPR and specifically called out the need for due diligence on data protection during acquisitions . The Information Commissioner stated that companies must assess what personal data they are inheriting and how it’s protected, as part of M&A . This underscores that any gap in GDPR compliance at the target becomes the acquirer’s problem after closing.
In summary, GDPR-related data management gaps to watch for in M&A are: missing data inventories, sloppy consent and preference management, weak security controls (unencrypted or broadly accessible personal data), lack of data lifecycle management, and poor staff training on privacy. A thorough due diligence will include a privacy and cybersecurity audit to uncover these. If significant gaps are found, buyers may negotiate remediation measures (e.g. requiring the target to obtain missing consents or delete certain data sets before closing) or even adjust the deal value to account for potential fines. For instance, if a target has millions of EU users but no record of consent for email marketing to them, the acquiring company knows it might have to halt those campaigns (affecting revenue) or run a costly re-permissioning campaign. Thus, compliance gaps can translate directly into valuation adjustments. On the flip side, a target that demonstrably has strong data governance and GDPR compliance will give the buyer confidence, possibly even commanding a premium for lower risk. In today’s environment, data privacy diligence is as crucial as financial diligence, given how damaging a lapse can be.
Case Studies of Data Infrastructure Issues Discovered in Due Diligence
Real-world M&A transactions have illustrated how data infrastructure or governance issues can significantly impact deals – from price negotiations to post-merger headaches. Here are a few notable case studies:
Verizon’s Acquisition of Yahoo (2017): During deal negotiations, Yahoo disclosed – belatedly – that it had suffered two massive data breaches (in 2013 and 2014) compromising over 1.5 billion user accounts in total . These breaches (which involved stolen personal data, unencrypted security answers, and outdated hashing algorithms) had not been made public earlier. The revelations almost derailed the acquisition. In the end, Verizon proceeded but knocked $350 million off the purchase price (about a 7% discount) and agreed to split legal liabilities for the breaches . The final price was $4.48B instead of the initial $4.83B. This case vividly shows that undisclosed data security issues discovered in diligence directly translate to financial value loss. Verizon also had to invest in security remediation and faced PR fallout for inheriting Yahoo’s “privacy nightmare” . Importantly, the Yahoo case prompted regulators (SEC) to issue guidance that cybersecurity issues must be disclosed to investors, so acquirers now scrutinize this area even more .
Marriott’s Acquisition of Starwood (2016): Marriott purchased Starwood Hotels to form the world’s largest hotel chain (a $13.3B deal) . While operational due diligence focused on integrating reservations and loyalty systems, a lurking data issue went unnoticed: Starwood’s guest reservation database had been breached in 2014, exposing approximately 383 million guest records (including sensitive info like passport numbers). This breach wasn’t discovered by Starwood at the time of acquisition. It only came to light in 2018, two years post-close, when Marriott investigators found an unexpected data trail. The fallout was severe – Marriott faced regulatory probes and in 2019 the UK ICO fined Marriott £99 million (about $123M) under GDPR for failing to protect EU customers’ data . The ICO explicitly stated that Marriott should have performed better due diligence on data security during the merger, and ensured proper safeguards for the acquired data . Beyond the fine, Marriott had to bear substantial costs: notifying customers, providing credit monitoring, and accelerating the purge of Starwood’s legacy IT systems. This case underscores that inadequate vetting of a target’s security posture or data management can lead to huge post-deal costs. It also highlights a lesson: integrating IT systems quickly post-merger is vital – Marriott had left Starwood’s systems running separately (including the vulnerable database) for too long . Now, buyers often ask for detailed cybersecurity assessments and may require representations & warranties specifically about data breaches.
Facebook (Meta) and Musical.ly (2016, abandoned): Not all data issues surface via breach – sometimes regulatory and privacy concerns halt a deal. In 2016, Facebook explored acquiring Musical.ly (the lip-sync video app that later became TikTok). After due diligence, Facebook walked away from the deal, chiefly due to data/privacy reasons . Musical.ly was a China-based app with a predominantly underage user base, raising red flags about COPPA (children’s privacy law) and data flows to China. Facebook feared that acquiring it would invite intense scrutiny over child data protection and potential censorship issues. Indeed, those concerns proved prescient: Musical.ly was acquired by ByteDance (renamed TikTok) and by 2019 U.S. regulators and CFIUS were investigating it for sending user data to China without consent and other privacy violations . TikTok later faced lawsuits and government restrictions over these data practices. This example shows an instance where a buyer proactively avoided an M&A because of data governance/regulatory risk. Essentially, Facebook could not get comfortable with the target’s data infrastructure – specifically, data residency and content moderation practices – and judged the risk (of fines or forced divestiture) too high. For M&A practitioners, it reinforces that issues like data localization, user consent, and compliance with laws (GDPR, COPPA, etc.) are now top-of-mind. A target heavily dependent on data about minors or operating in jurisdictions with stringent data laws will undergo extra due diligence. If the risks can’t be mitigated (say, by technical safeguards or restructuring the data flows), a deal might be shelved.
CFIUS Interventions (various, 2018–2020): In cross-border M&A, data concerns have even led governments to block deals. For example, in 2018 the U.S. government (CFIUS) forced China’s Kunlun Tech to sell Grindr, a dating app it had acquired, due to worries that personal data of U.S. citizens (especially LGBTQ individuals) could be misused or leveraged for blackmail . Similarly, Ant Financial’s proposed acquisition of MoneyGram was denied in 2017 over data security of Americans’ financial info . These are not due diligence failures per se (the risks were known), but they are case studies showing how data privacy/national security issues can entirely derail or reverse an acquisition. From a due diligence standpoint, if a target has sensitive personal data (health data, financial records, etc.) and the buyer is foreign, one must anticipate regulatory objections. In deal planning, parties sometimes proactively propose mitigation (e.g. keeping data in certain jurisdictions or bringing in third-party audits) to get approval. The takeaway is that data is now a strategic asset scrutinized by governments; thus, M&A due diligence must assess geopolitical and compliance aspects of data handling, not just the IT aspects.
In conclusion, these case studies illustrate that data infrastructure and governance issues have become pivotal in M&A outcomes. A major data breach or compliance gap can reduce a company’s value (Yahoo), result in huge fines post-acquisition (Starwood/Marriott), or even scuttle deals entirely (Facebook/TikTok). Modern due diligence practices have evolved accordingly: it’s standard to do rigorous IT/cyber due diligence, including hiring specialist firms to probe for vulnerabilities or data liabilities in the target. Buyers may negotiate specific indemnities for data breaches or set aside escrow funds for possible fines. They also craft detailed integration plans to merge IT systems and improve data governance on Day 1 of the combined company. The overarching lesson from the cited examples is clear – neglecting data governance is costly, whereas companies that enter M&A with well-documented, secure, and compliant data practices will find smoother, more valuable deals.


