|

CSV Cleanup Playbook: Deduplicate, Split, Normalize (Sheets & Excel)

There’s a special kind of dread that comes with opening a CSV export and seeing that someone entered phone numbers as “(555) 123-4567” in some rows, “+15551234567” in others, and “555.123.4567” in the rest. Or that “John Smith” and “john smith” and ” John Smith ” are apparently three different customers in your database.

Messy CSVs are one of those universal problems that every team deals with but nobody talks about. I’ve spent more time than I’d like to admit cleaning up exports from CRMs, ad platforms, and legacy systems. This playbook is the collection of formulas and techniques I reach for every time — organized by task so you can jump to whatever mess you’re dealing with today.

Everything here works in Google Sheets. I’ve included Excel equivalents where the syntax differs.

Before You Start: Two Rules That Save Time

Rule 1: Always work on a copy. Before you clean anything, duplicate the tab or save a copy of the file. I’ve seen people accidentally overwrite raw data with cleanup formulas and lose the original values. It takes five seconds to duplicate a tab and it can save you hours.

Rule 2: Clean in a separate column, not in place. Instead of overwriting column A with a cleaned version, add a new column next to it with the cleanup formula. Once you’ve verified the results look right, you can paste-as-values and delete the original. This gives you a side-by-side comparison and an easy undo path.

Deduplicating Rows

Simple dedup by a single column

If you just need unique rows based on one column (like email), the UNIQUE function handles it:

=UNIQUE(A2:D)

This returns all columns but drops rows where column A (the first column in the range) repeats. One catch: UNIQUE keeps the first occurrence of each duplicate. If you want to keep the most recent entry instead, sort your data by date (newest first) before applying UNIQUE.

In Excel, the equivalent is Data → Remove Duplicates, which gives you a dialog to pick which columns to check. The downside is that it modifies your data in place — another reason to work on a copy.

Dedup by multiple columns (composite key)

Sometimes a single column isn’t enough. You might have two “John Smith” entries that are genuinely different people at different companies. In that case, create a helper column that concatenates the fields you want to match on:

=A2 & "|" & B2

This creates values like john@example.com|Acme Corp. Run UNIQUE or COUNTIF against this composite key to find true duplicates. The pipe character (|) works as a separator because it almost never appears in real data — unlike commas or spaces, which would create false matches.

After deduping, you can delete the helper column.

Trimming Whitespace and Invisible Characters

Stray spaces are the most common cause of failed lookups. Your VLOOKUP returns #N/A even though the values look identical? There’s probably a trailing space or a non-breaking space character hiding in one of the cells.

The standard cleanup formula handles both:

=TRIM(CLEAN(A2))

CLEAN removes non-printable characters (ASCII 0–31 — things like line breaks, tabs, and null characters that sometimes sneak in from database exports). TRIM removes leading and trailing spaces and collapses multiple internal spaces into one.

There’s one character this combo misses: the non-breaking space (character 160), which shows up in data copied from web pages. To catch that too:

=TRIM(CLEAN(SUBSTITUTE(A2, CHAR(160), " ")))

This converts non-breaking spaces to regular spaces first, then lets TRIM handle the rest.

Splitting Names into First and Last

The simple approach works when names are consistently “First Last”:

=SPLIT(A2, " ")

This splits on spaces and puts each part in a separate column. The problem is that it falls apart with names like “Mary Jane Watson” or “Jean-Luc Picard” — you’ll get three columns for the first and two columns for the second.

A more reliable approach uses REGEXEXTRACT to grab the first word as the first name and everything after the first space as the last name:

=REGEXEXTRACT(A2, "^(\S+)")       // First name
=REGEXEXTRACT(A2, "\s(.+)$")      // Last name (everything after first space)

For bulk processing across an entire column:

=ARRAYFORMULA({REGEXEXTRACT(A2:A, "^(\S+)"), REGEXEXTRACT(A2:A, "\s(.+)$")})

This isn’t perfect for every edge case (prefixes like “Dr.” or suffixes like “Jr.” will trip it up), but it handles the 90% case well. For the exceptions, manual review is faster than trying to write a formula that handles every possible name format.

Excel equivalent: Use LEFT + FIND for first name and MID + FIND for last name, or TEXTSPLIT in Excel 365.

Normalizing Phone Numbers

Phone number formatting is a mess because every system exports them differently. The fastest way to normalize is to strip everything that isn’t a digit:

=REGEXREPLACE(A2, "[^0-9]", "")

This turns (555) 123-4567, 555.123.4567, and +1-555-123-4567 all into 15551234567 or 5551234567. From there, you have a consistent format you can compare across datasets.

If you need to add formatting back for display:

="(" & MID(A2,1,3) & ") " & MID(A2,4,3) & "-" & MID(A2,7,4)

For an entire column at once:

=ARRAYFORMULA(IF(A2:A="", "", REGEXREPLACE(A2:A, "[^0-9]", "")))

Excel equivalent: There’s no REGEXREPLACE in standard Excel. You can use a series of SUBSTITUTE functions to remove specific characters: =SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(A2,"(",""),")",""),"-",""). It’s ugly but it works for common formats.

Standardizing Case

Inconsistent capitalization causes failed matches and messy reports. Three formulas cover most situations:

=LOWER(A2) — Converts everything to lowercase. Best for email addresses and identifiers.

=UPPER(A2) — All caps. Useful for country codes, state abbreviations, etc.

=PROPER(A2) — Capitalizes the first letter of each word. Good for names, but watch out for cases like “McDonald” becoming “Mcdonald” or “iPhone” becoming “Iphone”. PROPER is a blunt tool — use it for a first pass, then spot-check the results.

Finding Records That Don’t Match (Anti-Join)

One of the most common cleanup tasks is comparing two lists and finding what’s in one but not the other. For example: “Which customers from our CRM export aren’t in our billing system?”

If list A is in column A and list B is in column C:

=FILTER(A2:A, ISNA(MATCH(A2:A, C2:C, 0)))

This returns every value in column A that doesn’t appear in column C. It’s an anti-join — the spreadsheet equivalent of a SQL LEFT JOIN ... WHERE B.key IS NULL.

To flag matches instead of filtering:

=ARRAYFORMULA(IF(A2:A="", "", IF(ISNUMBER(MATCH(A2:A, C2:C, 0)), "match", "new")))

This adds a “match” or “new” label next to each row, which is helpful when you want to review before deleting.

Preserving Leading Zeros in Exports

ZIP codes, product SKUs, and phone numbers often start with zeros. When you open a CSV in Sheets or Excel, those leading zeros get silently stripped — 07102 becomes 7102, and your data is now wrong.

Two ways to prevent this:

Before opening the CSV: In Google Sheets, use File → Import and format the relevant columns as “Text” during import. In Excel, use the Text Import Wizard and set those columns to “Text.”

After the fact: If the zeros are already gone and you know the expected length, pad them back:

=TEXT(A2, "00000")     // Forces 5 digits, adds leading zeros

One-Click Cleanup with Apps Script

If you do this regularly, a simple Apps Script can automate the boring parts. This script runs through every cell in the active sheet, strips non-printable characters, collapses whitespace, and trims:

function cleanupSheet() {
  const sh = SpreadsheetApp.getActiveSheet();
  const range = sh.getDataRange();
  let values = range.getValues();
  
  for (let r = 1; r < values.length; r++) {    // skip header row
    for (let c = 0; c < values[0].length; c++) {
      let v = values[r][c];
      if (typeof v === 'string') {
        v = v.replace(/[\u0000-\u001F\u007F]/g, '');  // remove control chars
        v = v.replace(/\u00A0/g, ' ');                  // non-breaking spaces
        v = v.replace(/\s+/g, ' ').trim();              // collapse and trim
        values[r][c] = v;
      }
    }
  }
  
  range.setValues(values);
  SpreadsheetApp.getActiveSpreadsheet().toast('Cleanup complete', 'Done', 3);
}

To use it: Extensions → Apps Script, paste the code, save, then run cleanupSheet from the script editor. The first time, it’ll ask for permission to access your spreadsheet.

This won’t handle phone normalization or deduplication — those require decisions about your data that a generic script can’t make. But for the mechanical cleanup (whitespace, invisible characters, basic sanitization), it saves you from manually applying TRIM(CLEAN()) to every column.

FAQ

What’s the best order to clean data in? I usually go: trim whitespace → standardize case → normalize formats (phones, dates) → deduplicate → compare/merge. Cleaning whitespace and case first means your dedup step catches more true duplicates.

Should I clean in Sheets or use a dedicated tool? For files under 50,000 rows, Sheets handles it fine. Above that, performance degrades — formulas take longer to recalculate and the UI gets sluggish. For large datasets, consider OpenRefine (free, excellent for bulk cleanup), Python with pandas, or a database import.

My CSV has encoding issues (weird characters like é instead of é). How do I fix that? This is a UTF-8 vs. Latin-1 encoding mismatch. Before opening in Sheets, try opening the file in a text editor (VS Code, Notepad++), re-saving it as UTF-8, then importing. In Sheets, you can also try File → Import → Upload and see if it auto-detects the encoding correctly.

Similar Posts