Common CSV Import Errors and How to Prevent Them
Earlier this month I sat down with every CSV import problem I could find: exports from German Excel, spreadsheets with a title across the top, rows with one cell too many, headers that appear twice, a required cell that held a single space. The plan was an audit. Build a file for each problem, run it through Updog, and confirm each one comes out handled.
The pattern that stayed with me is that the worst CSV import errors never show an error. The file imports, the grid looks clean, and one value is wrong in a way nobody notices for weeks. A crash gets reported the same day. A silent wrong value gets trusted.
CSV import errors are rarely parsing errors. In a SaaS product, the person uploading a file needs to match it to your schema, fix the invalid values, review the result, and submit it without getting stuck, and every step has its own ways to fail. Here are the issues from that audit, in the order a real file runs into them, and what an importer can do about each one.
Encoding issues and special characters
The first thing an importer touches is bytes, and the bytes do not announce their encoding. A file saved from an older Windows tool is often Windows-1252. Exports from Japanese or Chinese systems arrive as Shift-JIS or GB18030. Read any of them as UTF-8 and the names break: José turns into Jos� or José, depending on which side guessed wrong, and the person cannot understand why a file that looks fine in Excel looks broken in your app.
There is a quieter version of the same problem. Many files begin with a byte order mark, a few invisible bytes that declare the encoding. Leave it in place and it glues itself to the first header, so the column that looks like id on screen is an invisible character followed by id in memory, and it fails to match your id field every time.
The fix is boring and absolute: detect the encoding from the bytes, strip the byte order mark, and convert everything to UTF-8 before a single cell is parsed. Updog does this on every file, because no person should be asked what encoding their export uses.
The delimiter is not always a comma
Comma-separated values are separated by semicolons surprisingly often. In much of Europe the comma is the decimal mark, so Excel there writes CSV files with semicolons between values. Other tools export tabs or pipes. An importer hard-coded to commas reads a semicolon file as one giant column, and everything downstream fails at once.
Detection is the easy half: count candidate delimiters across the first lines, ignore anything inside quotes, and pick the one that appears consistently. The subtle half is a hint line that some tools write at the top of a file so that Excel will open it with the right delimiter:
sep=;
id;name;age
1;Alice;30 The directive is not part of the CSV standard. Excel honors it when opening a file, but parsing libraries generally do not: PapaParse, for example, has an open issue about it, and the line arrives in your rows as if it were data. On a file without a header row the damage is sharp: the literal text sep= becomes a column header and every value shifts one cell over. Updog drops a first line shaped exactly like sep= plus one character before anything else sees the rows.
Missing or wrong column headers
We picture a CSV as a grid with the column names on the first line. Real files open with a report title, a date, a blank row or two, and the real header three lines down. An importer that trusts row one turns the title into the header, and nothing maps after that.
Finding the header takes a heuristic: score the first rows on signals like all-text cells, uniqueness, and length, and pick the best. The trap inside that heuristic is a title in a single cell. Sales Report is short, unique, and all text, so it scores like a header on every content signal, and an importer that trusts the score takes the title and shifts everything under it. The signal that matters is structural: a real header spans the width of the table, and a title does not. Updog checks that shape before accepting any row, so a one-cell title never wins.
Two more header problems are worth designing for. The same column arrives under different names in every file, First Name in one, fname in the next, Contact First in a third, which is why column mapping with fuzzy matching is a standard step in any serious CSV import flow. And sometimes a header appears twice. A file with two id columns is legal, but an importer that identifies columns by header name lets the second id overwrite the first, and a whole column vanishes. Updog renames repeats to id and id (2), so both columns survive and the person decides what each one means during mapping. A silent loss becomes a visible question.
Rows that do not line up with the headers
This is the issue I would put at the top of any list, because its worst case is invisible. A ragged row is a data row whose cell count does not match the header. Too many cells usually means an unescaped delimiter inside a value, a last name entered as Van Der, Berg or an address with a comma in it. Too few usually means a value was omitted without an empty placeholder.
The naive importer pads short rows and truncates long ones, silently. Hidden inside that behavior is the shifted row: one missing cell early in the line slides every later value one column over. The email lands in the last name field, the age lands in the email field. Nothing crashes, the row imports, and the data sits under the wrong headers until someone notices.
Here is the uncomfortable part: from the row alone, a shift cannot be repaired. Given four values and five headers, no amount of counting tells you which column is the missing one. Every guess produces a different, plausible row. Import tools deal with this honestly by flagging the mismatched row and asking the person to fix it, and that is the right floor. But one signal sits unused: your schema. Updog tries the possible alignments of a ragged row and keeps only the ones that pass your validators. When exactly one survives, the structure has been recovered and the row imports correctly. When several survive, the row imports with the uncertain cells flagged, so the person answers a small, specific question instead of retyping a line. The strictness of your schema decides how often that works: an email validator pins the email column in place, while a schema of free-text fields gives the aligner little to prove with. Even then, turning an invisible shift into a visible flag is most of the value.
Date and number formats follow the locale
03/04/2026 is March 4 in New York and April 3 in London, and the file does not say which. Guess per value and you corrupt every date that happens to read both ways. The reliable approach decides per file: one value with a day above twelve settles the order for the whole file, and every date is stored in one canonical shape, ISO 2026-04-03, so nothing downstream has to guess again.
Numbers carry the same ambiguity. 1.234,56 and 1,234.56 are the same amount written in two locales, so the decimal format also has to be decided once per file, and decided by looking only at the columns mapped to number fields, because dots in text and ID columns will pollute the vote. Real files also dress numbers up: currency symbols, percent signs, accounting parentheses for negatives. $1,234.56 in a price column should land as 1234.56. An ambiguous decoration like the letter code in 100 kr should stay raw and get flagged, because removing letters is guessing.
The rule that survived our audit: normalization belongs to the column type. Stripping € from a value headed into a number field is cleanup. Rewriting 1,234 inside a free-text column is corruption. The same string is content in one column and costume in another, and only the declared type can tell the importer which.
Excel changed the data before the file existed
Some of the most reported CSV import issues happen before the upload. Excel converts what it recognizes at the moment of entry: a product code like 3/4 becomes a date, a nineteen-digit card number has every digit past the fifteenth turned into a zero, and the zip code 00123 is usually not text at all but the number 123 wearing a five-digit display format. Inside the workbook the stored value is 123. The zeros were never saved.
An importer has to be honest about this: it cannot recover information the spreadsheet never stored, and inventing the digits back would only manufacture confidence. What it can do is refuse to add damage of its own. Text cells must round-trip byte for byte, so an ID typed as text keeps its zeros. Date cells in an Excel file arrive as date objects, and converting them carelessly produces strings like Thu Jan 15 2026 00:00:00 GMT+0000, where the calendar day can even change with the timezone of the machine; the safe rendering is the plain ISO date, computed in UTC, so January 15 stays January 15 everywhere. And the one piece of guidance worth passing to your users costs nothing: identifiers belong in text cells.
Values that pass validation and should not
Required fields missing, invalid email addresses, values outside an allowed set, duplicate rows: this is the visible layer of CSV import validation, and any import tool should check all of it before the data reaches your system. Duplicates deserve both halves of the check: a uniqueness rule inside the file, and a merge key on import, so a row that matches an existing record updates it instead of landing twice. The audit taught me to also distrust the green results.
The sneakiest file in the test set had a required cell holding a single space. A required check that compares against the empty string passes it, and the trap runs deeper than it looks, because a typical cleanup pass rewrites non-breaking spaces into regular spaces and manufactures exactly the value that fools the check. Updog trims whitespace at the point where data enters, so a visually empty cell is stored as a true empty string and the required rule fails honestly. A green check over an empty cell is the same failure as the shifted row: validation telling you everything is fine when it is not.
Large CSV files are an architecture problem
A large CSV file upload fails differently: the tab freezes, memory runs out, or an arbitrary row limit sends the person away to split their file into pieces. None of those are parsing problems either. They come from reading the whole file into memory at once and rendering every row into the page.
The architecture that holds up is streaming the parse in chunks, moving heavy work like filtering into a worker, and drawing the grid on canvas so the page only pays for the rows on screen. We wrote about that approach in how Updog handles a million rows in the browser. The honest part of that number is that it is a ceiling, not a promise of infinity. The point is that a hundred-thousand-row customer file should feel ordinary.
The error experience decides whether the import finishes
Every issue above ends the same way: some cells need a person's attention. What happens next separates an import that finishes from one that gets abandoned. A message like "import failed on line 8,442" sends the person back to Excel to fix one cell and upload again, and the next attempt fails on line 9,107.
The shape that works is a spreadsheet-like editor inside the CSV import flow itself: every failing cell highlighted, filters that narrow a big file down to the rows that need work, and edits applied in place with validation rerunning as the person types. Updog flags problems instead of blocking on them, because someone fixing row twelve should not be locked out of row thirteen. We wrote a longer piece on why CSV import is more than parsing a file, and the editor is the heart of that argument.
A checklist before users submit data
If you are building or buying a CSV importer, this is the list I would test it against, one broken file per line:
- Detect the encoding from the bytes and strip the byte order mark before parsing.
- Detect the delimiter, and drop a first line that reads
sep=plus one character. - Find the header row instead of assuming row one, and rename duplicate headers instead of losing a column.
- Compare every row's cell count to the header, and never realign a mismatch silently.
- Settle date order and decimal format once per file, from the file, and store canonical values.
- Normalize numbers and dates only in columns typed for them, and keep text byte for byte.
- Trim whitespace at the boundary, so a required field cannot be satisfied by a space.
- Validate emails, ranges, allowed values, and duplicates before submission, and show every failure where it can be fixed.
- Test with a headerless file, a non-US Excel export, a file with a title row, and a file large enough to hurt.
The audit never finishes
Every safeguard in this post earns its place because one specific file, somewhere, does one specific strange thing, and the supply of reasonable files from unreasonable tools does not run out. That is what CSV import issues are like up close. The importer's job is to meet each file calmly: fix what it can prove, flag what it cannot, and never be silently wrong. We keep feeding Updog broken files so that when your users upload theirs, the import is the easiest part of their day.