April 13, 2024

Common expressions are a really great tool in a programmer’s toolbox. However they’ll’t do the whole lot. And one of many issues they’ll’t do is to reliably parse CSV (comma separated worth) recordsdata. It’s because an everyday expression doesn’t retailer state. You want a state machine (or one thing equal) to parse a CSV file.

For instance, contemplate this (very brief) CSV file (3 double quotes + 1 comma + 3 double quotes):


That is accurately interpreted as:

quote to start out the info worth + escaped quote + comma + escaped quote + quote to finish the info worth

E.g. a single worth of:


How every character is interpreteted relies on what characters come earlier than and after it. E.g. the primary quote places you into an ‘inside knowledge’ state. The second quote places you right into a ‘is likely to be an escaped for the next character or is likely to be finish of information’ state. The third quote places you again right into a ‘inside knowledge’ state.

Regardless of how difficult a regex you give you, it can at all times be attainable to create a CSV file that your regex can’t accurately parse. And as soon as the parsing goes fallacious, the whole lot after that time might be rubbish.

You possibly can write a regex that may deal with CSV file the place you might be assured there are not any commas, quotes or carriage returns within the knowledge values. However commas, quotes or carriage returns within the knowledge values are completely legitimate in CSV recordsdata. So it’s only ever going to deal with a subset of all of the attainable well-formed CSV recordsdata.

Word that you simply can parse a TSV (tab separated worth) file with a regex, as TSV recordsdata are (usually!) not allowed to comprise tabs or carriage returns in knowledge and subsequently don’t want escaping.

See additionally on Stackoverflow:

Using regular expressions to parse HTML: why not?