Scroll to navigation

CSVPRINTF(1) General Commands Manual CSVPRINTF(1)

NAME

csvprintfCSV file parser

SYNOPSIS

csvprintf [options] format


csvprintf -b [options]


csvprintf -j [options]


csvprintf -x [options]


csvprintf -X [options]


xml2csv [file.xml]

DESCRIPTION

csvprintf is a simple UNIX command line utility for parsing CSV files.

In the first form, csvprintf works like the printf(1) command line utility: you supply a printf(1) format string on the command line, and each row of the CSV file is split into arguments and formatted accordingly.

The format specifiers in the format string contain numeric or symbolic column accessors to specify which CSV column to format.

A numeric column accessor is a sequence of decimal digits followed by the $ character (the same accessor format supported by printf(1)). So for example, "%3$d" would format the third CSV column as a decimal value. In addition, the "%0$d" specifier will print the number of columns in the record.

When the -n flag is given, the first row is assumed to contain column names and is not output. This allows symbolic, instead of numeric, column accessors to be used. A symbolic column accessor is the column name enclosed in curly braces.

For example, if the first row is FirstName,Lastname,IdNum then the format string "%{IdNum}04d: %{LastName}s, %{FirstName}s" would be equivalent to the format string "%3$04d: %2$s, %1$s".

Specifying a column name that does not appear in the first row generates an error, so the use of symbolic column accessors adds an extra consistency check.

XML Mode

With -x, the entire file is converted into an XML document.

The document element is <csv>.

Each CSV row becomes a <row> element containing its individual column values as sub-elements.

The column value sub-elements are named <col1>, <col2>, etc.; with -i, the sub-elements use the column names read from the first row (with illegal characters replaced by underscores).

In XML mode, a character encoding must be assumed; see -e.

The xml2csv command can convert XML documents generated by csvprintf -x back into CSV.

JSON Mode

With -j, each row is converted into a JSON document.

This form is described by RFC 7464 and consists of concatenated JSON documents framed by ASCII RS and LF control characters, which is compatible with the jq(1) utility's --seq flag.

Normally each row is written as a string array; with -i, each row is written as an object, using column names for fields. An error occurs if two columns have the same name.

In JSON mode, a character encoding must be assumed; see -e.

Bash Mode

With -b, each row is converted into bash(1) variable assignment(s) which may be applied with the eval(1) command.

Normally the output just assigns ROW as an array of values. The resulting output can be used like this:

cat input.csv | csvprintf -b | while read -r LINE; do
    eval "${LINE}"
    echo "The first column is: ${ROW[0]}"
    echo "The second column is: ${ROW[1]}"
    ...
done

With -i, each column value is assigned to a separate variable whose name is the corresponding column name (with underscores replacing non-alphanumeric characters), and an error occurs if two variables have the same name.

So an input file like this:

"Last Name","First Name","Registered???"
"Washington","George","Y"
"Lincoln","Abe","N"

can be processed like this:

cat input.csv | csvprintf -bi -p ROW_ | while read -r LINE; do
    eval "${LINE}"
    echo "First name: ${ROW_First_Name}"
    echo "Last name: ${ROW_Last_Name}"
    echo "Registered: ${ROW_Registered___}"
done

Bash Mode Security Concerns

There are two security issues to be aware of when using Bash Mode.

First, the -i flag opens a security hole because Bash has several special variables like PATH, TMPDIR, etc., which could be overwritten by malicious input. To prevent this, csvprintf omits known Bash variables, but for tighter security use the -c flag to explicitly white-list the variables you need. In addition, use of the -p flag is always recommended in Bash Mode to help avoid namespace collisions.

Secondly, if the Bash Mode output is piped into while read then the -r flag must be used to prevent extraneous decoding of backslash escapes.

Input Encoding

In all modes, lines must be terminated by LF bytes or CR+LF byte pairs, and the separator and quote characters must be recognizable as single byte values. This parsing behavior is compatible with ASCII, ISO-8859-1, UTF-8, etc., but not multi-byte encodings such as UTF-16, which must be re-encoded (e.g., to UTF-8) first.

In normal and Bash modes, column values are copied from input to output bytewise without interpretation.

In XML and JSON modes, column values must be interpreted according to an assumed character encoding. This encoding defaults to ISO-8859-1 but can be changed with the -e flag.

OPTIONS

Convert each CSV row into a bash(1) variable assignment line.
colname
Specify a column to be included when using column names in XML, JSON, or Bash output.

Without this flag, all columns are included. When this flag is used one or more times, only the specified columns are included.

If any colname doesn't exist, an error occurs.

Specify input character encoding for XML or JSON mode.

By default, ISO-8859-1 is assumed.

Read CSV input from the specified file.

By default (or if ``-'' is specified), csvprintf reads from standard input.

Use column names read from the first record in the output.

In normal mode, or when used with the -x flag, this flag is equivalent to -n.

In JSON mode, output objects instead of arrays and use column names for the object fields.

In Bash mode, output a variable for each column instead of a single ROW array variable.

It's possible for a row to have more columns than the column header row did. In that case, csvprintf reverts to using col1, col2, etc., for any extra columns.

This flag implies -n.

Convert the input into a JavaScript Object Notation (JSON) text sequence document.
Assume the first CSV record contains column names and omit from the output.

In normal mode, enable symbolic column accessors.

Specify a common prefix (UTF-8 encoding) to use with all column names in the output.

This flag is ignored unless -i is specified.

Specify an alternate CSV column quote character. The usual backslash escape sequences are accepted.

The default quote character is double quote.

Specify an alternate CSV column separator character. The usual backslash escape sequences are accepted.

The default separator character is comma.

Output usage message and exit.
Output version information and exit.
Convert the input into an XML document.
Convert the input into an XML document using column names for value sub-elements.

This flag implies -n.

CSV FORMAT

csvprintf parses according to the format described by ``The Comma Separated Value (CSV) File Format'' (see below). In particular, quote characters must be escaped with an extra quote and whitespace surrounding column values is ignored.

EXIT STATUS

csvprintf will exit with a status 1 if invalid CSV input is detected. Otherwise, if an invocation of printf(1) fails, processing stops and that exit value is returned.

FILES

/usr/share/csvprintf/csv.xsl
XSL transform that converts XML back into CSV format.

BUGS

Under the hood, csvprintf invokes the printf(1) executable on each CSV row it parses, which makes it relatively slow.

SEE ALSO

printf(1), printf(3), jq(1).

csvprintf: Simple CSV file parser for the UNIX command line, https://github.com/archiecobbs/csvprintf.

The Comma Separated Value (CSV) File Format, http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm.

RFC 7464: JavaScript Object Notation (JSON) Text Sequences, https://datatracker.ietf.org/doc/html/rfc7464.

AUTHOR

Archie L. Cobbs ⟨archie.cobbs@gmail.com⟩

November 30, 2010 Linux 5.14.21-150500.55.52-default