CSVPRINTF(1) | General Commands Manual | CSVPRINTF(1) |
NAME¶
csvprintf
— CSV
file parser
SYNOPSIS¶
csvprintf |
[options] format |
csvprintf |
-b [options] |
csvprintf |
-j [options] |
csvprintf |
-x [options] |
csvprintf |
-X [options] |
xml2csv |
[file.xml] |
DESCRIPTION¶
csvprintf
is a simple UNIX command line
utility for parsing CSV files.
In the first form, csvprintf
works like
the printf(1) command line utility: you supply a
printf(1) format string on the command line, and each row
of the CSV file is split into arguments and formatted accordingly.
The format specifiers in the format string contain numeric or symbolic column accessors to specify which CSV column to format.
A numeric column accessor is a sequence of decimal digits followed by the $ character (the same accessor format supported by printf(1)). So for example, "%3$d" would format the third CSV column as a decimal value. In addition, the "%0$d" specifier will print the number of columns in the record.
When the -n
flag is given, the first row
is assumed to contain column names and is not output. This allows symbolic,
instead of numeric, column accessors to be used. A symbolic column accessor
is the column name enclosed in curly braces.
For example, if the first row is FirstName,Lastname,IdNum then the format string "%{IdNum}04d: %{LastName}s, %{FirstName}s" would be equivalent to the format string "%3$04d: %2$s, %1$s".
Specifying a column name that does not appear in the first row generates an error, so the use of symbolic column accessors adds an extra consistency check.
XML Mode¶
With -x
, the entire file is converted into
an XML document.
The document element is <csv>.
Each CSV row becomes a <row> element containing its individual column values as sub-elements.
The column value sub-elements are named
<col1>, <col2>,
etc.; with -i
, the sub-elements use the column names
read from the first row (with illegal characters replaced by
underscores).
In XML mode, a character encoding must be assumed; see
-e
.
The xml2csv
command can convert XML
documents generated by csvprintf -x
back into
CSV.
JSON Mode¶
With -j
, each row is converted into a JSON
document.
This form is described by RFC 7464 and consists of concatenated
JSON documents framed by ASCII RS and LF control characters, which is
compatible with the jq(1) utility's
--seq
flag.
Normally each row is written as a string array; with
-i
, each row is written as an object, using column
names for fields. An error occurs if two columns have the same name.
In JSON mode, a character encoding must be assumed; see
-e
.
Bash Mode¶
With -b
, each row is converted into
bash(1) variable assignment(s) which may be applied with
the eval(1) command.
Normally the output just assigns ROW as an array of values. The resulting output can be used like this:
cat input.csv | csvprintf -b | while read -r LINE; do eval "${LINE}" echo "The first column is: ${ROW[0]}" echo "The second column is: ${ROW[1]}" ... done
With -i
, each column value is assigned to
a separate variable whose name is the corresponding column name (with
underscores replacing non-alphanumeric characters), and an error occurs if
two variables have the same name.
So an input file like this:
"Last Name","First Name","Registered???" "Washington","George","Y" "Lincoln","Abe","N"
can be processed like this:
cat input.csv | csvprintf -bi -p ROW_ | while read -r LINE; do eval "${LINE}" echo "First name: ${ROW_First_Name}" echo "Last name: ${ROW_Last_Name}" echo "Registered: ${ROW_Registered___}" done
Bash Mode Security Concerns¶
There are two security issues to be aware of when using Bash Mode.
First, the -i
flag opens a security hole
because Bash has several special variables like PATH,
TMPDIR, etc., which could be overwritten by malicious
input. To prevent this, csvprintf
omits known Bash
variables, but for tighter security use the -c
flag
to explicitly white-list the variables you need. In addition, use of the
-p
flag is always recommended in Bash Mode to help
avoid namespace collisions.
Secondly, if the Bash Mode output is piped into
while read then the -r
flag
must be used to prevent extraneous decoding of backslash escapes.
Input Encoding¶
In all modes, lines must be terminated by LF bytes or CR+LF byte pairs, and the separator and quote characters must be recognizable as single byte values. This parsing behavior is compatible with ASCII, ISO-8859-1, UTF-8, etc., but not multi-byte encodings such as UTF-16, which must be re-encoded (e.g., to UTF-8) first.
In normal and Bash modes, column values are copied from input to output bytewise without interpretation.
In XML and JSON modes, column values must be interpreted according
to an assumed character encoding. This encoding defaults to ISO-8859-1 but
can be changed with the -e
flag.
OPTIONS¶
-b
- Convert each CSV row into a bash(1) variable assignment line.
-c
colname- Specify a column to be included when using column names in XML, JSON, or
Bash output.
Without this flag, all columns are included. When this flag is used one or more times, only the specified columns are included.
If any colname doesn't exist, an error occurs.
-e
- Specify input character encoding for XML or JSON mode.
By default, ISO-8859-1 is assumed.
-f
- Read CSV input from the specified file.
By default (or if ``-'' is specified),
csvprintf
reads from standard input. -i
- Use column names read from the first record in the output.
In normal mode, or when used with the
-x
flag, this flag is equivalent to-n
.In JSON mode, output objects instead of arrays and use column names for the object fields.
In Bash mode, output a variable for each column instead of a single ROW array variable.
It's possible for a row to have more columns than the column header row did. In that case,
csvprintf
reverts to using col1, col2, etc., for any extra columns.This flag implies
-n
. -j
- Convert the input into a JavaScript Object Notation (JSON) text sequence document.
-n
- Assume the first CSV record contains column names and omit from the
output.
In normal mode, enable symbolic column accessors.
-p
- Specify a common prefix (UTF-8 encoding) to use with all column names in
the output.
This flag is ignored unless
-i
is specified. -q
- Specify an alternate CSV column quote character. The usual backslash
escape sequences are accepted.
The default quote character is double quote.
-s
- Specify an alternate CSV column separator character. The usual backslash
escape sequences are accepted.
The default separator character is comma.
-h
- Output usage message and exit.
-v
- Output version information and exit.
-x
- Convert the input into an XML document.
-X
- Convert the input into an XML document using column names for value
sub-elements.
This flag implies
-n
.
CSV FORMAT¶
csvprintf
parses according to the format
described by ``The Comma Separated Value (CSV) File Format'' (see below). In
particular, quote characters must be escaped with an extra quote and
whitespace surrounding column values is ignored.
EXIT STATUS¶
csvprintf
will exit with a status 1 if
invalid CSV input is detected. Otherwise, if an invocation of
printf(1) fails, processing stops and that exit value is
returned.
FILES¶
- /usr/share/csvprintf/csv.xsl
- XSL transform that converts XML back into CSV format.
BUGS¶
Under the hood, csvprintf
invokes the
printf(1) executable on each CSV row it parses, which
makes it relatively slow.
SEE ALSO¶
csvprintf: Simple CSV file parser for the UNIX command line, https://github.com/archiecobbs/csvprintf.
The Comma Separated Value (CSV) File Format, http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm.
RFC 7464: JavaScript Object Notation (JSON) Text Sequences, https://datatracker.ietf.org/doc/html/rfc7464.
AUTHOR¶
Archie L. Cobbs ⟨archie.cobbs@gmail.com⟩
November 30, 2010 | Linux 6.4.0-150600.23.38-default |