<code><b>config</b></code> should be a <ahref="#config">config object</a> as described below.
</li>
<li>
<code><b>before</b></code> is an optional callback that lets you inspect each file before parsing begins. Return:
<ul>
<li><code>"skip"</code> to skip parsing just that file.</li>
<li><code>false</code> to abort parsing this and all other files in the queue.</li>
<li>a config object to alter the options for parsing that file only</li>
<li>anything else, including <code>undefined</code>, to continue without any changes</li>
</ul>
</li>
<li>
<code><b>error</b></code> is executed when there is a problem getting the file ready to parse. (Parse errors are <i>not</i> reported here.) It receives an object that implements the <ahref="https://developer.mozilla.org/en-US/docs/Web/API/DOMError">DOMError</a> interface, the File object at hand, and the <input> element from which the file was selected. Errors can occur before reading the file if:
<ul>
<li>the HTML file input element has no files chosen</li>
<code><b>complete</b></code> is invoked when parsing a file completes. It receives the results of the parse (including parse errors), the File object, the <input> element from which the file was chosen, and the FileReader-generated event.
</li>
<li>
Read about <ahref="#streaming">streaming</a> for large files.
</li>
</ul>
</div>
<divclass="clear"></div>
<hr>
<h2id="config">The config object</h2>
<pclass="text-center">
Use a config object to specify the parser's behavior.
</p>
<hr>
<divclass="grid-50">
<p>
Any time you invoke the parser, you may customize its behavior using a "config" object. It supports these properties:
</p>
<ul>
<li>
<code><b>delimiter</b></code> The delimiting character. Leave blank to auto-detect. If you specify a delimiter, it must be a string of length 1, and cannot be <code>\n</code>, <code>\r</code>, or <code>"</code>.
</li>
<li>
<code><b>header</b></code> If true, the first row of parsed data will be interpreted as column titles (fields). Fields are returned separately from the rows, and each data point will be keyed to its field name. If false, the parser simply returns an array of arrays, including the first row.
</li>
<li>
<code><b>dynamicTyping</b></code> If true, fields that are only numeric will be converted to a number type. If false, each parsed datum is returned as a string.
</li>
<li>
<code><b>preview</b></code> If preview > 0, only that many rows will be parsed.
</li>
<li>
<code><b>step</b></code> To use a stream, <ahref="#step">define a callback function</a> here which receives the data, row-by-row, as each row is parsed. If parsing a file, step also receives the source file and file input element. Return <code>false</code> to abort the process.
</li>
</ul>
</div>
<divclass="grid-50">
<h3>Default config object</h3>
<br>
<codeclass="block">{
delimiter: "",
header: true,
dynamicTyping: true,
preview: 0,
step: undefined
}</code>
<br>
<h3>Notes</h3>
<ul>
<li>If using a header row, duplicate field names would be problematic.</li>
<li>Dynamic typing comes at a slight performance hit, only noticable for large files. If you don't need it, disable it.</li>
<li>Step through results by defining a "step" callback function that receives data from the parser after each row is parsed.</li>
</ul>
</div>
<divclass="clear"></div>
<hr>
<h2id="results">Parse results (output)</h2>
<hr>
<divclass="grid-50">
<h3>Structure</h3>
<p>
Parse output is always an object like this:
</p>
<codeclass="block">{
results: // parse results
errors: // parse <ahref="#errors">errors</a>, keyed by row
}</code>
</div>
<divclass="grid-50">
<h3>Notes</h3>
<ul>
<li>
<code>results</code> will be an array of arrays if header row is <i>disabled</i>, or an array of objects if header row is <i>enabled</i>.
</li>
<li>
If no delimiter is specified and a delimiter cannot be auto-detected, an error keyed by "config" will be produced and a default delimiter will be chosen.
</li>
<li>
With a header row, the field count must be the same on each row or a FieldMismatch error will be produced for that row. (Without a header row, lines can have variable number of fields without errors.)
</li>
</ul>
</div>
<divclass="clear"></div>
<divclass="grid-50">
<h3id="example1">Example 1</h3>
<p>
With default config (header row and dynamic typing <b>enabled</b>):
</p>
<codeclass="block limit-height">{
"results": {
"fields": [
"Item",
"SKU",
"Cost",
"Quantity"
],
"rows": [
{
"Item": "Book",
"SKU": "ABC1234",
"Cost": 10.95,
"Quantity": 4
},
{
"Item": "Movie",
"SKU": "DEF5678",
"Cost": 29.99,
"Quantity": 3
}
]
},
"errors": {
"length": 0
}
}</code>
</div>
<divclass="grid-45">
<ul>
<li>
With a header row, field names are returned as an array, separate from the rows of actual data which are returned as an array of objects. The values are keyed to their field names, which is much easier to work with than index positions.
</li>
<li>
Using the header row can degrade performance with really large inputs.
</li>
<li>
Notice how dynamic typing turned numeric values into Number types. This also comes at a slight performance hit (you'd only notice with really big inputs).
</li>
<li>
Because header row is enabled, errors will be raised for each row that has a different field count from the first (header) row.
</li>
</ul>
</div>
<divclass="clear"></div>
<divclass="grid-50">
<h3>Example 2</h3>
<p>
With header row and dynamic typing <b>disabled</b>:
</p>
<codeclass="block limit-height">{
"results": [
[
"Item",
"SKU",
"Cost",
"Quantity"
],
[
"Book",
"ABC1234",
"10.95",
"4"
],
[
"Movie",
"DEF5678",
"29.99",
"3"
]
],
"errors": {
"length": 0
}
}</code>
</div>
<divclass="grid-50">
<ul>
<li>
The results are returned as an array of arrays because the header row is disabled. You'll have to use plain ol' non-descript index positions to access the values.
</li>
<li>
Mismatching field counts will not produce <ahref="#errors">errors</a> without a header row.
</li>
<li>
This is a fast configuration. With header row and dynamic typing disabled, you should see faster performance for large inputs.
</li>
</ul>
</div>
<divclass="clear"></div>
<hr>
<h2id="errors">Parse errors</h2>
<hr>
<divclass="grid-50">
<h3>Structure</h3>
<p>
Parse errors are returned in this format, keyed by the row number, alongside the "length" property (shown above) which is included for convenience:
</p>
<codeclass="block">{
type: "", // A generalization of the error
code: "", // Standardized error code
message: "", // Human-readable details
line: 0, // Line of original input
row: 0, // Row index of parsed data where error is
index: 0 // Character index within original input
}</code>
</div>
<divclass="grid-45">
<h3>Notes</h3>
<ul>
<li>
If no delimiter is specified and a delimiter cannot be auto-detected, an error keyed by "config" will be produced and a default delimiter will be chosen.
</li>
<li>
With a header row, the field count must be the same on each row or a FieldMismatch error will be produced for that row. (Without a header row, lines can have variable number of fields without errors.)
</li>
<li>
The <code>type</code> will be one of "Abort", "Quotes", "Delimiter", or "FieldMismatch".
</li>
<li>
The <code>code</code> may be:
<ul>
<li>ParseAbort</li>
<li>MissingQuotes</li>
<li>UnexpectedQuotes</li>
<li>UndetectableDelimiter</li>
<li>TooFewFields</li>
<li>TooManyFields</li>
</ul>
</li>
<li>
The <code>index</code> will be the character index across the entire input where the error occurred; it is <i>not</i> the index of the offending character on that <i>line</i>.
</li>
<li>
In the event of an error, the Parser makes its best attempt to continue parsing as correctly as possible. For example, if a header row is used and extra fields are found on a line, they will be put into an array keyed by the field name "__parsed_extra".
</li>
</ul>
</div>
<divclass="clear"></div>
<hr>
<h2id="streaming">Streaming files</h2>
<pclass="text-center">
Papa can load and parse very large files by using streams.
</p>
<hr>
<divclass="grid-60 prefix-20 suffix-20">
<h3>Can Papa load and parse huge text files?</h3>
<p>
Yes. By defining a <ahref="#step">step</a> callback function, you're able to receive parsed results, row-by-row, as the data is collected. This dramatically reduces memory usage and prevents browsers from crashing.
</p>
<h3>What is a stream and when should I stream files?</h3>
<p>
A stream is a unique data structure which, given infinite time, gives you infinite space.
So if you're short on memory (as client computers often are), use a stream.
</p>
<h3>Wait, does that mean streaming takes more time?</h3>
<p>
Yes and no. Typically, when we gain speed, we pay with space. The opposite is true, too. Streaming uses significantly less memory with large inputs, but since the reading happens in chunks and results are processed at each row instead of at the very end, yes, it can be slower.
</p>
<p>
But consider the alternative: upload the file to a remote server, open and process it there using a (hopefully) fast and accurate parser, then compress it and have the client download the results. How long does it take you to upload a 500 MB or 1 GB file? Then consider that the server still has to open the file and read its contents, which is what the client would have done minutes ago. The server might parse it faster with natively-compiled binaries, but only if its resources are dedicated to the task and isn't already parsing files for many other users.
</p>
<p>
So unless your clients have <ahref="http://google.com/fiber">a fiber line</a> and you have a scalable cloud application, local parsing by streaming is nearly guaranteed to be faster.
</p>
<h3id="step">How do I use the <code>step</code> function?</h3>
<h3>How do I get all the results together after streaming?</h3>
<p>
You don't. Unless you assemble it manually. And really, don't do that... it defeats the purpose of using a stream. Just take the bits you need as they come through.
</p>
<h3>How big should a file be before streaming it?</h3>
<p>
In some very unscientific testing (with the fastest 2013 Macbook Pro), we were able to load files of about 250 MB for parsing in Chrome without crashing the tab. Beyond that, Chrome started to choke. Actual performance may vary widely. But keep in mind that file size may not be the only factor for choosing to stream.
</p>
<h3>Why <i>wouldn't</i> I stream the input?</h3>
<p>
Getting parsed results one row at a time is usually less convenient to work with; it's hard to see the big picture. (But the big picture might be <i>really</i> big.) As results stream in, you can tabulate stats or keep track of whatever you need to, but you wouldn't want to reassemble all the data...
</p>
<h3>Why should I stream large files, even if they fit in memory?</h3>
<p>
The space required by the parsed results is often much larger than that of the original input file. The convenience of Javascript objects afforded by 64-bit pointers (to make each value quickly accessible) takes up a lot more space than globbing it together like a file does (at the cost of accessibility). In other words, the output may not fit in memory even if the input does.
</p>
<h3>Can I stream text without using a file?</h3>
<p>
Yes, though that's often not necessary. Input that comfortably fits in a textarea usually is small enough that it doesn't need to be streamed.
</p>
<h3>Does Papa use a true stream?</h3>
<p>
Papa uses HTML 5's FileReader API to load files, which uses a stream to read in the data. FileReader doesn't technically allow us to hook into the underlying stream (other than providing occasional progress reports), but it does let us load the file in chunks/blobs. Don't worry about that though, because if you want to stream, you'll still get results, row-by-row, into your <ahref="#step">step</a> function.
Papa can convert numeric values to true numbers for you
</p>
<hr>
<divclass="grid-60 prefix-20 suffix-20">
<h3>What is dynamic typing?</h3>
<p>
By default, parsed values are returned as strings. Dynamic typing is a feature built into Papa that converts numeric values to a Number type. When dynamic typing is enabled, parsed values that resemble a number will be converted to one.
</p>
<h3>Do I need dynamic typing?</h3>
<p>
If you're performing mathematical operations on the data, then yes, it'll be very helpful. (You'd probably rather add two numbers than concatenate them, right?)
</p>
<h3>What's the trade-off for using dynamic typing?</h3>
<p>
Performance, as usual. Each parsed value is matched against a regular expression to determine its numerality. You probably won't notice the degraded performance except with very large inputs. Even then, it may not be significantly slower in many cases. But if you absolutely need the best performance possible, turn off dynamic typing (and header row).
If, for some reason, the data is padded by whitespace, it will be ignored. Within the actual data, however, whitespace is significant. For example, floats represented using scientific notation should not have spaces around the "e" character.
</p>
</div>
<divclass="clear"></div>
<hr>
<h2id="contribute">Contribute</h2>
<pclass="text-center">
Help make Papa better
</p>
<hr>
<divclass="grid-60 prefix-20 suffix-20">
<h3>How to contribute</h3>
<p>
Please, feel free to <ahref="https://github.com/mholt/jquery.parse">fork Papa on GitHub</a> and submit a <ahref="https://github.com/mholt/jquery.parse/pulls">pull request</a>. Remember, the Parser component is <ahref="https://github.com/mholt/jquery.parse/blob/master/tests.html">under test</a>, so if you're making changes to the actual parsing mechanisms, be sure to add a test case to validate your change.
</p>
<h3>Feedback</h3>
<p>
You can <ahref="https://github.com/mholt/jquery.parse/issues?state=open">open an issue</a> on GitHub to ask questions or start discussion, or you can hashtag <ahref="https://twitter.com/search?q=%23PapaParse&src=typd&f=realtime">#PapaParse</a> on Twitter.