As mentioned in the introduction, this guide is mostly geared for client-side data analysis, but with a few augmentations, the same tools can be readily used server-side with Node.
If the data is too large, this might in fact be your only option if you want to use JavaScript for your data analysis. Trying to deal with large data in the browser might result in your users having to wait for a long time. No user will wait for 5 minutes with a frozen browser, no matter how cool the analysis might be.
To get started with Node, ensure both node and npm, the Node package manager, are installed and available via the command line:
which node
# /usr/local/bin/node
which npm
# /usr/local/bin/npm
Your paths may be different then mine, but as long as which
returns something, you should be good to go.
If node isn't installed on your machine, you can install it easily via a package manager.
Create a new directory for your data analysis project. In this example, we have a directory with a sub-directory called data
which contains our animals.tsv
file inside.
animals_analysis
|
- data
|
- animals.tsv
Next, we want to install our JavaScript tools, D3 and lodash. With Node, we can automate the process by using npm
. Inside your data analysis directory run the following:
npm install d3
npm install lodash
You can see that npm creates a new sub-directory called node_modules
by default, where your packages are installed. Everything is kept local, so you don't have to worry about problems with missing or out-of-date packages. Your analysis tools for each project are ready to go.
A package.json
file can be useful for saving this kind of meta information about your project: dependencies, name, description, etc. Check out this interactive example or npm's documentation for more information.
Now we create a separate JavaScript file to do our analysis in:
touch analyze.js
Inside this file, we first require our external dependencies.
var fs = require("fs");
var d3 = require("d3");
var _ = require("lodash");
We are requiring our locally installed d3
and lodash
packages. Note how we assign them to variables, which are used to access their functions later in the code.
We also require the file system module. As we will see in a second, we need this to load our data - which is really the key difference between client-side and server-side use of these tools
D3's data loading functionality is based on XMLHttpRequest, which is great, but Node does not have XMLHttpRequest
. There are packages around this mismatch, but a more elegant solution is to just use Node's built in file system functionality to load the data, and then D3 to parse it.
fs.readFile("data/animals.tsv", "utf8", function(error, data) {
data = d3.tsv.parse(data);
console.log(JSON.stringify(data));
});
fs.readFile is asynchronous and takes a callback function when it is finished loading the data.
Like our Queue example in client-side reading, the parameters of this function start with error
, which will be null
unless there is an error.
The data returned by readFile
is the raw string contents of the file.
We can use d3.tsv.parse which takes a string and and converts it into an array of data objects - just like what we are used to on the client side!
From this point on, we can use d3 and lodash functionality to analyze our data.
A full, but very simple script might look like this:
var fs = require("fs");
var d3 = require("d3");
var _ = require("lodash");
fs.readFile("data/animals.tsv", "utf8", function(error, data) {
data = d3.tsv.parse(data);
console.log(JSON.stringify(data));
var maxWeight = d3.max(data, function(d) { return d.avg_weight; });
console.log(maxWeight);
});
Since this is not in a browser, we need to execute this script, much like you would with a script written in Ruby or Python.
From the command line, we can simply run it with node
to see the results.
node analyze.js
=> [{"name":"tiger","type":"mammal","avg_weight":"260"},{"name":"hippo","type":"mammal","avg_weight":"3400"},{"name":"komodo dragon","type":"reptile","avg_weight":"150"}]
3400
Maybe the original data set is too big, but we can use Node to perform an initial pre-processing or filtering step and output the result to a new file to work with later.
Node has fs.writeFile that can perform this easily.
Inside the read callback, we can call this to write the data out.
var bigAnimals = data.filter(function(d) { return d.avg_weight > 300; });
bigAnimalsString = JSON.stringify(bigAnimals);
fs.writeFile("big_animals.json", bigAnimalsString, function(err) {
console.log("file written");
});
Running this should leave us with a big_animals.json
file in our analysis folder.
This is fine if JSON is what you want, but often times you want to output TSV or CSV files for further analysis. D3 to the rescue again!
D3 includes d3.csv.format (and the equivalent for TSV and other file formats) which converts our array of data objects into a string - perfect for writing to a file.
Let's use it to make a CSV of our big animals.
var bigAnimals = data.filter(function(d) { return d.avg_weight > 300; });
bigAnimalsString = d3.csv.format(bigAnimals);
fs.writeFile("big_animals.csv", bigAnimalsString, function(err) {
console.log("file written");
});
Run this with the same node analysis.js
and now you should have a lovely little big_animals.csv
file in your directory. It even takes care of the headers for you:
name,type,avg_weight
hippo,mammal,3400
Now even BIG data is no match for us - using the power of JavaScript!