Fork me on GitHub

Learn JS Data

Data cleaning, manipulation, and wrangling in JavaScript

Combining Data

Note: This task was very generously contributed by Timo Grossenbacher - Geographer and Data Specialist extraordinaire. Thanks very much Timo!

Often, you have to combine two or more different data sets because they contain complementary information. Or, for example, the data come in chunks from the server and need to be reassembled on the client side.

Combining or merging data may involve one of the following tasks:

Combine data sets by one or more common attributes

Let's say we have a list of the following articles:

var articles = [{
    "id": 1,
    "name": "vacuum cleaner",
    "weight": 9.9,
    "price": 89.9,
    "brand_id": 2
}, {
    "id": 2,
    "name": "washing machine",
    "weight": 540,
    "price": 230,
    "brand_id": 1
}, {
    "id": 3,
    "name": "hair dryer",
    "weight": 1.2,
    "price": 24.99,
    "brand_id": 2
}, {
    "id": 4,
    "name": "super fast laptop",
    "weight": 400,
    "price": 899.9,
    "brand_id": 3
}];

And of the following brands:

var brands = [{
    "id": 1,
    "name": "SuperKitchen"
}, {
    "id": 2,
    "name": "HomeSweetHome"
}];

As you can see, in each article, brand_id points to a particular brand, whose details are saved in another data set - which can be considered a lookup table in this case. This is often how separate data schemes are stored in a server-side database. Also note that the last article in the list has a brand_id for which no brand is stored in brands.

What we want to do now is to combine both datasets, so we can reference the brand's name directly from an article. There are several ways to achieve this.

Using native Array functions

We can implement a simple join (left outer join in database terms) using native, i.e., already existing Array functions as follows. The method presented here modifies the articles array in place by adding a new key-value-pair for brand.

articles.forEach(function(article) {
    var result = brands.filter(function(brand) {
        return brand.id === article.brand_id;
    });
    delete article.brand_id;
    article.brand = (result[0] !== undefined) ? result[0].name : null;
});
console.log(articles);
=> [{
    "id": 1,
    "name": "vacuum cleaner",
    "weight": 9.9,
    "price": 89.9,
    "brand": "HomeSweetHome"
}, {
    "id": 2,
    "name": "washing machine",
    "weight": 540,
    "price": 230,
    "brand": "SuperKitchen"
}, {
    "id": 3,
    "name": "hair dryer",
    "weight": 1.2,
    "price": 24.99,
    "brand": "HomeSweetHome"
}, {
    "id": 4,
    "name": "super fast laptop",
    "weight": 400,
    "price": 899.9,
    "brand": null
}];

First, we loop over each article, where we take its brand_id to look up the corresponding brand using the native filter function. Note that this function returns an array and we expect it to have only one element. In case there is no corresponding brand, result[0] will be undefined, and in order to prevent an error (something like result[0] is undefined), we use the ternary operator.

Also, as we no longer need brand_id after the lookup has been done, we can safely delete it.

If we want to join by more than one attribute, we can modify the filter function to achieve this. Hypothetically, this might look something like:

innerArray.filter(function(innerArrayItem) {
    return innerArrayItem.idA === outerArrayItem.idA &&
        innerArrayItem.idB === outerArrayItem.idB;
});

Using a generic and more efficient approach

A more generic, and also more performant version of a join is proposed below (abbreviated from this StackOverflow answer). Its output is equivalent to the one of the above method.

function join(lookupTable, mainTable, lookupKey, mainKey, select) {
    var l = lookupTable.length,
        m = mainTable.length,
        lookupIndex = [],
        output = [];
    for (var i = 0; i < l; i++) { // loop through l items
        var row = lookupTable[i];
        lookupIndex[row[lookupKey]] = row; // create an index for lookup table
    }
    for (var j = 0; j < m; j++) { // loop through m items
        var y = mainTable[j];
        var x = lookupIndex[y[mainKey]]; // get corresponding row from lookupTable
        output.push(select(y, x)); // select only the columns you need
    }
    return output;
};

Because above defined function creates an index for the lookupTable (in our case brands) in the first iteration, it runs considerably faster than the previously shown method. Also, via a callback, it allows us to directly define which keys (or "attributes") we want to retain in the resulting, joined array (output). It is used like so:

var result = join(brands, articles, "id", "brand_id", function(article, brand) {
    return {
        id: article.id,
        name: article.name,
        weight: article.weight,
        price: article.price,
        brand: (brand !== undefined) ? brand.name : null
    };
});
console.log(result);
=> [{
    "id": 1,
    "name": "vacuum cleaner",
    "weight": 9.9,
    "price": 89.9,
    "brand": "HomeSweetHome"
}, {
    "id": 2,
    "name": "washing machine",
    "weight": 540,
    "price": 230,
    "brand": "SuperKitchen"
}, {
    "id": 3,
    "name": "hair dryer",
    "weight": 1.2,
    "price": 24.99,
    "brand": "HomeSweetHome"
}, {
    "id": 4,
    "name": "super fast laptop",
    "weight": 400,
    "price": 899.9,
    "brand": null
}];

Note that we don't modify articles in place but create a new array.

Add together rows from different data sets

Let's say we want to load a huge data set from the server, but because of network performance reasons, we load it in three chunks and reassemble it on the client side. Using Queue.js, as illustrated in reading in data, we get the data and immediately combine it. For this, we can use D3's merge to combine the single arrays one after another. In database terms, this operation is called "union".

queue()
    .defer(d3.csv, "/data/big_data_1.csv")
    .defer(d3.csv, "/data/big_data_2.csv")
    .defer(d3.csv, "/data/big_data_3.csv")
    .await(combine);

function combine(error, big_data_1, big_data_2, big_data_3) {
    if (error) {
        console.log(error);
    }
    console.log(d3.merge([big_data_1, big_data_2, big_data_3]));
}
=> [{"a": "1", "b": "2"},{"a": "3", "b": "4"},{"a": "5", "b": "6"}]
This code is using d3.js

Note that the argument passed to d3.merge must be an array itself, which is why we use the square brackets.

Combine attributes from different data sets

In the last case, we have two or more data sets that contain attributes describing the same observations, or conceptual entities, and they need to be combined. This implies that all data sets have the same length. For example, dataset_1 below contains two observations of attribute type and attribute model, while dataset_2 contains the same two entities, but observed through attributes price and weight.

var dataset_1 = [{
    'type': 'boat',
    'model': 'Ocean Queen 2000'
}, {
    'type': 'car',
    'model': 'Ferrari'
}];
var dataset_2 = [{
    'price': 23202020,
    'weight': 5656.9
}, {
    'price': 59988,
    'weight': 1.9
}];

So in both data sets we essentially have separate information about the same conceptual entities, thus it makes sense to "merge" them, for which we can use lodash's merge function:

var result = _.merge(dataset_1, dataset_2);
console.log(result);
=> [{
    'type': 'boat',
    'model': 'Ocean Queen 2000',
    'price': 23202020,
    'weight': 5656.9
}, {
    'type': 'car',
    'model': 'Ferrari',
    'price': 59988,
    'weight': 1.9
}];
This code is using lodash

Next Task

Summarizing Data