Combining Multiple JSON Files Into A Single JSON File

Photo by Joshua Aragon on Unsplash

JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is based on a subset of JavaScript language (the way objects are built-in JavaScript). JSON is often used when data is sent from a server to a web page so that JSON data processing in data analytics becomes inevitable.

In this article, I will show you an example of JSON data processing. The goal is to make JSON files containing one observation per file in a folder combined into a single JSON file. The advantage of performing this kind of data processing is you can significantly shrink your data size and simplify your data form so that it will be easier for somebody to use it.

Dataset

Imagine you received a folder of movie datasets, all in JSON type format. You will need to do an ETL process on those datasets, i.e. clean and store them into a data warehouse. But then you found that the folder is quite large.

Image by Author

Keeping data in such form and apply a usual pandas-data-processing in Python, you will be likely to have this kind of error:

Image by Author

The error is showing that you have run out of memory in your RAM when executing the codes. Fortunately, you can shrink the file size of the datasets by combining those movie JSON files into a single movies JSON file. A single JSON file also acts as a data warehouse for reproducible data analysis.

ETL Process

This is the form of a movie JSON file in the folder.

We can see that it is just a line of a dictionary, showing one movie data. Some of the fields are relevant for data analysis, some of them are not. We want to combine all movie observations into a single JSON file by picking necessary fields of a movie then put that in one line, then for the next movie we put that in the next line and so on. But before we dump the data, we need to do the transformation on some fields such as genres and spoken_language. The whole processes of extract, transform and load can be done in less than 40 lines of code (might be lesser, comment if you can find!). The code is as follows.

Here I use os.walk to extract data from the movies folder. Then by iterating the stored filenames, the data transformation on the genre and spoken_language column is running. The transformed data then stored in a predefined empty list called data_list. After the iteration is done, then by using json.dump we will be putting the data_list in a single JSON file, resulting in this output:

After movies.json created, we can see that the file size decreased from 452 MB to around 145 MB, 67% of memory is freed without losing any information! Besides, now we have a more centred data source that can be processed easier.

Image by Author

Conclusion

It is better to combine a folder full of JSON files into a single JSON file. The example above shows that the process gives you more space in your memory. Also, it will ease your data processing as you will not need to iterate over files in a folder. If you’re interested in this kind of data processing, don’t hesitate to visit my GitHub page. There you can find the details of the codes and clone my repository if you want to try it.

A Data Science Enthusiast