In this article, I will show you an example of JSON data processing. The goal is to make JSON files containing one observation per file in a folder combined into a single JSON file. The advantage of performing this kind of data processing is you can significantly shrink your data size and simplify your data form so that it will be easier for somebody to use it.
Imagine you received a folder of movie datasets, all in JSON type format. You will need to do an ETL process on those datasets, i.e. clean and store them into a data warehouse. But then you found that the folder is quite large.
Keeping data in such form and apply a usual pandas-data-processing in Python, you will be likely to have this kind of error:
The error is showing that you have run out of memory in your RAM when executing the codes. Fortunately, you can shrink the file size of the datasets by combining those movie JSON files into a single movies JSON file. A single JSON file also acts as a data warehouse for reproducible data analysis.
This is the form of a movie JSON file in the folder.
We can see that it is just a line of a dictionary, showing one movie data. Some of the fields are relevant for data analysis, some of them are not. We want to combine all movie observations into a single JSON file by picking necessary fields of a movie then put that in one line, then for the next movie we put that in the next line and so on. But before we dump the data, we need to do the transformation on some fields such as
spoken_language. The whole processes of extract, transform and load can be done in less than 40 lines of code (might be lesser, comment if you can find!). The code is as follows.
Here I use
os.walk to extract data from the movies folder. Then by iterating the stored filenames, the data transformation on the
spoken_language column is running. The transformed data then stored in a predefined empty list called
data_list. After the iteration is done, then by using
json.dump we will be putting the
data_list in a single JSON file, resulting in this output:
movies.json created, we can see that the file size decreased from 452 MB to around 145 MB, 67% of memory is freed without losing any information! Besides, now we have a more centred data source that can be processed easier.
It is better to combine a folder full of JSON files into a single JSON file. The example above shows that the process gives you more space in your memory. Also, it will ease your data processing as you will not need to iterate over files in a folder. If you’re interested in this kind of data processing, don’t hesitate to visit my GitHub page. There you can find the details of the codes and clone my repository if you want to try it.