Sending Multiple Files to a Data Generator in Python

AcademicFeaturedMachine Learning

In the current era of deep learning, data is the biggest problem. If the data size is small, then it’s obviously a big problem for training a deep neural network. Even if the data size is huge, it’s still a big problem for people who don’t have enough resources.

The problem that comes with a huge amount of data is that there are not enough resources to load the data into memory. Most of us train our models in Colab, which initially only gives us 12 GB of memory (sometimes 25/35 GB). Recently, I was facing the same kind of problem where just loading half of the data was crashing my runtime. Most of us know the obvious solution: use a Python generator.

A Python generator is a function that does not load the whole input into the memory. It takes only the amount of data that we need for that moment. For deep learning, we do not need all the data at a single moment. We need the batch_size amount of data at a particular time. So it actually returns an iterator, which gives the data as we need it.

Let’s say you have a CSV file that is 80 GB in size, so you can’t load the whole dataset into the memory. In Python, we have a lib that can open the CSV file and read line by line. We will use that to make our generator:

import csv

def data_generator():
  with open('nlp.csv') as fp:
    reader = csv.reader(fp)
    for row in reader:
	  yield row

for row in data_generator():
  print(row)

This code block will print each row of the CSV file. But unlike our traditional method, it won’t load the whole data into memory. It will read line by line and throw us that line for our usage.

I know some people reading are bored at the moment. Everyone knows about Python generator, but that’s not what this article is all about. The solution above works for a single file. However, what will you do when you have three, four, or more CSV files?

Use a Python generator again?

That was the problem I faced earlier. With my little Python experience, I have implemented some functions that I will share with you today:

def get_all(*args, chunksize):
  pds = []
  args = args[0]
  for arg in args:
    pds.append(pd.read_csv(f'{BASE_PATH}/{arg}', chunksize=chunksize))
  return pds
  
def merge_all(*args):
  merged = None
  args = args[0]
  for arg in args:
    if merged is None:
      tmp = arg
    else:
      merged = pd.merge(tmp, arg, how='inner', on=['protein', 'index'])
  return merged

def data_generator():
  total_row = NUMBER_OF_ROWS
  files = [
    'a.csv',
    'b.csv',
    'c.csv',
    'd.csv',
  ]

  pds = get_all(files, chunksize=batch_size)
  cnt = 0

  while True:
    data_frames = []
    for reader in pds:
      data_frames.append(reader.get_chunk())

    cnt += batch_size
    
    merged = merge_all(data_frames)
    x = merged.iloc[:, 1:].to_numpy()
    y = merged.iloc[:, 0].to_numpy()

    if cnt >= total_row:
      pds = get_all(files, chunksize=batch_size)
      cnt = 0

    yield x, y

This was the code I used on my recent research to merge all four files (total 47 GB) and get only the batch_size of data (16, 32, 64, 128, …). Don’t be overwhelmed by the size of the code. I will explain it step by step.

def get_all(*args, chunksize):
  pds = []
  args = args[0]
  for arg in args:
    pds.append(pd.read_csv(f'{BASE_PATH}/{arg}', chunksize=chunksize))
  return pds

In this part of the code, I passed all the file names as the parameter and the batch_size. I used pandas ( pd) to read CSV, but unlike the traditional way, I passed the chunksize parameter on read_csv. It will then return a reader object instead of a data frame. I appended all the reader objects in a list and returned that. I will use it in later parts of the code.

def merge_all(*args):
  tmp = None
  args = args[0]
  for arg in args:
    if tmp is None:
      tmp = arg
    else:
      tmp = pd.merge(tmp, arg, how='inner', on=['id'])
  return tmp

In this part, all the data frame objects were sent. And in my case, I had to merge all the data frame objects on a single data frame using the id as the primary value. Pandas already has a built-in function for that (pd.merge). On this function, you need to send the left and right data frame, the join type, and the variable(s) that need(s) to be used as the primary variable(s).

def data_generator():
  total_row = NUMBER_OF_ROWS
  files = [
    'a.csv',
    'b.csv',
    'c.csv',
    'd.csv',
  ]

  pds = get_all(files, chunksize=batch_size)
  cnt = 0

  while True:
    data_frames = []
    for reader in pds:
      data_frames.append(reader.get_chunk())

    cnt += batch_size
    
    merged = merge_all(data_frames)
    x = merged.iloc[:, 1:].to_numpy()
    y = merged.iloc[:, 0].to_numpy()

    if cnt >= total_row:
      pds = get_all(files, chunksize=batch_size)
      cnt = 0

    yield x, y

Now, this is the main generator function that I had used on my model fit function. At first, I used the get_all function to get all the readers. Now the function will run for an infinite time. How much data it needs will be dependent on the model — not the amount in the dataset or the generator function. So it will run infinitely and circularly serve the data.

reader.get_chunk() will return the data frame of chunksize. We have all the chunks from all the reader objects and merged them using our previous merge_all function. Then I separated the values and targeted and returned them using the yield keyword at the end of the code.

cnt += batch_size

if cnt >= total_row:
  pds = get_all(files, chunksize=batch_size)
  cnt = 0

What I have done with this part of the code? A deep learning model does not train on the data just one time. It reads the data repeatedly — traditionally once per epoch. If we use the data from a numpy array, it manages the data repetition on its own. But as we are using a generator, we have to handle it manually. So on the first line, I am tracking the number of data I have already read, and on the if condition, I check if I have reached the end of the data. If yes, then I just get all the reader objects again and reset cnt to 0. So it will start to serve me data from the start of the file again.

I suffered a lot with this topic and came up with this solution after a lot of Google searches. I hope this article will save some of your time and internet bandwidth.

Leave a Reply