How to prepare data for analysis?

Data preparation is perhaps the most important step in the data analysis process, as it helps to limit errors and inaccuracies that can arise during data processing. Data preparation is usually the longest part in an data analysis project. Efficient and precise decisions must be based on reliable data.

What is data preparation?

Data preparation is the process of cleaning and transforming raw data before processing and analysis. The ultimate goal of data preparation is to improve the quality, usability and accessibility of the data before making it available to people and data analysis systems.

Preparing data is often a long-term task, but it is a very important step in the process of transforming data into reliable information ready to be used for decision-making. This can include a whole range of processes, but in this post we will focus on data integration, data  profiling, data cleaning and data governance.

Questions to ask when preparing data

1. Where is your data?

The first step is to identify your data sources and know where they are physically stored. Depending on the company, data can be stored in different places and in different storage systems. The most used storage solutions are:

  • Relational databases (Oracle, MySql, SQL Server, PostgreSQL,…)
  • NoSQL databases (MongoDB, Cassandra, HBase,…)
  • Structured files (Excel, CSV, QVD,…)
  • Semi-structured files like XML
  • Web services (REST or SOAP)
  • Hadoop data lakes

Before deciding which data sources to use, you will also need to know what permissions are required to access the data, is the external data reliable or requires verification, and what level of granularity do you need.

2. Do you need to change the data?

Depending on the quality of the data that you process, some data may require manual transformation or manipulation in order to make it reliable.

Example of when you need to modify the data:

  • A dataset uses different formats for the same information,
  • Data that is inconsistent or contains duplicate information,
  • You need to group data in new ways.

Here are the questions you should ask yourself regarding data quality.

  • For each data source, is it complete, precise and up to date?
  • Can this data answer my questions?
  • What should I do to clean up the data? Should I manually modify some values ​​or implement a more systematic approach?
  • Can my data preparation tool connect to all of my data sources?
  • Should I set up a process for modifying the data in its original location (production) or make the changes in a data preparation process.

In case you are using heterogeneous data sources, you have to make sure that the link fields contain the same data type with the same format. Example: The field “customer_ID” which is in the “Customer” table corresponds to the customer number and that it is of type Integer. If you are using a CSV file that contains customer information, the file must contain a field that contains the customer number in Integer format.

Also think about the evolution of your data model.

  • How easy is it to add data sources and make changes to the model later?
  • Will external data sources be available in the future with the same structure?
  • Can I simplify my model without affecting performance?

3. How to import the data?

To import your data, you have the choice between directly querying the production databases (not recommended) and loading your data in a secondary environment before doing your processing in order to avoid overloading your production environment with your requests. The questions you need to ask yourself are:

  • How will importing data affect my production environment?
  • How often should I import the data? When should I start loading?
  • How many intermediate environments should I set up?
  • Does the server on which I move my data have the software and hardware necessary to manage the amounts of data I’m dealing with?

4. How to check the results?

At the end of the data preparation process you must ensure that the final result is accurate and that you have made no mistakes during the processing. To verify the data, make sure that:

  • Results make sense at a general level
  • The metrics you see are what you already know about the business
  • The number of records in the period is not excessively different from that of the previous period.

Data preparation is initially reserved for IT teams, because the tools required technical knowledge. Today the data preparation software has improved and now allows business users to make, independently or collaboratively, their preparations with no technical knowledge or almost!