What is Unity and what does it do?
ThinkData’s transformation engine, Unity, is a visually interactive component used to process, merge, and fix data sets, enabling people to transform raw data sets into ideal, customized data products. Although this kind of modelling will be familiar to anyone who works in the field of data science, these transformations needed to be visually represented so that they could be customized and performed by anyone, from the general public to data engineers.
Unity can be used to design an ideal schema that plugs into the raw data and automates the transformation of that data into your ideal product on a go-forward basis, allowing users to drive an internal standard, transform the data to meet the requirements of their models, and create custom enhancements that facilitate internal processes.
Building Ideal Data Products
Unity lets users aggregate and transform any number of data sets into a single ideal data product called a “schema” (read more on schema in the definitions section). The data sets (“graphs”) that go into building a schema can be pulled from Namara, an FTP, or a user’s own cloud deployment. This way, Unity lets users blend together data from multiple places. Schema can be scheduled to run at regular intervals, ensuring the the ideal data product stays current with the graphs that go into building it.
As the name suggestions, Unity specializes in aggregating several data sets into a single product. But Unity is also a powerful tool for transforming or fixing data to meet an internal standard. String, math, and logic functions are all available to deploy inside Unity transformations (see “nodes” in the definition section).
From Raw to Refined to Research
Unity is a critical tool in the Namara platform, falling in the middle of a data management lifecycle. Raw data brought onto Namara has already been standardized into common formats through Namara’s Ingestion engine. This data is then kept up-to-date and monitored for changes through the Puppeteer and Dataspec microservices. Unity gives you greater control of your data, providing a transformation layer on top of the standardized data ingested into Namara.
After you’ve completed your Unity build, you’ll be able to load the ideal schema back into Namara using the Import feature, integrate the data with whatever BI product you prefer, and analyze the data through Namara Analytics.
Creating an ideal data set
Step 1: Find your data
As you know, data can originate from anywhere. For many of us, the data we need is in all over the place: our team’s cloud, the Statistics Canada website, or a folder somewhere on our home computer. The first step of solving the problem of data variety is getting it all into one place. For us that place is Namara.
Since not all data can be imported into Namara, we also let Unity users plug into data from AWS, GCloud, or any other cloud deployment they have access to.
If you’re using data that you’ve downloaded onto your personal computer, we recommend importing it using Namara’s Import feature prior to bringing it onto Unity. Our ingestion pipeline will standardize the data automatically and make it much easier to work with during your Unity build.
Step 2: Pull data into Unity
Once you have tracked down all the data you want to use in your Schema, it’s time to bring the data into Unity.
From Unity, select “new schema” to create the environment you’ll pull the data into. From there, you’ll be able to give your schema a title and description. This title and description will be generated when you import the data into Namara later, so make sure you’re naming them well.
After submitting your data set title and description, you will be prompted to add a “graph” to your schema. As mentioned in the description section, graphs are the individual data sets that make up your ideal schema. You can name the graph whatever you like - that option is there to help you keep track of the raw data and it does not change the structure of the underlying graph.
Next, you will be prompted to select a reference type. As mentioned previously, this can be data from a cloud provider or Namara. For data that resides in the cloud, you will have to ask an administrator for your organization to set up the secure credentials to allow Unity to access your cloud securely. For Namara data, select Namara as the reference type, and then input the data set ID and any filter params you might want. For more information on the query service, please review our API docs.
Step 3: Generate schema attributes
After added your first graph to your schema, you will be prompted to create your schema’s properties. These will be the columns in your final data set. If the columns in your ideal data set are the same or similar to the columns in the graph you created, you may ”inject” the graph properties from the graph you just created. Alternatively, you can create the properties one by one by selecting the “add field” button.
When you create properties, you can name the property and choose the property type. This is useful if you know your input graph has columns that obey certain type rules. Geometry columns, for example, should be type “geojson,” while any column that has a monetary value in it should be type “currency.” Please reach out to us if you would like more information on property types.
Once you’ve added all the properties you want to have in your schema, you can select “submit to be taken to the graph overview page. From there, you can add more graphs to your schema, jump into the graph editor, or return to the schema edit page to add, change, or reorder your schema properties.
Step 4: Transform your raw data
After adding all the graphs you want to your schema, you’ll be able to work with the individual graphs in the graph editor: a visually interactive component that lets you transform data in an intuitive and user-friendly way.
From the graph overview page, select the graph you want to edit. You will be taken to a view that looks a bit like a drafting table. This is where you will map the graphs to your ideal schema. On the left hand side of the graph editor you will see the properties of your raw data. On the right hand side of the graph editor are the schema properties you built in the last step.
For simple one-to-one connections between attributes (for example:
business_name in the raw graph maps to
business_name in the ideal schema) you can simply select “auto-map” in the toolbar. All properties with identical names will immediately be joined. If, however, the raw data contains an attribute called
business_name and you’ve decided to name that property
business_title in your schema, you will have to manually connect the edges. This can be done by selecting the graph edge and dragging an endpoint to the schema edge.
These are relatively simple joins between the raw data and the ideal schema. Unity also allows you to perform simple and complex transformations on the data from within the graph editor.
Consider if your raw data had multiple properties for
postal code but you would prefer a single address property in your schema. By selecting “Add node” from the toolbar you will be able to choose from a suite of transformation tools. For the example above, you might want to choose the “concat” node. This is a node that lets you concatenate multiple properties or attributes into a single output.
After adding the node to the graph editor, you can select the node and modify its endpoint config to
postal_code. Your output, in this scenario, would be simply
After applying these changes, you can exit out of the node editor, pull the proper endpoints into the concat node, and link the output to your ideal schema. After you’ve finished you should have something that looks like this:
If, conversely, your input data had a full address and you wanted to parse it into street, city, and postal code, you would be able to do that using any combination of nodes such as “Extract,” “Split,” and “If-then-else.” Below you’ll find an explanation of a few of the nodes currently available in the graph editor. We build more every week as we continue to build products and understand more about the problems people are trying to solve with data.
Step 5: Preview and Build
Once you’ve mapped and transformed all the graphs in your schema, you can select “Run” from the graph overview page. From there, all the data from the graphs you added will be pulled into the schema and transformed to your ideal output. When the run is successful, you can download the schema, or head back to Namara to add it to your catalogue using the Import feature.
Since we’re working with layers of data from many sources, some of the language that’s used might be unfamiliar. Here we’ve included the definitions for a number of Unity’s features.
Schema: This is the ideal product that you’re building. A schema can refer to both the attributes of your ideal data product or the product itself.
Graph: A graph, in the context of Unity, is one of the data sets being used to build your schema.
Node: Nodes are interactive components in the unity graph builder that let users perform transformations on the raw data. Descriptions of some of the nodes available on Unity are available
Endpoint: An endpoint is the point at which the property connects to another property or a node. These endpoints are connected using an “edge”.
Edge: An edge is the connection between two endpoints.
Property: are the attributes, or column titles, for a data set. If you are using a data set with columns named “Title, Business Name, and Contact Person,” the properties would be “
Property Type: describes the format or type of the data output. A property with type “date” would format the output to correspond to a specific format, whereas property type “boolean” would expect something true or false.
Prepend/Append: Add a value or string to the beginning or end of each data point.
Concat: Concatenate, or merge, several properties into a single endpoint.
Contains/Starts With: Specify a value using plain text or regular expressions. When that result occurs, output of the contains node will be TRUE Copy: Create exact copy of endpoint.
Upper Case/Lower Case: Title case, upper case, or lower case the input of the data to a desired output.
Split: Split the output based on a user-specified regular expression.
Static: Provides a constant value for every row of data within the input data set.
Trim: Trims leading and trailing whitespace in a cell of data.
Average: Calculates a mean average between two or more inputs.
Divide/Subtract/Sum: Performs basic mathematic functions on two or more outputs.
Scale: Scale any specified result. To lower results, provide values less than 1. Example, to halve a result
scale = 0.5.
Equals: Provides a boolean output based on two or more inputs.
Greater Than/Lesser Than: Provides a boolean output based on a specified conditional integer.
Round: Rounds any output to a specified number of decimal places.
Square Root: Provides the square root for all input values.
Extract: The extract node lets you pull certain elements out of a cell. If a phone number exists in the description field, for example, you could configure the extract node to pull only the phone number and nothing else using a regular expression. The extract node can also be used as a “find and replace” function.
If Then Else: A conditional node that allows users to specify two outcomes based on a configurable prerequisite. For example, if your raw data specifies both a physical address and a mailing address for a business, you could use an
if-then-else node to output the physical address when it is present, and output the mailing address when it is not.
Matches: Outputs a boolean result based on whether the input matches a specified result.
Parse JSON/Parse JSON array: This node lets you extract elements of a JSON or JSON array based on specific keys.
MD5: Create a unique identifier based on the input of another column in the raw data set.
Namara: Pull in attributes from another data set on Namara based on a matching attribute in the raw data. Example: raw data describes the location of all watermain breaks by ward in the city of Toronto but does not provide geospatial coordinates for the ward boundaries. Using the Namara node you can pull in a separate data set that provides the polygonal geometry of Toronto’s wards, matching on the ward name. For an example of how this is done, review this blog post.
Script: For developers who want a bit of added control in Unity, we’ve included a script node that allows you to nest scripts inside the graph editor. To enable lean, agile programming, this node is configured for Groovy language. Read more about Groovy syntax here.
Geocode: Input address features (street address, city, postal code, etc) to gather a response from a geocoding API. This node can be configured to pull from any geocoding service.
Get In Touch
- are looking for clarifications on the instructions;
- notice any mistakes in our documentation; or
- would like to send us any feedback.
We would love to hear from you!