Data Tool

This tool allows you to transform your CSV data using various techniques. You can clean data, pivot tables, pseudonymize data, or generate hash IDs. Useful for quick data wrangling and curation.

Transformation Types Documentation

Filter Rows

Filter rows based on column values containing specific text.

Column: Select the column to filter on.

Value Contains: Enter text to filter for (case-insensitive).

Example: Filter a "Country" column for rows containing "united" to find both "United States" and "United Kingdom".

Group & Aggregate

Group data by a column and perform calculations on another column.

Group By: Select a column to group records by (e.g., Category, Country).

Aggregate Column: Select a column (usually numeric) to perform calculations on.

Function: Choose an aggregation function:

Sum: Total of all values
Average: Mean of values
Minimum: Smallest value
Maximum: Largest value
Count: Number of records

Example: Group sales data by "Region" and sum the "Revenue" column to see total revenue by region.

Sort Data

Order the dataset by values in a specific column.

Sort By: Select the column to sort on.

Direction: Choose ascending (A→Z, 1→9) or descending (Z→A, 9→1) order.

Example: Sort a product list by "Price" in descending order to see most expensive items first.

Clean Data

Automatically clean and prepare data for analysis.

Actions performed:

Remove rows with missing values
Convert numeric strings to numbers
Trim whitespace from text fields

Example: Clean survey data to remove incomplete responses and ensure numeric fields are properly formatted.

Pivot Table

Create a cross-tabulation of data similar to Excel pivot tables.

Row Labels: Select a column for the row dimension.

Column Labels: Select a column for the column dimension.

Example: Create a table showing product sales (counts) by region and category, with regions as rows and categories as columns.

Pseudonymize Data

Replace identifying information with fictional data while maintaining consistency.

Pseudonymize Columns: Select columns containing sensitive data to replace with fictional values.

Type Selection: Choose the appropriate type for each column:

Full Name: Replaces with fictional full names
First Name: Replaces with fictional first names
Last Name: Replaces with fictional last names
Username: Replaces with fictional usernames

Remove Columns: Completely remove columns that shouldn't be included in the output.

Mapping: Maintains consistency by always replacing a specific value with the same pseudonym.

Example: Pseudonymize customer data for sharing with analysts while protecting privacy.

Generate Hash IDs

Create a new column with hash identifiers based on values from selected columns.

ID Column Name: Name for the new column containing hash IDs.

Auto-generate name: Creates a descriptive column name based on selected columns.

Hash Algorithm: Choose between a simple or more complex hash algorithm.

Salt: Optionally add a secret key to make hashes unique but non-reproducible.

Columns for Hash: Select which columns to include when generating the hash ID.

Example: Create persistent anonymous identifiers from demographic data, or generate unique IDs from multiple fields.

Cluster & Edit Values

Find and merge similar values in a column using various clustering algorithms.

Select Column: Choose a column that may contain variations of the same value.

Clustering Methods:

Key collision: Transforms values into keys that ignore differences in case, word order, etc.
N-Gram Fingerprint: Creates keys based on character sequences, helpful for spotting typos.
Phonetic fingerprint: Groups values that sound similar using phonetic algorithms.
Nearest neighbor: Groups values based on similarity measures like Levenshtein distance.

Similarity Threshold: For Levenshtein method, controls how similar values need to be to cluster.

Example: Find and standardize variations like "United States", "USA", "U.S.A", and "US" in a Country column.

Drop Columns

Remove unnecessary columns from your dataset.

Select columns: Choose which columns to completely remove from the dataset.

Example: Remove sensitive columns like "SSN" or "Phone Number" before sharing data with others, or remove irrelevant columns to simplify your analysis.

CSV Data Transformer

Upload CSV File