Data Cleanse Pro Tool
Use Data Cleanse Pro to fix common data quality issues. You can replace null values, remove punctuation, HTML tags, modify capitalization, and more!
Important
The existing Data Cleansing tool is still available and will continue to work in existing workflows.
Tool Components
The Data Cleanse Pro tool has 2 anchors.
Input anchor: Use the input anchor to connect the data you want to clean up.
Output anchor: The output anchor outputs the cleaned data.
Configure the Tool
Use the Data Cleanse Pro configuration window to determine how data quality issues are managed.
Remove Data
Select either or both of the checkboxes to remove entire rows and columns of null data.
Important
These options are applied after all other cleansing steps to ensure consistent output when combined with the Replace with Null options. If you want to remove Null Data from the initial input, do not use the Remove Data and Replace with Null options at the same time.
Null Rows:
Remove all rows with a null value in every column.
This option doesn't remove rows with empty string values only.
A message shows how many rows were removed.
Null Columns:
Remove all columns with a null value in every row.
This option doesn't remove columns with empty string values only.
A message shows how many columns were removed.
Select Columns to Cleanse
Use this section to select the columns that you want to cleanse. These columns are populated from the tool's input connection. Column Name identifies the column name and Type identifies the data type of the column (String, Integer, Float, Date, etc.).
You can use the select all checkbox to select all of your columns or the checkboxes associated with the individual columns to select specific columns to clean.
Use the Search input field to search for specific columns by name.
Select the filter icon next to Type to filter this table for columns with a specific data type.
By default, all columns are selected and data types are included. The Unknown column is also selected by default and allows new columns in the data.
Remove Characters
Use this section to select the specific character categories that you want to remove from the column that you selected in the previous section. Several of the below examples use this as the input:
Input | Content |
---|---|
1 |
|
2 |
|
3 |
|
Leading and Trailing Tabs, Line Breaks, and Whitespace: Remove tabs, line breaks, and whitespace at the beginning and end of a string.
Output
Content
1
Hello , Alteryx Community!
2
W e ekly Challe nges
3
Help Documentation !
Tabs, Line Breaks, and Duplicate Whitespace: Replace all consecutive spaces, tabs, and other whitespace characters at any position in a string, with a single space.
Output
Content
1
Hello , Alteryx Community!
2
W e ekly Challe nges
3
Help Documentation !
All WhiteSpace: Remove all whitespace at any position in a string.
Output
Content
1
Hello,AlteryxCommunity!
2
WeeklyChallenges
3
HelpDocumentation!
HTML Tag: Remove HTML, XML, and other tags enclosed in angle brackets. Only the plain text content within the bracket remains. For example, an input of
<h1>Alteryx</h1>
results inAlteryx
.Example Input:
<div class="ExternalClass683BD308D77E4320B52BF37E72FF5C00"><html> <p>HD - Request to update several receiver fields - agreed to unlock file to allow for receiver edits - changes will be noted in notes field<br></p> </html></div>
Example Output:
HD - Request to update several receiver fields - agreed to unlock file to allow for receiver edits - changes will be noted in notes field
Important
Text that includes angle brackets (< and >) is interpreted as containing tags. This means even non-HTML content written using
< >
, for example, comparisons or math expressions, might be affected.For example...
Input:
3 < 5 and 3 > 1
Output:
3 1
: The text< 5 and 3 >
is treated as a tag and removed.
To prevent this, consider replacing angle brackets with alternative symbols (for example,
<
and>
or use parentheses).Invisible Characters: Remove all characters that are invisible formatting markers, such as invisible separators, word-break control, and the Unicode encodings U+200B (zero-width space) and U+00AD (soft hyphen). This option is useful when cells appear empty but are not truly blank.
Letters: Remove all letters, including non-Latin alphabet letters like,
A b Z À é ö
. Use the Except field to enter letters that should not be removed, if any (case-sensitive, no separators). Review Unicode ICU alphabetic categories:Uppercase Letter: A, B, C, Α, Б
Lowercase Letter: a, b, c, α, б
Titlecase Letter: Dž, Lj, Nj (special titlecase letters)
Modifier Letter: ʰ, ˠ, ˡ (phonetic or modifier letters)
Other Letter: क, 日, ש (letters not uppercase or lowercase)
Numbers: Remove all numbers. Use the Except field to enter numbers that should not be removed, if any. Review Unicode ICU numeric values:
Decimal Digit Number: 0-9, ١, ۲ (standard digits)
Letter Number: Ⅳ, Ⅶ, ↀ (Roman numerals, etc.)
Other Number: ½, ², ¾ (fractions, superscripts)
Punctuation and Special Characters: Remove all punctuation and special characters based on Unicode punctuation and symbol definitions used by the ICU library. Use the Except field to enter punctuation symbols that should not be removed.
In Unicode, all punctuation characters fall under the punctuation category:
Connector Punctuation: _ , ‿ , ⁀ (underscore, etc.)
Dash Punctuation: - , — , ‒ (hyphen, dash)
Open Punctuation: ( , [ , { (opening brackets)
Close Punctuation: ) , ] , } (closing brackets)
Initial Quote Punctuation: " , ‘ (left quotes)
Final Quote Punctuation: " , ’ (right quotes)
Other Punctuation: ! , ? , ; (miscellaneous)
Characters are categorized under the symbol general category, which includes:
Math Symbols: Characters like +, −, =, and ∞.
Currency Symbols: Characters like $, €, ¥, and £.
Modifier Symbols: Characters like ^, ˜, and accents used for phonetic transcription.
Other Symbols: Various symbols that include ©, ®, and emoji characters.
Replace String Columns
Determine how to handle data quality issues in string columns.
Replace Nulls with Blanks: Replace null values with a blank string value. A blank registers as " " rather than [Null].
Replace Blanks with Nulls: Replace blank string values with null values. A blank registers as " " rather than [Null]. This is the inverse of the previous option and helps standardize empty columns as true nulls.
Replace Numeric Columns
Determine how to handle data quality issues in numeric columns.
Replace Nulls with 0: Replace null values with a 0 (zero).
Replace 0 with Nulls: Replace 0's (zeros) with null values. This is the inverse of the previous option.
Modify Case
Select the Modify Case checkbox to activate the modify case dropdown. Select one of these options:
lower case: Convert all letters in a string to lowercase.
Title Case: Capitalize the 1st letter of all words in a string.
UPPER CASE: Capitalize all letters in a string.