RapidMiner Studio

In a few clicks, build validated machine learning models, for customer churn prediction, predictive maintenance, credit risk modelling, sentiment analysis etc. Combine R or Python code seamlessly.

RapidMiner Studio


Compare RapidMiner Editions

Auto ML

Automated Machine Learning

RapidMiner Auto Model builds validated models in 5 clicks, with automated machine learning.

Data Science use cases


Pre-built templates for common use cases including customer churn, predictive maintenance, fraud detection, and many more.

Easy to use data science visual workflow studio

Visual Workflow Designer

Increase productivity across the entire data science team, from analysts to experts.

Accelerate and automate the creation of predictive models in a drag and drop visual interface.

Rich library of 1500+  algorithms and functions ensures the best model for any use case.

“Wisdom of Crowds” provides proactive recommendations at every step to help beginners.

Data connectivity

Connect to any data source

Work with all of your data, no matter where it lives.

Create point and click connections to databases, enterprise data warehouses, data lakes, cloud storages, business applications and social media.

Easily re-use connections any time and easily share them with anyone who needs access.

Connect to new sources, including social media, with extensions from the RapidMiner Marketplace

RapidMiner in data base processing

Automated In-database Processing

Run data prep and ETL inside databases to keep your data optimized for advanced analytics.

Query and retrieve data without writing complex SQL.

Harness the power of highly scalable database clusters.

Supports MySQL, PostgreSQL, and Google BigQuery.

RapidMiner Studio data visualisation

Data Visualisation 

Understand patterns, trends, and distributions with scatter plots, histograms, line charts, parallel coordinates, box plots, and more.

Explore data using robust statistical overviews and over 30 interactive visualizations.

RapidMiner Studio data profiling and exploration.

Data profiling and exploration with the Statistics view

Summary stats of categorical, numeric and time-date data, displayed.

Quickly identify and fix common data quality problems including missing values and outliers.

RapidMiner Studio data prep interface.

Data Prep and Blending

Eliminate the hassle of preparing data for predictive modeling.

RapidMiner Turbo Prep offers a fully interactive point and click data prep experience.

Extract, join, filter, and group data across any number of sources.

Create repeatable data prep and ETL processes that can be scheduled and shared.

RapidMiner Studio ML

Machine Learning

Quickly create impactful machine learning models, without writing code. Choose from hundreds of supervised and unsupervised machine learning algorithms

Implement basic and advanced ML techniques including regression, clustering, time-series, text analytics, and deep learning.

Build the model to be sensitive to constraints like costs to optimize the predictions for your desired business impact. Use both automated and manual feature engineering to optimize model accuracy

RapidMiner Studio model validation

RapidMiner Studio Model Validation operators – just select the machine learning model.

Understand the true performance of a model before deploying to production.

Eliminate overfitting through a unique approach that prevents model training pre-processing data from leaking into the application of the model.

Add proven techniques, like cross validation, to a model with just a single mouse click.

RapidMiner explainable auto-models

Explainable Models Not Black Boxes

Create visual data science workflows that are easy to explain and easy to understand. Each step in the data prep, modeling, and validation process is documented for complete transparency.

Visuals are easy to explain to others in the organization to gain buy-in for deployment.

Supports the Local Interpretable Model-Agnostic Explanations (LIME) framework.

RapidMiner and R code

Get more from R & Python Code

Scalable code deployment and collaboration between coders and non-coders. Deploy code-based models and code-containing models into a scalable platform.

Eliminate duplicate work and ensure others re-use your work by uploading your code snippets into the RapidMiner repository for use in a simplified visual workflow designer

Make use of the vibrant Python ecosystem by building on top of specialized and cutting-edge libraries and making it available for the rest of your team using RapidMiner.

Leverage data prep and ETL pipelines others have produced.

RapidMiner real time scoring

Flexible Scoring

Turn predictive insights into business impact. Quickly deploy scored data to spreadsheets and data visualization tools or turn models into production web services with RapidMiner AI Hub.

Integrate with enterprise scheduling tools.

Add RapidMiner Real Time Scoring for demanding high transaction/low latency use cases.

ModelOps

Model Operations

RapidMiner Model Ops delivers an easy way for less advanced users to put models into production and properly manage them

RapidMiner for teams

RapidMiner for Teams

RapidMiner AI Hub – Team collaboration, process automation, accelerate model creation, scale and deploy

RapidMiner Go – Automated and guided machine learning web interface. Point/click data science for domain experts, business users and analysts. 

Request pricing or a 30 minute demo

Let’s create a plan for your specific needs!

Feature List

Data Access & Management

Access, load and analyze any type of data – both traditional structured data and unstructured data like text, images, and media. Also transform unstructured data into structured.

  • Access to more than 40 file types including SAS, ARFF, Stata, and via URL
  • Wizards for Microsoft Excel & Access, CSV, and database connections
  • Access to NoSQL databases MongoDB and Cassandra
  • Write to Qlik QVX or Tableau TDE files
  • Access to Cloud storage like Dropbox and Amazon S3
  • Access to text documents and web pages, PDF, HTML, and XML
  • Support for all JDBC database connections including Oracle, IBM DB2, Microsoft SQL Server, MySQL, Postgres, Teradata, Ingres, VectorWise, and more
  • Access to full-text index & search platform SOLR
  • Access to Twitter & Salesforce.com
  • Repository-based data management on local systems or central servers via RapidMiner Server
  • Connect to Zapier and trigger Zapier tasks
  • Access to time series data, audio files, images, and many more
  • Enhanced data and metadata editor for repository entries

Data Exploration:
Descriptive Statistics

  • Univariate statistics: mean, median, minimum, maximum, standard deviation, and number of missing values and plots
  • Nominal / categorical attributes: number of categories, counts, mode, number of missing values
  • Date attributes: minimum, maximum, number of missing values
  • Distribution plots
  • Bivariate statistics and plots: Covariance matrix
  • Correlation matrix
  • Anova matrix
  • Grouped Anova
  • Transition matrix
  • Transition graph
  • Mutual information matrix
  • Rainflow matrix
  • Scaled and non-scaled mean-deviation plots
  • Plots of attribute weights based on multiple types of connection with targets
  • Simple rescaling of axis
  • Plots can be easily copied and pasted into other applications or exported as in PNG, SVG, JPEG, EPS or PDF formats
  • Choose from a variety of different colour schemes

Graphs and Information

  • Easy-to-configure charts for fast insight generation from various visualizationsScatter, scatter matrices
  • Line
  • Bubble
  • Parallel
  • Deviation
  • Box
  • 3-D
  • Density
  • Histograms
  • Area
  • Bar charts, stacked bars
  • Pie charts
  • Survey plots
  • Self-organizing maps
  • Andrews curves
  • Quartile
  • Surface / contour plots, time series plots
  • Pareto / lift chart
  • Support for zooming and panning
  • Additional advanced chart engine for arbitrary definition of multiple charts including: on-the-fly grouping, filtering & aggregation

Data Prep Basics

  • Select attributes operator
  • Aggregations for multiple groups and functions like sum, average, median, standard deviation, variance, count, least, mode, minimum, maximum, product, or log product
  • Set operators like join, merge, append, union, or intersect
  • Operators for handling meta data like rename or attribute role definition
  • Filtering rows / examples according to range, missing values, wrong or correct predictions, or specific attribute value
  • Filtering outliers according to distances, densities, local outlier factors, class outlier factors, local correlation integrals, or clustering based outlier detections
  • Identification and removal of duplicates

Sampling

  • Absolute, relative, or probability–based
  • Balanced
  • Stratified
  • Bootstrapping
  • Model-based
  • Kennard-Stone
  • Range

Transformations

  • Normalization and standardization
  • Z-transformation, range transformation, proportion transformation, or interquartile ranges
  • Preprocessing models for applying the same transformations on test / scoring data
  • De-normalization making use of preprocessing models
  • Scaling by weights
  • All kinds of type conversions between numerical attributes, nominal / categorical attributes, and date attributes
  • Operator for guessing correct meta data from existing data sets
  • Adjustment of calendar dates and times
  • Sorting and Pareto sort
  • Shuffling
  • Rotations of data sets: Pivoting, De-Pivoting, and transposing data sets
  • Expression builder for arbitrary transformations on attributes: Statistical functions: round, floor, ceiling, average, minimum, maximum
  • Basic functions: addition, subtraction, multiplication, division, less than, greater than, less or equal, greater or equal, equal, not equal, Boolean not, Boolean and, Boolean or
  • Log and exponential functions: natural logarithm, logarithm base 10, logarithm dualis, exponential, power
  • Trigonometric functions: sine, cosine, tangent, arc sine, arc cosine, arc tangent, hyperbolic sine, hyperbolic cosine, hyperbolic tangent, inverse hyperbolic sine, inverse hyperbolic cosine, inverse hyperbolic tangent
  • Text functions: to string, to number, cut, concatenation, replace and replace all, lower, upper, index, length, character at, compare, contains, equals, starts with, ends with, matches, suffix, prefix, trim, escape HTML
  • Date functions: parse, parse with locale, arse custom, before, after, to string, to string with locale, to string with custom pattern, create current, difference, add, set, and get
  • Miscellaneous functions: if then-else, square root, signum, random, modulus, sum, binomial, missing binomial, missing

Data Partitioning

  • Ensure high model quality through hold-out data sets
  • Create training, validation, and test data sets
  • Default stratification by the class if available
  • User-defined partitions possible
  • Resulting in example sets usable for modeling or further transformations

Binning

  • Interactive binning by user specification
  • Simple binning
  • Count-based
  • Size-based
  • Frequency-based
  • Entropy-based minimizing the entropy in the induced partitions
  • Handling of missing values as its own group

Weighting and Selection

  • Attribute weighting
    – 30+ weighting schemes measuring the influence of attributes & forming base or weight-based selections (filter approach)
  • Attribute selection
    – Selection of attributes by user specification
    – Removal of “useless” attributes
    – Removal of attributes unrelated to target based on a chi-square or correlation-based selection criterion
    – Removal of attributes unrelated to target based on arbitrary weighting schemes like information gain, Gini index, and others
    – Removal attributes with missing values
    – Selection of random attribute subsets
  • Automatic optimization of selections
    – Evolutionary
    – Forward selection
    – Backward elimination
    – Weight-guided
    – Brute-force
  • Attribute space transformations
    – Principal Component Analysis (PCA)
    – Singular Value Decomposition
  • Support for Fast Map
  • Plots for principal components coefficients, Eigenvalues, and cumulative variance of Eigenvalues
  • Calculates Eigenvalues and Eigenvectors from correlation and covariance matrices
  • Choose the number of components to be retained
  • Independent component analysis (ICA)
  • Generalized Hebbian Algorithm (GHA)
  • Dimensionality reduction with Self- Organizing Maps (SOM)
  • Correspondence Analysis

Attribute Generation

  • Operators for generating IDs, copies, concatnations, aggregations, products, Gaussian distributions, and more
  • Automatically optimized generations and detection of latent variables: Evolutionary weighting
  • Forward weighting
  • Backward weighting
  • Multiple algorithms for the automatic creation of new attributes based on arbitrary functions of existing attributes
  • Genetic programming

Similarity Calculation

  • Calculation of similarities between data points
  • Cross Distances operator computes similarities between data points of two data sets
  • Numerical distance measures
    • Euclidean
    • Camberra
    • Chebychev
    • Correlation
    • Cosine
    • Dice
    • Dynamic Time Warping
    • Inner product
    • Jaccard
    • Kernel-Euclidean
    • Manhattan
    • Max-Product
    • Overlap
  • Nominal / categorical distance measures
    • Nominal
    • Dice
    • Jaccard
    • Kulczynski
    • Rogers-Tanimoto
    • Russel-Rao
    • Simple Matching
  • Mixed Euclidean distance for cases with numerical & nominal attributes
    • Bregman divergences
    • Itakura-Saito
    • Kullback-Leibler
    • Logarithmic loss
    • Logistic loss
    • Mahalonobis
    • Squared Euclidean
    • Squared Loss

Clustering

  • User defined clustering or automatically chooses the best clusters
  • Support Vector Clustering
  • Several strategies for encoding class into the clustering
  • k-Means (for all available distance and similarity measures)
  • k-Medoids (for all available distance and similarity measures)
  • Kernel k-Means
  • X-Means
  • Cobweb
  • Clope
  • DBScan
  • Expectation Maximization Clustering
  • Self-organizing maps
  • Agglomerative Clustering
  • Top Down Clustering
  • Operators for flattening hierarchical cluster models
  • Extraction of prototypes for centroid-based cluster models

Market Basket Analysis

  • Associations and sequence discovery
  • Measuring quality of rules by support, confidence, La Place, gain, ps-value, lift or conviction
  • Interactive filter for frequent item sets
  • Interactive visualization of association rules as a network graph
  • Rules description table
  • User defined rule filtering depending on minimum value for the above criteria or matching criteria for specific items
  • Faster than ever FP-Growth operator (similar to Apriori but far more efficient)
  • Generalized sequential patterns
  • Modular operators for the creation of frequent item sets or association rules only
  • Post-processing to unify of item sets
  • Application of association rules to deploy as a recommendation engine

Decision Trees

  • Easy-to-understand models
  • Supported methods: classification and regression trees (CART), CHAID, decision stumps, ID3, C4.5, Random Forest, bagging and boosting
  • Support for multi-way trees
  • Gradient Boosted Trees (GBT)
  • Pre-pruning and pruning
  • Split criteria include information gain, gain ratio, accuracy, and Gini index
  • Error-based and confidence-based pruning
  • Distribution shown at tree leaves
  • Height of distribution bars correlate to number of examples in each leaf
  • Majority class shown at tree leaves
  • Class counts shown as tool tip at tree leaves
  • The darkness of connections correlates with the number of examples on this path
  • Graphical and textual representation of trees
  • Interactive visualization of trees including selecting and moving of nodes

Rule Induction

  • Recursive technique with easy-to-read results
  • Especially useful for modeling rare events like for subgroup discovery
  • Supported methods: rule induction, single rule induction, single attribute, subgroup discovery, tree to rules
  • Supported splitting criteria include information gain and accuracy
  • Definition of pureness of rules
  • Error-based pruning
  • Easy to read and parse representation of rule sets as textual descriptions or tables

Bayesian Modeling

  • Naïve Bayes
  • Kernel Naïve Bayes
  • Bayes models can be updated and are therefore especially suitable for large data sets or online stream mining

Regression

  • Linear
  • Logistic
  • Generalized Linear Model (H2O)
  • Kernel Logistic Regression
  • Linear Discriminant Analysis (LDA)
  • Quadratic Discriminant Analysis (QDA)
  • Regularized Discriminant Analysis (RDA)
  • Stepwise forward and backward selection
  • Selection with M5’, t-test or iterative t-test
  • Seemingly unrelated regression
  • Vector linear regression
  • Polynomial regression
  • Local polynomial regression
  • Gaussian Processes

Neural networks

  • Flexible network architectures with different activation functions
  • Multiple layers with different numbers of nodes
  • Different training techniques
  • Perceptron
  • Multilayer Perceptron
  • Deep Learning (H2O)
  • Automatic optimization of both learning rate and size adjustment of neural networks during training

Support Vector Machines

  • Powerful and robust modeling techniques for large numbers of dimensions
  • Offers overfitting control by regularization
  • Especially suitable for modeling unstructured information like text data
  • More than 10 different methods for support vector classification, regression, and clustering
  • Support Vector Machine
  • Relevance vector machine
  • Linear, Evolutionary, PSO, Fast Large Margin, Hyper Hyper
  • Kernel functions include dot, radial basis function, polynomial, neural, Anova, Epachnenikov, Gaussian combination, or multiquadric
  • Simple support vector machines for boosting support
  • Linear-time support vector machine for fast training also for large numbers of dimensions and examples

Memory-Based Reasoning

  • k-Nearest Neighbors for classification and regression
  • Locally weighted learning
  • Optimized scoring through ball trees data search structure

Model Ensembles

  • Hierarchical models
  • Combination of multiple models to form a potentially stronger model
  • Vote
  • Additive regression
  • Ada boost
  • Bayesian boosting
  • Bagging
  • Stacking
  • Classification by regression
  • Meta cost for defining costs for different error types and detecting optimal models avoiding expensive errors

Model Validation

  • RapidMiner Studio follows a stringent modular approach which prevents information used in pre-processing steps leaking from model training into the application of the model. This unique approach is the only guarantee that no overfitting is introduced and no overestimation of prediction performances can occur
  • Performance criteria for numerical and nominal / categorical targets, including:
    • Accuracy
    • Classification error
    • Kappa
    • Area under curve (AUC)
    • Precision
    • Recall
    • Lift
    • Fallout
    • F-measure
    • False positives
    • False negatives
    • True positives
    • True negatives
    • Sensitivity
    • Specificity
    • Youden index
    • Positive predictive value
    • Negative predictive value
    • PSEP
    • Correlation
    • Spearman rho
    • Kendall tau
    • Squared correlation
    • Absolute error
    • Relative error
    • Normalized absolute error
    • Root mean squared error (RMSE)
    • Root relative squared error (RRSE)
    • Squared error
    • Cross entropy
    • Margin
    • Soft margin loss
    • Logistic loss
  • Calculating significance tests to determine if and which models performed better
      • T-test
      • Anova
  • Find threshold operator to determine optimal cutoff point for binominal classes
  • Performance estimation for cluster models based on distance calculations, density calculations, or item distributions
  • Embed pre-processing steps into the validation
  • Display multiple results in history to help better evaluate model performance
  • Various techniques for the estimation of model performance: Cross validation (with parallel execution of the folds)
  • Split validation
  • Bootstrapping
  • Batch cross validation
  • Wrapper cross validation
  • Wrapper split validation
  • Visual evaluation techniques
  • Lift chart
  • ROC curves
  • Confusion matrix

Scoring

  • Operator for applying models to datasets (Scoring)
  • Support of predictive models, cluster models, preprocessing models, transformation models, and models for missing value imputations
  • Storing of models in central repositories for reuse in other processes and projects
  • Applying a model creates optimal scores by ignoring unused attributes and handling previously unseen values
  • Import and export of RapidMiner models, R models, and Weka models from repository or files
  • Support of PMML 3.2 and 4.0

Background process execution

  • Utility-like process control operations that lets you build processes that behave like a program to repeat and loop over tasks, branch flows and call on system resources
  • Execute multiple processes in parallel
  • Long-running processes can be run in the background, while continuing to work on other process in the foreground for faster and more effective development iterations.
  • Processes running in the background can be monitored. Results and logs can be reviewed once they are available
  • The maximum number of allowed processes running simultaneously can be configured to adapt to the hardware resources and the demand of the processes being executed. The default is the number of cores minus one

Scripting

  • Write scripts for easy-to-complex data preparation and transformation tasks where existing operators might not be sufficient
  • Incorporate procedures from other processes or projects
  • Develop custom models
  • Augment scoring logic by custom post-processing or model application procedures
  • Easy-to-use program development interface: Predefined imports for common data structures
  • Syntactic sugar for simplified data access and alteration
  • Interactive code editor and syntax high-lighting
  • Execute command line programs and integrate results and result codes in processes
  • Execution of SQL statements directly in database
  • Seamless integration of the various programming languages into the RapidMiner Studio user interface: Execution of Groovy scripts within RapidMiner Studio processes
  • Execution of OS scripts within RapidMiner Studio processes
  • Execution of R scripts within RapidMiner Studio processes
  • Execution of Python scripts within RapidMiner Studio processes
  • Predefined scripted models & transformtions available as operators
  • Custom scripts can be stored and executed as own operators within a process

Process Control

  • Organize segments in sub-processes and reuse them in different projects
  • Repeat execution over a segment of a process
  • Support for loops
    • Loop (basic loop, with parallel execution of the iterations)
    • Attributes (parallel execution of the iterations)
    • Labels
    • Subsets
    • Values (parallel execution of the iterations)
    • Examples
    • Clusters
    • Batches
    • Data Sets
    • Data Fractions
    • Parameters
    • Files (parallel execution of the iterations)
    • Repository entries
  • Branches (if-then-else) based on:
    • Data values
    • Attribute existence
    • Numbers of examples
    • Performance values
    • Existence of files and process inputs
    • Definition of macros
    • Arbitrary expressions
  • Creation of collections of the same type
  • Collection handling: selection, flattening, or looping
  • Remembering and recalling (intermediate) process results for complex process designs
  • Handling expected and unexpected errors and exceptions

Automatic Optimization

  • Automatic selection of best performing sub processes
  • Measuring the influence of preprocessing steps by nested cross validations / other validations
  • Automatic selection of best model type and parameters
  • Automatic selection of best attribute subsets
  • Automatic optimization of process para-meters, including modeling parameters
    • Grid
    • Quadratic
    • Evolutionary

Macros

  • Centralized definition of macros / variables containing arbitrary textual or numerical content
  • Usage of macros everywhere in the process design, especially as value for parameters
  • Macros can be defined during the process or in the process context
  • Definition of macros in the context allows for parameterization of complete processes, e.g. for transforming processes into customizable web services
  • Extraction of macro values from data values, meta data or statistics supported
  • Expression engine for calculating arbitrary macro values from existing macros

Logging and Process-Based Reporting

  • Logging can be introduced at arbitrary places within a process
  • Logging can collect parameter values, performance values, or specific values for each operator, e.g. the current generation for evolutionary algorithms
  • Data values can be logged
  • Macro values can be logged
  • Logged values can be transformed into several formats including: data sets and weights which can be stored, transformed, analyzed, or visualized like any other data set.
  • In cases where logging alone is not sufficient, a complete process-based reporting engine allows for the collection of arbitrary results in static reports
  • Different formats like PDF, Excel, HTML, or RTF supported
  • Different reporting styles including a sequential report or portals
  • Support of sections with up to 5 levels
  • Arbitrary process results as well as intermediate results can be transformed into different types of visualizations like tables, charts etc.
  • Support for page breaks and other style information
  • Combination with loops or other process control structures allows for highly-detailed result overviews even for complex process designs
System Requirements
Processor3GHz or faster, Quad core
(2 GHz, Dual core, minimum)
RAM16GB RAM
(4GB RAM, minimum)
Hard disk space>100GB free disk space
(>1GB free disk space, minimum)
Display(1280×1024 pixel resolution, minimum)
Operating SystemWindows 10 (64-bit highly recommended)
Windows 8.1
Windows 8
Windows 7
Linux (64-bit only)
MacOS X 10.10 – 10.15
Java Platform64-bit, OpenJDK Java 8