Monday, 12 Jan 2026
Subscribe
logo
  • Global
  • AI
  • Cloud Computing
  • Edge Computing
  • Security
  • Investment
  • Sustainability
  • More
    • Colocation
    • Quantum Computing
    • Regulation & Policy
    • Infrastructure
    • Power & Cooling
    • Design
    • Innovations
    • Blog
Font ResizerAa
Data Center NewsData Center News
Search
  • Global
  • AI
  • Cloud Computing
  • Edge Computing
  • Security
  • Investment
  • Sustainability
  • More
    • Colocation
    • Quantum Computing
    • Regulation & Policy
    • Infrastructure
    • Power & Cooling
    • Design
    • Innovations
    • Blog
Have an existing account? Sign In
Follow US
© 2022 Foxiz News Network. Ruby Design Company. All Rights Reserved.
Data Center News > Blog > Cloud Computing > Machine learning for Java developers: Machine learning data pipelines
Cloud Computing

Machine learning for Java developers: Machine learning data pipelines

Last updated: February 2, 2024 12:22 pm
Published February 2, 2024
Share
Machine learning for Java developers: Machine learning data pipelines
SHARE

The article, Machine learning for Java developers: Algorithms for machine learning, introduced setting up a machine learning algorithm and developing a prediction function in Java. Readers learned the inner workings of a machine learning algorithm and walked through the process of developing and training a model. This article picks up where that one left off. You’ll get a quick introduction to Weka, a machine learning framework for Java. Then, you’ll see how to set up a machine learning data pipeline, with a step-by-step process for taking your machine learning model from development into production. We’ll also briefly discuss how to use Docker containers and REST to deploy a trained ML model in a Java-based production environment.

What to expect from this article

Deploying a machine learning model is not the same as developing one. These are different parts of the software development lifecycle, and often implemented by different teams. Developing a machine learning model requires understanding the underlying data and having a good grasp of mathematics and statistics. Deploying a machine learning model in production is typically a job for someone with both software engineering and operations experience.

This article is about how to make a machine learning model available in a highly scalable production environment. It is assumed that you have some development experience and a basic understanding of machine learning models and algorithms; otherwise, you may want to start by reading Machine learning for Java developers: Algorithms for machine learning.

Let’s start with a quick refresher on supervised learning, including the example application we’ll use to train, deploy, and process a machine learning model for use in production.

Supervised machine learning: A refresher

A simple, supervised machine learning model will illustrate the ML deployment process. The model shown in Figure 1 can be used to predict the expected sale price of a house.

Trained supervised machine learning model for sale price prediction. IDG

Figure 1. Trained supervised machine learning model for sale price prediction.

Recall that a machine learning model is a function with internal, learnable parameters that map inputs to outputs. In the above diagram, a linear regression function, hθ(x), is used to predict the sale price for a house based on a variety of features. The x variables of the function represent the input data. The θ (theta) variables represents the internal, learnable model parameters.

To predict the sale price of a house, you must first create an input data array of x variables. This array contains features such as the size of the lot or the number of rooms in a house. This array is called the feature vector.

Because most machine learning functions require a numerical representation of features, you will likely have to perform some data transformations in order to build a feature vector. For instance, a feature specifying the location of the garage could include labels such as “attached to home” or “built-in,” which have to be mapped to numerical values. When you execute the house-price prediction, the machine learning function will be applied with this input feature vector as well as the internal, trained model parameters. The function’s output is the estimated house price. This output is called a label.

Training the model

Internal, learnable model parameters (θ) are the part of the model that is learned from training data. The learnable parameters will be set during the training process. A supervised machine learning model like the one shown below has to be trained in order to make useful predictions.

See also  A change in the machine learning landscape
An untrained supervised machine learning model IDG

Figure 2. An untrained supervised machine learning model

Typically, the training process starts with an untrained model where all the learnable parameters are set with an initial value such as zero. The model consumes data about various house features along with real house prices. Gradually, it identifies correlations between house features and house prices, as well as the weight of these relationships. The model adjusts its internal, learnable model parameters and uses them to make predictions.

A trained supervised machine learning model. IDG

Figure 3. A trained supervised machine learning model

After the training process, the model will be able to estimate the sale price of a house by assessing its features.

Machine learning algorithms in Java code

The HousePriceModel provides two methods. One of them implements the learning algorithm to train (or fit) the model. The other method is used for predictions.

fit() and predict() methods in a machine learning model IDG

Figure 4. Two methods in a machine learning model

The fit() method

The fit() method is used to train the model. It consumes house features as well as house-sale prices as input parameters but returns nothing. This method requires the correct “answer” to be able to adjust its internal model parameters. Using housing listings paired with sale prices, the learning algorithm looks for patterns in the training data. From these, it produces model parameters that generalize from those patterns. As the input data becomes more accurate, the model’s internal parameters are adjusted.

Listing 1. The fit() method is used to train a machine learning model


// load training data
// ...

// e.g. [{MSSubClass=60.0, LotFrontage=65.0, ...}, {MSSubClass=20.0, ...}]
List<Map<String, Double>> houses = ...;

// e.g. [208500.0, 181500.0, 223500.0, 140000.0, 250000.0, ...]
List<Double> prices = ...;

// create and train the model
var model = new HousePriceModel();
model.fit(houses, prices);

Note that the house features are double typed in the code. This is because the machine learning algorithm used to implement the fit() method requires numbers as input. All house features must be represented numerically so that they can be used as x parameters in the linear regression formula, as shown here:


hθ(x) = θ0 * x0 + ... + θn * xn

The trained house price prediction model could look like what you see below:


price = -490130.8527 * 1 + -241.0244 * MSSubClass + -143.716  * LotFrontage +  … * …

Here, the input house features such as MSSubClass or LotFrontage are represented as x variables. The learnable model parameters (θ) are set with values like -490130.8527 or -241.0244, which have been gained during the training process.

See also  Pushing Back Against Data Centers, One Small Town at a Time

This example uses a simple machine learning algorithm, which requires just a few model parameters. A more complex algorithm, such as for a deep neural network, could require millions of model parameters; that is one of the main reasons why the process of training such algorithms requires high computation power.

The predict() method

Once you have finished training the model, you can use the predict() method to determine the estimated sale price of a house. This method consumes data about house features and produces an estimated sale price. In practice, an agent of a real estate company could enter features such as the size of a lot (lot-area), the number of rooms, or the overall house quality in order to receive an estimated sale price for a given house.

Transforming non-numeric values

You will often be faced with datasets that contain non-numeric values. For instance, the Ames Housing dataset used for the Kaggle House Prices competition includes both numeric and textual listings of house features:

Kaggle House Prices dataset IDG

Figure 5. A sample from the Kaggle House Prices dataset

To make things more complicated, the Kaggle dataset also includes empty values (marked NA), which cannot be processed by the linear regression algorithm shown in Listing 1.

Real-world data records are often incomplete, inconsistent, lacking in desired behaviors or trends, and may contain errors. This typically occurs in cases where the input data has been joined using different sources. Input data must be converted into a clean data set before being fed into a model.

To improve the data, you would need to replace the missing (NA) numeric LotFrontage value. You would also need to replace textual values such as MSZoning “RL” or “RM” with numeric values. These transformations are necessary to convert the raw data into a syntactically correct format that can be processed by your model.

Once you’ve converted your data to a generally readable format, you may still need to make additional changes to improve the quality of input data. For instance, you might remove values not following the general trend of the data, or place infrequently occurring categories into a single umbrella category.

Java-based machine learning with Weka

As you’ve seen, developing and testing a target function requires well-tuned configuration parameters, such as the proper learning rate or iteration count. The example code you’ve seen so far reflects a very small set of the possible configuration parameters, and the examples were simplified to keep the code readable. In practice, you will likely rely on machine learning frameworks, libraries, and tools.

See also  Genesys acquires Radarr Technologies to unify customer experience

Most frameworks or libraries implement an extensive collection of machine learning algorithms. Additionally, they provide convenient high-level APIs to train, validate, and process data models. Weka is one of the most popular frameworks for the JVM.

Weka provides a Java library for programmatic usage, as well as a graphical workbench to train and validate data models. In the code below, the Weka library is used to create a training data set, which includes features and a label. The setClassIndex() method is used to mark the label column. In Weka, the label is defined as a class:


// define the feature and label attributes
ArrayList<Attribute> attributes = new ArrayList<>();
Attribute sizeAttribute = new Attribute("sizeFeature");
attributes.add(sizeAttribute);
Attribute squaredSizeAttribute = new Attribute("squaredSizeFeature");
attributes.add(squaredSizeAttribute);
Attribute priceAttribute = new Attribute("priceLabel");
attributes.add(priceAttribute);


// create and fill the features list with 5000 examples
Instances trainingDataset = new Instances("trainData", attributes, 5000);
trainingDataset.setClassIndex(trainingSet.numAttributes() - 1);
Instance instance = new DenseInstance(3);

instance.setValue(sizeAttribute, 90.0);
instance.setValue(squaredSizeAttribute, Math.pow(90.0, 2));
instance.setValue(priceAttribute, 249.0);
trainingDataset.add(instance);
Instance instance = new DenseInstance(3);
instance.setValue(sizeAttribute, 101.0);
...

The data set or Instance object can also be stored and loaded as a file. Weka uses an ARFF (Attribute Relation File Format), which is supported by the graphical Weka workbench. This data set is used to train the target function, known as a classifier in Weka.

Recall that in order to train a target function, you have to first choose the machine learning algorithm. The code below creates an instance of the LinearRegression classifier. This classifier is trained by calling the buildClassifier() method. The buildClassifier() method tunes the theta parameters based on the training data to find the best-fitting model. Using Weka, you do not have to worry about setting a learning rate or iteration count. Weka also does the feature scaling internally:


Classifier targetFunction = new LinearRegression();
targetFunction.buildClassifier(trainingDataset);

Once it’s established, the target function can be used to predict the price of a house, as shown here:


Instances unlabeledInstances = new Instances("predictionset", attributes, 1);
unlabeledInstances.setClassIndex(trainingSet.numAttributes() - 1);
Instance unlabeled = new DenseInstance(3);
unlabeled.setValue(sizeAttribute, 1330.0);
unlabeled.setValue(squaredSizeAttribute, Math.pow(1330.0, 2));
unlabeledInstances.add(unlabeled);

double prediction  = targetFunction.classifyInstance(unlabeledInstances.get(0));

Weka provides an Evaluation class to validate the trained classifier or model. In the code below, a dedicated validation data set is used to avoid biased results. Measures such as the cost or error rate will be printed to the console. Typically, evaluation results are used to compare models that have been trained using different machine-learning algorithms, or a variant of these:


Evaluation evaluation = new Evaluation(trainingDataset);
evaluation.evaluateModel(targetFunction, validationDataset);
System.out.println(evaluation.toSummaryString("Results", false));

The examples above use linear regression, which predicts a numeric-valued output such as a house price based on input values. Linear regression supports the prediction of continuous, numeric values. To predict binary yes/no values or classifiers, you could use a machine learning algorithm such as decision tree, neural network, or logistic regression:


// using logistic regression
Classifier targetFunction = new Logistic();
targetFunction.buildClassifier(trainingSet);

Source link

Contents
What to expect from this articleSupervised machine learning: A refresherMachine learning algorithms in Java codeJava-based machine learning with Weka
TAGGED: data, developers, Java, Learning, Machine, pipelines
Share This Article
Twitter Email Copy Link Print
Previous Article Google Bard gets image generation and a more capable Gemini Pro to take on ChatGPT Google Bard gets image generation and a more capable Gemini Pro to take on ChatGPT
Next Article More than $13 Billion Investment Opportunities in the APAC Green Data Center Market - Industry Analysis Report & Competitive Market Share & Forecast 2023-2028 More than $13 Billion Investment Opportunities in the APAC Green Data Center Market – Industry Analysis Report & Competitive Market Share & Forecast 2023-2028
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Your Trusted Source for Accurate and Timely Updates!

Our commitment to accuracy, impartiality, and delivering breaking news as it happens has earned us the trust of a vast audience. Stay ahead with real-time updates on the latest events, trends.
FacebookLike
TwitterFollow
InstagramFollow
YoutubeSubscribe
LinkedInFollow
MediumFollow
- Advertisement -
Ad image

Popular Posts

H5 Data Centers, DE-CIX Enable Turnkey Connectivity at SAT-IX San Antonio

The primary Internet Exchange (IX) to collaborate with DE-CIX as a Turnkey Connectivity Accomplice is…

February 24, 2025

The AI edge in cybersecurity: Predictive tools aim to slash response times

Be a part of our day by day and weekly newsletters for the newest updates…

October 20, 2024

Cambridge Future Tech and Arup sign MoU to develop Deep Rack Venture Studio to revolutionise data centre infrastructure

Cambridge Future Tech (CFT), a frontrunner in scientific enterprise constructing, and Arup, a famend consultancy…

October 14, 2025

Newly Established CloudSquad Achieves Partner Status with Google Cloud

CloudSquad, a newly established integration service supplier specializing in cloud computing and AI, has secured…

April 14, 2024

ALi and Ceva partner to embed scalable AI NPUs in next-gen video platforms

ALi Company and Ceva introduced a strategic partnership to combine Ceva’s NeuPro-Nano and NeuPro-M Neural…

August 18, 2025

You Might Also Like

Portrait of Two Diverse Developers Working on Computers, Typing Lines of Code that Appear on Big Screens Surrounding Them. Male and Female Programmers Creating Innovative Software, Fixing Bugs.
Global Market

At CES, Nvidia launches Vera Rubin platform for AI data centers

By saad
Best 5 AI semantic reasoning tools for databases
Cloud Computing

Best 5 AI semantic reasoning tools for databases

By saad
Why 2026 will redefine the EMEA data centre landscape
Global Market

Why 2026 will redefine the EMEA data centre landscape

By saad
Aumovio turns to cloud computing to scale autonomous vehicle testing
Cloud Computing

Aumovio turns to the cloud to scale autonomous vehicle testing

By saad
Data Center News
Facebook Twitter Youtube Instagram Linkedin

About US

Data Center News: Stay informed on the pulse of data centers. Latest updates, tech trends, and industry insights—all in one place. Elevate your data infrastructure knowledge.

Top Categories
  • Global Market
  • Infrastructure
  • Innovations
  • Investments
Usefull Links
  • Home
  • Contact
  • Privacy Policy
  • Terms & Conditions

© 2024 – datacenternews.tech – All rights reserved

Welcome Back!

Sign in to your account

Lost your password?
We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will assume that you are happy with it.
You can revoke your consent any time using the Revoke consent button.