Thursday, 30 Apr 2026
Subscribe
logo
  • AI Compute
  • Infrastructure
  • Power & Cooling
  • Security
  • Colocation
  • Cloud Computing
  • More
    • Sustainability
    • Industry News
    • About Data Center News
    • Terms & Conditions
Font ResizerAa
Data Center NewsData Center News
Search
  • AI Compute
  • Infrastructure
  • Power & Cooling
  • Security
  • Colocation
  • Cloud Computing
  • More
    • Sustainability
    • Industry News
    • About Data Center News
    • Terms & Conditions
Have an existing account? Sign In
Follow US
© 2022 Foxiz News Network. Ruby Design Company. All Rights Reserved.
Data Center News > Blog > AI & Compute > How S&P is using deep web scraping, ensemble learning and Snowflake architecture to collect 5X more data on SMEs
AI & Compute

How S&P is using deep web scraping, ensemble learning and Snowflake architecture to collect 5X more data on SMEs

Last updated: June 2, 2025 11:36 pm
Published June 2, 2025
Share
How S&P is using deep web scraping, ensemble learning and Snowflake architecture to collect 5X more data on SMEs
SHARE

Be a part of our day by day and weekly newsletters for the most recent updates and unique content material on industry-leading AI protection. Be taught Extra


The investing world has a big downside on the subject of knowledge about small and medium-sized enterprises (SMEs). This has nothing to do with knowledge high quality or accuracy — it’s the dearth of any knowledge in any respect. 

Assessing SME creditworthiness has been notoriously difficult as a result of small enterprise monetary knowledge isn’t public, and subsequently very troublesome to entry.

S&P Global Market Intelligence, a division of S&P World and a foremost supplier of credit score rankings and benchmarks, claims to have solved this longstanding downside. The corporate’s technical group constructed RiskGauge, an AI-powered platform that crawls in any other case elusive knowledge from over 200 million web sites, processes it by quite a few algorithms and generates danger scores. 

Constructed on Snowflake structure, the platform has elevated S&P’s protection of SMEs by 5X. 

“Our goal was growth and effectivity,” defined Moody Hadi, S&P World’s head of danger options’ new product growth. “The challenge has improved the accuracy and protection of the information, benefiting purchasers.” 

RiskGauge’s underlying structure

Counterparty credit score administration basically assesses an organization’s creditworthiness and danger based mostly on a number of components, together with financials, chance of default and danger urge for food. S&P World Market Intelligence offers these insights to institutional traders, banks, insurance coverage corporations, wealth managers and others. 

“Giant and monetary company entities lend to suppliers, however they should understand how a lot to lend, how continuously to observe them, what the length of the mortgage could be,” Hadi defined. “They depend on third events to give you a reliable credit score rating.” 

However there has lengthy been a spot in SME protection. Hadi identified that, whereas massive public corporations like IBM, Microsoft, Amazon, Google and the remaining are required to reveal their quarterly financials, SMEs don’t have that obligation, thus limiting monetary transparency. From an investor perspective, contemplate that there are about 10 million SMEs within the U.S., in comparison with roughly 60,000 public corporations. 

See also  Digital Realty breaks ground on new data centre in Rome

S&P World Market Intelligence claims it now has all of these coated: Beforehand, the agency solely had knowledge on about 2 million, however RiskGauge expanded that to 10 million.  

The platform, which went into manufacturing in January, is predicated on a system constructed by Hadi’s group that pulls firmographic knowledge from unstructured internet content material, combines it with anonymized third-party datasets, and applies machine studying (ML) and superior algorithms to generate credit score scores. 

The corporate makes use of Snowflake to mine firm pages and course of them into firmographics drivers (market segmenters) which might be then fed into RiskGauge. 

The platform’s knowledge pipeline consists of:

  • Crawlers/internet scrapers
  • A pre-processing layer
  • Miners
  • Curators
  • RiskGauge scoring

Particularly, Hadi’s group makes use of Snowflake’s knowledge warehouse and Snowpark Container Companies in the midst of the pre-processing, mining and curation steps. 

On the finish of this course of, SMEs are scored based mostly on a mixture of monetary, enterprise and market danger; 1 being the very best, 100 the bottom. Buyers additionally obtain stories on RiskGauge detailing financials, firmographics, enterprise credit score stories, historic efficiency and key developments. They’ll additionally evaluate corporations to their friends. 

How S&P is gathering invaluable firm knowledge

Hadi defined that RiskGauge employs a multi-layer scraping course of that pulls numerous particulars from an organization’s internet area, equivalent to primary ‘contact us’ and touchdown pages and news-related info. The miners go down a number of URL layers to scrape related knowledge. 

“As you possibly can think about, an individual can’t do that,” stated Hadi. “It’s going to be very time-consuming for a human, particularly while you’re coping with 200 million internet pages.” Which, he famous, ends in a number of terabytes of web site info. 

See also  Snowflake expands its technical and mainstream AI platforms

After knowledge is collected, the following step is to run algorithms that take away something that isn’t textual content; Hadi famous that the system isn’t excited by JavaScript and even HTML tags. Knowledge is cleaned so it turns into human-readable, not code. Then, it’s loaded into Snowflake and several other knowledge miners are run in opposition to the pages.

Ensemble algorithms are vital to the prediction course of; a lot of these algorithms mix predictions from a number of particular person fashions (base fashions or ‘weak learners’ which might be basically slightly higher than random guessing) to validate firm info equivalent to title, enterprise description, sector, location, and operational exercise. The system additionally components in any polarity in sentiment round bulletins disclosed on the positioning. 

“After we crawl a web site, the algorithms hit completely different elements of the pages pulled, they usually vote and are available again with a advice,” Hadi defined. “There isn’t a human within the loop on this course of, the algorithms are mainly competing with one another. That helps with the effectivity to extend our protection.” 

Following that preliminary load, the system displays web site exercise, robotically operating weekly scans. It doesn’t replace info weekly; solely when it detects a change, Hadi added. When performing subsequent scans, a hash key tracks the touchdown web page from the earlier crawl, and the system generates one other key; if they’re an identical, no modifications have been made, and no motion is required. Nevertheless, if the hash keys don’t match, the system will probably be triggered to replace firm info. 

See also  Spexi unveils LayerDrone decentralized network for crowdsourcing high-res drone images of Earth

This steady scraping is essential to make sure the system stays as up-to-date as attainable. “In the event that they’re updating the positioning typically, that tells us they’re alive, proper?,” Hadi famous. 

Challenges with processing velocity, big datasets, unclean web sites

There have been challenges to beat when constructing out the system, after all, significantly as a result of sheer dimension of datasets and the necessity for fast processing. Hadi’s group needed to make trade-offs to steadiness accuracy and velocity. 

“We stored optimizing completely different algorithms to run sooner,” he defined. “And tweaking; some algorithms we had have been actually good, had excessive accuracy, excessive precision, excessive recall, however they have been computationally too expensive.” 

Web sites don’t at all times conform to straightforward codecs, requiring versatile scraping strategies.

“You hear loads about designing web sites with an train like this, as a result of after we initially began, we thought, ‘Hey, each web site ought to conform to a sitemap or XML,’” stated Hadi. “And guess what? No one follows that.”

They didn’t need to onerous code or incorporate robotic course of automation (RPA) into the system as a result of websites differ so extensively, Hadi stated, they usually knew an important info they wanted was within the textual content. This led to the creation of a system that solely pulls mandatory elements of a web site, then cleanses it for the precise textual content and discards code and any JavaScript or TypeScript.

As Hadi famous, “the most important challenges have been round efficiency and tuning and the truth that web sites by design will not be clear.” 


Source link
TAGGED: architecture, collect, data, deep, ensemble, Learning, scraping, SMEs, Snowflake, web
Share This Article
Twitter Email Copy Link Print
Previous Article IBM and Roche use AI to forecast blood sugar levels IBM and Roche use AI to forecast blood sugar levels
Next Article Enterprise alert: PostgreSQL just became the database you can't ignore for AI applications Enterprise alert: PostgreSQL just became the database you can’t ignore for AI applications
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Your Trusted Source for Accurate and Timely Updates!

Our commitment to accuracy, impartiality, and delivering breaking news as it happens has earned us the trust of a vast audience. Stay ahead with real-time updates on the latest events, trends.
FacebookLike
TwitterFollow
InstagramFollow
YoutubeSubscribe
LinkedInFollow
MediumFollow
- Advertisement -
Ad image

Popular Posts

Salute collaborates with Ecolab on cooling solutions for AI

Salute has introduced a collaboration with Ecolab, an organization targeted on water, hygiene, and an…

April 1, 2026

Inside Huawei’s automotive sound engineering lab in Shanghai

Strolling into Huawei’s Shanghai Acoustics R&D Centre, I anticipated a normal facility tour. What I…

September 30, 2025

How Anthropic’s ‘Skills’ make Claude faster, cheaper, and more consistent for business workflows

Anthropic launched a brand new functionality on Thursday that enables its Claude AI assistant to…

October 16, 2025

Mitigating business data accuracy threats

Over half of us now use AI to go looking the net, but the stubbornly…

November 23, 2025

Ooredoo and DE-CIX bring Internet exchange to Qatar with Doha IX

Leveraging DE-CIX’s in depth international experience, developed throughout practically 60 places worldwide, this initiative strengthens…

February 5, 2025

You Might Also Like

STL launches Neuralis data centre connectivity suite in the U.S.
AI & Compute

STL launches Neuralis data centre connectivity suite in the U.S.

By saad
What is optical interconnect and why Lightelligence's $10B debut says it matters for AI
AI & Compute

What is optical interconnect and why Lightelligence’s $10B debut says it matters for AI

By saad
IBM launches AI platform Bob to regulate SDLC costs
AI & Compute

IBM launches AI platform Bob to regulate SDLC costs

By saad
STL launches Neuralis data centre connectivity suite in the U.S.
Power & Cooling

STL launches Neuralis data centre connectivity suite in the U.S.

By saad

About Us

Data Center News is your dedicated source for data center infrastructure, AI compute, cloud, and industry news.

Top Categories

  • AI & Compute
  • Cloud Computing
  • Power & Cooling
  • Colocation
  • Security
  • Infrastructure
  • Sustainability
  • Industry News

Useful Links

  • Home
  • Contact
  • Privacy Policy
  • Terms & Conditions

Find Us on Socials

© 2026 Data Center News. All Rights Reserved.

© 2026 Data Center News. All Rights Reserved.
Welcome Back!

Sign in to your account

Lost your password?
We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will assume that you are happy with it.
You can revoke your consent any time using the Revoke consent button.