The final 12 months have demonstrated the large capabilities enabled by public net knowledge assortment; nevertheless, it’s clear that the trade nonetheless has room to develop in 2026.
With anticipated adjustments to laws within the dependent AI trade and authorized battles forward, will probably be fascinating to observe how this performs out because the yr unfolds. One factor we will depend on: the basics of knowledge assortment will stay extra necessary than ever.
Beneath, prime tech specialists have come collectively to share their insights into how they anticipate the information assortment panorama to develop, primarily based on their trade experience, and to disclose what 2026 may convey to companies and AI worldwide.
Truthful use of copyrighted materials
Denas Grybauskas, Chief Governance and Technique Officer at Oxylabs, defined: “In US regulation discussions and doubtlessly follow, we’ll see a rising emphasis placed on the transformation of copyrighted work. The honest use doctrine permits transformative use of copyrighted materials, which provides one thing new and makes it totally different in function or character.
“Due to this fact, a lot authorized dialogue will doubtless deal with whether or not utilizing content material, together with net content material, for AI coaching constitutes transformative use enough to qualify as honest use.
“On the identical time, in instances the place the honest use doctrine doesn’t apply – in jurisdictions such because the EU – the trade will want technological mechanisms for credit score attribution and workable methods to remunerate creators, with out undermining the openness of the net or the seamlessness of entry to public data.”
Agentic techniques for knowledge assortment
Julius Černiauskas, CEO at Oxylabs, mentioned: “Subsequent yr will doubtless see fascinating developments in complete agentic techniques for public knowledge assortment. Take the method of net scraping, which consists of many small duties. AI brokers can automate these duties.
“Collectively, they comprise a multi-agent system that may deal with a lot of the method, driving down prices and democratising public knowledge entry by making it extra accessible with out requiring specific expertise or engineering groups.
“As soon as once more, new instruments and options to automate specific duties consistently enter the market – one thing that can multiply subsequent yr.”
LLM use for parsing
Juras Juršėnas, COO at Oxylabs, acknowledged: “Over the following 12 months, using LLMs for parsing will develop. For the previous few years, knowledge parsing has been some of the impactful AI use instances in public knowledge assortment.
“Nevertheless, it was nonetheless restricted by value (for LLM tokens) and by prompt-size constraints. Builders and knowledge groups used to all the time want to wash the HTML to scale back its dimension earlier than passing it to the LLM for parsing, which required extra sources. You would possibly now solely want to do that in particular instances.
“The variety of choices out there for instruments that may do it for you is booming. Thus, it’s cheap to anticipate a rise in LLM utilization for parsing.”
High quality vs amount
Rytis Ulys, Head of Knowledge & AI at Oxylabs, commented: “In 2026, the seek for knowledge will focus much less on amount and extra on high quality. Latest Anthropic analysis confirmed that even small amounts of low-quality data can ruin the entire dataset.
“Moreover, it confirmed that past a sure level, including extra low-quality knowledge yields minimal acquire – and even degrades efficiency – in comparison with utilizing a focused, related subset.
“As such, the basics of knowledge assortment will stay extra necessary than ever. Sturdy tables and catalogues, high quality and lineage, and low-latency question engines have develop into conditions for brokers, retrieval, not afterthoughts. Graph and vector-augmented retrieval is shifting from weblog posts to patterns, observability now spans prompts, instruments, and price, and compliance sits alongside efficiency on the identical airplane. Knowledge isn’t fading; it’s been promoted to AI’s management floor.”
A greater understanding of on-line knowledge assortment
Primarily based on these insights, we will anticipate fascinating developments in complete agentic techniques for public knowledge gathering, the expansion of LLMs for parsing, and a shift towards high quality over amount in knowledge search.
In tandem, over the following 12 months, authorized choices on copyright regulation should be made in each the US and Europe, as the present state of affairs has left many in unsure territory.
Hopefully, 2026 will convey companies readability and understanding, with new instruments and capabilities to automate processes, in addition to a greater understanding of net knowledge assortment and its function in companies’ day-to-day lives.
