data Archives - SD Times

Google open sources Java-based differential privacy library

Jenna Barron — Thu, 31 Oct 2024 15:33:10 +0000

Google has announced that it is open sourcing a new Java-based differential privacy library called PipelineDP4J.

Differential privacy, according to Google, is a privacy-enhancing technology (PET) that “allows for analysis of datasets in a privacy-preserving way to help ensure individual information is never revealed.” This enables researchers or analysts to study a dataset without accessing personal data.

Google claims that its implementation of differential privacy is the largest in the world, spanning nearly three billion devices. As such, Google has invested heavily in providing access to its differential privacy technologies over the last several years. For instance, in 2019, it open sourced its first differential privacy library, and in 2021, it open sourced its Fully Homomorphic Encryption transpiler.

In the years since, the company has also worked to expand the languages its libraries are available in, which is the basis for today’s news.

The new library, PipelineDP4j, enables developers to execute highly parallelizable computations in Java, which reduces the barrier to differential privacy for Java developers, Google explained.

“With the addition of this JVM release, we now cover some of the most popular developer languages – Python, Java, Go, and C++ – potentially reaching more than half of all developers worldwide,” Miguel Guevara, product manager on the privacy team at Google, wrote in a blog post.

The company also announced that it is releasing another library, DP-Auditorium, that can audit differential privacy algorithms.

According to Google, two key steps are needed to effectively test differential privacy: evaluating the privacy guarantee over a fixed dataset and finding the “worst-case” privacy guarantee in a dataset. DP-Auditorium provides tools for both of those steps in a flexible interface.

It uses samples from the differential privacy mechanism itself and doesn’t need access to the application’s internal properties, Google explained.

“We’ll continue to build on our long-standing investment in PETs and commitment to helping developers and researchers securely process and protect user data and privacy,” Guevara concluded.

The post Google open sources Java-based differential privacy library appeared first on SD Times.

How Melissa’s Global Phone service cuts down on data errors and saves companies money

Jenna Barron — Mon, 07 Oct 2024 14:01:09 +0000

Having the correct customer information in your databases is necessary for a number of reasons, but especially when it comes to active contact information like email addresses or phone numbers.

“Data errors cost users time, effort, and money to resolve, so validating phone numbers allows users to spend those valuable resources elsewhere,” explained John DeMatteo, solutions engineer I at Melissa, a company that provides various data verification services, including one called Global Phone that validates phone number data.

For instance, call center employees often ask callers what a good number to call them back would be in case they get disconnected. Validating that number can eliminate user error and thus prevent that user being frustrated if they aren’t able to be called back. Or, if you’re doing a mobile campaign, you don’t want to be texting landlines or dead numbers because “it costs money every time you send out a text message,” DeMatteo said during a recent SD Times microwebinar.

It’s also helpful when cleansing databases or migrating data because you can confirm that the numbers in an existing database are actually valid.

There are a number of common errors in phone number data that validation can sort out, including inconsistent formatting, data type mismatches, disconnected or fake phone numbers, and manual entry errors.

“Global Phone allows customers the ability to standardize and validate phone numbers, to correct and detect any issues that may be present,” said DeMatteo.

The service takes in either a REST request for a single phone number or up to 100 records in a JSON request. All that’s needed is a single phone number, and optionally a country name — Global Phone can detect the country, but supplying it can speed up processing.

Then, Global Phone outputs a JSON file that contains validated, enriched, and standardized phone numbers, as well as result codes that identify information tied to the record, such as the number belonging to a cell phone or it being a disposable number. It may also be able to return CallerID information and carrier information.

“Probably the most important thing is the result code,” DeMatteo explained. “We’re going to be returning information about what the data quality looks like, if there’s any problems with it.”

During the microwebinar, DeMatteo walked through an example of a poorly formatted phone number going through Global Phone.

In his example, the original phone number was ((858)[481]_8931. While it is the correct number of digits for a phone number, it is clearly poorly formatted and contains extra punctuation characters that shouldn’t be there.

Running it through Global Phone put the number into the correct format and also returned specific validation codes: PS01 (valid phone number), PS08 (landline), and PS18 (Do Not Call) list.

According to DeMatteo, there are a number of best practices when working with phone data. First, always verify the phone type and active status before sending SMS. Another tip is to use the RecordID and TransmissionReference output fields to better keep track of data.

And for better efficiency, some recommendations are to supply the country information if it’s known and send multiple records at once using JSON batch calls, as that’s going to “give you the best bang for your buck.”

The post How Melissa’s Global Phone service cuts down on data errors and saves companies money appeared first on SD Times.

Microsoft open-sources Drasi, a data processing system for detecting and reacting to changes

Jenna Barron — Fri, 04 Oct 2024 15:39:27 +0000

Microsoft has announced and is open-sourcing a new data processing system called Drasi that can detect and react to changes in complex systems.

This new project “simplifies the automation of intelligent reactions in dynamic systems, delivering real-time actionable insights without the overhead of traditional data processing methods,” Mark Russinovich, CTO, deputy chief information security officer, and technical fellow at Microsoft Azure, wrote in a blog post.

It watches for events in logs and change feeds without having to copy data to a central data lake or continuously querying data sources. Developers can use define which changes they want to track, and then Drasi decides if changes should trigger an action.

“If they do, it executes context-aware reactions based on your business needs. This streamlined process reduces complexity, ensures timely action while the data is most relevant, and prevents important changes from slipping through the cracks,” Russinovich explained.

Drasi can be boiled down into three components: Sources, Continuous Queries, and Reactions.

Sources connect to data sources like application logs, database updates, or system metrics and continuously monitor for critical changes.

Continuous Queries continuously evaluate incoming changes based on some predefined criteria.

Reactions are made when a change completes a continuous query, and can include tasks like sending alerts, updating other systems, or performing remediation steps.

According to Microsoft, developers building event-handling mechanisms have often turned to multiple tools, resulting in fragmented and complex architectures. Additionally, a lot of change detection tools don’t have real-time capabilities and instead rely on batch processing, data collation, or delayed event analysis.

“For businesses that need immediate reactions, even these slight delays can lead to missed opportunities or risks. In short, there is a pressing need for a comprehensive solution that detects and accurately interprets critical events, and automates appropriate, meaningful reactions,” Russinovich wrote.

The project has been submitted to the Cloud Native Computing Foundation (CNCF) as a Sandbox project, meaning that if it’s accepted, it will get support, guidance, and resources from the organization. It is being licensed under the Apache 2.0 license.

The post Microsoft open-sources Drasi, a data processing system for detecting and reacting to changes appeared first on SD Times.

MongoDB 8.0 offers significant performance improvements to read throughput, bulk writes, and more

Jenna Barron — Wed, 02 Oct 2024 16:58:43 +0000

MongoDB has announced the release of the latest version of its database platform—MongoDB 8.0. According to the company, this release offers significant performance improvements compared to MongoDB 7.0, such as 36% better read throughput, 56% faster bulk writes, 20% faster concurrent writes during replication, and 200% faster handling of higher volumes of time series data, alongside lower resource usage and costs.

Some of these performance optimizations were gained by changes made to the architecture that allows for reduced memory usage and query times. It also can now perform more efficient batch processing for inserts, updates, and deletes.

This release also features faster and more cost-efficient horizontal scaling, which allows data to be split — or “sharded” — across multiple servers (shards). In this release, data can be distributed across shards 50x faster for up to 50% lower cost compared to MongoDB 7.0.

Another notable update is that users now have more control over optimizing performance during unpredictable usage spikes and periods of sustained high demand. They can now set a default maximum time limit for running queries, reject recurring types of queries that have caused issues, and set query settings that will persist through events like database restarts.

MongoDB 8.0 also offers better support for search and AI applications through quantized vectors, which are “compressed representations of full-fidelity vectors.” According to the company, these require very little memory and are very fast to retrieve—all while preserving accuracy.

And finally, it is introducing updates to MongoDB Queryable Encryption, which is a capability that allows users to encrypt sensitive data, store it in their database, and run queries on the encrypted data. Now users can perform range queries, which further reduces the risk of data exposure and exfiltration by keeping it encrypted throughout its full lifecycle.

According to MongoDB, it has already moved its internal build system over to MongoDB 8.0 and its development team has seen a 75% query latency reduction since switching over. “This was a double win, as it improved the performance of our own tooling, and it set our performance chat room abuzz with excitement in anticipation of delighting external customers. While results may vary based on your particular workload, the point is that we just couldn’t wait to share MongoDB 8.0’s performance gains with customers,” Jim Scharf, CTO of MongoDB wrote in a blog post.

The post MongoDB 8.0 offers significant performance improvements to read throughput, bulk writes, and more appeared first on SD Times.

New Chrome security features seek to better protect user privacy

Jenna Barron — Fri, 13 Sep 2024 15:21:23 +0000

Google is announcing several new Chrome features aimed at better protecting users as they browse the web.

Safety Check — a tool that checks for compromised passwords, Chrome updates, and other potential security issues in the browser — has been updated to run automatically in the background so that it can be more proactive in protecting users.

It will now inform users whenever it takes actions, such as revoking permissions from sites that haven’t been visited in a while or flagging potentially unwanted notifications.

Safety Check also now automatically revokes notification permissions for a site if Google Safe Browsing determines that site deceived users into granting permission in the first place.

In a similar vein, Android users will now be able unsubscribe from site notifications in one click by tapping the “Unsubscribe” button that will now appear in the notifications drawer. This feature is now available on Pixel devices and will be available on more Android devices down the line.

“This feature has already resulted in a 30 percent reduction in notification volume on supported Pixel devices, and we’re looking forward to bringing it to the broader ecosystem,” Andrew Kamau, product manager from Chrome at Google, wrote in a blog post.

And finally, Chrome will now offer the ability for users to grant website permissions for a single visit to the site. For instance, a user could grant the site access to the phone’s mic, and then once the user leaves the site, Chrome revokes the permission and the site will have to ask again the next time they visit.

“With these new features, you can continue to rely on Chrome for a safer browsing experience that gives you even more control over how you explore the internet,” Kamau concluded.

The post New Chrome security features seek to better protect user privacy appeared first on SD Times.

Three considerations to assess your data’s readiness for AI

Javeed Nizami — Wed, 11 Sep 2024 15:00:54 +0000

Organizations are getting caught up in the hype cycle of AI and generative AI, but in so many cases, they don’t have the data foundation needed to execute AI projects. A third of executives think that less than 50% of their organization’s data is consumable, emphasizing the fact that many organizations aren’t prepared for AI.

For this reason, it’s critical to lay the right groundwork before embarking on an AI initiative. As you assess your readiness, here are the primary considerations:

Availability: Where is your data?
Catalog: How will you document and harmonize your data?
Quality: Having good quality data is key to the success of your AI initiatives.

AI underscores the garbage in, garbage out problem: if you input data into the AI model that’s poor-quality, inaccurate or irrelevant, your output will be, too. These projects are far too involved and expensive, and the stakes are too high, to start off on the wrong data foot.

The importance of data for AI

Data is AI’s stock-in-trade; it is trained on data and then processes data for a designed purpose. When you’re planning to use AI to help solve a problem – even when using an existing large language model, such as a generative AI tool like ChatGPT – you’ll need to feed it the right context for your business (i.e. good data,) to tailor the answers for your business context (e.g. for retrieval-augmented generation). It’s not simply a matter of dumping data into a model.

And if you’re building a new model, you have to know what data you’ll use to train it and validate it. That data needs to be separated out so you can train it against a dataset and then validate against a different dataset and determine if it’s working.

Challenges to establishing the right data foundation

For many companies, knowing where their data is and the availability of that data is the first big challenge. If you already have some level of understanding of your data – what data exists, what systems it exists in, what the rules are for that data and so on – that’s a good starting point. The fact is, though, that many companies don’t have this level of understanding.

Data isn’t always readily available; it may be residing in many systems and silos. Large companies in particular tend to have very complicated data landscapes. They don’t have a single, curated database where everything that the model needs is nicely organized in rows and columns where they can just retrieve it and use it.

Another challenge is that the data is not just in many different systems but in many different formats. There are SQL databases, NoSQL databases, graph databases, data lakes, sometimes data can only be accessed via proprietary application APIs. There’s structured data, and there’s unstructured data. There’s some data sitting in files, and maybe some is coming from your factories’ sensors in real time, and so on. Depending on what industry you’re in, your data can come from a plethora of different systems and formats. Harmonizing that data is difficult; most organizations don’t have the tools or systems to do that.

Even if you can find your data and put it into one common format (canonical model) that the business understands, now you have to think about data quality. Data is messy; it may look fine from a distance, but when you take a closer look, this data has errors and duplications because you’re getting it from multiple systems and inconsistencies are inevitable. You can’t feed the AI with training data that is of low quality and expect high-quality results.

How to lay the right foundation: Three steps to success

The first brick of the AI project’s foundation is understanding your data. You must have the ability to articulate what data your business is capturing, what systems it’s living in, how it’s physically implemented versus the business’s logical definition of it, what the business rules for it are..

Next, you must be able to evaluate your data. That comes down to asking, “What does good data for my business mean?” You need a definition for what good quality looks like, and you need rules in place for validating and cleansing it, and a strategy for maintaining the quality over its lifecycle.

If you’re able to get the data in a canonical model from heterogeneous systems and you wrangle with it to improve the quality, you still have to address scalability. This is the third foundational step. Many models require a lot of data to train them; you also need lots of data for retrieval-augmented generation, which is a technique for enhancing generative AI models using information obtained from external sources that weren’t included in training the model. And all of this data is continuously changing and evolving.

You need a methodology for how to create the right data pipeline that scales to handle the load and volume of the data you might feed into it. Initially, you’re so bogged down by figuring out where to get the data from, how to clean it and so on that you might not have fully thought through how challenging it will be when you try to scale it with continuously evolving data. So, you have to consider what platform you’re using to build this project so that that platform is able to then scale up to the volume of data that you’ll bring into it.

Creating the environment for trustworthy data

When working on an AI project, treating data as an afterthought is a sure recipe for poor business outcomes. Anyone who is serious about building and sustaining a business edge by developing and using  AI must start with the data first. The complexity and the challenge of cataloging and readying the data to be used for business purposes is a huge concern, especially because time is of the essence. That’s why you don’t have time to do it wrong; a platform and methodology that help you maintain high-quality data is foundational. Understand and evaluate your data, then plan for scalability, and you will be on your way to better business outcomes.

The post Three considerations to assess your data’s readiness for AI appeared first on SD Times.

Podcast: How time series data is revolutionizing data management

SD Times — Wed, 04 Sep 2024 19:23:32 +0000

Time series data is an important component of having IoT devices like smart cars or medical equipment that work properly because it is collecting measurements based on time values.

To learn more about the crucial role time series data plays in today’s connected world, we invited Evan Kaplan, CEO of InfluxData, onto our podcast to talk about this topic.

Here is an edited and abridged version of that conversation:

What is time series data?

It’s actually fairly easy to understand. It’s basically the idea that you’re collecting measurement or instrumentation based on time values. The easiest way to think about it is, say sensors, sensor analytics, or things like that. Sensors could measure pressure, volume, temperature, humidity, light, and it’s usually recorded as a time based measurement, a time stamp, if you will, every 30 seconds or every minute or every nanosecond. The idea is that you’re instrumenting systems at scale, and so you want to watch how they perform. One, to look for anomalies, but two, to train future AI models and things like that.

And so that instrumentation stuff is done, typically, with a time series foundation. In the years gone by it might have been done on a general database, but increasingly, because of the amount of data that’s coming through and the real time performance requirements, specialty databases have been built. A specialized database to handle this sort of stuff really changes the game for system architects building these sophisticated real time systems.

So let’s say you have a sensor in a medical device, and it’s just throwing data off, as you said, rapidly. Now, is it collecting all of it, or is it just flagging what an anomaly comes along?

It’s both about data in motion and data at rest. So it’s collecting the data and there are some applications that we support, that are billions of points per second — think hundreds or thousands of sensors reading every 100 milliseconds. And we’re looking at the data as it’s being written, and it’s available for being queried almost instantly. There’s almost zero time, but it’s a database, so it stores the data, it holds the data, and it’s capable of long term analytics on the same data.

So storage, is that a big issue? If all this data is being thrown off, and if there are no anomalies, you could be collecting hours of data that nothing has changed?

If you’re getting data — some regulated industries require that you keep this data around for a really long period of time — it’s really important that you’re skillful at compressing it. It’s also really important that you’re capable of delivering an object storage format, which is not easy for a performance-based system, right? And it’s also really important that you be able to downsample it. And downsample means we’re taking measurements every 10 milliseconds, but every 20 minutes, we want to summarize that. We want to downsample it to look for the signal that was in that 10 minute or 20 minute window. And we downsample it and evict a lot of data and just keep the summary data. So you have to be very good at that kind of stuff. Most databases are not good at eviction or downsampling, so it’s a really specific set of skills that makes it highly useful, not just us, but our competitors too.

We were talking about edge devices and now artificial intelligence coming into the picture. So how does time series data augment those systems? Benefit from those advances? Or how can they help move things along even further?

I think it’s pretty darn fundamental. The concept of time series data has been around for a long time. So if you built a system 30 years ago, it’s likely you built it on Oracle or Informatics or IBM Db2. The canonical example is financial Wall Street data, where you know how stocks are trading one minute to the next, one second to the next. So it’s been around for a really long time. But what’s new and different about the space is we’re sensifying the physical world at an incredibly fast pace. You mentioned medical devices, but smart cities, public transportation, your cars, your home, your industrial factories, everything’s getting sensored — I know that’s not a real word, but easy to understand.

And so sensors speak time series. That’s their lingua franca. They speak pressure, volume, humidity, temperature, whatever you’re measuring over time. And it turns out, if you want to build a smarter system, an intelligent system, it has to start with sophisticated instrumentation. So I want to have a very good self-driving car, so I want to have a very, very high resolution picture of what that car is doing and what that environment is doing around the car at all times. So I can train a model with all the potential awareness that a human driver or better, might have in the future. In order to do that, I have to instrument. I then have to observe, and then have to re-instrument, and then I have to observe. I run that process of observing, correcting and re-instrumenting over and over again 4 billion times.

So what are some of the things that we might look forward to in terms of use cases? You mentioned a few of them now with, you know, cities and cars and things like that. So what other areas are you seeing that this can also move into?

So first of all, where we were really strong is energy, aerospace, financial trading, network, telemetry. Our largest customers are everybody from JPMorgan Chase to AT&T to Salesforce to a variety of stuff. So it’s a horizontal capability, that instrumentation capability.

I think what’s really important about our space, and becoming increasingly relevant, is the role that time series data plays in AI, and really the importance of understanding how systems behave. Essentially, what you’re trying to do with AI is you’re trying to say what happened to train your model and what will happen to get the answers from your model and to get your system to perform better.

And so, “what happened?” is our lingua franca, that’s a fundamental thing we do, getting a very good picture of everything that’s happening around that sensor around that time, all that sort of stuff, collecting high resolution data and then feeding that to training models where people do sophisticated machine learning or robotics training models and then to take action based on that data. So without that instrumentation data, the AI stuff is basically without the foundational pieces, particularly the real world AI, not necessarily talking about the generative LLMs, but I’m talking about cars, robots, cities, factories, healthcare, that sort of stuff.

The post Podcast: How time series data is revolutionizing data management appeared first on SD Times.

Pinecone previews new bulk import feature for its serverless offering

Jenna Barron — Tue, 27 Aug 2024 14:57:32 +0000

Pinecone, a vector database for scaling AI, is introducing a new bulk import feature to make it easier to ingest large amounts of data into its serverless infrastructure.

According to the company, this new feature, now in early access, is useful in scenarios when a team would want to import over 100 million records (though it currently has a 200 million record limit), onboard a known or new tenant, or migrate production workloads from another provider into Pinecone.

The company claims that bulk import results in six times lower ingestion costs than comparable upsert-based processes. It costs $1.00/GB, and, for instance, ingesting 10 million records of 768-dimension costs $30 with bulk import.

Because it is an asynchronous, long-running process, customers don’t have to performance tune or monitor the status of their imports; Pinecone takes care of it in the background.

During the import process, data is read from a secure bucket in the customer’s object storage, which provides them with control over data access, including the ability to revoke Pinecone’s access whenever.

While in early access, Pinecone is limiting bulk import to writing records into a new serverless namespace, meaning that data cannot currently be imported into existing namespaces. Additionally, bulk import is limited to Amazon S3 for serverless AWS regions, but the company will be adding support for Google Cloud Storage and Azure Blob Storage in a couple of weeks.

Pinecone serverless now GA on Google Cloud, Microsoft Azure

Adding to the existing AWS support, Pinecone serverless is now generally available on both Google Cloud and Microsoft Azure.

Google Cloud support is available in us-central1 (Iowa) and europe-west4 (Netherlands), and Microsoft Azure support is available in eastus2 (Virginia), with additional regions coming soon to both clouds.

This availability also comes with new features in early access, such as backups for serverless indexes for all three clouds available for Standard and Enterprise users, and more granular access controls for the Control Plane and Data Plane, including NoAccess, ReadOnly, and ReadWrite. Pinecone will also add more user roles — Org Owner, Billing Admin, Org Manager, and Org Member — at the Organization and Project levels in a couple of weeks.

“Bringing Pinecone’s serverless vector database to Google Cloud Marketplace will help customers quickly deploy, manage, and grow the platform on Google Cloud’s trusted, global infrastructure,” said Dai Vu, managing director of Marketplace & ISV GTM Programs at Google Cloud. “Pinecone customers can now easily build knowledgeable AI applications securely and at scale as they progress their digital transformation journeys.”

The post Pinecone previews new bulk import feature for its serverless offering appeared first on SD Times.

Pros and cons of 5 AI/ML workflow tools for data scientists today

Len Gilbert — Fri, 09 Aug 2024 16:17:37 +0000

With businesses uncovering more and more use cases for artificial intelligence and machine learning, data scientists find themselves looking closely at their workflow. There are a myriad of moving pieces in AI and ML development, and they all must be managed with an eye on efficiency and flexible, strong functionality. The challenge now is to evaluate what tools provide which functionalities, and how various tools can be augmented with other solutions to support an end-to-end workflow. So let’s see what some of these leading tools can do.

DVC

DVC offers the capability to manage text, image, audio, and video files across ML modeling workflow.

The pros: It’s open source, and it has solid data management capacities. It offers custom dataset enrichment and bias removal. It also logs changes in the data quickly, at natural points during the workflow. While you’re using the command line, the process feels quick. And DVC’s pipeline capabilities are language-agnostic.

The cons: DVC’s AI workflow capabilities are limited – there’s no deployment functionality or orchestration. While the pipeline design looks good in theory, it tends to break in practice. There’s no ability to set credentials for object storage as a configuration file, and there’s no UI – everything must be done through code.

MLflow

MLflow is an open-source tool, built on an MLOps platform.

The pros: Because it’s open source, it’s easy to set up, and requires only one install. It supports all ML libraries, languages, and code, including R. The platform is designed for end-to-end workflow support for modeling and generative AI tools. And its UI feels intuitive, as well as easy to understand and navigate.

The cons: MLflow’s AI workflow capacities are limited overall. There’s no orchestration functionality, limited data management, and limited deployment functionality. The user has to exercise diligence while organizing work and naming projects – the tool doesn’t support subfolders. It can track parameters, but doesn’t track all code changes – although Git Commit can provide the means for work-arounds. Users will often combine MLflow and DVC to force data change logging.

Weights & Biases

Weights & Biases is a solution primarily used for MLOPs. The company recently added a solution for developing generative AI tools.

The pros: Weights & Biases offers automated tracking, versioning, and visualization with minimal code. As an experiment management tool, it does excellent work. Its interactive visualizations make experiment analysis easy. Collaboration functions allow teams to efficiently share experiments and collect feedback for improving future experiments. And it offers strong model registry management, with dashboards for model monitoring and the ability to reproduce any model checkpoint.

The cons: Weights & Biases is not open source. There are no pipeline capabilities within its own platform – users will need to turn to PyTorch and Kubernetes for that. Its AI workflow capabilities, including orchestration and scheduling functions, are quite limited. While Weights & Biases can log all code and code changes, that function can simultaneously create unnecessary security risks and drive up the cost of storage. Weights & Biases lacks the abilities to manage compute resources at a granular level. For granular tasks, users need to augment it with other tools or systems.

Slurm

Slurm promises workflow management and optimization at scale.

The pros: Slurm is an open source solution, with a robust and highly scalable scheduling tool for large computing clusters and high-performance computing (HPC) environments. It’s designed to optimize compute resources for resource-intensive AI, HPC, and HTC (High Throughput Computing) tasks. And it delivers real-time reports on job profiling, budgets, and power consumption for resources needed by multiple users. It also comes with customer support for guidance and troubleshooting.

The cons: Scheduling is the only piece of AI workflow that Slurm solves. It requires a significant amount of Bash scripting to build automations or pipelines. It can’t boot up different environments for each job, and can’t verify all data connections and drivers are valid. There’s no visibility into Slurm clusters in progress. Furthermore, its scalability comes at the cost of user control over resource allocation. Jobs that exceed memory quotas or simply take too long are killed with no advance warning.

ClearML

ClearML offers scalability and efficiency across the entire AI workflow, on a single open source platform.

The pros: ClearML’s platform is built to provide end-to-end workflow solutions for GenAI, LLMops and MLOps at scale. For a solution to truly be called “end-to-end,” it must be built to support workflow for a wide range of businesses with different needs. It must be able to replace multiple stand-alone tools used for AI/ML, but still allow developers to customize its functionality by adding additional tools of their choice, which ClearML does. ClearML also offers out-of-the-box orchestration to support scheduling, queues, and GPU management. To develop and optimize AI and ML models within ClearML, only two lines of code are required. Like some of the other leading workflow solutions, ClearML is open source. Unlike some of the others, ClearML creates an audit trail of changes, automatically tracking elements data scientists rarely think about – config, settings, etc. – and offering comparisons. Its dataset management functionality connects seamlessly with experiment management. The platform also enables organized, detailed data management, permissions and role-based access control, and sub-directories for sub-experiments, making oversight more efficient.

One important advantage ClearML brings to data teams is its security measures, which are built into the platform. Security is no place to slack, especially while optimizing workflow to manage larger volumes of sensitive data. It’s crucial for developers to trust their data is private and secure, while accessible to those on the data team who need it.

The cons: While being designed by developers, for developers, has its advantages, ClearML’s model deployment is done not through a UI but through code. Naming conventions for tracking and updating data can be inconsistent across the platform. For instance, the user will “report” parameters and metrics, but “register” or “update” a model. And it does not support R, only Python.

In conclusion, the field of AI/ML workflow solutions is a crowded one, and it’s only going to grow from here. Data scientists should take the time today to learn about what’s available to them, given their teams’ specific needs and resources.

You may also like…

Data scientists and developers need a better working relationship for AI

How to maximize your ROI for AI in software development

The post Pros and cons of 5 AI/ML workflow tools for data scientists today appeared first on SD Times.

pgEdge introduces advanced logical replication features in v24.7

Jenna Barron — Wed, 07 Aug 2024 16:43:04 +0000

The open-source distributed PostgreSQL platform, pgEdge, has a new release with advanced logical replication features, large object support, and improved error handling.

“These enhancements make pgEdge an even more powerful alternative for legacy multi-master replication technologies, offering greater throughput, flexibility, and control for users,” Phillip Merrick, co-founder and CEO of pgEdge, wrote in a blog post.

pgEdge 24.7 (also called the Constellation Release), adds support for PostgreSQL’s large object logical replication (LOLOR) extension, which provides compatibility for large objects to undergo logical replication. According to pgEdge, this support will allow for “smoother transitions from legacy databases to PostgreSQL without requiring application modifications.”

The release also includes a new advanced error handling and logging mechanism, where replication errors get logged in an exception table, which prevents disruptions to the replication process.

Replication repair mode can also now be toggled on and off, enabling users to have more control over the replication process. pgEdge explained that this is helpful for controlling replication changes during error resolution.

Another new feature is automatic replication of DDL commands across all cluster nodes, which improves the process of updating database schema.

And finally, there is a new extension for Snowflake sequences that ensures unique sequence numbers for different regions without needing to make modifications to code or schemas.

The company also revealed that later this year, it will introduce parallel replication to significantly improve replication throughput. “It promises to reduce replication lag, ensuring timely data synchronization across nodes and maintaining data consistency even in high-demand environments,” Merrick wrote.

You may also like…

Data scientists and developers need a better working relationship for AI

Software engineering leaders must act to manage integration technical debt

The post pgEdge introduces advanced logical replication features in v24.7 appeared first on SD Times.