paint-brush
DeepSeek's Janus Pro, OpenAI o3-mini, Mistral's 24B Parameter Model, and Moreby@thisweekinaieng
200 reads

DeepSeek's Janus Pro, OpenAI o3-mini, Mistral's 24B Parameter Model, and More

by This Week in AI EngineeringFebruary 10th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

This week, we’re covering DeepSeek’s new Janus-Pro, a multimodal AI agent, OpenAI’’'s o3-mini with faster reasoning, and Mistral Small 3, a new level of model efficiency. We’ll be getting into all these updates along with some must-know tools to make developing AI agents and apps easier.
featured image - DeepSeek's Janus Pro, OpenAI o3-mini, Mistral's 24B Parameter Model, and More
This Week in AI Engineering HackerNoon profile picture


Hello AI Enthusiasts!


Welcome to the fifth edition of "This Week in AI Engineering"!


This week, we’re covering DeepSeek’s new Janus-Pro, a multimodal AI agent, OpenAI’s o3-mini with faster reasoning, and Mistral Small 3, a new level of model efficiency.


We’ll be getting into all these updates along with some must-know tools to make developing AI agents and apps easier.

Janus-Pro: DeepSeek's new Multimodal AI with unified transformer processing

DeepSeek has unveiled Janus-Pro, an advanced open-source multimodal AI model that significantly outperforms current industry leaders in both image generation and visual understanding tasks while maintaining MIT licensing for commercial use.


Technical Architecture:

  • Model Variants: Available in 1B and 7B parameter versions for flexible deployment options
  • Processing Pipeline: Integrated transformer architecture handling both understanding and generation tasks
  • Resolution Support: Native 1024x1024 image generation with 2.4s average inference time


Performance Metrics:

  • DPG-Bench: It is 84.2% accuracy that surpasses DALL-E 3 (83.5%)
  • GenEval: Brings an 80.0% generational rating to its ability
  • Cross-Model Comparison: Outperforms Show-o (46%), VILA-U (60%), and Emu3-Chat (58%) on multimodal understanding
  • Resource Efficiency: 7B model achieves SOTA while maintaining practical deployment requirements

Integration Features:


  • Open-Source Deployment: Full MIT license with commercial use rights
  • API Access: Comprehensive SDK with Python and REST endpoints
  • Platform Support: Direct integration through HuggingFace and GitHub
  • Documentation: Extensive implementation guides and example code

OpenAI o3-mini Released: 2500ms Faster Time-to-First token

OpenAI has introduced o3-mini, their newest reasoning-optimized model that delivers o1-level performance. This release features three distinct computation models(low/medium/high) for optimal performance and speed tradeoffs.


Technical Architecture:

  • Developer Integration: First small model that supports calling Structured Outputs
  • Search Enhancement: Early prototype of web search integration with automated citation linking
  • Enterprise Ready: Full API access with 150 messages/day allocation for Plus/Team users


Performance Metrics:

  • STEM Excellence: 87.3% accuracy on AIME 2024 (vs 83.3% o1) with high reasoning effort
  • Code Generation: 2130 Codeforces ELO rating, setting new benchmarks for efficient models
  • PhD-level Tasks: 79.7% accuracy on GPQA Diamond, matching full-scale models


Core Features:

  • Cost Optimization: Maintains o1-level reasoning while significantly reducing compute requirements

Production Support: Native integration across ChatGPT, Assistants API, and Batch API systems


Mistral Small 3: 24B Parameter Model Achieves 3x Speed with Apache 2.0 License

Mistral AI has unveiled Small 3, a high-efficiency language model that matches the performance of 70B parameter competitors while delivering 150 tokens/s throughput. This open-source release under the Apache 2.0 license marks a significant advancement in model optimization.


Technical Architecture:

  • Streamlined Layer Design: Reduced parameter count while maintaining SOTA performance
  • Optimized Inference: Custom architecture delivering 11-12ms latency per token
  • Resource Efficiency: Full model runs on single RTX 4090 or 32GB MacBook
  • Memory Management: Advanced parameter activation for reduced compute requirements


Performance Metrics:

  • MMLU Score: 81% accuracy matching Llama 3.3 70B
  • Speed Advantage: 3x faster than Llama 3.3 on identical hardware
  • Human Evaluation: Outperforms larger models in blind tests


Core Features:

  • Platform Access: Available through Hugging Face, Ollama, Kaggle, and major cloud providers
  • Enterprise Focus: Optimized for fraud detection, medical triage, and robotics applications
  • Developer Tools: Full API access through la Plateforme with extensive documentation

Gemini 2.0 Achieves 27% Bug Report Automation with Native Video Processing

Gemini 2.0's video analysis capabilities enable the automated generation of technical bug reports from browser sessions and DevTools data. The system uses native video processing to create precise, developer-friendly bug documentation from raw session recordings.


Technical Architecture:

  • Video Analysis: Direct processing of session recordings without additional latency
  • Automated Tracking: Timestamped reproduction steps linked to session playback
  • DevTools Integration: Real-time capture of console logs and network data
  • Ticket Creation: Direct integration with 9+ issue-tracking platforms


Performance and Features:

  • Bug Report Automation: 27% of early access reports fully generated by AI
  • Processing Speed: Single-click report generation from session data
  • Integration Coverage: Support for Jira, Linear, and 7 additional platforms
  • Accuracy Rate: Precise step reproduction with timestamp synchronization


The model generates reproduction steps with integrated video timestamps, enabling instant navigation to specific moments in session recordings. Its concise reporting style eliminates traditional documentation bloat, allowing developers to quickly grasp and reproduce issues without parsing through excessive text. Developers can check it out HERE.


Berkeley's $30 DeepSeek Replication: Breaking the Cost Barrier in AI Research

Berkeley researchers have demonstrated that DeepSeek R1's core reasoning capabilities can be reproduced for just $30, using a 3B parameter model and reinforcement learning. This breakthrough challenges the notion that advanced AI requires expensive hardware like H100 GPUs.


Findings

  • Training Strategy: Base language model that learns through Countdown game interactions
  • Reinforcement Method: It combines structured prompts with ground-truth rewards for iterative improvement
  • Verification System: This model develops self-checking abilities through trial and error
  • Learning Pipeline: Its progressive scaling from 0.5B to 3B parameters helps to achieve advanced reasoning


Performance Metrics:

  • Training Time: Complete experiment runs under 19 hours
  • Algorithm Testing: Consistent performance across PPO, GRPO, and PRIME variants
  • Learning Rate: Matches DeepSeek R1-Zero's problem-solving capabilities
  • Resource Usage: Runs on consumer-grade hardware versus H100 requirements


Development Features:

  • Problem Solving: Model learns to break down complex calculations like multiplication
  • Task Adaptation: Develops specific strategies for different mathematical challenges
  • Open Source: Complete implementation available on GitHub

Tülu 3 Scales to 405B: AI2's Latest Model Challenges DeepSeek V3

AI2 has released Tülu 3 405B, scaling up their successful open-source recipe to build the largest transparent language model to date. With a novel RLVR training approach and full 405B parameter architecture, the model demonstrates that open development can match and exceed closed-source alternatives.


Technical Architecture:

  • Massive Scale Deployment: The model leverages 32 nodes with 256 GPUs in parallel, using vLLM for efficient 16-way tensor parallelism
  • Advanced Weight Management: Implements NCCL broadcast system for seamless weight synchronization
  • Optimized Training: Utilizes 240 GPUs for training while maintaining 16-way inference parallelism
  • Resource Optimization: Employs 8B value model to reduce RLVR computational costs


Performance Metrics:

  • Base Evaluations: Achieves 88.4% on IFEval and surpasses previous open models on key benchmarks
  • RLVR Enhancement: Shows significant MATH performance gains at 405B scale, similar to DeepSeek-R1 findings
  • Safety Standards: Maintains 86.8% accuracy on comprehensive safety evaluations
  • Processing Speed: Completes inference in 550 seconds with 25-second weight transfers

Kimi k1.5: Advanced Reinforcement Learning Scales to Match o1 Performance

Moonshot AI has released Kimi k1.5, an LLM leveraging reinforcement learning from verifiable rewards (RLVR) to achieve o1-level reasoning without massive compute requirements. The model surpasses GPT-4o and Claude 3.5 Sonnet on key STEM benchmarks while maintaining efficient deployment capabilities.


Technical Architecture:

  • Training Framework: Novel RLVR system for verifiable rewards and self-improvement
  • Context Window: Extended 128k token processing for comprehensive reasoning
  • Parameter Design: Streamlined architecture requiring minimal computational resources


Performance Metrics:

  • AIME Benchmark: 77.5% accuracy (vs GPT-4o's 9.3%)
  • MATH-500: 96.2% score, leading performance in mathematical reasoning
  • Codeforces: 94th percentile ranking in competitive programming
  • MathVista: 74.9 points demonstrating strong multi-modal capabilities


The model validates that strategic reinforcement learning and architecture optimization can match the performance of much larger models, marking a potential shift in scaling approaches.


UI-TARS: ByteDance's GUI Agent Achieves SOTA Performance with Unified Architecture

ByteDance has open-sourced UI-TARS, integrating perception, reasoning, and action capabilities into a single model for automated GUI interaction. Built on Qwen2-VL architecture, the model demonstrates unprecedented performance in automated interface testing and real-world task completion.


Technical Architecture:

  • Single Model Integration: End-to-end processing eliminates the need for separate perception and action models
  • Unified Action Space: Native support for clicks, typing, scrolling, and platform-specific gestures
  • Memory Management: Real-time context tracking with long-term task knowledge retention


Performance Metrics:

  • Android Tasks: 98.1% accuracy on UI element detection
  • Desktop Testing: 95.9% success rate in application control
  • Web Benchmarks: 93.6% score on automated browsing tests
  • Cross-Platform: 91.3% average on combined environment tasks


Key Features:

  • Local Deployment: 7B and 72B variants optimized for vLLM infrastructure
  • API Integration: OpenAI-compatible endpoints for seamless tooling
  • Development Kit: Midscene.js SDK for browser automation
  • Open License: Apache 2.0 for full commercial usage


The model surpasses previous GUI automation tools by eliminating modular components while achieving higher accuracy through unified processing.


Tools & Releases YOU Should Know About

  1. ChatBot LLM arena leaderboard: Chatbot Arena is an open platform for crowdsourced AI benchmarking developed by researchers at UC Berkeley SkyLab and LMArena. With over 1,000,000 user votes, the platform ranks best LLM and AI chatbot using the Bradley-Terry model to generate live leaderboards.


  2. Bolt.DIY: Bolt.diy is an open-source tool derived from Bolt.new, designed to help users build full-stack applications directly in their browsers. It allows users to select from various AI models to assist with coding tasks, including OpenAI, HuggingFace, Gemini, Deepseek, Anthropic, Mistral, LMStudio, xAI, and Groq. Users can also add more models using the Vercel AI SDK.


  3. Goose: This is an open-source, extensible, local AI agent that helps automate engineering tasks. Written in Rust, goose helps developers create AI assistants. It works with many different AI systems and keeps user information private. It can help in testing/debugging software.


And that wraps up this issue of "This Week in AI Engineering" brought to you by jam.dev—the tool that makes it impossible for your team to send you bad bug reports.


Thank you for tuning in! Be sure to share this with your fellow AI enthusiasts and follow for the latest weekly updates.


Until next time, happy building!