#

multimodal-ai

Here are 143 public repositories matching this topic...

duixcom / Duix-Avatar

🚀 Truly open-source AI avatar(digital human) toolkit for offline video generation and digital human cloning.

cloning video-generation digital-human cloning-tool ai-avatar ai-avatars video-synthesis multimodal-ai

Updated Oct 16, 2025
C

Denis2054 / Building-Business-Ready-Generative-AI-Systems

This GitHub repository contains the complete code for building Business-Ready Generative AI Systems (GenAISys) from scratch. It guides you through architecting and implementing advanced AI controllers, intelligent agents, and dynamic RAG frameworks. The projects demonstrate practical applications across various domains.

multi-agent-systems ai-agents rag human-centered-ai llms chain-of-thought enterprise-ai agentic-ai ai-architecture multimodal-ai deepseek-r1 context-engineering generative-ai-systems

Updated Aug 9, 2025
Jupyter Notebook

thubZ09 / multimodal-research

Hub for researchers exploring VLMs and Multimodal Learning:)

nlp machine-learning research computer-vision deep-learning multimodal-learning multimodal-deep-learning vision-language multimodal-large-language-models vlms multimodal-ai

Updated Dec 3, 2025

Snappy

athrael-soju / Snappy

🐊 Snappy's unique approach unifies vision-language late interaction with structured OCR for region-level knowledge retrieval. Like the project? Drop a star! ⭐

Updated Dec 5, 2025
Python

seehiong / prompt-to-puzzle

A web app that dynamically generates playable 'Spot the Difference' games from a single text prompt using a multimodal pipeline with Google's Gemini and Imagen models.

react game typescript computer-vision html5-canvas puzzle-game generative-art text-to-image hackathon-project appwrite google-cloud-run generative-ai google-gemini google-ai-studio spot-the-difference multimodal-ai google-imagen

Updated Sep 13, 2025
TypeScript

kiranbaby14 / TalkMateAI

🎭 Real-time voice-controlled 3D avatar with multimodal AI - speak naturally and watch your AI companion respond with perfect lip-sync

websocket nextjs vlm fastapi huggingface whisper-ai flash-attention-2 multimodal-ai kokoro-tts smolvlm

Updated Jul 5, 2025
TypeScript

sinanuozdemir / oreilly-multimodal-ai

Learn how multimodal AI merges text, image, and audio for smarter models

openai diffusion multimodal deepgram livekit stable-diffusion dreambooth generative-ai llava dalle-3 llama3 multimodal-ai

Updated Jan 21, 2025
Jupyter Notebook

neocortex-link / neocortex-unity-sdk

Neocortex Unity SDK for Smart NPCs and Virtual Assistants

ai game-development npc npcs game-ai ai-agents conversational-ai smart-agent ai-tools ai-agent aiagent smart-agents aiagents multimodal-ai smart-npc smart-npcs unity-llm unityllm

Updated Nov 13, 2025
C#

microsoft / multimodal-ai

Enterprise-ready solution leveraging multimodal Generative AI (Gen AI) to enhance existing or new applications beyond text—implementing RAG, image classification, video analysis, and advanced image embeddings.

python ai azure video-analysis azure-ai enterprise-ai multimodal-ai

Updated Dec 6, 2025
HCL

alperensumeroglu / ai-clips-maker

AI-powered tool to turn long videos into short, viral-ready clips. Combines transcription, speaker diarization, scene detection & 9:16 resizing — perfect for creators & smart automation.

Updated Apr 2, 2025
Python

DmitryRyumin / ICML-2025-Papers

ICML 2025 Papers: Dive into cutting-edge research from the premier machine learning conference. Stay current with breakthroughs in deep learning, generative AI, optimization, reinforcement learning, and beyond. Code implementations included. ⭐ support the future of machine learning research!

machine-learning reinforcement-learning deep-learning optimization reinforcement-learning-algorithms icml ai-research graph-learning diffusion-models generative-ai multimodal-ai icml-2025

Updated Oct 24, 2025

doepking / gemini_multimodal_demo

A demo multimodal AI chat application built with Streamlit and Google's Gemini model. Features include: secure Google OAuth, persistent data storage with Cloud SQL (PostgreSQL), and intelligent function calling. Includes a persona-based newsletter engine to deliver personalized insights.

postgresql google-cloud smtp cloud-sql cloud-run gemini-ai multimodal-ai

Updated Aug 3, 2025
Python

byerlikaya / SmartRAG

⚡ Production-ready .NET Standard 2.1 RAG library with 🤖 multi-AI provider support, 🏢 enterprise vector storage, 📄 intelligent document processing, and 🗄️ multi-database query coordination. 🌍 Cross-platform compatible.

Updated Dec 1, 2025
C#

masfaatanveer / Agentic-AI-Computer

This is a fully autonomous, self-operating computer automation system designed to automate tasks on Windows without any user interaction. It runs scheduled or trigger-based workflows using Python, system tools, and smart agents — ideal for repetitive tasks, bots, or self-executing pipelines.

python bot agent automation autopilot task-runner windows-automation autonomous-system ai-agent ollama gemini-pro-vision claude-3 gpt-4o agentic-ai multimodal-ai self-operating

Updated Aug 3, 2025
Python

umitkacar / awesome-vision-models

Vision Foundation Models: SAM, ViT, CLIP, DINOv2, object detection, segmentation, and multimodal AI for computer vision.

computer-vision sam yolo image-recognition object-detection vit clip semantic-segmentation zero-shot-learning mae instance-segmentation vision-transformers foundation-models visual-understanding open-vocabulary dinov2 grounding-dino multimodal-ai

Updated Nov 10, 2025
Makefile

debanjan06 / geospatial-rag

AI Framework for Remote Sensing Image Analysis using RAG - 88%+ accuracy, multi-modal queries, ChatGPT-like interface

machine-learning computer-vision geospatial pytorch embeddings remote-sensing clip earth-observation rag academic-research langchain multimodal-ai

Updated Jul 27, 2025
Python

michaelbeijer / Supervertaler

The ultimate companion tool for translators and writers. Context-aware AI translation leveraging multiple sources: full document context, images, TM, termbases. Featuring: Prompt Library/Manager, PDF Rescue, TMX Editor, Supervoice (AI voice dictation), Superbench (LLM translation quality benchmarking), Universal Lookup, and CAT tool integration.

python nlp translation ai localization gemini ahk translation-memory prompts memoq claude cafetran translation-tool cat-tool llm prompt-engineering multimodal-ai context-aware-translation

Updated Dec 6, 2025
Python

NxtGenLegend / TreeHacks-ZoneOut

#3 Winner of Best Use of Zoom API at Stanford TreeHacks 2025! An AI-powered meeting assistant that captures video, audio and textual context from Zoom calls using multimodal RAG.

Updated Feb 16, 2025
JavaScript

VectorInstitute / VLDBench

VLDBench: A large-scale benchmark for evaluating Vision-Language Models (VLMs) and Large Language Models (LLMs) on multimodal disinformation detection.

nlp benchmarking machine-learning computer-vision deep-learning datasets benchmark-framework ai-safety llm vlms vision-language-models multimodal-ai disinformation-detection

Updated Jun 10, 2025
Python

sams-tom / Multimodal-AUV

Leveraging Bayesian Neural Networks for multimodal AUV data fusion, enabling precise and uncertainty-aware mapping of underwater environments.

python machine-learning computer-vision deep-learning geospatial-data pytorch remote-sensing uncertainty-quantification underwater-robotics auv environmental-monitoring bayesian-neural-networks data-fusion habitat-mapping marine-science multimodal-ai sonar-data environmental-ai optical-imagery

Updated Oct 24, 2025
Python

Improve this page

Add a description, image, and links to the multimodal-ai topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the multimodal-ai topic, visit your repo's landing page and select "manage topics."