Last active
June 5, 2022 12:53
-
-
Save astellon/fbbdacbafd46a061281cb7e4ef2a43f4 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
A Bayesian Permutation training deep representation learning method for speech enhancement with variational autoencoder | http://arxiv.org/abs/2201.09875v1 | |
---|---|---|
A COMPARISON OF DISCRETE AND SOFT SPEECH UNITS FOR IMPROVED VOICE CONVERSION | http://arxiv.org/abs/2111.02392v1 | |
A Configurable Multilingual Model is All You Need to Recognize All Languages | http://arxiv.org/abs/2107.05876v1 | |
A Framework for Private Communication with Secret Block Structure | http://arxiv.org/abs/2110.04345v1 | |
A GENERAL FRAMEWORK FOR DISTRIBUTED INFERENCE WITH UNCERTAIN MODELS | http://arxiv.org/abs/2011.10669v1 | |
A GENERALIZED HIERARCHICAL NONNEGATIVE TENSOR DECOMPOSITION | http://arxiv.org/abs/2109.14820v2 | |
A likelihood ratio based domain adaptation method for E2E models | http://arxiv.org/abs/2201.03655v1 | |
A Maximal Correlation Approach to Imposing Fairness in Machine Learning | http://arxiv.org/abs/2012.15259v1 | |
A METHOD TO REVEAL SPEAKER IDENTITY IN DISTRIBUTED ASR TRAINING, AND HOW TO COUNTER IT | http://arxiv.org/abs/2104.07815v1 | |
A NEW DATA AUGMENTATION METHOD FOR INTENT CLASSIFICATION ENHANCEMENT AND ITS APPLICATION ON SPOKEN CONVERSATION DATASETS | http://arxiv.org/abs/2202.10137v1 | |
A NOVEL 1D STATE SPACE FOR EFFICIENT MUSIC RHYTHMIC ANALYSIS | http://arxiv.org/abs/2111.00704v2 | |
A STUDY OF THE ROBUSTNESS OF RAW WAVEFORM BASED SPEAKER EMBEDDINGS UNDER MISMATCHED CONDITIONS | http://arxiv.org/abs/2110.04265v2 | |
A TIME ENCODING APPROACH TO TRAINING SPIKING NEURAL NETWORKS | http://arxiv.org/abs/2110.06735v1 | |
A Variational Bayesian Approach to Learning Latent Variables for Acoustic Knowledge Transfer | http://arxiv.org/abs/2110.08598v2 | |
ACCELERATED INTRAVASCULAR ULTRASOUND IMAGING USING DEEP REINFORCEMENT LEARNING | http://arxiv.org/abs/2201.09522v1 | |
ADAPTIVE GROUP TESTING WITH MISMATCHED MODELS | http://arxiv.org/abs/2110.02265v1 | |
ADIMA: ABUSE DETECTION IN MULTILINGUAL AUDIO | http://arxiv.org/abs/2202.07991v1 | |
ADVERSARIAL ROBUSTNESS BY DESIGN THROUGH ANALOG COMPUTING AND SYNTHETIC GRADIENTS | http://arxiv.org/abs/2101.02115v1 | |
Adversarial sample detection for speaker verification by neural vocoders | http://arxiv.org/abs/2107.00309v4 | |
AECMOS: A SPEECH QUALITY ASSESSMENT METRIC FOR ECHO IMPAIRMENT | http://arxiv.org/abs/2110.03010v3 | |
AERIAL BASE STATION PLACEMENT LEVERAGING RADIO TOMOGRAPHIC MAPS | http://arxiv.org/abs/2109.07372v2 | |
AISHELL-NER: NAMED ENTITY RECOGNITION FROM CHINESE SPEECH | http://arxiv.org/abs/2202.08533v1 | |
AMICABLE EXAMPLES FOR INFORMED SOURCE SEPARATION | http://arxiv.org/abs/2110.05059v2 | |
AN APPROACH TO MISPRONUNCIATION DETECTION AND DIAGNOSIS WITH ACOUSTIC, PHONETIC AND LINGUISTIC (APL) EMBEDDINGS | http://arxiv.org/abs/2110.07274v2 | |
AN ASYMPTOTICALLY OPTIMAL APPROXIMATION OF THE CONDITIONAL MEAN CHANNEL ESTIMATOR BASED ON GAUSSIAN MIXTURE MODELS | http://arxiv.org/abs/2111.11064v1 | |
An Embarrassingly Simple Model for Dialogue Relation Extraction | http://arxiv.org/abs/2012.13873v2 | |
AN INFORMATION MAXIMIZATION BASED BLIND SOURCE SEPARATION APPROACH FOR DEPENDENT AND INDEPENDENT SOURCES | http://arxiv.org/abs/2205.00794v1 | |
AN INVESTIGATION OF THE EFFECTIVENESS OF PHASE FOR AUDIO CLASSIFICATION | http://arxiv.org/abs/2110.02878v3 | |
ANALYZING THE ROBUSTNESS OF UNSUPERVISED SPEECH RECOGNITION | http://arxiv.org/abs/2110.03509v5 | |
APPLYING DIFFERENTIAL PRIVACY TO TENSOR COMPLETION | http://arxiv.org/abs/2110.00539v4 | |
APPROACHES TOWARD PHYSICAL AND GENERAL VIDEO ANOMALY DETECTION | http://arxiv.org/abs/2112.07661v1 | |
ASSEM-VC: REALISTIC VOICE CONVERSION BY ASSEMBLING MODERN SPEECH SYNTHESIS TECHNIQUES | http://arxiv.org/abs/2104.00931v2 | |
ATTENTIVE MAX FEATURE MAP AND JOINT TRAINING FOR ACOUSTIC SCENE CLASSIFICATION | http://arxiv.org/abs/2104.07213v2 | |
ATTRIBUTABLE WATERMARKING OF SPEECH GENERATIVE MODELS | http://arxiv.org/abs/2202.08900v2 | |
AUDIOCLIP: EXTENDING CLIP TO IMAGE, TEXT AND AUDIO | http://arxiv.org/abs/2106.13043v1 | |
AUGMENTING MOLECULAR DEEP GENERATIVE MODELS WITH TOPOLOGICAL DATA ANALYSIS REPRESENTATIONS | http://arxiv.org/abs/2106.04464v2 | |
Automated Audio Captioning using Transfer Learning and Reconstruction Latent Space Similarity Regularization | http://arxiv.org/abs/2108.04692v1 | |
AUTOMATIC DJ TRANSITIONS WITH DIFFERENTIABLE AUDIO EFFECTS AND GENERATIVE ADVERSARIAL NETWORKS | http://arxiv.org/abs/2110.06525v2 | |
BLIND EXTRACTION OF EQUITABLE PARTITIONS FROM GRAPH SIGNALS | http://arxiv.org/abs/2203.05407v1 | |
BLIND REVERBERATION TIME ESTIMATION IN DYNAMIC ACOUSTIC CONDITIONS | http://arxiv.org/abs/2202.11790v1 | |
BLOOM-NET: BLOCKWISE OPTIMIZATION FOR MASKING NETWORKS TOWARD SCALABLE AND EFFICIENT SPEECH ENHANCEMENT | http://arxiv.org/abs/2111.09372v2 | |
BONA FIDE RIESZ PROJECTIONS FOR DENSITY ESTIMATION | http://arxiv.org/abs/2204.13606v1 | |
BUILDING ROBUST SPOKEN LANGUAGE UNDERSTANDING BY CROSS ATTENTION BETWEEN PHONEME SEQUENCE AND ASR HYPOTHESIS | http://arxiv.org/abs/2203.12067v1 | |
Camera Calibration through Camera Projection Loss | http://arxiv.org/abs/2110.03479v3 | |
CAN AUDIO CAPTIONS BE EVALUATED WITH IMAGE CAPTION METRICS? | http://arxiv.org/abs/2110.04684v2 | |
CAPITALIZATION NORMALIZATION FOR LANGUAGE MODELING WITH AN ACCURATE AND EFFICIENT HIERARCHICAL RNN MODEL | http://arxiv.org/abs/2202.08171v1 | |
CASCADING BANDIT UNDER DIFFERENTIAL PRIVACY | http://arxiv.org/abs/2105.11126v2 | |
CAUSAL LINEAR TOPOLOGICAL FILTERS OVER A 2-SIMPLEX | http://arxiv.org/abs/2110.02362v1 | |
CLIMATE AND WEATHER: INSPECTING DEPRESSION DETECTION VIA EMOTION RECOGNITION | http://arxiv.org/abs/2204.14099v2 | |
Cloning one's voice using very limited data in the wild | http://arxiv.org/abs/2110.03347v2 | |
CLSEG: Contrastive Learning of Story Ending Generation | http://arxiv.org/abs/2202.09049v1 | |
COGNITIVE CODING OF SPEECH | http://arxiv.org/abs/2110.04241v1 | |
COLLABORATIVE OBJECT DETECTORS ADAPTIVE TO BANDWIDTH AND COMPUTATION | http://arxiv.org/abs/2105.00591v2 | |
Compressive Scanning Transmission Electron Microscopy | http://arxiv.org/abs/2112.11955v1 | |
CONDITIONAL DIFFUSION PROBABILISTIC MODEL FOR SPEECH ENHANCEMENT | http://arxiv.org/abs/2202.05256v1 | |
Considering user agreement in learning to predict the aesthetic quality | http://arxiv.org/abs/2110.06956v1 | |
CONTEXT MODELING WITH EVIDENCE FILTER FOR MULTIPLE CHOICE QUESTION ANSWERING | http://arxiv.org/abs/2010.02649v1 | |
CONTEXTUAL ADAPTERS FOR PERSONALIZED SPEECH RECOGNITION IN NEURAL TRANSDUCERS | http://arxiv.org/abs/2205.13660v1 | |
CONTINUOUS SPEECH SEPARATION WITH RECURRENT SELECTIVE ATTENTION NETWORK | http://arxiv.org/abs/2110.14838v1 | |
CONTRASTIVE PREDICTION STRATEGIES FOR UNSUPERVISED SEGMENTATION AND CATEGORIZATION OF PHONEMES AND WORDS | http://arxiv.org/abs/2110.15909v2 | |
CONVOLUTIONAL FILTERING IN SIMPLICIAL COMPLEXES | http://arxiv.org/abs/2201.12584v1 | |
CSI CLUSTERING WITH VARIATIONAL AUTOENCODING | http://arxiv.org/abs/2111.09758v2 | |
Data Agnostic Filter Gating for Efficient Deep Networks | http://arxiv.org/abs/2010.15041v1 | |
DEEP DETERMINISTIC INDEPENDENT COMPONENT ANALYSIS FOR HYPERSPECTRAL UNMIXING | http://arxiv.org/abs/2202.02951v2 | |
DEEP HASHING WITH HASH CENTER UPDATE FOR EFFICIENT IMAGE RETRIEVAL | http://arxiv.org/abs/2106.06269v1 | |
DEEP IMPULSE RESPONSES: ESTIMATING AND PARAMETERIZING FILTERS WITH DEEP NETWORKS | http://arxiv.org/abs/2202.03416v1 | |
DEEP ITERATIVE PHASE RETRIEVAL FOR PTYCHOGRAPHY | http://arxiv.org/abs/2202.10573v1 | |
DEEP LEARNING FOR LOCATION BASED BEAMFORMING WITH NLOS CHANNELS | http://arxiv.org/abs/2201.01386v1 | |
DEEP LEARNING FOR PROMINENCE DETECTION IN CHILDREN'S READ SPEECH | http://arxiv.org/abs/2110.14273v1 | |
Deformable VisTR: Spatio temporal deformable attention for video instance segmentation | http://arxiv.org/abs/2203.06318v1 | |
DEMON: IMPROVED NEURAL NETWORK TRAINING WITH MOMENTUM DECAY | http://arxiv.org/abs/1910.04952v4 | |
DEPTH PRUNING WITH AUXILIARY NETWORKS FOR TINYML | http://arxiv.org/abs/2204.10546v1 | |
DETECTING BACKDOOR ATTACKS AGAINST POINT CLOUD CLASSIFIERS | http://arxiv.org/abs/2110.10354v1 | |
DICTIONARY LEARNING WITH UNIFORM SPARSE REPRESENTATIONS FOR ANOMALY DETECTION | http://arxiv.org/abs/2201.03869v1 | |
DIFFERENTIABLE DIGITAL SIGNAL PROCESSING MIXTURE MODEL FOR SYNTHESIS PARAMETER EXTRACTION FROM MIXTURE OF HARMONIC SOUNDS | http://arxiv.org/abs/2202.00200v1 | |
DIFFERENTIABLE PROGRAMMING A LA MOREAU | http://arxiv.org/abs/2012.15458v1 | |
DIFFERENTIABLE WAVETABLE SYNTHESIS | http://arxiv.org/abs/2111.10003v4 | |
DIGRAPH SIGNAL PROCESSING WITH GENERALIZED BOUNDARY CONDITIONS | http://arxiv.org/abs/2005.09762v3 | |
DIRECT DESIGN OF BIQUAD FILTER CASCADES WITH DEEP LEARNING BY SAMPLING RANDOM POLYNOMIALS | http://arxiv.org/abs/2110.03691v2 | |
DISTRIBUTED GRAPH LEARNING WITH SMOOTH DATA PRIORS | http://arxiv.org/abs/2112.05887v1 | |
DISTRIBUTED LINK SPARSIFICATION FOR SCALABLE SCHEDULING USING GRAPH NEURAL NETWORKS | http://arxiv.org/abs/2203.14339v1 | |
DIVERSE AUDIO CAPTIONING VIA ADVERSARIAL TRAINING | http://arxiv.org/abs/2110.06691v2 | |
DO YOU LIVE A HEALTHY LIFE? ANALYZING LIFESTYLE BY VISUAL LIFE LOGGING | http://arxiv.org/abs/2011.12102v1 | |
DYNAMICALLY PRUNING SEGFORMER FOR EFFICIENT SEMANTIC SEGMENTATION | http://arxiv.org/abs/2111.09499v1 | |
ECONOMICS OF SEMANTIC COMMUNICATION SYSTEM IN WIRELESS POWERED INTERNET OF THINGS | http://arxiv.org/abs/2110.01423v1 | |
EFFECT OF NOISE SUPPRESSION LOSSES ON SPEECH DISTORTION AND ASR PERFORMANCE | http://arxiv.org/abs/2111.11606v1 | |
EFFICIENT SEQUENCE TRAINING OF ATTENTION MODELS USING APPROXIMATIVE RECOMBINATION | http://arxiv.org/abs/2110.09245v2 | |
EFFICIENT UNIVERSAL SHUFFLE ATTACK FOR VISUAL OBJECT TRACKING | http://arxiv.org/abs/2203.06898v1 | |
EFFICIENTLY AND GLOBALLY SOLVING JOINT BEAMFORMING AND COMPRESSION PROBLEM IN THE COOPERATIVE CELLULAR NETWORK VIA LAGRANGIAN DUALITY | http://arxiv.org/abs/2110.05085v2 | |
EMGSE: ACOUSTIC/EMG FUSION FOR MULTIMODAL SPEECH ENHANCEMENT | http://arxiv.org/abs/2202.06507v1 | |
ENHANCING AND DISSECTING CROWD COUNTING BY SYNTHETIC DATA | http://arxiv.org/abs/2201.08992v1 | |
ENHANCING UTILITY IN THE WATCHDOG PRIVACY MECHANISM | http://arxiv.org/abs/2110.04724v1 | |
ENVIRONMENTAL SOUND EXTRACTION USING ONOMATOPOEIC WORDS | http://arxiv.org/abs/2112.00209v4 | |
EPIGRAPHICAL RELAXATION FOR MINIMIZING LAYERED MIXED NORMS | http://arxiv.org/abs/2008.04565v2 | |
ESPNET-SLU: ADVANCING SPOKEN LANGUAGE UNDERSTANDING THROUGH ESPNET | http://arxiv.org/abs/2111.14706v2 | |
ESTIMATING THE CONFIDENCE OF SPEECH SPOOFING COUNTERMEASURE | http://arxiv.org/abs/2110.04775v2 | |
EVALUATION OF ORTHOGONAL CHIRP DIVISION MULTIPLEXING FOR AUTOMOTIVE INTEGRATED SENSING AND COMMUNICATIONS | http://arxiv.org/abs/2111.04975v1 | |
EVALUATION OF VIDEO CODING FOR MACHINES WITHOUT GROUND TRUTH | http://arxiv.org/abs/2205.06519v1 | |
EXPLAINING DEEP LEARNING MODELS FOR SPOOFING AND DEEPFAKE DETECTION WITH SHAPLEY ADDITIVE EXPLANATIONS | http://arxiv.org/abs/2110.03309v1 | |
Exploiting Language Model for Efficient Linguistic Steganalysis | http://arxiv.org/abs/2107.12168v3 | |
EXPLORING HETEROGENEOUS CHARACTERISTICS OF LAYERS IN ASR MODELS FOR MORE EFFICIENT TRAINING | http://arxiv.org/abs/2110.04267v2 | |
Exploring Machine Speech Chain for Domain Adaptation | http://arxiv.org/abs/2104.03815v1 | |
Factorized Neural Transducer for Efficient Language Model Adaptation | http://arxiv.org/abs/2110.01500v5 | |
Fast Graph Filters for Decentralized Subspace Projection | http://arxiv.org/abs/2011.07579v1 | |
FAST GRAPH SAMPLING FOR SHORT VIDEO SUMMARIZATION USING GERSHGORIN DISC ALIGNMENT | http://arxiv.org/abs/2110.11420v2 | |
FAST MULTISCALE DIFFUSION ON GRAPHS | http://arxiv.org/abs/2104.14652v1 | |
FAST-RIR: FAST NEURAL DIFFUSE ROOM IMPULSE RESPONSE GENERATOR | http://arxiv.org/abs/2110.04057v2 | |
Feature Imitating Networks | http://arxiv.org/abs/2110.04831v2 | |
Federated Learning Challenges and Opportunities: An Outlook | http://arxiv.org/abs/2202.00807v1 | |
FilterAugment: An Acoustic Environmental Data Augmentation Method | http://arxiv.org/abs/2110.03282v4 | |
FLDP: Flexible strategy for local differential privacy | http://arxiv.org/abs/2203.14875v1 | |
FORENSIC ANALYSIS AND LOCALIZATION OF MULTIPLY COMPRESSED MP3 AUDIO USING TRANSFORMERS | http://arxiv.org/abs/2203.16499v2 | |
FrAUG: A Frame Rate Based Data Augmentation Method for Depression Detection from Speech Signals | http://arxiv.org/abs/2202.05912v1 | |
FUSING ASR OUTPUTS IN JOINT TRAINING FOR SPEECH EMOTION RECOGNITION | http://arxiv.org/abs/2110.15684v2 | |
GENERALIZATION ABILITY OF MOS PREDICTION NETWORKS | http://arxiv.org/abs/2110.02635v3 | |
GENERALIZED TIME DOMAIN VELOCITY VECTOR | http://arxiv.org/abs/2110.06304v4 | |
GENERALIZING AUC OPTIMIZATION TO MULTICLASS CLASSIFICATION FOR AUDIO SEGMENTATION WITH LIMITED TRAINING DATA | http://arxiv.org/abs/2110.14425v1 | |
GENERATING DISENTANGLED ARGUMENTS WITH PROMPTS: A SIMPLE EVENT EXTRACTION FRAMEWORK THAT WORKS | http://arxiv.org/abs/2110.04525v2 | |
GRAPH SIGNAL PROCESSING: VERTEX MULTIPLICATION | http://arxiv.org/abs/2007.04723v1 | |
Hand Gesture Recognition Using Temporal Convolutions and Attention Mechanism | http://arxiv.org/abs/2110.08717v1 | |
HGCN: HARMONIC GATED COMPENSATION NETWORK FOR SPEECH ENHANCEMENT | http://arxiv.org/abs/2201.12755v2 | |
HISTOKT: CROSS KNOWLEDGE TRANSFER IN COMPUTATIONAL PATHOLOGY | http://arxiv.org/abs/2201.11246v1 | |
HODGELETS: LOCALIZED SPECTRAL REPRESENTATIONS OF FLOWS ON SIMPLICIAL COMPLEXES | http://arxiv.org/abs/2109.08728v1 | |
HOW CAN A COGNITIVE RADAR MASK ITS COGNITION? | http://arxiv.org/abs/2110.08608v1 | |
HOW NEURAL PROCESSES IMPROVE GRAPH LINK PREDICTION | http://arxiv.org/abs/2109.14894v1 | |
Identification of Edge Disconnections in Networks Based on Graph Filter Outputs | http://arxiv.org/abs/2102.06428v2 | |
IMPORTANTAUG: A DATA AUGMENTATION AGENT FOR SPEECH | http://arxiv.org/abs/2112.07156v2 | |
IMPROVED META LEARNING FOR LOW RESOURCE SPEECH RECOGNITION | http://arxiv.org/abs/2205.06182v1 | |
IMPROVING ADVERSARIAL WAVEFORM GENERATION BASED SINGING VOICE CONVERSION WITH HARMONIC SIGNALS | http://arxiv.org/abs/2201.10130v1 | |
IMPROVING BIRD CLASSIFICATION WITH UNSUPERVISED SOUND SEPARATION | http://arxiv.org/abs/2110.03209v1 | |
Improving Contextual Coherence in Variational Personalized and Empathetic Dialogue Agents | http://arxiv.org/abs/2202.05971v1 | |
Improving Factored Hybrid HMM Acoustic Modeling without State Tying | http://arxiv.org/abs/2201.09692v1 | |
IMPROVING FEATURE GENERALIZABILITY WITH MULTITASK LEARNING IN CLASS INCREMENTAL LEARNING | http://arxiv.org/abs/2204.12915v1 | |
IMPROVING LYRICS ALIGNMENT THROUGH JOINT PITCH DETECTION | http://arxiv.org/abs/2202.01646v1 | |
Improving Maximum Likelihood Difference Scaling method to measure inter content scale | http://arxiv.org/abs/2203.13186v1 | |
IMPROVING NOISE ROBUSTNESS OF CONTRASTIVE SPEECH REPRESENTATION LEARNING WITH SPEECH RECONSTRUCTION | http://arxiv.org/abs/2110.15430v1 | |
IMPROVING SOURCE SEPARATION BY EXPLICITLY MODELING DEPENDENCIES BETWEEN SOURCES | http://arxiv.org/abs/2203.15140v1 | |
IMPROVING THE FUSION OF ACOUSTIC AND TEXT REPRESENTATIONS IN RNN-T | http://arxiv.org/abs/2201.10240v1 | |
Increasing Loudness in Audio Signals: a perceptually motivated approach to preserve audio quality | http://arxiv.org/abs/2202.08183v1 | |
INCREMENTAL USER EMBEDDING MODELING FOR PERSONALIZED TEXT CLASSIFICATION | http://arxiv.org/abs/2202.06369v1 | |
INFERGRAD: IMPROVING DIFFUSION MODELS FOR VOCODER BY CONSIDERING INFERENCE IN TRAINING | http://arxiv.org/abs/2202.03751v1 | |
INTEGRATING TEXT INPUTS FOR TRAINING AND ADAPTING RNN TRANSDUCER ASR MODELS | http://arxiv.org/abs/2202.13155v1 | |
INTERPRETING INTERMEDIATE CONVOLUTIONAL LAYERS IN UNSUPERVISED ACOUSTIC WORD CLASSIFICATION | http://arxiv.org/abs/2110.02375v2 | |
ISOMETRIC MT: NEURAL MACHINE TRANSLATION FOR AUTOMATIC DUBBING | http://arxiv.org/abs/2112.08682v3 | |
ITOWAVE: ITO STOCHASTIC DIFFERENTIAL EQUATION IS ALL YOU NEED FOR WAVE GENERATION | http://arxiv.org/abs/2201.12519v2 | |
Joint calibration and mapping of satellite altimetry data using trainable variational models | http://arxiv.org/abs/2110.03405v1 | |
JOINT INFERENCE OF MULTIPLE GRAPHS WITH HIDDEN VARIABLES FROM STATIONARY GRAPH SIGNALS | http://arxiv.org/abs/2110.03666v2 | |
JOINT LEARNING OF FEATURE EXTRACTION AND COST AGGREGATION FOR SEMANTIC CORRESPONDENCE | http://arxiv.org/abs/2204.02164v2 | |
JOINT SPEECH RECOGNITION AND AUDIO CAPTIONING | http://arxiv.org/abs/2202.01405v1 | |
JOINT UNSUPERVISED AND SUPERVISED TRAINING FOR MULTILINGUAL ASR | http://arxiv.org/abs/2111.08137v1 | |
LABEL PROPAGATION ACROSS GRAPHS: NODE CLASSIFICATION USING GRAPH NEURAL TANGENT KERNELS | http://arxiv.org/abs/2110.03763v1 | |
LDNET: UNIFIED LISTENER DEPENDENT MODELING IN MOS PREDICTION FOR SYNTHETIC SPEECH | http://arxiv.org/abs/2110.09103v1 | |
Learnable Hypergraph Laplacian for Hypergraph Learning | http://arxiv.org/abs/2106.05701v1 | |
LEARNABLE NONLINEAR COMPRESSION FOR ROBUST SPEAKER VERIFICATION | http://arxiv.org/abs/2202.05236v1 | |
LEARNING CONTINUOUS REPRESENTATION OF AUDIO FOR ARBITRARY SCALE SUPER RESOLUTION | http://arxiv.org/abs/2111.00195v2 | |
LEARNING DECOUPLING FEATURES THROUGH ORTHOGONALITY REGULARIZATION | http://arxiv.org/abs/2203.16772v1 | |
Learning Expanding Graphs for Signal Interpolation | http://arxiv.org/abs/2203.07966v1 | |
LEARNING MUSIC AUDIO REPRESENTATIONS VIA WEAK LANGUAGE SUPERVISION | http://arxiv.org/abs/2112.04214v2 | |
LEARNING SOUND LOCALIZATION BETTER FROM SEMANTICALLY SIMILAR SAMPLES | http://arxiv.org/abs/2202.03007v1 | |
LEARNING TO INTEGRATE VISION DATA INTO ROAD NETWORK DATA | http://arxiv.org/abs/2112.10624v2 | |
LEARNINGS FROM FEDERATED LEARNING IN THE REAL WORLD | http://arxiv.org/abs/2202.03925v1 | |
LEVERAGING LOCAL TEMPORAL INFORMATION FOR MULTIMODAL SCENE CLASSIFICATION | http://arxiv.org/abs/2110.13992v1 | |
Light-SERNet: A lightweight fully convolutional neural network for speech emotion recognition | http://arxiv.org/abs/2110.03435v1 | |
LiteHAR: LIGHTWEIGHT HUMAN ACTIVITY RECOGNITION FROM WIFI SIGNALS WITH RANDOM CONVOLUTION KERNELS | http://arxiv.org/abs/2201.09310v1 | |
LOCALIZATION BASED SEQUENTIAL GROUPING FOR CONTINUOUS SPEECH SEPARATION | http://arxiv.org/abs/2107.06853v1 | |
LOCUNET: FAST URBAN POSITIONING USING RADIO MAPS AND DEEP LEARNING | http://arxiv.org/abs/2202.00738v2 | |
LOW COMPLEXITY EQUALIZATION FOR AFDM IN DOUBLY DISPERSIVE CHANNELS | http://arxiv.org/abs/2203.01875v2 | |
L-SpEx: Localized Target Speaker Extraction | http://arxiv.org/abs/2202.09995v1 | |
MASKED ACOUSTIC UNIT FOR MISPRONUNCIATION DETECTION AND CORRECTION | http://arxiv.org/abs/2108.05517v2 | |
Matching Point Sets with Quantum Circuit Learning | http://arxiv.org/abs/2102.06697v2 | |
MAXIMIZING AUDIO EVENT DETECTION MODEL PERFORMANCE ON SMALL DATASETS THROUGH KNOWLEDGE TRANSFER, DATA AUGMENTATION, AND PRETRAINING: AN ABLATION STUDY | http://arxiv.org/abs/2202.03514v1 | |
METRICGAN-U: UNSUPERVISED SPEECH ENHANCEMENT/ DEREVERBERATION BASED ONLY ON NOISY/ REVERBERATED SPEECH | http://arxiv.org/abs/2110.05866v1 | |
MIXED TRANSFORMER U-NET FOR MEDICAL IMAGE SEGMENTATION | http://arxiv.org/abs/2111.04734v2 | |
MIXTURE MODEL AUTO-ENCODERS: DEEP CLUSTERING THROUGH DICTIONARY LEARNING | http://arxiv.org/abs/2110.04683v2 | |
MM-DFN: Multimodal Dynamic Fusion Network For Emotion Recognition in Conversations | http://arxiv.org/abs/2203.02385v1 | |
Modeling Intention, Emotion and External World in Dialogue Systems | http://arxiv.org/abs/2202.06476v1 | |
MULTI-ACCDOA: LOCALIZING AND DETECTING OVERLAPPING SOUNDS FROM THE SAME CLASS WITH AUXILIARY DUPLICATING PERMUTATION INVARIANT TRAINING | http://arxiv.org/abs/2110.07124v2 | |
Multichannel Speech Enhancement without Beamforming | http://arxiv.org/abs/2110.13130v2 | |
MULTIMODAL DEPRESSION CLASSIFICATION USING ARTICULATORY COORDINATION FEATURES AND HIERARCHICAL ATTENTION BASED TEXT EMBEDDINGS | http://arxiv.org/abs/2202.06238v1 | |
MULTIPLE OFFSETS MULTILATERATION: A NEW PARADIGM FOR SENSOR NETWORK CALIBRATION WITH UNSYNCHRONIZED REFERENCE NODES | http://arxiv.org/abs/2205.11299v1 | |
MULTISCALE CROWD COUNTING AND LOCALIZATION BY MULTITASK POINT SUPERVISION | http://arxiv.org/abs/2202.09942v1 | |
Multitask Gaussian Process with Hierarchical Latent Interactions | http://arxiv.org/abs/1808.01132v7 | |
MUSIC ENHANCEMENT VIA IMAGE TRANSLATION AND VOCODING | http://arxiv.org/abs/2204.13289v1 | |
MUSIC SOURCE SEPARATION WITH DEEP EQUILIBRIUM MODELS | http://arxiv.org/abs/2110.06494v2 | |
NEAREST SUBSPACE SEARCH IN THE SIGNED CUMULATIVE DISTRIBUTION TRANSFORM SPACE FOR 1D SIGNAL CLASSIFICATION | http://arxiv.org/abs/2110.05606v2 | |
Neural Architecture Search for Speech Emotion Recognition | http://arxiv.org/abs/2203.16928v1 | |
Neural Speech Synthesis on a Shoestring: Improving the Efficiency of LPCNet | http://arxiv.org/abs/2202.11169v1 | |
No More Than 6ft Apart: Robust K-Means via Radius Upper Bounds | http://arxiv.org/abs/2203.02502v1 | |
NONVERBAL SOUND DETECTION FOR DISORDERED SPEECH | http://arxiv.org/abs/2202.07750v1 | |
ON DATA AUGMENTATION FOR GAN TRAINING | http://arxiv.org/abs/2006.05338v3 | |
ON FEDERATED LEARNING WITH ENERGY HARVESTING CLIENTS | http://arxiv.org/abs/2202.06105v1 | |
ON IDENTIFIABLE POLYTOPE CHARACTERIZATION FOR POLYTOPIC MATRIX FACTORIZATION | http://arxiv.org/abs/2204.11534v1 | |
On Language Model Integration for RNN Transducer based Speech Recognition | http://arxiv.org/abs/2110.06841v2 | |
ON LOSS FUNCTIONS AND EVALUATION METRICS FOR MUSIC SOURCE SEPARATION | http://arxiv.org/abs/2202.07968v1 | |
ON STABILITY AND CONVERGENCE OF DISTRIBUTED FILTERS | http://arxiv.org/abs/2102.11250v1 | |
ON THE ACQUISITION OF STATIONARY SIGNALS USING UNIFORM ADCS | http://arxiv.org/abs/2202.05143v2 | |
ON THE INTERPLAY BETWEEN SPARSITY, NATURALNESS, INTELLIGIBILITY, AND PROSODY IN SPEECH SYNTHESIS | http://arxiv.org/abs/2110.01147v2 | |
ON THE STABILITY OF LOW PASS GRAPH FILTER WITH A LARGE NUMBER OF EDGE REWIRES | http://arxiv.org/abs/2110.07234v1 | |
ONE TTS ALIGNMENT TO RULE THEM ALL | http://arxiv.org/abs/2108.10447v1 | |
OPTIMIZING THE CONSUMPTION OF SPIKING NEURAL NETWORKS WITH ACTIVITY REGULARIZATION | http://arxiv.org/abs/2204.01460v1 | |
PARAMETRIC MODELS FOR DOA TRAJECTORY LOCALIZATION | http://arxiv.org/abs/2204.09647v1 | |
PEER COLLABORATIVE LEARNING FOR POLYPHONIC SOUND EVENT DETECTION | http://arxiv.org/abs/2110.03511v1 | |
PERSONALIZED AUTOMATIC SPEECH RECOGNITION TRAINED ON SMALL DISORDERED SPEECH DATASETS | http://arxiv.org/abs/2110.04612v1 | |
Personalized PageRank Graph Attention Networks | http://arxiv.org/abs/2205.14259v1 | |
PERSONALIZED SPEECH ENHANCEMENT: NEW MODELS AND COMPREHENSIVE EVALUATION | http://arxiv.org/abs/2110.09625v1 | |
PHASE CONTINUITY: LEARNING DERIVATIVES OF PHASE SPECTRUM FOR SPEECH ENHANCEMENT | http://arxiv.org/abs/2202.11918v1 | |
PHONOLOGY RECOGNITION IN AMERICAN SIGN LANGUAGE | http://arxiv.org/abs/2110.00453v1 | |
PIXINWAV: RESIDUAL STEGANOGRAPHY FOR HIDING PIXELS IN AUDIO | http://arxiv.org/abs/2106.09814v1 | |
POPO: PESSIMISTIC OFFLINE POLICY OPTIMIZATION | http://arxiv.org/abs/2012.13682v2 | |
Power allocation for wireless federated learning using graph neural networks | http://arxiv.org/abs/2111.07480v2 | |
PRIVACY ATTACKS FOR AUTOMATIC SPEECH RECOGNITION ACOUSTIC MODELS IN A FEDERATED LEARNING FRAMEWORK | http://arxiv.org/abs/2111.03777v2 | |
PRIVACY SENSITIVE SPEECH ANALYSIS USING FEDERATED LEARNING TO ASSESS DEPRESSION | http://arxiv.org/abs/2205.00111v2 | |
PROGRESSIVE CONTINUAL LEARNING FOR SPOKEN KEYWORD SPOTTING | http://arxiv.org/abs/2201.12546v2 | |
PROTOTYPE LEARNING FOR INTERPRETABLE RESPIRATORY SOUND ANALYSIS | http://arxiv.org/abs/2110.03536v4 | |
PSEUDO STRONG LABELS FOR LARGE SCALE WEAKLY SUPERVISED AUDIO TAGGING | http://arxiv.org/abs/2204.13430v1 | |
PSLA: IMPROVING AUDIO TAGGING WITH PRETRAINING, SAMPLING, LABELING, AND AGGREGATION | http://arxiv.org/abs/2102.01243v3 | |
QUANTUM FEDERATED LEARNING WITH QUANTUM DATA | http://arxiv.org/abs/2106.00005v1 | |
RADAR TARGET DETECTION AIDED BY RECONFIGURABLE INTELLIGENT SURFACES | http://arxiv.org/abs/2104.00768v3 | |
REAL ADDITIVE MARGIN SOFTMAX FOR SPEAKER VERIFICATION | http://arxiv.org/abs/2110.09116v1 | |
REAL-M: TOWARDS SPEECH SEPARATION ON REAL MIXTURES | http://arxiv.org/abs/2110.10812v1 | |
RECOVERY OF GRAPH SIGNALS FROM SIGN MEASUREMENTS | http://arxiv.org/abs/2109.12576v1 | |
Reformulating Speaker Diarization as Community Detection With Emphasis On Topological Structure | http://arxiv.org/abs/2204.12112v1 | |
RESCOREBERT: DISCRIMINATIVE SPEECH RECOGNITION RESCORING WITH BERT | http://arxiv.org/abs/2202.01094v3 | |
RESIDUAL RECOVERY ALGORITHM FOR MODULO SAMPLING | http://arxiv.org/abs/2110.03335v1 | |
RETRIEVING SPEAKER INFORMATION FROM PERSONALIZED ACOUSTIC MODELS FOR SPEECH RECOGNITION | http://arxiv.org/abs/2111.04194v1 | |
R-G2P: EVALUATING AND ENHANCING ROBUSTNESS OF GRAPHEME TO PHONEME CONVERSION BY CONTROLLED NOISE INTRODUCING AND CONTEXTUAL INFORMATION INCORPORATION | http://arxiv.org/abs/2202.11194v1 | |
ROBUST CLASSIFICATION WITH FLEXIBLE DISCRIMINANT ANALYSIS IN HETEROGENEOUS DATA | http://arxiv.org/abs/2201.02967v1 | |
RTSNET: DEEP LEARNING AIDED KALMAN SMOOTHING | http://arxiv.org/abs/2110.04717v2 | |
SAFEGUARDING UAV NETWORKS THROUGH INTEGRATED SENSING, JAMMING, AND COMMUNICATIONS | http://arxiv.org/abs/2110.04733v1 | |
SALSA-Lite: A Fast and Effective Feature for Polyphonic Sound Event Localization and Detection with Microphone Arrays | http://arxiv.org/abs/2111.08192v2 | |
SA-SDR: A NOVEL LOSS FUNCTION FOR SEPARATION OF MEETING STYLE DATA | http://arxiv.org/abs/2110.15581v2 | |
Scattering Statistics of Generalized Spatial Poisson Point Processes | http://arxiv.org/abs/1902.03537v2 | |
SCORE DIFFICULTY ANALYSIS FOR PIANO PERFORMANCE EDUCATION BASED ON FINGERING | http://arxiv.org/abs/2203.13010v1 | |
S-DCCRN: Super Wide Band DCCRN with learnable complex feature for speech enhancement | http://arxiv.org/abs/2111.08387v1 | |
SEED: SOUND EVENT EARLY DETECTION VIA EVIDENTIAL UNCERTAINTY | http://arxiv.org/abs/2202.02441v2 | |
SIGNAL PROCESSING ON CELL COMPLEXES | http://arxiv.org/abs/2110.05614v2 | |
Simple Attention Module based Speaker Verification with Iterative noisy label detection | http://arxiv.org/abs/2110.06534v1 | |
SIMPLICIAL CONVOLUTIONAL NEURAL NETWORKS | http://arxiv.org/abs/2110.02585v1 | |
SKETCHED RT3D: HOW TO RECONSTRUCT BILLIONS OF PHOTONS PER SECOND | http://arxiv.org/abs/2203.00952v1 | |
SLUE: NEW BENCHMARK TASKS FOR SPOKEN LANGUAGE UNDERSTANDING EVALUATION ON NATURAL SPEECH | http://arxiv.org/abs/2111.10367v2 | |
SOUND EVENT DETECTION GUIDED BY SEMANTIC CONTEXTS OF SCENES | http://arxiv.org/abs/2110.03243v3 | |
SOUND EVENT DETECTION: A TUTORIAL | http://arxiv.org/abs/2107.05463v1 | |
SOURCE MIXING AND SEPARATION ROBUST AUDIO STEGANOGRAPHY | http://arxiv.org/abs/2110.05054v2 | |
SOURCE SEPARATION BY STEERING PRETRAINED MUSIC MODELS | http://arxiv.org/abs/2110.13071v1 | |
SPATIAL ACTIVE NOISE CONTROL BASED ON INDIVIDUAL KERNEL INTERPOLATION OF PRIMARY AND SECONDARY SOUND FIELDS | http://arxiv.org/abs/2202.04807v1 | |
SPATIAL DATA AUGMENTATION WITH SIMULATED ROOM IMPULSE RESPONSES FOR SOUND EVENT LOCALIZATION AND DETECTION | http://arxiv.org/abs/2110.06501v2 | |
SPATIAL MIXUP: DIRECTIONAL LOUDNESS MODIFICATION AS DATA AUGMENTATION FOR SOUND EVENT LOCALIZATION AND DETECTION | http://arxiv.org/abs/2110.06126v1 | |
SPEAKER GENERATION | http://arxiv.org/abs/2111.05095v1 | |
SPEAKER IDENTITY PRESERVATION IN DYSARTHRIC SPEECH RECONSTRUCTION BY ADVERSARIAL SPEAKER ADAPTATION | http://arxiv.org/abs/2202.09082v1 | |
SPEAKER REINFORCEMENT USING TARGET SOURCE EXTRACTION FOR ROBUST AUTOMATIC SPEECH RECOGNITION | http://arxiv.org/abs/2205.04433v1 | |
SPEECH TASKS RELEVANT TO SLEEPINESS DETERMINED WITH DEEP TRANSFER LEARNING | http://arxiv.org/abs/2111.14684v1 | |
SPELL MY NAME: KEYWORD BOOSTED SPEECH RECOGNITION | http://arxiv.org/abs/2110.02791v1 | |
STABILITY ANALYSIS OF UNFOLDED WMMSE FOR POWER ALLOCATION | http://arxiv.org/abs/2110.07471v2 | |
STABILITY OF NEURAL NETWORKS ON MANIFOLDS TO RELATIVE PERTURBATIONS | http://arxiv.org/abs/2110.04702v1 | |
STABLE AND TRANSFERABLE WIRELESS RESOURCE ALLOCATION POLICIES VIA MANIFOLD NEURAL NETWORKS | http://arxiv.org/abs/2110.04706v1 | |
STUDY OF POSITIONAL ENCODING APPROACHES FOR AUDIO SPECTROGRAM TRANSFORMERS | http://arxiv.org/abs/2110.06999v1 | |
Subjective and Objective Quality Assessment of Mobile Gaming Video | http://arxiv.org/abs/2103.05099v1 | |
Supervised Learning based Sparse Channel Estimation for RIS aided Communications | http://arxiv.org/abs/2202.11997v1 | |
TARGETDROP: A TARGETED REGULARIZATION METHOD FOR CONVOLUTIONAL NEURAL NETWORKS | http://arxiv.org/abs/2010.10716v1 | |
THE DAWN OF QUANTUM NATURAL LANGUAGE PROCESSING | http://arxiv.org/abs/2110.06510v1 | |
THE MIRRORNET : LEARNING AUDIO SYNTHESIZER CONTROLS INSPIRED BY SENSORIMOTOR INTERACTION | http://arxiv.org/abs/2110.05695v4 | |
Threshold Independent Evaluation of Sound Event Detection Scores | http://arxiv.org/abs/2201.13148v1 | |
T-NGA: TEMPORAL NETWORK GRAFTING ALGORITHM FOR LEARNING TO PROCESS SPIKING AUDIO SENSOR EVENTS | http://arxiv.org/abs/2202.03204v1 | |
TO CATCH A CHORUS, VERSE, INTRO, OR ANYTHING ELSE: ANALYZING A SONG WITH STRUCTURAL FUNCTIONS | http://arxiv.org/abs/2205.14700v1 | |
TORCHAUDIO: BUILDING BLOCKS FOR AUDIO AND SPEECH PROCESSING | http://arxiv.org/abs/2110.15018v2 | |
TOWARDS A COMMON SPEECH ANALYSIS ENGINE | http://arxiv.org/abs/2203.00613v1 | |
TOWARDS EXPRESSIVE SPEAKING STYLE MODELLING WITH HIERARCHICAL CONTEXT INFORMATION FOR MANDARIN SPEECH SYNTHESIS | http://arxiv.org/abs/2203.12201v2 | |
TOWARDS IDENTITY PRESERVING NORMAL TO DYSARTHRIC VOICE CONVERSION | http://arxiv.org/abs/2110.08213v1 | |
Towards Interpretability of Speech Pause in Dementia Detection using Adversarial Learning | http://arxiv.org/abs/2111.07454v1 | |
TOWARDS LEARNING UNIVERSAL AUDIO REPRESENTATIONS | http://arxiv.org/abs/2111.12124v2 | |
TOWARDS MEASURING FAIRNESS IN SPEECH RECOGNITION: CASUAL CONVERSATIONS DATASET TRANSCRIPTIONS | http://arxiv.org/abs/2111.09983v1 | |
TOWARDS REDUCING THE NEED FOR SPEECH TRAINING DATA TO BUILD SPOKEN LANGUAGE UNDERSTANDING SYSTEMS | http://arxiv.org/abs/2203.00006v1 | |
TOWARDS SPEAKER AGE ESTIMATION WITH LABEL DISTRIBUTION LEARNING | http://arxiv.org/abs/2202.11424v1 | |
TRAINING STABLE GRAPH NEURAL NETWORKS THROUGH CONSTRAINED LEARNING | http://arxiv.org/abs/2110.03576v2 | |
UNDERWATER IMAGE ENHANCEMENT VIA LEARNING WATER TYPE DESENSITIZED REPRESENTATIONS | http://arxiv.org/abs/2102.00676v2 | |
UNROLLING PARTICLES: UNSUPERVISED LEARNING OF SAMPLING DISTRIBUTIONS | http://arxiv.org/abs/2110.02915v1 | |
UNSUPERVISED SPEECH ENHANCEMENT WITH SPEECH RECOGNITION EMBEDDING AND DISENTANGLEMENT LOSSES | http://arxiv.org/abs/2111.08678v2 | |
Upmixing via style transfer: a variational autoencoder for disentangling spatial images and musical content | http://arxiv.org/abs/2203.12053v1 | |
USING MULTIPLE REFERENCE AUDIOS AND STYLE EMBEDDING CONSTRAINTS FOR SPEECH SYNTHESIS | http://arxiv.org/abs/2110.04451v1 | |
VISION TRANSFORMER EQUIPPED WITH NEURAL RESIZER ON FACIAL EXPRESSION RECOGNITION TASK | http://arxiv.org/abs/2204.02181v1 | |
VOCALSOUND: A DATASET FOR IMPROVING HUMAN VOCAL SOUNDS RECOGNITION | http://arxiv.org/abs/2205.03433v1 | |
VOCBENCH: A NEURAL VOCODER BENCHMARK FOR SPEECH SYNTHESIS | http://arxiv.org/abs/2112.03099v1 | |
VSEGAN: VISUAL SPEECH ENHANCEMENT GENERATIVE ADVERSARIAL NETWORK | http://arxiv.org/abs/2102.02599v2 | |
VU-BERT: A UNIFIED FRAMEWORK FOR VISUAL DIALOG | http://arxiv.org/abs/2202.10787v1 | |
WAV2CLIP: LEARNING ROBUST AUDIO REPRESENTATIONS FROM CLIP | http://arxiv.org/abs/2110.11499v2 | |
WAVEBENDER GAN: AN ARCHITECTURE FOR PHONETICALLY MEANINGFUL SPEECH MANIPULATION | http://arxiv.org/abs/2202.10973v1 | |
WEARABLE SELD DATASET: DATASET FOR SOUND EVENT LOCALIZATION AND DETECTION USING WEARABLE DEVICES AROUND HEAD | http://arxiv.org/abs/2202.08458v1 | |
When BERT Meets Quantum Temporal Convolution Learning for Text Classification in Heterogeneous Computing | http://arxiv.org/abs/2203.03550v1 | |
Win the Lottery Ticket via Fourier Analysis: Frequencies Guided Network Pruning | http://arxiv.org/abs/2201.12712v1 | |
WORD ORDER DOES NOT MATTER FOR SPEECH RECOGNITION | http://arxiv.org/abs/2110.05994v2 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment