Search Captions & Ask AI

Making Statistics Work in the Real World

June 29, 2017 / 04:28

This episode features Wharton statistics Professor Bashor Bataria discussing his research on statistics, probability, and combinatorics. Key topics include graph-based methods, computational complexity, and applications in disease research and natural language processing.

Professor Bataria explains how his research reveals the effectiveness of graph-based methods in statistics, providing theoretical justification for their use. He highlights the interplay between computational efficiency and statistical performance, emphasizing that these methods can handle large datasets.

He discusses practical applications of his research, such as the two-sample problem in gene expression studies, where differences in gene expression levels between patients with diabetes and healthy individuals are analyzed.

Additionally, Bataria touches on the relevance of his work in natural language processing, particularly in understanding word similarities. He notes the importance of these methods for businesses analyzing customer data from social media.

Looking ahead, he shares his focus on high-dimensional data analysis, where traditional techniques may fall short, and the need for new algorithms to address these challenges.

TL;DR

Professor Bashor Bataria discusses graph-based methods in statistics, their applications in gene expression and natural language processing, and future research directions.

Episode

4:28
00:00:02
we're here today with Wharton statistics
00:00:03
Professor bashor bataria to talk about
00:00:05
some of his latest research bashor
00:00:07
thanks for being with us today could you
00:00:08
first of all talk to us a little bit
00:00:10
about give us a brief summary of your
00:00:11
research what kind of question you were
00:00:13
trying to answer right so my research
00:00:15
interests are the intersection of
00:00:16
Statistics probability and combinatorics
00:00:19
so recently numerous very interesting
00:00:20
combinatorial and graph theoretic
00:00:22
problems of emerging statistics mainly
00:00:24
because of the ubiquitous presence of
00:00:26
network data and the increasing use of
00:00:28
graph based methods in modern analytics
00:00:30
and as a consequence many interesting
00:00:32
connections have emerged between modern
00:00:34
statistical methods and classical
00:00:36
Concepts in geometry and probability and
00:00:39
you know as a consequence you can use
00:00:40
them to solve interesting problems in
00:00:42
statistics great and tell us a little
00:00:44
bit about in this study what were the
00:00:46
key takeaways that you took away that
00:00:48
you took away from the study the key
00:00:50
takeaways of my research basically the
00:00:52
interplay between computational
00:00:54
complexity which is the time it takes to
00:00:56
to implement one of these methods and
00:00:58
its statistical performance perance
00:01:00
which is how close it is to the
00:01:01
mathematically based procedure it turns
00:01:04
out that many of these graph based
00:01:05
methods are computationally very
00:01:07
efficient so they can be aply used to
00:01:09
solve uh and apply in large data sets
00:01:12
and we have also shown that in many
00:01:14
situations these tests have near Optimal
00:01:16
Performance guarantees which provide the
00:01:19
theoretical justification required for
00:01:20
using these procedures now had graph
00:01:23
based methods is that something that has
00:01:25
not been used as much in the past or had
00:01:27
been looked at with more skepticism mean
00:01:29
why is this
00:01:30
right so so graph based methods have
00:01:32
been used in practice for a long time
00:01:34
but I think what comes out of my
00:01:35
research is the the answer to the
00:01:37
question that why they work so they were
00:01:38
used and they were working fine before
00:01:41
but here is so here we provide some
00:01:43
theory behind why it works giving the
00:01:44
justification of using it in real
00:01:46
problems great and so getting to that
00:01:48
actually like if I'm a business if I'm a
00:01:50
business person or I own a business I
00:01:52
mean how can I apply This research
00:01:55
practically in my life right so one of
00:01:58
the recent projects we're looking at is
00:01:59
what is known as the two sample problem
00:02:01
so imagine I have a situation where I
00:02:03
want to find out whether a set of genes
00:02:06
regulate or affect the occurrence of a
00:02:08
disease so for example to just
00:02:10
illustrate suppose I have uh uh 20 genes
00:02:13
and I have the gene expression level
00:02:15
data from 100 patients who have diabetes
00:02:18
and I have the same expression level
00:02:19
data for 100 patients who are healthy
00:02:22
and the goal is to find out whether
00:02:24
these 20 genes are expressed
00:02:26
differentially so by that I mean that
00:02:28
their expression levels are sign
00:02:29
signicantly different between these two
00:02:31
sets of patients and uh our research
00:02:34
sort of aims to provide the theoretical
00:02:36
understanding and comparison of the
00:02:38
different methods that are deployed to
00:02:41
understand or answer such questions what
00:02:43
are some other applications for This
00:02:44
research right so another interesting
00:02:46
application of our work is in natural
00:02:48
language processing mainly in problems
00:02:50
which try to understand similarity
00:02:52
between words so imagine the word color
00:02:55
which can be spelled in two ways one
00:02:56
with the letter U one without the letter
00:02:58
U and they are the same words but the
00:03:00
words for instance wolf and fox are both
00:03:03
animals but they are very different
00:03:05
words so in this case in spite of the
00:03:08
amount of data we have the support size
00:03:10
which is basically the collection of all
00:03:12
words is far larger than the than the
00:03:14
data set itself so one of our methods is
00:03:17
what we are studying can be used to
00:03:19
analyze such problems as well which I
00:03:21
would think would be pretty interesting
00:03:22
to businesses nowaday just because so
00:03:24
many people are like taking social media
00:03:26
posts and things like that trying to get
00:03:28
data about customers that way right
00:03:30
perfectly yeah yeah yeah sure and so
00:03:32
what's next for This research right so
00:03:34
currently I'm trying to understand or or
00:03:36
analyze the methods for analyzing data
00:03:39
in what is known as the high dimensional
00:03:40
setting where you have say 10,000 genes
00:03:43
and only a few hundred patients and uh
00:03:45
you want to find out something about how
00:03:48
the genes affect the disease or the
00:03:49
patients and for these cases different
00:03:52
new techniques are required and I'm
00:03:53
trying to understand the theoretical uh
00:03:56
the uh background of these results and
00:03:58
how these can be used to find to get new
00:04:00
methods and new algorithms great BR
00:04:02
thanks for being with us today thank you
00:04:03
thank you
00:04:15
[Music]

Episode Highlights

  • Graph-Based Methods Explained
    Discover why graph-based methods are effective in modern analytics and their computational efficiency.
    “We provide some theory behind why it works.”
    @ 01m 37s
    June 29, 2017
  • Applications in Natural Language Processing
    Learn how research applies to understanding word similarities and customer data analysis.
    “Our methods can analyze problems in natural language processing.”
    @ 03m 17s
    June 29, 2017
  • Future Research Directions
    Exploring new techniques for analyzing high-dimensional data in medical research.
    “New techniques are required for high dimensional settings.”
    @ 03m 40s
    June 29, 2017

Episode Quotes

  • Why they work is the key question.
    Making Statistics Work in the Real World
  • Our research aims to provide theoretical understanding.
    Making Statistics Work in the Real World
  • Analyzing data in high dimensional settings is crucial.
    Making Statistics Work in the Real World

Key Moments

  • Research Overview00:10
  • Graph Theory Insights00:22
  • Business Applications01:55
  • Natural Language Processing02:46
  • Future Directions03:34

Words per Minute Over Time

Vibes Breakdown

Related Episodes

Hockey Analytics, Simulation, and Predictive Limits
April 22, 2026
Captions not detected. You can watch the video, but not search it. If you think this is an error, contact support.
59:22
Hockey Analytics, Simulation, and Predictive Limits
Wharton Professors Eric Bradlow and Peter Fader on "The Data Dilemma"
March 19, 2009
Captions not detected. You can watch the video, but not search it. If you think this is an error, contact support.
04:58
Wharton Professors Eric Bradlow and Peter Fader on "The Data Dilemma"
NBA Shockwaves, Why the Chiefs Still Rank No.1, and the Power of Data
December 01, 2025
Captions not detected. You can watch the video, but not search it. If you think this is an error, contact support.
01:00:01
NBA Shockwaves, Why the Chiefs Still Rank No.1, and the Power of Data
Wharton Moneyball Podcast – 10-Year Anniversary Episode
May 23, 2024
Captions not detected. You can watch the video, but not search it. If you think this is an error, contact support.
01:05:14
Wharton Moneyball Podcast – 10-Year Anniversary Episode
Innovation Networks
July 23, 2015
Captions not detected. You can watch the video, but not search it. If you think this is an error, contact support.
06:00
Innovation Networks
What's Behind the Surge of Interest in People Analytics?
April 10, 2015
Captions not detected. You can watch the video, but not search it. If you think this is an error, contact support.
22:49
What's Behind the Surge of Interest in People Analytics?
The Many Meanings of Baseball: History, Data, and Fan Experience
April 02, 2026
Captions not detected. You can watch the video, but not search it. If you think this is an error, contact support.
56:14
The Many Meanings of Baseball: History, Data, and Fan Experience
Superforecaster Full Video
February 26, 2016
Captions not detected. You can watch the video, but not search it. If you think this is an error, contact support.
01:01:33
Superforecaster Full Video
Baseball’s Hall of Fame Debate Is Changing
May 27, 2026
Captions not detected. You can watch the video, but not search it. If you think this is an error, contact support.
59:35
Baseball’s Hall of Fame Debate Is Changing
Influencing the Influencers: Using Social Media to Find Top Customers
June 27, 2017
Captions not detected. You can watch the video, but not search it. If you think this is an error, contact support.
07:08
Influencing the Influencers: Using Social Media to Find Top Customers
Meet Wharton's Newest Faculty: Enric Boix
November 20, 2025
Captions not detected. You can watch the video, but not search it. If you think this is an error, contact support.
01:08
Meet Wharton's Newest Faculty: Enric Boix
Overcoming "Algorithm Aversion"
February 13, 2017
Captions not detected. You can watch the video, but not search it. If you think this is an error, contact support.
12:20
Overcoming "Algorithm Aversion"