OUTREACH

OUTREACH

To build a steady workforce on data science that is diverse, responsible, and has a solid foundation, effort must begin at the K-12 level. For this reason, EnCORE outreach efforts will be focused on creating data science syllabus in school curriculum through regular visits to middle and high schools, and ensuring a year-long mentoring to students. 

Incorporating data science syllabus in school curriculum through regular visits to middle and high schools, and ensuring a year-long mentoring to students need exemplary dedication, and passion. That is exactly where the EnCORE as an institute envisions to stand out. A major architect of the outreach program at EnCORE will be Naderi (UCSD) who will be spending 20 hours per week, and coordinate activities across the sites through direct site visits, and hybrid offerings.A uniquely experienced team. Naderi’s Thinkabit Lab impacted 14,000+ middle school students directly (on-site) and 60,000+ students indirectly (off-site), and has replicated in 7+ sites (3 middle schools, 2 universities, and 2 community centers serving entire districts). See Broadening Participation Plan for more details. After an eight months of deliberation with the school authority, we now have an agreement in place with the Sweetwater Union High School District in San Diego which has 68% students from the underrepresented groups, to work with them throughout the entire school year (see collaboration letter).

Mishne is a member of Pathways2AI initiative at UCSD, creating content on teaching AI concepts to engages students in grades 6-12. The material will be translated to bilingual format through partnership with CREATE.

The following new activities will be conducted as part of EnCORE institute.

For Hands on Activities for K-12, the team will identify key Data Science concepts and translate them into experiences for kids to appreciate.

With Naderi’s guidance, the team will identify key Data Science concepts and translate them into experiences for kids to appreciate. For example, infusing art into STEM education improves creativity and innovation, and supports broadening participation. These activities are designed to be hands-on either by having a tangible project or virtually using Python and Jupyter Notebooks. The goal is to build confidence in science and change perceptions to increase motivation to pursue STEM pathways and careers. The team will work with affinity student organizations to support facilitating these activities in the classroom and at community centers regardless if the content is designed to be during school or out-of-school time. If the content is designed to be during school, the team also anticipates working with

the grade science teacher the content is specifically targeting to ensure that state standards are being met in the exercise. This latter strategy is intended to not take away from time teachers need to go through state standards but enhance their current teaching load by overlapping with what they are already doing.

The targeted activities will start with the Sweetwater school district in year one, and expand to other San Diego, and Los Angeles metro area school districts starting year two/three. Naderi will be working with Nettercenter (Penn), The Fife-Penn CS Academy (Penn), and DDCA (UT Austin) to expand the program to local schools in Pennsylvania, and Texas.

K-12 Professional Development For K-12 activities designed to be done during the classroom time, once piloted with the original teacher who co-developed the content, the team will train other teachers to deliver the same exercises.

For K-12 activities designed to be done during the classroom time, once piloted with the original teacher who co-developed the content, the team will train other teachers to deliver the same exercises. This training will qualify as professional development. Content creators, along with Naderi’s guidance, will work with K-12 teachers who co-developed the activities to create a series of trainings for other K-12 teachers to adopt the activities into their classroom. Teachers are incentivized to participate in professional development by being financially compensated (budget allocated) and receiving credit from their district. Naderi works with administrators at the district level, e.g. with Sweetwater Union High School district to ensure the professional development training meets their standards to receive credit.

EnCORE will develop a video contribution: a how-to guide- supporting parents to support their kids who are interested in the Data Science pathway on the UCTV platform through UCSD Extension (collaboration letter attached). With this contribution, UC Extension will provide inkind contributions of create and develop workshops around the video in-person or online. The parents enrolled would be incentivized with credit and a certificate at the end. The credit is towards the certificate: post-graduate credit and it’s transcriptable. Guide books – translates the video and creates a guidebook (downloadable pdf), and create a video that is time marked to go along with the guidebook! Potentially this can be translated into spanish and portuguese.

 

New Activities Along with continuing NSF funded (supplemental) REU program, several new programs will be offered through EnCORE.

It has been noticed that first-generation college students tend to have less programming experience as compared to their peers taking an introductory Data Science course. Furthermore, there is a correlation between a lack in programming experience and doing well in the course. As a result, EnCORE will be designing programming bootcamps to supplement the programming exposure to students to potentially minimize this gap in performance, share stories to build confidence, and provide opportunities to build relationships with their classmates.

Introduction to Principles of Data Science courses often don’t have a prerequisite for programming, but use Python coding for lessons/homework. However, from the PI’s experience at HDSI (UCSD), it has been noticed that first-generation college students tend to have less programming experience as compared to their peers taking an introduction Data Science course. Furthermore, there is a correlation between a lack in programming experience and doing well in the course. As a result, EnCORE will be designing programming bootcamps to supplement the programming exposure to students to potentially minimize this gap in performance, share stories to build confidence, and provide opportunities to build relationships with their classmates . The camp will be offered virtually in all campuses. There will be 1 hour

Zoom sessions for 5 days for 1 week. The students will be encouraged to share their videos. The instructor will go over basic Python concepts and then give coding activity sheets for students to do. The students will be in paired in breakout rooms where one participant shares their screen of their jupyter notebook and their partner describes what to code and periodically switch.

The EnCORE institute will offer a one-week residential summer program on Responsible & Ethical Data Science in collaboration with David Danks

The EnCORE institute will offer a one-week virtual summer program (with some in-person interactions for southern CA students) on Responsible & Ethical Data Science in collaboration with David Danks (Professor of Data Science and Philosophy, HDSI, UCSD, see collaboration letter). The summer program will provide hands-on experience in both (i) understanding the ethical, societal, and psychological impacts of data science and AI systems; and (ii) practical tools and methods to minimize risks of harm and maximize possibility of benefit through the use of data science and AI. Roth and Danks will jointly design the curriculum for this program bringing in complementary expertise.

Tea Chats (hybrid on the virtual institute-see collaboration plan) are opportunities to provide informal social gathering events to facilitate networking between faculty, department staff, and students. To encourage conversation between people, the format will dictate to set aside electronic devices.

There will be regular interview sessions with faculty and poctdoc researchers of EnCORE about their journey into becoming a researcher. Discuss their past (childhood), present (their research in a way that a child would understand), and future (what they are excited about next).

K-12 Community Colleges, UG Undergraduates Grad Students, Post-Docs Junior Faculty
Hands on Activities - Students
Python Bootcamp
REU
Research Seminars
Mentoring
Professional Development - Teachers
Ethics Summer Camp
Tea with a Professor
Travel Support
Outreach to Parents
Early Career Workshop

Pareto-Optimal Algorithms for Learning in Games

Natalie Collina and Eshwar Ram Arunachaleswaran

Abstract: We study the problem of characterizing optimal learning algorithms for playing repeated games against an adversary with unknown payoffs. In this problem, the first player (called the learner) commits to a learning algorithm against a second player (called the optimizer), and the optimizer best-responds by choosing the optimal dynamic strategy for their (unknown but well-defined) payoff. Classic learning algorithms (such as no-regret algorithms) provide some counterfactual guarantees for the learner, but might perform much more poorly than other learning algorithms against particular optimizer payoffs.

In this paper, we introduce the notion of asymptotically Pareto-optimal learning algorithms. Intuitively, if a learning algorithm is Pareto-optimal, then there is no other algorithm which performs asymptotically at least as well against all optimizers and performs strictly better (by at least $\Omega(T)$) against some optimizer. We show that well-known no-regret algorithms such as Multiplicative Weights and Follow The Regularized Leader are Pareto-dominated. However, while no-regret is not enough to ensure Pareto-optimality, we show that a strictly stronger property, no-swap-regret, is a sufficient condition for Pareto-optimality.

Proving these results requires us to address various technical challenges specific to repeated play, including the fact that there is no simple characterization of how optimizers who are rational in the long-term best-respond against a learning algorithm over multiple rounds of play. To address this, we introduce the idea of the asymptotic menu of a learning algorithm: the convex closure of all correlated distributions over strategy profiles that are asymptotically implementable by an adversary. Interestingly, we show that all no-swap-regret algorithms share the same asymptotic menu, implying that all no-swap-regret algorithms are “strategically equivalent”.

This talk is based on work with Jon Schneider.

Metric Learning from Lazy Crowds

Geelon So

Abstract: We consider crowd-based metric learning from preference comparisons, where given two items, a user prefers the item that is closer to their latent ideal item. Here, “closeness” is measured with respect to a shared but unknown Mahalanobis distance. Can we recover this distance when we can only obtain very few responses per user?

In this very low-budget regime, we show that generally, nothing at all about the metric is revealed, even with infinitely many users. But when the items have subspace cluster structure, we present a divide-and-conquer approach for metric recovery, and provide theoretical recovery guarantees and empirical validation.

This is joint work with Zhi Wang (UCSD) and Ramya Korlakai Vinayak (UW–Madison).

Random Walks, Conductance, and Resistance for the Connection Graph Laplacian

Zhengchao Wan

Abstract: We investigate the concept of effective resistance in connection graphs, expanding its traditional application from undirected graphs. We propose a robust definition of effective resistance in connection graphs by focusing on the duality of Dirichlet-type and Poisson-type problems on connection graphs. Additionally, we delve into random walks, taking into account both node transitions and vector rotations. This approach introduces novel concepts of effective conductance and resistance matrices for connection graphs, capturing mean rotation matrices corresponding to random walk transitions. Thereby, it provides new theoretical insights for network analysis and optimization.

This is based on a joint work with Alexander Cloninger, Gal Mishne, Andreas Oslandsbotn, Sawyer Jack Robertson and Yusu Wang.

Approximability and Inapproximability of Strict-CSPs

Akbar Rafiey

Abstract: We study the approximability and inapproximability of Strict-CSPs. An instance of the Strict-CSPs consists of a set of constraints over a set of variables and a cost function over the assignments. The goal is to find an assignment to the variables of minimum cost which satisfies all the constraints. Some prominent problems that this framework captures are (Hypergraph) Vertex Cover, Min Sum k-Coloring, Multiway Cut, Min Ones, and others.

We focus on a systematic study of Strict-CSPs of the form Strict-CSPs(H), that is, Strict-CSPs where the type of constraints is limited to predicates from a set H. Our first result is a dichotomy for approximation of Strict-CSPs(H), where H is a binary predicate, i.e., a digraph. We prove that if digraph H has bounded width, then Strict-CSPs(H) is approximable within a constant factor (depending on H); otherwise, there is no approximation for Strict-CSPs(H) unless P=NP.

Second, we study the inapproximability of Strict-CSP and present the first general hardness of approximation for Strict-CSP. More precisely, we prove a dichotomy theorem that states every instance of Strict-CSP(H) (H being a digraph) is either polynomial-time solvable or APX-complete. Moreover, we show the existence of a universal constant 0<\delta<1 such that it is NP-hard to approximate Strict-CSP(H) within a factor of (2-\delta) for all digraphs H where Strict-CSP(H) is NP-complete.

Buy-many Mechanisms for Many Unit-demand Buyers

Rojin Rezvan

Abstract: A recent line of research has established a novel desideratum for designing approximatelyrevenue-optimal multi-item mechanisms, namely the buy-many constraint. Under this constraint, prices for different allocations made by the mechanism must be subadditive implying that the price of a bundle cannot exceed the sum of prices of individual items it contains. This natural constraint has enabled several positive results in multi-item mechanism design bypassing well-established impossibility results. Our work addresses a main open question from this literature involving the design of buymany mechanisms for multiple buyers. Our main result is that a simple sequential item pricing mechanism with buyer-specific prices can achieve an O(log m) approximation to the revenue of any buy-many mechanism when all buyers have unit-demand preferences over m items. This is the best possible as it directly matches the previous results for the single-buyer setting where no simple mechanism can obtain a better approximation. Our result applies in full generality: even though there are many alternative ways buy-many mechanisms can be defined for multibuyer settings, our result captures all of them at the same time. We achieve this by directly competing with a more permissive upper-bound on the buy-many revenue, obtained via an ex-ante relaxation.

Streaming PCA for Markovian Data

Syamantak Kumar

Abstract: Since its inception in 1982, Oja’s algorithm has become an established method for streaming principle component analysis (PCA). We study the problem of streaming PCA, where the data-points are sampled from an irreducible, aperiodic, and reversible Markov chain. Our goal is to estimate the top eigenvector of the unknown covariance matrix of the stationary distribution. This setting has implications in scenarios where data can solely be sampled from a Markov Chain Monte Carlo (MCMC) type algorithm, and the objective is to perform inference on parameters of the stationary distribution. Most convergence guarantees for Oja’s algorithm in the literature assume that the data-points are sampled IID. For data streams with Markovian dependence, one typically downsamples the data to get a “nearly” independent data stream. In this paper, we obtain the first sharp rate for Oja’s algorithm on the entire data, where we remove the logarithmic dependence on the sample size, resulting from throwing data away in downsampling strategies.

A d^{1/2 + o(1)} Monotonicity Tester for Boolean Functions on d-Dimensional Hypergrids

Hadley Black

Abstract: Monotonicity testing of Boolean functions over the n-width, d-dimensional hypergrid is a classic problem in property testing, where the goal is to design a randomized algorithm which can distinguish monotone functions from those which are far from any monotone function while making as few queries as possible. The special case of n = 2 corresponds to the hypercube domain. Here a long line of works exploiting a very interesting connection with isoperimetric inequalities for Boolean functions culminated in a non-adaptive tester making ~O(d^{1/2}) queries in a celebrated paper by Khot, Minzer, and Safra (SICOMP 2018). This is known to be optimal for non-adaptive testers. However, the general case of hypergrids for n > 2 remained open. Very recently, two papers (Black-Chakrabarty-Seshadhri STOC 2023 and Braverman-Khot-Kindler-Minzer ITCS 2023) independently obtained ~O(poly(n) d^{1/2}) query testers for hypergrids. These results are essentially optimal for n < polylog(d), but are far from optimal for n >> polylog(d).

This talk covers our most recent result (appearing at FOCS 2023) which obtains a non-adaptive d^{1/2+o(1)} query tester for all n, resolving the non-adaptive monotonicity testing problem for hypergrids, up to a factor of d^{o(1)}. Our proof relies on many new techniques as well as two key theorems which we proved in earlier works from SODA 2020 and STOC 2023.

SmoothLLMs: Defending LLMs against Adversarial Attacks

Alex Robey

Abstract: Despite efforts to align large language models (LLMs) with human values, widely-used LLMs such as GPT, Llama, Claude, and PaLM are susceptible to jailbreaking attacks, wherein an adversary fools a targeted LLM into generating objectionable content.  To address this vulnerability, we propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on LLMs.  Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs.  SmoothLLM reduces the attack success rate on numerous popular LLMs to below one percentage point, avoids unnecessary conservatism, and admits provable guarantees on attack mitigation.  Moreover, our defense uses exponentially fewer queries than existing attacks and is compatible with any LLM.

Composition of Nested Embeddings with an Application to Outlier Removal

Kristin Sheridan

Abstract: We study the design of embeddings into Euclidean space with outliers. Given a metric space $(X,d)$ and an integer $k$, the goal is to embed all but $k$ points in $X$ (called the “”outliers””) into $\ell_2$ with the smallest possible distortion $c$. Finding the optimal distortion $c$ for a given outlier set size $k$, or alternately the smallest $k$ for a given target distortion $c$ are both NP-hard problems. In fact, it is UGC-hard to approximate $k$ to within a factor smaller than $2$ even when the metric sans outliers is isometrically embeddable into $\ell_2$. We consider bi-criteria approximations. Our main result is a polynomial time algorithm that approximates the outlier set size to within an $O(\log^2 k)$ factor and the distortion to within a constant factor.

The main technical component in our result is an approach for constructing a composition of two given embeddings from subsets of $X$ into $\ell_2$ which inherits the distortions of each to within small multiplicative factors. Specifically, given a low $c_S$ distortion embedding from $S\subset X$ into $\ell_2$ and a high(er) $c_X$ distortion embedding from the entire set $X$ into $\ell_2$, we construct a single embedding that achieves the same  distortion $c_S$ over pairs of points in $S$ and an expansion of at most $O(\log k)\cdot c_X$ over the remaining pairs of points, where $k=|X\setminus S|$. Our composition theorem extends to embeddings into arbitrary $\ell_p$ metrics for $p\ge 1$, and may be of independent interest. While unions of embeddings over disjoint sets have been studied previously, to our knowledge, this is the first work to consider compositions of {\em nested} embeddings.

Graph Sparsification by Approximate Matrix Multiplication

Neo Charalambides

Abstract:  Graphs arising in statistical problems, signal processing, large networks, combinatorial optimization, and data analysis are often dense, which causes both computational and storage bottlenecks. One way of sparsifying a weighted graph, while sharing the same vertices as the original graph but reducing the number of edges, is through spectral sparsification. We study this problem through the perspective of RandNLA. Specifically, we utilize randomized matrix multiplication to give a clean and simple analysis of how sampling according to edge weights gives a spectral approximation to graph Laplacians, without requiring spectral information. Through the CR−MM algorithm, we attain a simple and computationally efficient sparsifier whose resulting Laplacian estimate is unbiased and of minimum variance. Furthermore, we define a new notion of additive spectral sparsifiers, which has not been considered in the literature.

Talk Title

Student Name

Abstract